报告题目:Discovering and Assembling Data in Tabular Data Lakes
报告时间:2024年12月24日15:00
报告地点:美高梅4688集团amB404
报告人:Zhifeng Bao
报告人国籍:中国
报告人单位:RMIT University
报告人简介:Professor Zhifeng Bao leads the Big Data and Database Group at RMIT University and is an Honorary Senior Fellow at The University of Melbourne. In the past he co-directed the RMIT Center of Information Discovery and Data Analytics. He obtained his PhD in Computer Science from National University of Singapore and received the Best PhD Thesis Award. His recent research focuses on data management and governance, particularly in DB4AI and AI4DB. In DB4AI, he investigates how to identify suitable datasets, uncover hidden relationships, tackle data quality issues, and meet diverse user needs. In AI4DB, he studies how machine learning can optimize database operations, including index selection, query optimization, and cardinality estimation for both low- and high-dimensional data. He has received several honors, including the Australasian Research Council Future Fellowship, the Computing Research and Education Association of Australasia (CORE) Award for Outstanding Research, the Google Faculty Research Awards, and Best Paper Award Runner-up at KDD’19. He is the PC Co-chair of full paper track at CIKM’24 and has served as the Associate Editor of PVLDB, SIGMOD, and ICDE. He also chairs the Data Management and Data Science field for the CORE 2026 conference ranking committee. In addition to academic work, he provides consultancy to various organizations, including the City of Melbourne on its Smart City Project and the Victoria Department of Health and Human Services on data quality initiatives.
报告摘要:Data lakes have emerged as vital repositories for storing vast quantities of heterogeneous data, presenting immense opportunities as well as significant challenges for data-driven research and applications. This talk introduces a systematic guide to effectively harnessing tabular data in data lakes, focusing on three key tasks: 1) Dataset Discovery – identifying relevant datasets that align with various user intents and inputs; 2) Dataset-Level Assemblage – assembling the discovered datasets into a unified and comprehensive resource that meets various user requirements; 3) Data Points-Level Assemblage – optimizing the selection of data points from the assembled dataset, curating a subset most effective for typical downstream tasks such as machine learning model training. By addressing these tasks, our guided framework transforms fragmented raw data into high-quality, application-ready datasets. The talk will cover problem formulations, challenges, and methodologies involved, and will highlight open questions where effective and efficient data preparation is crucial. Ultimately, we aim to explore the potential for developing an intelligent, personalized data preparation agent to automate and optimize these processes for real-world applications.
邀请人:王胜