COM 3590 Data Cleaning and Transformation

In real-world situations, data scientists must be able to use data from many dirty, autonomous, and heterogeneous data sources that are far from being ready to be analyzed. Preparing the data for analysis (often referred to as data wrangling) involves four different tasks: cleaning, sampling, transformation, and integration. For each of these tasks, interactive tools are useful both for preparing small data sets as well as for investigating the general quality or structure of a large data set. When dealing with large data sets measuring in many thousands or millions of rows, however, programmatic quantitative approaches are an absolute necessity to make data preparation a realistic task. This course covers both interactive tools and quantitative approaches to each of these tasks. Because data preparation is a focus of significant R&D and small advances may have major impacts on one's productivity, the course also introduces students to the communities of research and practice that continue to advance the state of the art enabling students to stay abreast of valuable advances in this area.

Credits

3