Guest blog by Professor Chris Williams, University of Edinburgh.
Chris was due to speak at DataTech20, which was unfortunately cancelled due to COVID-19. This article is based on the presentation that was prepared for DataTech20 in Glasgow and The Turing Present: AI UK in London, both of which were due to be held in March 2020.
We hear so much about artificial intelligence (AI) and deep learning these days that it is tempting to believe that such approaches to extracting insight and predictions from data can now solve a wide variety of problems. There are indeed notable successes, such as predicting the objects present in an image, but images are a particular kind of rather clean data. However, many areas give rise to messy data, and the analyst often needs to spend up to 80% of their time cleaning and preparing such data (a process known as data wrangling). Examples of messy data include electronic healthcare records, published government tables, and household survey data.
The ‘Artificial intelligence for data analytics (AIDA)‘ project based at The Alan Turing Institute is an initiative to address these problems. In this post we’ll take a closer look at the challenge of messy data and share a few tools and insights that we’ve developed to make the lives of data scientists and data engineers easier.
In the AIDA project we have identified the following data engineering challenges:
- Data organisation
- Data parsing: e.g. reading in CSVs
- Obtaining (or inferring) a data dictionary: Basic types & semantics
- Data integration: Combining data from multiple sources
- Data transformation
- Data quality
- Canonicalisation: e.g. UK, U.K. and United Kingdom
- Missing data: identification and repair
- Anomaly detection and repair
- Non-stationarity, e.g. change points
- Feature engineering
First, it is important to realise that predictive models (such as deep neural networks) are just one part of the data analytics process. This process starts with a problem specification, asks what data are available to address this question, then prepares the data before carrying out predictive modelling. After this the model needs to be evaluated and then deployed. This is not a linear process, there can be many feedback loops that require revisiting earlier steps.
Second, whilst research such as that being done on self-driving vehicles ultimately aims to remove the human aspect, we find that for the process of data cleaning human insights can be crucial. Thus we do not aim to fully automate data engineering but instead focus on ‘semi-automated’ tools that keep the human in the loop, and let the analyst guide the overall process and provide key insights.
AIDA combines multidisciplinary work from machine learning, semantic technologies, and programming languages to:
- Build AI assistants for individual tasks
- Build an open-source platform (called Wrattler) and integrate the assistants into the platform
- Provide exemplar use cases of real-world data wrangling
Of course we are not the only people to work on problems in data engineering, this is an interdisciplenary topic that benefits from insight from the databases, data mining, and programming languages communities. Below we give several examples of tools and insights generated by the AIDA project.
(Other work done in the AIDA project includes the filling in of missing data in tables, see ‘Handling Incomplete Heterogeneous Data using VAEs‘, and identifying and repairing outliers in tabular data, see ‘Robust Variational Autoencoders for Outlier Detection in Mixed-Type Data‘.)
Often an analyst is given a number of datasets from successive time periods (let’s say different years). If each set has the same structure and format it is easy to collate them together. But usually this isn’t the case: column names might have changed, the columns have been re-ordered, new columns have been added, etc. Datadiff is an AI assistant that seeks to match the columns between dataset A and dataset B, based on the idea that there should be a simple transformation that makes the distribution of the data in the matched columns the same. This might involve swapping, inserting or deleting columns, a linear transformation (e.g. change of units) etc.
For each pair of columns in A and B, we first identify the best single transformation between them and compute an associated weight. We then find the best weighted bipartite matching to identify how the columns of A and B should match up. For example we have used datadiff to match data about UK broadband performance over various years.
ptype: Probabilistic type inference
To analyse data properly, we need to know its datatype, e.g. integer, date, string. But missing and anomalous data can mess up the automatic identification of data types, as shown in the graphic below. If we are not able to identify “Error” as an anomalous entry, and and “Null” as an indicator of missing data, these columns will be identified as being of string type, which is incorrect and cause problems downstream if not addressed.
Our solution is a method for probabilistic type inference. This seeks to explain the data in each column in terms of a model that can generate appropriate entries, or anomalies or missing data indicators. We have built probabilistic finite state machines (PFSMs) for each datatype, and then run probabilistic inference to infer the column type given the data. Experiments show that ptype performs better than its competitors in identifying column types, and is particularly strong on identifying anomalous or missing type cells.
You might think that a CSV file would be comma separated, right? But unfortunately real life is not that simple. In a CSV file one needs to specify the column delimiter, quote character and escape character to specify the “dialect”. CleverCSV seeks to automatically identify the dialect, based on an objective function that maximises both the uniformity of rows by pattern, and the uniformity of the identified columns by type. The method achieves 97% overall accuracy on large corpus of CSV files, and improves accuracy on messy CSV files by 22%. Together with ptype, this paves the way to automatically converting datasets from CSV format to properly typed data frames ready for analysis.
ColNet: Embedding the semantics of web tables for column type prediction
Suppose we have a column of data with entries which include “Mute swan”, “Eider duck” and “Wandering albatross”. A human would likely suspect that the column describes types of bird, or perhaps a more general class like “species”. Identifying such semantic column types can be useful for data quality checking, and potentially for integrating other information known about the semantic type.
How can an AI assistant predict the semantic type from the data itself? ColNet works by first retrieving column cells’ corresponding entries in a relevant knowledge base, and returning the classes of the matched entities as a set of candidate classes for annotation. It then scores each candidate class using a customised neural network classifier. ColNet was shown to outperform state-of-the-art approaches.
The standard notebook architecture for data science (e.g. Jupyter) involves the notebook interacting with a server. However, this has limitations regarding reproducibility, lacks the ability to rollback state, and is limited to a single programming language. Wrattler is an architecture that overcomes these issues, by allowing for reproducibility and versioning, polyglot programming, and providing a platform for AIDA’s AI assistants.
One of the main problems with research on data wrangling and data cleaning is that few organisations are willing or able to share both the original messy data and the final clean data. Datasets shared on Kaggle or other platforms usually come pre-cleaned and ready to use, even though it is incredibly valuable for data scientists and researchers to see the process of data cleaning and understand all the steps that needed to be performed.
To further understanding of the process of data wrangling, the AIDA team has curated four datasets and documented the data wrangling process. These datasets are from different domains and include the Tundra Traits dataset of plant measurements, the Household Electricity Survey, a dataset of Electronic Health Records, and the aforementioned Ofcom Broadband survey. For each dataset the analysis and data wrangling process is available online.
A common view is that up to 80% of a data mining project is spent on data understanding and data preparation. The AIDA team seeks to produce AI assistants and a software framework to reduce the time and effort needed for data engineering tasks. The work highlighted above illustrates a diverse set of issues covering data organization and quality. So far we have focused on tabular data, but data engineering issues also apply to time series, images, graph data etc. There is still lots to do!
The AIDA project has received funding from The Alan Turing Institute, the UK Government’s Defence and Security Programme in support of The Alan Turing Institute, and starter funding from the Lloyds Register Foundation.