News & blog

Read the latest news and blogs from The Data Lab and Scotland’s data science community.

Synthetic Data Part 1: What, Why and How it can Financially Boost Scotlands Economy

Blog by Euan Gardner, Senior Information Development Manager, Strategic Development Team, NHS NSS

1. Introduction

Guess who’s back, not-so-slim shady’s back, tell a friend and this time I’m talking about synthetic data. This blog is about the concept of synthetic data, if you want a discussion on the methods (and safeguards) that The Data Lab and NHS NSS are using to generate high-quality low level synthetic medical data then this is the next blog.

I need you to not fall asleep at this juncture so keep reading as there is a huge economic benefit (yes, actual money) to Scotland in terms of synthetic data and the Digital Economy (sections 4 & 5). Rather than start hitting you with definitions about the how’s and what’s of synthetic data, I want to set up the why.

Hands up if you think AI is going to cure every disease, kill us all, coalesce the vapors ethereal to disrupt the dynamic market sector of pencil sharpening using cognitive intelligent systems through agile start-up focussed data lakes with a feature focus, ninja growth hacking birds-eye vision and emphasis on organic data-fed stories. Also Blockchain.

Now I’ve wiped the vomit off myself after writing that, there’s just one slight problem with these views. There’s a complete lack of data to actually build anything resembling the systems needed to do any of the above – minus the pencil sharpener, really got that right first time.

Yes on the news you see the horrors of data being used to disrupt (actual definition) elections or the phenomenal ability of accurate speech translation but these systems require data to actually be built.

The reason the likes of Google and Facebook show up on the news with amazing breakthroughs is that they have literal server farms dedicated to holding Petabytes of valuable user data. If they didn’t have this key advantage then their ML/AI divisions would just get really good at playing cards.

So what about using some of the Petabytes of high quality health, education, government data Scotland has to, instead, do some good and really improve people’s lives?

2. The Why of Synthetic Data

Turns out if you just released the data and put the First Minister’s medical records up on social media, Nicola would be really angry and send the Information Commissioner squad round – understandably and rightfully so. So the question needs to be asked:

How can we safely get high quality, detailed, analytically useful data into the hands of people who need it for innovation to make Scotland the leader in data technologies and generate better outcomes, such as health, for everyone?

There have been various attempts at doing this, with the main focus historically having been on Open Data. This is a great start but there are two main issues with it. Firstly, there is a risk of information that you don’t want being released, getting out there. Secondly, to combat the risk of information being leaked the data is distorted (perturbed) and usually aggregated to a high level.

The problem is that if you aggregate you lose much of the valuable information (especially things like geography) and may have problems interlinking datasets – lowering the analytical usefulness of the data. Higher level data also diminishes the real power in areas like medicine as it’s the individualised, low level data analytics that creates things like personalised medicine.

Keep aggregating and you discover amazing trends like the older someone is, the more likely they’ll die and breathing is the number two cause of death, second only to dying. Now, before Statisticians hit me, I’m exaggerating obviously but the point remains if we want accurate, amazing medical insights and AI systems (aka personalised medicine), we need safe high quality, low level data.

3. The What of Synthetic Data

So how to solve balancing privacy with analytical need and innovation? Synthetic Data.

What is synthetic data? It’s data that been created by some generating process (usually a model) that has all the statistical trends, data types, column names, structures etc as the real data – see section 4 for more on the quality of the synthesis. Crucially, however, the data and the people contained in the synthetic data are completely made up.

This realistically means that people aged 45-50 have certain disease profiles, health outcomes, histories etc. that are different from people aged 25-30. You then get the model to learn what a typical person aged 45-50 ‘looks like’ in terms of the data, so that you can create similar fake people that resemble real people.

You’re saying it’s basically taking the real data, shaking it and throwing it back out? Nope, not in the slightest. The idea is that the model learns ‘latent representations’ which are the underlying patterns of variables that control the overall structure of the data.

You then have two choices you can let the model figure out how these latent variables are distributed (auto create synthetic data) or you can manually control the distribution – i.e. you can make accurate custom datasets, more on this later. To make these ideas clearer, let’s construct an example.

3.1 Example: Making Synthetic Dugs (Dogs)

Let’s make a synthetic dug (dog). What are the latent features of a dog? Well they usually have 4 or fewer legs, say ‘woof’, are fluffy, and have wagging tails. Draw me a dog based on the latent variables described, or don’t. I’ll bet it’s different from my own drawing (below) because you don’t draw like you’re allergic to art.


Figure 1 – A dug (dog) I constructed from latent variables.

Figure 2 – A dug (dog) constructed from latent variables by someone who can draw.

The more trained a person is as an artist or the more variety of dogs they have seen then they can create more realistic pictures of dogs from the latent variables listed above. As you can see even among the two examples here (Figures 1 and 2) and your own, they are different but still fundamentally represent a dog.

If you showed someone an image (minus Figure 1) then they would state that it is a dog. Alternatively, an image classifier would identify your own image or Figure 2 as a dog, meaning you could increase the amount of data available to your classifier model with synthetic data.

The key point is that from just a few latent variables you can create lots of realistic (sometimes photo realistic) pictures of dogs but crucially these dogs have never existed, they’re simply created from the average of a person’s skill and knowledge about dogs that have been learned over their life.

The more we train a person on these latent representations then the more accurate their drawings of dogs should be. Models that can generate synthetic data are the same as people in this sense. They learn latent variables from data so that they can then sample these and build up a fully synthetic dataset; the more data we feed the model, the wider a range of latent representations and data it can generate.

The aim of generative models is (explicitly) not to copy the original dataset but rather learn aspects of it that it can then recompile back into data that fits the latent representations criteria – the same as we have done above.

4. The Quality of Synthetic Data

Much like the images above, synthetic data has varying levels of quality, which are linked to their use.

Basic synthetic data typically takes the form of the synthetic data having the same variable names and some of the categories/values as the real data. The caveat is that the distributions are in no way similar to the real data. This is typically suitable for basic software testing or getting to grips with the data but offers no real value. As with all synthetic data, the data isn’t real so even though its analytically limited, it’s still safe.

Standard quality data is that which is quite similar to the real data. Most of the main variables are captured fairly well. This can be useful in some settings for basic training in data analysis and there have been quite a few successful attempts at this. The main issue in this area is that the models often miss the important edge cases that really assist in building personalised systems. There may or may not be interlinking between the variables so that you get a true distribution across all of the variables rather than just accuracy per column.

4.1 Truly Valuable Synthetic Data

High quality synthetic data encompasses two main areas. The first area is the ability to create accurate time-series based synthetic data. A synthetic dataset of the pathways (journeys) that patients take through hospitals during their stay(s) would be an example of exceptionally valuable synthetic data.

Second, is the ability to make either exceptionally accurate distributions or modify the distributions in such a way that you can create powerful custom datasets. An example would be the ability to create a synthetic dataset where a number of patients have a condition of interest. That way you could ask a hypothetical question like: “what would the co-occurrence of conditions in Scotland look like if we had half the population with Diabetes as a diagnosed condition?” and test scenario responses on the data.

The pursuit of accuracy in high quality synthetic data is mainly twofold. Firstly researchers can not only build their test scripts but also get approximate expected data values (later to be confirmed on the real data). AI and machine learning algorithms can not only be pre-trained on synthetic data to help their development but also made more accurate by being adversarially trained with synthetic examples.

For these reasons, this is why the NHS NSS and The Data Lab have partnered to develop high-quality, detailed (in data terms, low level) synthetic data on hospital admissions. It’s early days but there have been some amazing results with generating multiple columns of ICD10 codes, with close to 10,000 possible codes, per column. The next blog will cover more on this. Yeah, be excited.

4.2 Information Governance Wins Too

What’s the most difficult part of being in Information Governance (IG)? Saying no to a really, really good project because it’s simply too risky and we all need to keep privacy and rights safe. You’re not being unkind or difficult but you need to ensure people’s rights are kept. Synthetic data can make your life easier.

Instead of having to deny a request for information you can issue safe synthetic data for the individual/organisation/company requesting the data to develop their idea on. This means when they approach you again they can not only be exceptionally specific in the data that they need but can also evidence why they need the data.

As IG, you can then take their proposal and consider what to do next without the rush of critical time period based requests. Additionally, you may be able to take the code/solution that was developed and test it on the real data that you hold and return high level accuracy/success results. This provides a solid buffer between the real data and the people requesting the data while also facilitating innovation.

5. The True Economic Value of Synthetic Data

Imagine if you were a researcher, start-up, or citizen who had an amazing idea that could really benefit Scotland. You have an amazing idea (could be health, could be defence related) but you’re actually going to help people! You have a fever, and the only prescription is, more data. Yeah, you ain’t getting the data.

The reality is that if we crack general data synthesis (we’re getting closer literally everyday), we can safely give you representative data that you can fully develop your idea on. The benefit is that as the people don’t exist in the data then it’s fully legally compliant (GDPR etc.) and you can experiment on it openly; the public are safe and you’re making the world better. Everybody wins.

If the synthetic data is of high enough quality (that’s the aim) then you can actually test your idea to see how it would realistically perform on the real data (or get expected statistical values). That way if you develop a system to detect diabetes but it’s terrible, you might find that it actually detects glaucoma far better than you expected – congrats new idea. You can then confirm these results by submitting your idea to the agency that houses the data, who could work to confirm you findings in a secure, safe manner on the real data.

A key concern in the use of data is the ethics and representation of communities within analyses. If you are developing a system that aims to assist in predicting the length of stay in hospital for people but your data only has a tiny amount of people from an underrepresented community then it’s difficult to ensure your model accounts for true variation in people. Synthetic data generation system sets would allow you to construct large, representative, datasets that contain many examples of people from the underrepresented community. This means that your model is much better trained to deal with the wide variation in people it needs to – a fairer and more accurate model, what more do you want?

If you truly want a thriving Digital Economy then safe access to accurate synthetic data is the necessary linchpin. If organisations, both public and private, regardless of size can share or access data then this can only serve to swell the economy. If Scotland wants to be the leader in Artificial Intelligence and Data Analytics, then we need to lead the world in giving our people access to data.

No other country in the world has cracked this yet; there are some synthetic datasets in Scotland and abroad but nothing close with what is needed to form the basis of a thriving Digital Economy. That being said, synthetic data would also allow researchers to safely collaborate across countries. It’s no secret that Canada is a world leader in AI with similar views to Scotland and synthetic data would allow world class researchers/citizens/companies in Canada and Scotland to develop ideas much more efficiently.

Lastly let’s tackle a scary idea. Imagine right, if Scotland could give its citizens and researchers free synthetic data but charge companies a fair price for synthetic data. This money could then be reinvested into things like the NHS.

I know, I’ve had a nosebleed from the sheer controversy of government projects not giving away really valuable things for free to private companies. See here’s the thing, we can scale prices for access to synthetic data to allow SMEs fair access while also charging more for custom data sets for larger established companies.

Say a small medical start-up needs synthetic data to assess the impact of a drug they’re developing; we can give them a custom synthetic dataset. That way, the start-up wins by getting the data it needs, Scotland wins as it gets money to invest in services and most importantly, we the people win as our information is safe (as the people are synthetic in the data) and we’re getting healthier.

5.1 Closing Remarks

Mediocrity solves nothing and only serves to hinder progress, the real way to get value and solve problems is to aim for things that are actually useful to people. High-quality synthetic data is what satisfies true innovation: accurate powerful and pre-tested AI algorithms/data insights for all. It allows citizens, organisations and companies to produce the things we were promised from AI and Data Science.

I don’t have all the answers but I do know one thing, Scotland really has the potential to lead the world in data, AI and be a powerhouse of a Digital Economy. Synthetic data is the foundation, let’s then build the house together.

The Sequel
Not-so-slim-shady will return in: Synthetic Data Part 2 – NHS NSS Electric Boogaloo.

Disclaimer

Literally every word of this article is entirely my own view and don’t in anyway shape or form represent those of (but not limited to): my employer, The Data Lab, The Scottish Government. Additionally, any ideas, projects, or directions of ideas in the article in no way represent the course or any decisions made, they are simply my own views.

So yeah if you hate any part of this, blame me and send me nice emails detailing why think I’m worse than stepping on Lego: @SwearyStats (twitter).

Image Credit: KolonjaArt, Pixabay

Share this story:

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email