What is synthetic data?
And how can you use it to fuel AI breakthroughs?
It’s hard to believe, but the rise of artificial intelligence has, in some ways, created data scarcity. Not a shortage, per se. We have an astonishing amount of data that’s growing exponentially (estimates show that 120 zettabytes were created in 2023). And that number could more than double by 2027!
No, our current data problem is suitability, not quantity. Synthetic data – a product of generative AI – may be the answer for that.
In this article, we'll discuss synthetic data’s vital place in our data-hungry AI initiatives, how businesses can use synthetic data to unlock growth and the ethical challenges yet to be solved.
What is synthetic data? And why do we need it?
Simply put, synthetic data is algorithmically generated data that mimics real-world data. It could be randomly generated – 100,000 birth dates. Easy.
Usually, though, synthetic data fills a gap in fit-for-purpose data: 100,000 birth dates of women who recently registered to vote. Tough.
Synthetic data’s real sweet spot, however, is found in the rare edge cases: a data set of male prostate cancer patients younger than 35 years old, or images of wear patterns in bronze piston rings, for example. See where this is going? That specificity – that rarity – makes the data harder to get and, in some cases, riskier to use.
Accenture's Chief Data Scientist Fernando Lucini explains in a podcast conversation with SAS strategic advisor Kimberly Nevala that synthetic data can also help with data privacy. Private personal information (PPI) is closely guarded in health care, the public sector and even retail. When we can’t risk exposing PPI, we need replacement data to analyze.
“We ask (AI to create …) data with the same patterns but none of the characteristics of the original data. In simple terms (synthetic data) is machine-generated data that is a facsimile – not a copy, but a facsimile – of the signals and patterns within the original data,” Lucini explains.
Key data equivalents:
1 yottabyte (YB) = 1,000 zettabytes
1 zettabyte (ZB) = 1,000 exabytes
1 exabyte (EB) = 1,000 petabytes
1 petabyte (PB) = 1,000 terabytes
1 terabyte (TB) = 1,000 gigabytes
1 gigabyte (GB) = 1,000 megabytes
1 megabyte (MB) = 1,000 kilobytes
1 kilobyte (KB) = 1,000 bytes
Benefits of synthetic data
Access to large, diverse and authentic data is crucial for training robust AI models. But getting that kind of real-world data can be tough given increasing privacy concerns, legal restrictions, and high data acquisition and annotation costs.
Synthetic data can be created with labels and annotations already baked in – saving time and resources – and without exposing sensitive information because the links to real individuals have been severed for built-in data privacy.
What about anonymized data, you ask? According to Edwin van Unen, SAS Principal Customer Advisor, anonymization isn’t the answer either. It is inadequate, laborious and inconsistent.
“Its poor quality makes it almost impossible to use for advanced analytics tasks such as AI or machine learning modeling and dashboarding,” explains van Unen.
Synthetic data changes the game here. It mirrors the original statistical properties and correlations. The data sets are highly useful for testing and training precise predictive models with no need to mask sensitive information. This “synthetic twin” approach helps counteract bias and achieves near-perfect anonymity.
Infographic
Why Synthetic Data Is Essential for Your Organization's AI-Driven Future
A look at four basic types of synthetic data and how they’re often used
- Synthetic structured data represents individuals, products and other entities and their activities or attributes – including customers and their purchasing habits, or patients and their symptoms, medications and diagnoses.
- Synthetic images are crucial for training object detection, image classification and segmentation. These images are useful for early cancer detection, drug discovery and clinical trials, or teaching self-driving cars. Synthetic images can be used for rare edge cases where little data is available, like horizontal-oriented traffic signals.
- Synthetic text can be tailored to enable robust, versatile natural language processing (NLP) models for translation, sentiment analysis and text generation for applications such as fraud detection and stress testing.
- Synthetic time series data (including sensor data) can be used in radar systems, IoT sensor readings, and light detection and ranging. It can be valuable for predictive maintenance and autonomous vehicle systems, where more data can ensure safety and reliability.
SAS® Data Maker – Now in Preview
Protect existing data, innovate faster and ensure scalable outcomes using a low-code/no-code interface to augment or generate data quickly. Unlock the potential of existing data with SAS Data Maker.
Creating synthetic data: When to use SMOTE vs. GAN
Generating data with business rules and business logic is not a new concept. AI adds a layer of accuracy to data generation by introducing algorithms that can use existing data to automatically model appropriate values and relationships.
Two popular AI techniques for generating synthetic data are:
- Synthetic minority oversampling technique (SMOTE).
- Generative adversarial network (GAN).
SMOTE is an intelligent interpolation technique. It works by using a sample of real data and generating data points between random points and their nearest neighbors. In this way, SMOTE allows you to focus on points of interest, such as underrepresented classes, and create similar points to balance the data set and improve overall accuracy in predictive models.
GAN, on the other hand, is a technique that generates data by training a sophisticated deep learning model to represent the original data. A GAN comprises two neural networks: a generator to create synthetic data and a discriminator that tries to detect it. This iterative adversarial relationship produces increasingly realistic synthetic data, as the discriminator ultimately cannot easily tell the difference between synthetic and real data. The training process can be time-consuming and often requires graphics processing units (GPUs), but it can capture highly nonlinear, complex relationships among variables and thus produce very accurate synthetic data. It can also generate data at or beyond the boundaries of the original data, potentially representing novel data that would otherwise be neglected.
A test: Synthetic data versus anonymized data
SAS and a partner tested synthetic data’s viability as an alternative to anonymized data using a real-world telecom customer’s churn data set (read the blog post, Using AI-generated synthetic data for easy and fast access to high-quality data). Van Unen explained that the team assessed the outcome on data quality, legal validity and usability.
What they learned:
- Synthetic data retained the original statistical properties and business logic, including “deep hidden statistical patterns.” Comparatively, anonymization destroyed underlying correlations.
- Synthetic data models predicted churn similarly to those trained on original data. Meanwhile, anonymized data models performed poorly.
- Synthetic data can be used to train models and understand key data characteristics, protecting privacy by reducing and preventing access to original data.
- Synthetic data generation processes are reproducible. Anonymization is variable, inconsistent and more manual.
“This case study reinforces the idea that AI-generated synthetic data provides fast, easy access to high-quality data for analytics and model development,” affirms van Unen. “Its privacy-by-design approach makes analysis, testing and development more agile."
We must approach synthetic data with great care to avoid unintended consequences. Natalya Spicer Synthetic Data Product Manager SAS
Ethical considerations of synthetic data
As synthetic data use becomes more widespread, synthetic data vaults will also become more prevalent. These shared repositories will foster collaboration, data democratization and cross-pollination of ideas. But they could inadvertently underwrite bias, hide data privacy infractions and perpetuate unfair data practices.
Contrary to popular belief, Lucini argues, synthetic data is neither automatically private nor privacy-preserving. If not implemented with the right controls and testing, synthetic data generation can still lead to privacy leaks.
"Generative models can be a ‘black box.’ To ensure responsible use, they require rigorous validation, which the industry has not yet fully developed. We have to approach synthetic data with great care to avoid unintended consequences," says Natalya Spicer, a Synthetic Data Product Manager at SAS.
The right to privacy is black and white – we can regulate it, put rules around it, and everyone can be bound by those rules. Fairness and bias are not as simple to regulate. If those subjective decisions are left to individuals, the consequences could have long-term consequences. So we need enterprise-level governance until there are more comprehensive government regulations.
“We built SAS® Viya® to serve as an enterprise platform for the compliant use of data and analytics, which is crucial with the acceleration of AI and synthetic data,” says Spicer. “SAS Viya has full traceability regarding how models are created, all the way back to raw data and the models used to analyze its accuracy.”
The future of synthetic data and AI
As artificial intelligence and data science advance, synthetic data will become increasingly important. The synergy between synthetic data and emerging techniques will enable the creation of even more sophisticated and realistic synthetic data sets, further pushing the boundaries of what is possible.
Governance will play an important role as the use of synthetic data evolves. Organizations must implement robust governance frameworks, data auditing practices, and clear communication around the limitations and appropriate use cases for synthetic data. Policies for labeling and identifying the use of synthetic data will also become crucial to avoid misuse and misunderstanding. By embracing the power of synthetic data, data scientists can unlock new frontiers of innovation, develop more robust and reliable AI models, and drive transformation that positively impacts our world.
Ready to subscribe to Insights now?
Recommended reading
- IFRS 9 and CECL: The challenges of loss accounting standardsThe loss accounting standards, CECL and IFRS 9, change how credit losses are recognized and reported by financial institutions. Although there are key differences in the standards for CECL (US) and IFRS 9 (international), both require a more forward-looking approach to credit loss estimation.
- frtb: a wait and see strategy could be riskyFRTB, fundamental review of the trading book, is a regulation that changes how banks analyze market risk in the trading book to address systemic challenges.
- IFRS 17 and Solvency II: Insurance regulation meets insurance accounting standardsIFRS and Solvency II encourage comparability and transparency from a regulatory and accounting perspective for insurers, but there are important differences.