How Fake Data Protects Real Users: The Value of Synthetic Data

The Endless Thirst For More Data

As the world became more digitized, it started producing and requiring increasing amounts of data. This poses a problem, as said data is often associated with real people and real companies that might have serious privacy concerns.

This has become an even bigger issue with the emergence of AI, which is able not just to do statistical analysis on batches of data but also to comb through and analyze the dataset in-depth at all levels, from an individual person to billions of numerical entries.

Data is now so essential to the modern economy that demand for real, high-quality data has grown exponentially. At the same time, stricter data privacy rules and ever-larger AI models have made gathering and labeling real data increasingly difficult or impractical. – IBM Research

This is why synthetic data was invented as a solution. Those data replicate real-world data but do not contain any private data that could cause issues. They also can be modified and adapted to specific use cases, rare situations, or anything the statistician or tester using them might need.

Here too, AI has been transformative. On one side AI technology is very useful to generate better synthetic data, going beyond the previous purely statistical methods used until now. On the other side, synthetic data are equally useful to train AI models, from simulated 3D models of proteins for drug discovery to streets for self-driving AI.

Synthetic AI Explained

Synthetic data refers to datasets that are artificially generated but retain the underlying statistical properties of the original data on which it is based.

Synthetic data acts as a complement to real-world data and provides a few key advantages that allow researchers and analysts to expand on initial result collected from surveys, experiments, and measurements:

Training AI models with synthetic data allows us to increase the overall volume of data when high-quality real data are in short supply.
In sectors like finance and healthcare, data is in limited supply, time-consuming to obtain, or difficult to access.

The research firm Gartner estimates that by 2030, synthetic data will overtake actual data in training AI models. Gartner also predicts that by 2026, 75% of businesses will employ generative AI to create synthetic customer data by 2026

Types Of Synthetic Data

Partially synthetic data uses real-world datasets and replaces portions of it with artificial value. This is usually done for privacy concerns and is commonly used in clinical research, where the real identities of patients and medical records are anonymized.

Fully synthetic data is an entirely generated dataset, estimating the characteristics of real data and trying to emulate them as well as possible: attributes, patterns, and relationships. This can be, for example, done for training against data missing from a user dataset, like financial data missing fraudulent activities, which are needed to train an AI for fraud detection.

Hybrid synthetic data combines real data with fully synthetic data.

How To Generate Synthetic Data

Statistical methods are by far the oldest method to generate synthetic data, dating back to the 1930s with the synthesis of audio and voice, leading to software synthesizers from the 1970s onwards.

Variational autoencoders (VAEs) are programs that produce variations on the data they are trained on. These systems are often used to generate synthetic images, as well as other forms of machine learning.

Source: IBM

A related approach to VAEs is generative adversarial networks (GANs), a major approach to generative artificial intelligence. It is made of two neural networks:

One generates data that tries to look like the real data set.
Another one compares the generated data to a real data set.

The second neural network gives feedback to the first one until the first one is able to generate a synthetic dataset as close as possible to the real one.

Source: Wikipedia

Transformer models use the mathematical tools used in the development of many modern AIs, including ChatGPT (where “T” stands for “transformer). They “guess” the most statistically probable output sequence by focusing on the most important tokens in the input sequence.

Lastly, agent-based modeling goes one step further and creates “agents”, mini-AIs that simulate interactions and agent behaviors to produce synthetic data. For example, individual agents can represent individual people in an epidemiology study, with each displaying its own pattern or rate of contact, infection risk, etc.

(We explored the future role of AI agents in the workplace and daily life in “AI’s Killer App: How AI Agents Could Change Everything”)

Synthetic Data Advantages

Control & Customization

Because the data are created from scratch, it is a lot easier to produce the correct set of data for a given task, for example, training an AI system.

They can also be created to the exact specifications and needs of a business or researcher.

Efficiency

The generation of data removes the need for expensive and time-consuming collection of real data, at least as long as the generated synthetic data are close enough to data from the real world.

This data also comes prelabeled, which removes the tedious manual step of labeling every data point by a human, describing each image, sentence, or audio file so that an automated system can understand them.

Privacy

Fully synthetic data have no privacy-related issues at all, as they are not tied to any real-life individuals or businesses. Other forms of synthetic data are a good way to anonymize and “clean” real data of any protected information, be it individual private data or copyrighted or otherwise protected intellectual properties.

Source: Mostly AI

More Diverse Data

Too small real-world datasets can miss edge cases or underrepresented groups. This can be an issue when training AIs, as the resulting model would completely ignore the existence of these cases.

By expanding the initial dataset and artificially adding the missing case the designer knows should exist, the resulting hybrid synthetic data can be more accurate and representative of real situations.

Synthetic Data Limits

Data Loss

Even if, ideally, synthetic data are virtually identical to real data, some level of information can be lost in the process. This is especially true with strong anonymization. So, a balance between privacy and efficiency must sometimes be found.

Bias

As synthetic data tries hard to replicate real-world datasets, they are also likely to replicate any error, bias, or problem found in them. So it is often important to mix multiple real-life datasets from different regions, demographic groups, time frames, etc, when creating synthetic data.

“The fidelity of synthetic data is calculated by comparing it to real-world data through statistical and analytical tests. This includes an assessment of how well the synthetic data preserves key statistical properties, such as means, variances, and correlations between variables.”

Raul Salles de Padua – Director of Engineering, AI and Quantum at Multiverse Computing

Model Collapse

AI training can fail when it starts to train on too much of its own output. More training from AI-generated data creates declining quality, which becomes the input of the next cycle of training, leading to the “degeneration” of the AI model and its collapse.

For this reason, mixing real data with synthetic data is generally recommended.

“Training on samples from another generative model can induce a distribution shift, which—over time—causes model collapse. This in turn causes the model to misperceive the underlying learning task.

To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time.”

AI models collapse when trained on recursively generated data – Nature.

Synthetic Data Use Cases

Self-Driving

As real-life data of city streets can be difficult to collect in a sufficient number, most self-driving AI companies are using synthetic data to some extent. These simulated streets, complete with life-like bicycles, cars, walkers, and random moving objects, can help train the self-driving AI with many more hours of virtual experience, decreasing the overall cost of the training.

Finance

From predictive models for investment and risks (trading, banks, insurance) to fraud detection, finance companies use synthetic data for better detection of risks, fraud, and money laundering.

Here, the use case is not only to properly detect these risks but also for the companies’ management teams to demonstrate to regulators and stakeholders that every effort is being made to detect and avoid these issues, potentially preventing billions in losses or fines.

Healthcare

By increasing the total “experience” of an AI in training, synthetic data can help to train models later used in epidemiology, medical image & lab result analysis, or clinical trials.

Such AIs can later be retroactively tested on known cohorts and population studies, proving the accuracy of their prediction.

Synthetic Data Provider – Tonic.ai

Most companies using synthetic data tend to rely on external providers who specialize in this field.

One example of this is Tonic.ai, which can integrate with virtually every database, allowing for data mining, development, and testing using the client’s own real data.

Source: Tonic.ai

Among the services proposed by the company can be mentioned:

Source: Tonic.ai

Tonic.ai tools are used by many large corporations, such as eBay developers, American Express (see below), Volvo, Cigna, Walgreens, etc.

American Express Company (AXP -6.42%)

One of the world’s leading credit card providers, American Express, has been at the forefront of utilizing synthetic data for business purposes, already using deep learning before 2020 and using Nvidia hardware.

AI Uses for Customers

Notably, it was reported to use “AI-generated fake fraud patterns to sharpen its models’ ability to detect rare or uncommon swindles”.

“These techniques have a substantial impact on the customer experience, allowing American Express to improve the speed of detection and prevent losses by automating the decision-making process.”

Dmitry Efimov – vice president of machine learning research at American Express

It also uses AI and synthetic data for streamlining credit risk assessment by including even social behavior and real-time market conditions.

It is also used, especially with generative AI, for improving customer services and reducing the times when the company’s chatbot proves insufficient to answer customers’ requests.

Meanwhile, AI algorithms analyze customers’ spending behaviors, preferences, and transaction histories to suggest tailored offers and rewards.

Internal AI Uses

Internally, AI has allowed American Express to reduce escalation to IT tickets through a reactive problem-solving system, and the company’s 9,000 engineers now use GitHub Copilot for coding assistance.

It also helps the 5,000 travel counselors advising the firm’s most elite Centurion (black) card and Platinum card members.

“Travel counselors get stretched across a lot of different areas. For instance, one customer may be asking about must-visit sites in Barcelona, while the next is enquiring about Buenos Aires’ five-star restaurants. It’s trying to keep all that in somebody’s head, right?”

Hilary Packer, Amex EVP and CTO

American Express Overview

Besides AI and synthetic data, American Express is a solid financial company, expecting a growth of revenues by 8-10% in 2025, in line with the long-term goal for revenue growth, and earnings per share by 12-16%.

The company is also quickly expanding internationally, after a long period of mostly being present in the US market, with 15% growth year-to-year in international card services billed business.

Latest on American Express

Source link