Since OpenAI’s ChatGPT kicked off the generative AI boom in 2022, it’s clear that creating accurate, reliable, and efficient AI models requires plenty of the right data. The problem is that the best data, especially “expert” data specialized in certain fields like health and finance, is in short supply. While AI companies mine fresh information from the internet, AI models are constantly hungry and need to be fed.
San Francisco-based startup Gretel AI has long believed that the most satisfying solution is to create fake food that tastes as good as the real thing. The company helps clients like EY, Google, and the U.S. Department of Justice generate synthetic data: artificially generated data that mimics the characteristics of real-world data. And it’s becoming increasingly easy to make. For example, Gretel today announced the broad availability of a generative AI-powered system that can create synthetic datasets of tabular data, such as text and numeric data arranged in columns and rows like an Excel spreadsheet, with just natural language prompts like those used in ChatGPT.
Let’s say a bank wants to create a synthetic dataset that resembles its own customer data but doesn’t contain any real personal names or information. Using Gretel’s Navigator product, the bank can instruct the system to create millions of fictitious names, identities, dates, amounts, and account balances based on, say, Gretel’s dataset or the bank’s own data. The resulting computer-generated data doesn’t contain any real customer information, so Gretel claims it doesn’t violate customer privacy and can generate enough data to train powerful, accurate models.
Synthetic data will become popular in 2024 as data scarcity forces companies to look to other sources to build general models or tweak them for specific tasks, said Ali Golshan, co-founder and CEO of Gretel. luckGolshan, who previously co-founded two security-focused startups, noted that the company started in 2020 as a way to generate privacy-conscious data (the name Gretel comes from the classic story of Hansel and Gretel, who left breadcrumbs to find their way home). The company “wanted to stop people from leaving digital breadcrumbs,” while also giving developers a way to access useful data, especially in highly regulated industries.
“We hadn’t really thought about the situation of data scarcity before. That was our ChatGPT moment,” he said. But now data scarcity, along with data privacy and security, are why companies are turning to synthetic data as an option for training AI models.
Golshan emphasizes that generating synthetic data isn’t about spitting out a ton of low-quality, useless data (think Reddit posts). “People think synthetic data is interchangeable with fake data or junk data and that we need more of it,” he says. “That’s what creates this spiral of harmful illusions and hallucinations. There has to be a quality piece.” He adds that what will drive business over the next 20 years is big AI investments built on “messy, public, privacy-invading data” and “pairing that with sensitive, proprietary, domain-specific data that’s unique and can drive models forward.”
He also pushed back against the idea that synthetic data is “not as good” as real data, and the potential dangers of AI training itself on its own illusions and misinformation. Because the company primarily serves businesses, organizations and governments, Gretel’s work typically starts with a seed of data that a company already has, such as patient data, fraud data or transactional data. “That acts as the boundary and the gate for how you build the rest of your data,” he said.
Gretel’s latest product enables companies to generate data on topics where information is scarce. The company’s technology focuses on very specific data to improve individual tasks within a client’s internal systems, rather than generating problematic data based on millions of pages scraped from the internet.
Gretel isn’t the only company trying to corner the market in generating synthetic data to train AI models: startups like SynthLabs, Synthetaic, and Clearbox AI are vying to provide companies with all the data they need (i.e. computer-generated data).
This got Golshan and his co-founders thinking about the future. Golshan says that soon, companies will be able to make money by making synthetic data trained on their own proprietary datasets available for purchase by others. For example, organizations that have lots of data but haven’t built AI models could sell access to their data to other companies to help train their synthetic data.
To that end, Golshan said Gretel’s next big move is to build a synthetic data and model exchange: “We’re going to enable companies and customers to train models on their own data, have mathematical guarantees that the data is secure, and then allow someone to ‘subscribe’ to that model, generate data, and pay only for what they use,” he explained.
This will take Gretel to a new level of “providing a secure interface for private data, eliminating exploitative approaches to data mining and collection,” he added. It will also mean that companies like Antropic and OpenAI, which have built huge AI models on huge amounts of data, will no longer need to enter into licensing agreements with each company that wants to obtain the data, he said.
In terms of funding, Gretel closed its Series B round in 2021, bringing its total to $68 million. Golshan said the startup has plenty of funding left and “still has about two years to go.” But in this “moment” for synthetic data, he said he sees an opportunity to build the next Databricks or Snowflake (the two biggest data cloud platforms) or even OpenAI.
“We’re pretty aggressive with this because we have a big influence,” he said. “We envision building a secure, high-quality, next-generation data business, and this is a huge opportunity given the need.”