Data Marketplaces for Training LLMs: What's The Deal?

Nothing New Under the Sun

Current Large Language Models are basically pattern-learning, mix-and-match data repositories. They cobble responses together based on their tremendous datasets. It only looks like they're generating new ideas when they answer a query. Everything is based on different combinations of data they already have. LLMs are very effective at tasks like text generation, translation, summarization, and even limited reasoning because they discern and learn from patterns in human language. But they're completely limited by the data they have access to.

Newer LLMs are constantly being developed to become more relevant and effective with sophisticated algorithms or innovative data sources.

Still, the first LLMs were trained on everything publicly available on the Internet. That was mostly human-generated content.

Since then, many LLMs utilized transformer architectures to create their own content. And that opens a whole Pandora's box of new problems when it comes to data quality.

AI: Inbred Trippin': Spreading the Disease

We can find a parallel for what quantifiably happens to LLMs that depend on their own content in the tales of H.P. Lovecraft. In "The Shadow Over Innsmouth," a severely isolated town's population interbreeds with demonic ocean entities called the Deep Ones in exchange for wealth and prosperity. The hybrid offspring eventually degenerates into deformed and, er… fishy examples of inbreeding and madness.

Ayal Steinberg, General Manager of Technical Community and Client Engineering at IBM, breaks down how relevant this parallel is with two AI phenomena, MAD and Hapsburg AI.

In Model Autophagy Disorder (MAD), LLMs begin to deteriorate in performance due to over-reliance on self-generated, synthetic data. The model feeds on its own output more and more, leading to a feedback loop that degrades the model’s functionality.

Researchers at Rice and Stanford University named the phenomenon after mad cow disease, an illness caused when cows eat feed made from the bones of other cows.

Hapsburg AI describes the same sort of instance, when LLMs trained on synthetic data become generic, standardized and, well, useless. The Hapsburgs were European royalty who continually inbred for generations. They became subject to a host of genetic disorders and their gene pool was horrifically homogenous.

We're already going there. Most data generated in the last 18 months is AI-generated rather than human-generated, potentially accounting for 90% of online information.

Subsequently, AI hallucinations abound. Models continuously produce inaccurate or fabricated content, which more and more people depend on. Biased and inaccurate content prevails. Read more specific examples on Steinberg's Selling Data LinkedIn blog.

There are some possible fixes. Retrieval-Augmented Generation or RAG is an AI model architecture that integrates external data retrieval processes, but it's still not used across the board and demands more complicated architecture requirements. Nor is it a panacea. One of RAG's developers, Patrick Lewis of Cohere, describes RAG outputs as “low hallucination” rather than hallucination-free. Read the Wired breakdown of RAG here.

Although RAG models, transparency, standardized governance and data vetting AI agents are promising treatments for the AI inbred trippin' disease, they're certainly not the cure. That's why there's a good chance data marketplaces for LLMs are going to skyrocket in value within the next few years.

Data Marketplaces: Delivering the Goods

Data marketplaces like Innodata, Defined.ai, and Databricks already facilitate the buying, selling, and exchanging of datasets for training LLMs. They support a wide range of applications like machine learning, generative AI, fraud detection, content moderation, autonomous driving, geospatial analysis, and facial recognition, spanning almost every industry imaginable.

Data marketplace Revelate shares a survey by NewVantage Partners that revealed 91.6% of executives from big companies believed data selling and analytics will continue to increase over the years. That was in 2019. They've only been proven correct.

Reddit expects over $66 million in revenue this year from licensing its data to LLM devs. They also intend to license their staggering amount of data to megalith OpenAI. Other companies like Shutterstock, Freepik, Tumblr, and WordPress have also made deals with LLMs to license their content.

And NVIDIA reports its first-quarter revenue as a record $22.6 billion, up 23% from the previous quarter and up 427% from a year ago. They recently unveiled the NVIDIA Blackwell platform and the Blackwell-powered DGX SuperPOD™ for generative AI supercomputing.

“The next industrial revolution has begun — companies and countries are partnering with NVIDIA to shift the trillion-dollar traditional data centers to accelerated computing and build a new type of data center — AI factories — to produce a new commodity: artificial intelligence.

“AI will bring significant productivity gains to nearly every industry and help companies be more cost- and energy-efficient, while expanding revenue opportunities.

“Our data center growth was fueled by strong and accelerating demand for generative AI training and inference on the Hopper platform. Beyond cloud service providers, generative AI has expanded to consumer internet companies, and enterprise, sovereign AI, automotive and healthcare customers, creating multiple multibillion-dollar vertical markets." - Jensen Huang, founder and CEO of NVIDIA.

But any organization collecting data has the opportunity to monetize it, especially when tools like this are coming to the forefront. Revelate's blog gives the hopeful data merchant some pointers, but it's recommended to search for the data marketplace that's most relevant to what you offer.

Not Your TradFi's Data: Issues With Selling Data

There are several issues with selling data for LLMs, however. Traditional data sales depend on marginal data points which by nature, become outdated, like corporate emails of sales leads. So, a data buyer purchases a subscription. This is a sustainable business model for the data seller that frankly doesn't work when it comes to LLMs. Once data is used to train models, it cannot be taken back. This complicates licensing and pricing especially because LLMs generate their own data based on what they've integrated, which might include learning how to gather comparable data itself from free sources. Their ability to create new datasets makes controlling downstream usage difficult.

Alex Izydorczyk gives an expanded discourse regarding these issues on Substack: LLM Data Sales: A Market for Lemons?

In Conclusion

If you look at who is selling their data and how much money is currently involved, it's obvious that data marketplaces for training LLMs are going to grow exponentially. Business is seriously booming already when it comes to data marketplaces, but there are still issues to be addressed. Human-generated data marketplaces aren't necessarily a fix for all of them. There must be moar fresh data, simply because LLM model performance degrades over time without it. And we've all seen the consequences of that, from silly to financially damning. But stringent checks, legal requirements, fair business models and other data qualifications make it harder than it sounds to sell your own data.

Everything is so nascent. No one really knows what they're doing yet. And that's part of the wild ride AI is taking the world on.