Data Is the New Gold in the Generative AI Era
The generative AI race has sparked another battleground, which will dictate the next stage of AI development.
The next big challenge in the development of generative AI will be data, and getting access to enough human input to replicate human responses.
Which could mean that social platforms are better placed to lead the charge, with the AI chatbots from Meta and xAI having direct access to more human data inputs than anyone else. Google, too, has access to Search queries and review inputs. But smaller players, without such access, could be left out in the cold, as publishers look to lock down their content, in order to control access, and maximize profit.
The latest push on this front is a petition signed by thousands of well-known artists which calls for a ban on the unlicensed use of creative works for training generative AI. Publisher Penguin Random House is also taking a stand against the use of its authors’ work for AI training, while several news publications are also now organizing official licensing deals with individual AI developers for their output.
If official regulations are implemented as a result of this shift, which rightfully ensures that copyright holders can profit from their licensed works, that will limit access to the massive data inputs needed to train AI models. Which will then leave smaller developers with either bad or worse choices: Either scrape whatever data they can from the broader web (and more publishers are altering their robots.txt parameters to outlaw unlicensed use of their data), or worse, use AI generated content to further train their AI models.
The latter is a pathway to an erosion of AI outputs, with the ongoing use of AI content to build large language models (LLMs) effectively poisoning the system, and compounding errors in the dataset. That’s not sustainable, which means that the data inputs from humans are going to be in high demand, which will likely put Meta, X, and Reddit in the drivers’ seat.
Reddit CEO Steve Huffman highlighted this in an interview this week, noting that:
“The source of artificial intelligence is actual intelligence, and that's what you find on Reddit."
Reddit has already inked a data-sharing deal with Google to help power the search giant’s Gemini AI experiments, and that could prove to be a key collaboration for the future of Google’s tools.
The question then is which social platform has the most valuable data for AI model creation?
Meta has an assortment of content from billions of human users, though posting frequency has declined in recent years, in favor of video consumption in its apps instead. Which is why Threads could be a valuable component, and why the Threads algorithm may favor posts which ask questions, as a means to help train its AI systems.
X, too, sees over 200 million original posts and replies uploaded to its platform every day, but the nature of those posts is relevant, in terms of training a system on how to understand human-like interaction, and provide accurate responses.
Which is why Reddit, as Huffman notes, could be the best platform for AI training.
Subreddit communities are built around Q and A style engagement, with users posing questions, and serving relevant answers, which are up and downvoted in the app. Building an AI tool around that understanding, alongside each developers own AI models, could provide the most accurate responses, and it’ll be interesting to see how that ends up fueling Google’s AI efforts, and what Google ends up paying for the ongoing privilege.
While it also means that others could end up falling away in the race.
OpenAI, for example, doesn’t have an ongoing feed of data, other than from LinkedIn, as part of its partnership with Microsoft. Will that eventually impede development of ChatGPT, as more publishers lock down their content, and remove it from AI training?
It’s a valid consideration for the future development of AI models, as without fresh data sources, such tools could quickly lose relevance. Which will see users shift to other models.
So who wins out in this case? Meta? xAI? Google?
Right now, it does seem like one of these three is going to eventually have the better model, and will lead the way with the next wave of gen AI tools.
Or, we’re going to start seeing big deals on exclusive data inputs, and more niche AI models built around different data sets.
That could end up being a more beneficial and logical progression, which will change the landscape of generative AI development.