Question Posts May Become a Key Focus for AI Training Data
LLMs need question and answer text, and social platforms are looking to incentivize such.
As generative AI becomes a bigger focus, the next big push will be on the data side, and ensuring that AI projects have the best dataset, or datasets, in order to provide better, more human-like answers to the questions being posed in these systems.
Because if the data inputs are no good, or are not broad enough, then the outputs produced will ultimately prove underwhelming. That’s why Google has cut a deal with Reddit to use its data, why X has upped the price of its API access, and why OpenAI has struck agreements with several major publishers, including Condé Nast just this week.
Better quality data means better generative AI responses, and it’s interesting to see how platforms are now moving to improve their data ingestion processes, in order to enhance their own resources and tools.
For example, Meta recently launched a new web crawler to drag back more data from the open web for its Llama models.
As reported by Fortune:
“[Meta’s] crawler, named the “Meta External Agent”, was launched last month according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.”
Google, of course, also scrapes the web for its Search results, and has something of an advantage in this regard because a) it’s already been collecting this data for some time, and b) publishers can’t block it, because blocking Google’s crawler bot means also blocking its Search inputs, which will hurt your business.
But many publishers are now actively blocking LLM crawlers, in order to stop AI companies from stealing their data, with OpenAI being a particular focus for those looking to maintain control of their info.
But Meta’s new crawler is apparently not seeing mass blocking as yet, which could provide another way for Meta to gather more inputs to train its advancing large language models.
Though Meta claims that it already has a heap of info, in the form of public Facebook and IG posts. At 3 billion active users, Meta does have a broad corpus of content to pull from in this respect, but then again, the nature of Facebook doesn’t really align with the AI chatbot use case, in asking questions, similar to Google Search.
And Google, really, only has half of the data in this respect: It has the questions, but it sources the answers to such from third party websites. Hence the Reddit deal, with the text from Reddit’s expert forums, which often include more question and answer type interactions, proving highly valuable for LLM training.
X, too, claims that it has more of these types of interactions, though the main selling point of its Grok chatbot is real-time updates, providing up-to-the-minute inputs direct from X posts. The accuracy of which may be more questionable, but from these examples, you can see how AI developers are looking to source the best inputs, relevant to the Q and A use case, to boost their AI tools.
And that could guide social platform algorithms and policy.
X, for example, now has its Creator Ad Revenue Share program, which rewards users for ads displayed within the replies to their X posts. That incentivizes users to pose engaging questions, questions that people want to respond to. Which may also be questions that people look to pose to Grok as well, and by driving creators to incite such responses, X could be aligning users around providing the data that it needs for its own LLM.
Meta’s also looking to drive the same on Threads, with its “Threads Bonus Program” offering incentives for creators based on post view counts.
You drive more views of your Threads by maximizing engagement, and you can drive more engagement by posing questions.
As such, social platforms have multiple drivers to push users in this direction, which they could further incentivize by amplifying questions in user feeds.
Because again, the best inputs for more human-like AI responses are actual human answers to questions, and the more that Meta and X can prompt such responses in their apps, the more insight they have to train and improve their AI systems.
Which could see more question-bait being posted in social apps, and drive more reach for related queries.
So if you were looking to boost your social media engagement, it may be worth checking out tools like Answer the Public, which provides an overview of common searches based around your chosen keyword.
Not every question will resonate with your audience, but the ones that do may well get big amplification.