Large language models are trained on all kinds of data, most of which appears to be collected without anyone’s knowledge or consent. Now you have the choice Whether Google will be allowed to use your web content as material to feed Bard AI and any future models it decides to make remains to be seen.
It’s as simple as disallowing “User-Agent: Google-Extended” in your site’s robots.txt file, which is the document that tells automated web crawlers what content they can access.
Although Google claims to develop its AI in an ethical and comprehensive manner, the use case for AI training is significantly different from web indexing.
“We’ve also heard from web publishers that they want more choice and control over how their content is used in emerging AI use cases,” Danielle Roman, the company’s VP of Trust, wrote in a blog post, as if this came as a surprise.
Interestingly, the word “train” does not appear in the post, even though that is very clearly what this data is used for: as raw material for training machine learning models.
Instead, the VP of Trust asks you if you don’t really want to “help improve the Bard and Vertex AI generative APIs” — “to help these AI models become more accurate and capable over time.”
Look, it’s not about Google Taking Something from you. It’s about whether you are Ready to help.
On the one hand, this is probably the best way to ask this question, since consent is an important part of this equation and a positive choice to contribute is exactly what Google should be asking for. On the other hand, the fact that Bard and his other models have actually They were trained on truly massive amounts of data culled from users without their consent which robs this framework of any authenticity.
The inescapable truth underscored by Google’s actions is that it took advantage of unfettered access to web data, got what it needed, and is now asking for permission after the fact to make it look like consent and ethical data collection is a priority for it. If it were, we would have had this setup years ago.
Coincidentally, Medium has just announced that it will ban crawlers like this globally until a better, more detailed solution is found. And they’re not the only ones by a long shot.