Why Did Google Gemini “Leak” Chat Data? via @sejournal, @martinibuster

This is why Google Gemini chat pages seemingly leaked onto the Internet The post Why Did Google Gemini “Leak” Chat Data? appeared first on Search Engine Journal.

Feb 14, 2024 - 21:03

0 105

Why Did Google Gemini “Leak” Chat Data? via @sejournal, @martinibuster

It only took twenty four hours after Google’s Gemini was publicly released for someone to notice that chats were being publicly displayed in Google’s search results. Google quickly responded to what appeared to be a leak. The reason how this happened is quite surprising and not as sinister as it first appears.

@shemiadhikarath tweeted:

“A few hours after the launch of @Google Gemini, search engines like Bing have indexed public conversations from Gemini.”

They posted a screenshot of the site search of gemini.google.com/share/

But if you look at the screenshot, you’ll see that there’s a message that says, “We would like to show you a description here but the site won’t allow us.”

By early morning on Tuesday February 13th the Google Gemini chats began dropping off of Google search results, Google was only showing three search results. By the afternoon the number of leaked Gemini chats showing in the search results had dwindled to just one search result.

Screenshot of Google's search results for pages indexed from the Google Gemini chat subdomain

How Did Gemini Chat Pages Get Created?

Gemini offers a way to create a link to a publicly viewable version of a private chat.

Google does not automatically create webpages out of private chats. Users create the chat pages through a link at the bottom of each chat.

Screenshot Of How To Create a Shared Chat Page

Screenshot of how to create a public webpage of a private Google Gemini Chat

Why Did Gemini Chat Pages Get Indexed?

The obvious reason for why the chat pages were crawled and indexed is because Google forgot to put a robots.txt in the root of the Gemini subdomain, (gemini.google.com).

A robots.txt file is a document for controlling crawler activity on websites. A publisher can block specific crawlers by using commands standardized in the Robots.txt Protocol.

I checked the robots.txt at 4:19 AM on February 13th and saw that one was in place:

Google Gemini robots.txt file

I next checked the Internet Archive to see how long the robots.txt file has been in place and discovered that it was there since at least February 8th, the day that the Gemini Apps were announced.

Screenshot From Internet Archive

Screenshot of Google Gemini robots. txt from Internet Archive showing it was there on February 8, 2024.

That means that the obvious reason for why the chat pages were crawled is not the correct reason, it’s just the most obvious reason.

Although the Google Gemini subdomain had a robots.txt that blocked web crawlers from both Bing and Google, how did they end up crawling those pages and indexing them?

Two Ways Private Chat Pages Discovered And Indexed

There may be a public link somewhere. Less likely but maybe possible is that they were discovered through browsing history linked from cookies.

It’s likelier that there’s a public links.

I asked Bill Hartzer (@bhartzer) about it and he discovered a public link for one of the indexed pages:

Public link to a Google Gemini shared chat page

So now we know that it’s highly likely that a public link caused these Gemini Chat pages to be crawled and indexed.

Bill Hartzer offered this observation:

“Even though the Gemini URL is being blocked in the robots.txt file, there is a link to the Gemini URL in a blog comment, so that Gemini URL is getting indexed.

This just goes to show that Google will still index URLs that are blocked from crawling in the robots.txt file.

If Google really wanted to make sure that Gemini URL is not indexed, they would ALLOW crawling in the robots.txt file and add a noindex meta tag on the pages. Maybe Google should follow it’s own advice here?”

Why Did Chat Pages Begin Dropping Out Of Search Results?

But if there’s a public link then why did Google start dropping chat pages altogether? Did Google create an internal rule for the search crawler to exclude webpages from the /share/ folder from the search index, even if they’re publicly linked?

Insights Into How Bing and Google Search Index Content

Now here’s the really interesting part for all the search geeks interested in how Google and Bing index content.

The Microsoft Bing search index responded to the Gemini content differently from how Google search did. While Google was still showing three search results in the early morning of February 13th, Bing was only showing one result from the subdomain. There was a seemingly random quality to what was indexed and how much of it.

Why Did Gemini Chat Pages Leak?

Here are the known facts:

Google had a robots.txt in place since the February 8th. Both Google and Bing indexed pages from the gemini.google.com subdomain. Both Google and Bing may have discovered links to the chats and subsequently indexed them. The search engines indexed the content regardless of the robots.txt and then began dumping them.

That brings us back to the question of why these pages started dropping off of the search results of both Google and Bing. My guess is that the Google Gemini chat pages are low quality webpages that are not worth showing for what are essentially longtail searches (site:gemini.google.com/share/). There’s really no useful reason to surface these pages in the search results.

Content that is blocked by Robots.txt can still be discovered, crawled and end up in the search index and if the pages are useful they can also rank, unless they are not useful. I think this may be the case.