Google Reveals Two New Web Crawlers via @sejournal, @martinibuster

Google announces details of two new crawlers that are optimized for scraping images and videos for research and development purposes The post Google Reveals Two New Web Crawlers appeared first on Search Engine Journal.

May 18, 2024 - 21:04

0 92

Google Reveals Two New Web Crawlers via @sejournal, @martinibuster

Google announces two new crawlers that are for scraping images and videos for research and development purposes

Google revealed details of two new crawlers that are optimized for scraping image and video content for “research and development” purposes. Although the documentation doesn’t explicitly say so, it’s presumed that there is no impact in ranking should publishers decide to block the new crawlers.

It should be noted that the data scraped by these crawlers are not explicitly for AI training data, that’s what the Google-Extended crawler is for.

GoogleOther Crawlers

The two new crawlers are versions of Google’s GoogleOther crawler that was launched in April 2023. The original GoogleOther crawler was also designated for use by Google product teams for research and development in what is described as one-off crawls, the description of which offers clues about what the new GoogleOther variants will be used for.

The purpose of the original GoogleOther crawler is officially described as:

“GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development.”

Two GoogleOther Variants

There are two new GoogleOther crawlers:

GoogleOther-Image GoogleOther-Video

The new variants are for crawling binary data, which is data that’s not text. HTML data is generally referred to as text files, ASCII or Unicode files. If it can be viewed in a text file then it’s a text file/ASCII/Unicode file. Binary files are files that can’t be open in a text viewer app, files like image, audio, and video.

The new GoogleOther variants are for image and video content. Google lists user agent tokens for both of the new crawlers which can be used in a robots.txt for blocking the new crawlers.

1. GoogleOther-Image

User agent tokens:

GoogleOther-Image GoogleOther

Full user agent string:

GoogleOther-Image/1.0

2. GoogleOther-Video

User agent tokens:

GoogleOther-Video GoogleOther

Full user agent string:

GoogleOther-Video/1.0

Newly Updated GoogleOther User Agent Strings

Google also updated the GoogleOther user agent strings for the regular GoogleOther crawler. For blocking purposes you can continue using the same user agent token as before (GoogleOther). The new Users Agent Strings are just the data sent to servers to identify the full description of the crawlers, in particular the technology used. In this case the technology used is Chrome, with the model number periodically updated to reflect which version is used (W.X.Y.Z is a Chrome version number placeholder in the example listed below)

The full list of GoogleOther user agent strings:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; GoogleOther) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleOther) Chrome/W.X.Y.Z Safari/537.36

GoogleOther Family Of Bots

These new bots may from time to time show up in your server logs and this information will help in identifying them as genuine Google crawlers and will help publishers who may want to opt out of having their images and videos scraped for research and development purposes.