Google Says They Deploy Hundreds Of Undocumented Crawlers via @sejournal, @martinibuster

Google's Gary Illyes offered a candid overview of Googlebot, explaining there are hundreds of crawlers that are not publicly documented. The post Google Says They Deploy Hundreds Of Undocumented Crawlers appeared first on Search Engine Journal.

Google Says They Deploy Hundreds Of Undocumented Crawlers via @sejournal, @martinibuster

Google’s Gary Illyes and Martin Splitt published a podcast about Googlebot, explaining that it’s not just one standalone thing but hundreds of crawlers across different products and services, most of which are not publicly documented.

What Googlebot Is

Gary clarifies that the name “Googlebot” is a historical name originating from the early days when Google had just a single crawler. That’s not the case anymore because Google operates many crawlers across different products but the name Googlebot stuck, even though it’s not one thing anymore.

Further, he explains that Googlebot is not the crawling infrastructure itself or a singular system. Googlebot is actually one client interacting with a larger internal crawling service, the infrastructure.

Martin Splitt asked:

“How can I imagine Googlebot? How does our crawling infrastructure roughly look like?”

Gary answered:

“I mean, calling it Googlebot, that’s a misnomer. And it’s something that back in the days, perhaps early 2000s, it worked well because back then we probably had one crawler because we had one product. But then soon after another product came out, I think that was AdWords. And then we started having more crawlers and then more products came out and then more crawlers and then more crawlers.

But the Googlebot name that somehow stuck. Generally when we were talking about our crawling infrastructure in general, then we tended to call it Googlebot, but that was wildly inaccurate because Googlebot was just one thing that was communicating with our crawler infrastructure.”

Crawling Infrastructure Has A Name

Gary next explains that the crawling infrastructure has an internal name within Google but he declined to say what that name is.

He continued:

“Googlebot is not our crawler infrastructure. Our crawler infrastructure doesn’t have an external name. It has an internal name. Doesn’t matter what it is. Let’s call it Jack. And it is, I don’t know how to put it. It’s software as a service, if you like. SaaS. Right? then, so Jack has API endpoints, so to say. And then you can call those API endpoints to do a fetch from the internet.

And then when you do those API calls, then you also need to specify some parameters like how long are you willing to wait for, for the bytes to come back or what is your user agent that you want to send? What is the robots.txt product token that you want to obey and all these parameters.

And we do set a default parameter for most of these things, not all of them, but most of these things. So you can generally omit them, which makes these calls simpler, I guess, because you don’t have to specify all the stuff. But otherwise, it’s really just an API call to something in the cloud or on some random data center. And then that will perform a fetch for you as a software developer or a product.

So this product, because we can call it a product at this point, even if it’s internal, this has been around for a very, very, very, very long time. …But in essence, it’s always been doing the same thing. It’s basically you tell it, fetch something from the internet without breaking the internet. And then it will do that if the restrictions on the site allow it. That’s it. Like if I wanted to put it in one sentence, that would be it.”

Hundreds Of Crawlers SEOs Don’t Know About

Not all the Googlebot crawlers are documented, there are many that SEOs don’t know about. Gary said that many internal Google teams use the crawling infrastructure for different purposes. He said that there are potentially dozens or hundreds of internal crawlers but that only the major crawlers are documented publicly.

Smaller or low-volume crawlers are often not documented due to practical limitations but that if a crawler becomes large enough, it may be reviewed and documented.

Picking up on the theme of there being multiple clients (crawlers), Gary continued:

“…we try to document a big chunk of them, but Google is a big company, so there’s lots of teams that want to fetch from the internet. So there’s lots of crawlers, lots of named crawlers, which means that we would need to document dozens, if not hundreds of different crawlers or special crawlers or fetches.”

Gary explains that documenting the hundreds of crawlers is not feasible.

“And on a simple HTML page, that’s kind of infeasible. So we kind of try to draw a line and say that if the crawler is really tiny, meaning that it doesn’t fetch too much from the internet, then we try not to document it because the real estate on the crawler site, developers.google.com slash crawlers, is actually quite valuable.

We might try to deal with that differently, but for the moment basically just major crawlers and special crawlers and fetches are documented because, quite literally because of lack of space.”

Difference Between Crawlers And Fetchers

Gary explains that there are crawlers and fetchers that fall into the Googlebot category but are actually different things.

He explains what the difference is:

“So the simplest way to explain it is that Crawlers are doing work in batch and then Fetchers do work on individual URL basis, meaning that you give a URL to a Fetcher and then it will fetch just one URL. You cannot give it a list of URLs to fetch.

And then for crawlers, it’s a constant stream usually of URLs and it’s running continuously for your team and fetching for your team from the internet.

And internally, we also have this policy that fetches need to be in some way user controlled. Basically, there’s someone on the other end who’s waiting for the response of the fetcher.

While with crawlers it’s like just do it when you have the time.”

Martin and Gary say that there are many crawlers and fetchers they use internally that are not documented. Gary explained that he has a tool that triggers an alert when a crawler and fetcher crosses a specific threshold of crawls and fetches per day which he will then go follow up with the team responsible for the crawls to see what it’s doing and why as well as to verify that it’s not doing something accidentally. If it’s a crawler that is fetching a lot of URLs in a noticeable way then he’ll decide whether or not to document it so that the web ecosystem can know about it.

Listen to the Search Off The Record Podcast here:

Featured Image by Shutterstock/TarikVision