Resolving "urlresolver"

Forum Moderators: open

Message Too Old, No Replies

Resolving "urlresolver"

Google IPs repeat no-robots runs

Pfui

1:22 am on Jul 24, 2011 (gmt 0)

Google's repeatedly running a basically unknown UA from numerous of its bare (no-host) IPs. E.g.:

74.125.44.86 - - [nn/Jul/2011:12:34:56 -0700] "HEAD / HTTP/1.1" 403 0 "-" "urlresolver"

robots.txt? NO
Had I visited GWT, or any sign-in G site? NO
Twitter-swarm? NO

In a recent thread, "a legitimate blank?" [webmasterworld.com...] , g1smd and lucy24 reported the same "urlresolver" hits and conduct from additional G IPs:

74.125.75.17 - - [nn-Jun-2011:19:11:56 +0200] "HEAD / HTTP/1.1" 200 482 "-" "urlresolver"
66.249.82.195 - - [02/Jun/2011:16:29:17 -0700] "HEAD / HTTP/1.1" 200 269 "-" "urlresolver"

So let's see... HEAD hits for root, and no robots.txt. And after 'commanding' us to jump through GWT and Web Preview and no-cloaking and countless other hoops to stay in G's good graces. Not cool.

[edited by: incrediBILL at 4:46 am (utc) on Jul 24, 2011]

lucy24

4:39 am on Jul 24, 2011 (gmt 0)

You could argue that if all you're planning to visit is the root directory there's no reason to consult robots.txt because what's it going to say? "Stay out of the directory that you're already in"? Gets a bit recursive there...

Yes, I realize you can disallow individual files by name. (Yo! Bad robot! Here's a list of everything I don't want you to look at!) But at that point I kinda doubt you're relying on the uniformed doorman (robots.txt) to protect you. That's a job for an armed security guard (htaccess and/or config file) ;)

Incidentally, I noticed earlier today (different thread, forum next door) that the imagebot doesn't seem to need no steenking robots.txt either. Sigh.

Pfui

6:35 am on Jul 24, 2011 (gmt 0)

Am not sure I understand that argument, sorry. Ideally, the point of retrieving robots.txt would be to heed it by reviewing its rules before retrieving any other file. That's why this --

User-agent: *
Disallow: /

-- literally does say this: "Stay out of the directory that you're already in" a.k.a. root a.k.a. everything.

FWIW, numerous of G's UAs do not read/heed robots.txt. urlresolver is yet another one. That's why I whitelist by UA and IP/Host and why in my log excerpt, "urlresolver' earned a 403.

lucy24

8:05 am on Jul 24, 2011 (gmt 0)

Google outsources its robots.txt reading. Look at your logs and you'll see that the visits to robots.txt are completely independent of the regular robot-type visits. In the case of ordinary html files they do eventually catch up to robots.txt changes. (Why it would take a computer network up to a week in real time to process this information is a serious mystery. Do they use carrier pigeons?) Anyway, they don't behave like garden-variety Good Robots who begin each individual visit with a look at robots.txt and then adjust their behavior accordingly.

My point was that it doesn't make a lot of sense to say "stay out of the directory you're already in" because the only way you can receive this instruction is by going into the directory. And then what do you do on future visits? How would you find out if the rules have been changed and you now are allowed into the top directory? The site owner isn't going to send out an announcement; you have to go in and look.

Staffa

9:08 am on Jul 24, 2011 (gmt 0)

The site owner isn't going to send out an announcement; you have to go in and look.

Oh yes, he/she does.
I did it in the last couple of weeks. Gbot seems to think that it's the holiday season so the cat (webmaster) is likely away, let the mice (bots) dance.
Several IP bot numbers came and went straight to what is disallowed in robots.txt. Solution, I sent them first to read this file and after a few redirected visits lifted the redirect and those same numbers then behaved according to my rules in robots.txt.
The bot may well have been vexed but hey, that's life.

Pfui

1:17 pm on Jul 24, 2011 (gmt 0)

lucy24:

1.) "Google outsources its robots.txt reading. Look at your logs and you'll see that the visits to robots.txt are completely independent of the regular robot-type visits."

Outsources? Pardon?

Googlebot requests robots.txt; others of G's UAs may or may not. On my sites, Google Web Preview does not. urlresolver does not. A gazillion AppEngine-Google UAs do not. Feedfetcher-Google (if it looks/quacks like a bot...) does not. Etc.

Perhaps a log entry to illustrate outsourcing?

2.) In an ideal world...

Bot first asks for and then is allowed to see:

/robots.txt

Bot sees:

User-agent: *
Disallow: /

Bot leaves.

OR

Bot first asks for and then is allowed to see:

/robots.txt

Bot sees:

User-agent: *
Disallow: /bait/

Bot ignores only the /bait/ directory.

ETC.

Lather. Rinse. Repeat.

If/when you change what Bot's allowed to see, you change robots.txt. So the next time Bot comes by and asks for (your revised) robots.txt, it sees its new rule(s), your 'announcements.'

More how-it-works and Google bot-specific info here (ciick the "Manually create a robots.txt file" drop-down) [google.com...]

g1smd

5:09 pm on Jul 24, 2011 (gmt 0)

If you want to understand more about how it works, think about this interesting situation:

Google SERPs sometimes lists

robots.txt

files and indexes their content. There is even a "cache" link for many of these, leading to a cached copy of the file.

Adding

Disallow: /robots.txt

to the robots.txt file soon removes the robots.txt entry from the SERPs but does not stop Google coming back to read the file to see the list of what they should not be indexing.

Disallow: /

stops all access to all URLs within the site for the purpose of indexing. It does not stop access to the

robots.txt

file for site access control and it does not stop access to the WMT account ID file (e.g.

google2b27d8288a99e5a6.html

) for the purpose of verification.

lucy24

7:43 pm on Jul 24, 2011 (gmt 0)

:: sigh ::

Simple, repeatedly observed fact: a change in robots.txt does not result in an immediate change in behavior toward roboted-out directories by robots who have read the text. It can take up to a week. Sometimes far more.

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

Possibly I need to buy a new dictionary. In mine, "visit" and "index" are entirely different concepts. And, of course, "all robots" means "all robots including you, even if I have elsewhere singled you out for personal mention".

I particularly love the mental picture raised by the answer given in this this Q&A [robotstxt.org]. :: snicker ::

g1smd

9:01 pm on Jul 24, 2011 (gmt 0)

I always recommend adding entries to robots.txt several days before the URLs become active. Based on your observations, making that change perhaps a week or more in advance may be in order nowadays.

Possibly I need to buy a new dictionary. In mine, "visit" and "index" are entirely different concepts.

Yes, visit/spider and index are two separate processes. Disallow is supposed to stop bots visiting particular URLs. However, if there is a link to a disallowed URL, searchengines may still list that URL as a URL-only entry in the SERPs. Yahoo also used to attempt to construct a title for that entry by using the anchor text of any links pointing to that URL. It is the

meta robots noindex

tag stops the content appearing in the SERPs.

However, you need to be aware of the following example involving a "

Disallow: /

" rule. It says do not visit ANY URLs beginning with "

" on the site. However that does not stop the bot from visiting "

/robots.txt

" (which itself begins with "

") to find out what it should not be visiting.

And, of course, "all robots" means "all robots including you, even if I have elsewhere singled you out for personal mention".

No it does not. If you address a robot by name, it reads ONLY that section. The "

User-agent: *

" section applies to all other robots, i.e. all those not specifically named.

If a particular rule in the

User-agent: *

section should be applying to a specific robot, then that rule needs to be duplicated into the section for that specific robot.

Robots read only one section of the file; the one that specifically addresses them by name. Only if there is no specific named mention for that robot, will the

User-agent: *

section be read.

lucy24

10:06 pm on Jul 24, 2011 (gmt 0)

g1smd says:

If you address a robot by name, it reads ONLY that section. The "User-agent: *" section applies to all other robots, i.e. all those not specifically named.

the robots.txt page says:

The "User-agent: *" means this section applies to all robots.

g1smd

10:09 pm on Jul 24, 2011 (gmt 0)

Check out what Matt Cutts and Vanessa Fox said in 2006.
[webmasterworld.com...]
That stuff is still true today. Read jdMorgan's comments too.

Pfui

12:58 am on Jul 25, 2011 (gmt 0)

Alas, urlresolver doesn't ask for robots.txt in the first place.

g1smd, you gave it a 200 -- do you have any info on what G's doing with this critter? What did it do, if anything, after you let it in? Has it ever gone for other than root; other than HEAD?

keyplyr

1:34 am on Jul 25, 2011 (gmt 0)

"urlresolve" may fall into that gray area of link-checker, which has always claimed exemption from following robots.txt directives since no actual indexing was done. I'm not defending the behavior though. I'm one that feels all/any agent requesting files from my server, for any reason, should adhere to a standard.

lucy24

1:48 am on Jul 25, 2011 (gmt 0)

"urlresolve" may fall into that gray area of link-checker, which has always claimed exemption from following robots.txt directives since no actual indexing was done.

If you're talking about the w3c link checker there's no "always" about it. A couple of years back I was forced to install Checklink locally because the files I most often needed to check would otherwise have been located inside a no-robots directory and Checklink refused to enter them. That is, it would confirm that the file itself existed, but it wouldn't follow internal # links into other documents, only in the document I was checking. Even explicitly allowing Checklink-- that is, by "disallow" followed by nothing, not by "allow" which is apparently iffy-- didn't make any difference.

Check out what Matt Cutts and Vanessa Fox said in 2006.
{link}
That stuff is still true today. Read jdMorgan's comments too.

:: mopping brow ::

Whew. One curious thing I've now noticed is that when people talk about "sections" of robots.txt they're talking about UA-delimited sections. I always had mine set up the other way around, with a separate pair of statements for each blocked directory. (That's "had" in the past tense because I've just put up a completely different file in order to test an unrelated hunch.)

Pfui

2:36 am on Jul 25, 2011 (gmt 0)

(Methinks extended robots.txt convos would be best-suited for the "Sitemaps, Meta Data, and robots.txt" forum, [webmasterworld.com...] please;)

Getting back to "urlresolver" --

keyplyr, yep. Yep. GMTA.

What irks me most about this UA is it's unknown. At least the other G UAs -- those that read/heed robots.txt, those that don't -- relate to some known G purpose/function if not benefit to site owners.

"urlresolver" is unapologetically, totally cloaked from the get-go, plus it hits from bare G IPs. Whatcha doin', G? Hmm?

lucy24

3:18 am on Jul 25, 2011 (gmt 0)

It's easy to find out what an urlresolver is. Problem is, it isn't a robot. (I have to sidetrack and point to this [hamletbatista.com] simply because it's got such cool pictures :))

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

I don't see anything in there that would require a house call. Is it sending out robotic interns for random spot checks to verify that the URLresolving is working correctly?

Pfui

2:56 am on Aug 2, 2011 (gmt 0)

Google's "urlresolver" may or may not be the "URLresolver" mentioned by a non-G blogger in a four year-old analysis of G's inner workings or the Apache/Java-based "URLresolver" or any other so-called 'url resolver'.

"urlresolver" may or may not even do what its name purports.

All we know is that Google's "urlresolver" is not a real person visiting in real time; it's just another robot [en.wikipedia.org] executing an automated, unknown task.

Oh, and as of today, no longer is it just HEAD'ing to root:

66.249.82.65 - - "HEAD /dir/file.html HTTP/1.1" 403 0 "-" "urlresolver"

lucy24

7:03 am on Aug 2, 2011 (gmt 0)

All we know is that Google's "urlresolver" is not a real person visiting in real time; it's just another robot executing an automated, unknown task.

It's not either-or. A robot is a tool. Someone throws a brick through your window, you don't question the brick about its motives-- but you also don't throw up your hands and say "Who knows why bricks do what they do." You find out who threw it. And you might also ask the brick factory why they painted it that strange color.

dstiles

9:40 pm on Aug 2, 2011 (gmt 0)

You'd get VERY few replies, sensible or otherwise, to such questions from bot drivers. :)

Pfui

12:26 am on Aug 12, 2011 (gmt 0)

Another from 66.349.82. --

66.249.82.162 - - [1n/Aug/2011:12:34:56 -0700] "HEAD / HTTP/1.1" 403 0 "-" "urlresolver"

Curiously, an html-only hit from a domain with "analytics" and "seo" in its name came literally in the next second. (CUE TWILIGHT ZONE THEME:)

Pfui

1:23 am on Aug 29, 2011 (gmt 0)

74.125.75.4
urlresolver

robots.txt? NO

Today, finally, a confirmed Tweeted-link follower -- approx. 60 minutes after the swarming crowd.

dstiles

7:05 pm on Aug 29, 2011 (gmt 0)

With urlresolver? Interesting!

I have the IP range 74.125.75/24 listed as google feedfetcher/preview/translate. I could see preview being in there (someone listed the site from a recent tweet/whatever it's called, but urlresolver looks odd.

Pfui

7:59 pm on Aug 29, 2011 (gmt 0)

I would've missed the post-swarm Twitter tie-in but for the fact the linked-to file was part of a frameset and bot hits only 'take' one of the framed pages.

dstiles

9:09 pm on Aug 30, 2011 (gmt 0)

So how does google iknow to follow the twitter link? I thought it did not have access to T data?

Presumably its getting the info from googletoolbar - would that seem (un)reasonable?

Pfui

9:30 pm on Aug 30, 2011 (gmt 0)

G could be mining the gazillion Twitter swarmer apps' public (read: free until they can figure out how to monetize it) data, and/or the shorteners like http://bit.ly

Not sure about no-access... I just G'd twitter plus two words I knew would be in a Tweet with a link and voila.

keyplyr

2:01 am on Aug 31, 2011 (gmt 0)

As far as the Twitter connection, I only see urlresolver requesting web pages I post on Twitter using shortened URLs, hence the apropos "urlresolver."

Pfui

5:28 am on Aug 31, 2011 (gmt 0)

So apparently G is stealthily mining tweets, using bare IPs and an unannounced unknown UA, with no overt connection or clue as to what G's doing and/or why it's urlresolving and/or its plans for what it mines from tweeted context, tweeted links, or end-site content.

Oh, and all without asking for robots.txt, while threatening us if we cloak anything.

If 'urlresolved' tweets used goo.gl, I might be less irked. But when I spot a swarm, I check search.twitter.com and locate the original tweet. I've yet to see G's shortener used. Ever.

keyplyr, do you use that one, or some other?

keyplyr

7:21 am on Aug 31, 2011 (gmt 0)

I mostly use bitly, but sometimes use TinyURL or Ow.ly.

I have never used t.co (the Twitter propriety URL shortener currently in beta) nor G's URL shortener.

dstiles

7:32 pm on Aug 31, 2011 (gmt 0)

I would never use a shortener. VERY open to trojan-implanters. A few try to check the source URL but it's easy to cloak for it and domains are cheap and easy enough to buy.

It occurred to me to wonder if G gets any of these links from scraping their email services, as well as GTB.

Robert Charlton

4:55 am on Sep 19, 2011 (gmt 0)

Simple, repeatedly observed fact: a change in robots.txt does not result in an immediate change in behavior toward roboted-out directories by robots who have read the text. It can take up to a week. Sometimes far more.

Several questions raised by the above and ensuing discussion are covered in this thread, worth a look...

robots.txt - Google's JohnMu Tweets a tip
http://www.webmasterworld.com/google/4143083.htm [webmasterworld.com]