User Agent = Feedfetcher-Google

Forum Moderators: open

Message Too Old, No Replies

User Agent = Feedfetcher-Google

smallcompany

11:31 pm on Nov 24, 2008 (gmt 0)

User Agent = Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)

IP is 72.14.199.x

Just recently I moved one of my sites from shared hosting to VPS with dedicated IP.

I believe that this IP belongs to Google, does it?

Anyhow, I'm getting bunch of 404s as this UA is visiting every day asking for stuff like

gallery2/main.php?g2_view=rss.SimpleRender&g2_itemId=16

How do I fix this? Is there a way to "tell" Google to stop asking fro this RSS feed or whatever this is?

Is there a harm if I ban it? Robots.txt or .htaccess?

Thanks

Samizdata

11:05 am on Nov 25, 2008 (gmt 0)

It is a legitimate Google robot for fetching RSS feeds.

The requests are for a file that is part of Gallery, a popular PHP image gallery package.

If you don't have Gallery installed (possibly the previous IP user did) you can try:

RewriteCond %{HTTP_HOST} .
RewriteRule ^gallery - [G]

This should return a 410 Removed Permanently response. Unfortunately Google tends to treat it the same as a 404 and will probably continue to request it for several months before giving up.

If you have no RSS feeds you want indexed you can safely block the bot:

RewriteCond %{HTTP_USER_AGENT} Feedfetcher
RewriteRule .* - [F]

...

Samizdata

11:27 am on Nov 25, 2008 (gmt 0)

For the avoidance of doubt, I assumed that the "User Agent = " was not part of the actual user-agent string - if it was then I would definitely block it.

[webmasterworld.com...]

...

smallcompany

6:44 pm on Nov 25, 2008 (gmt 0)

Thanks.

No "User Agent" was not part of UA string itself.

I don;t use this so I blocked it via .htaccess. I get too many 404s every day. I guess it may be past submission or a wrong one as it is requesting via IP address, not my site's name.

Can this be abused? I mean, is it possible that someone would try harming your site that way, by submitting your IP to, in this case, Feedfetcher?

Samizdata

7:14 pm on Nov 25, 2008 (gmt 0)

Can this be abused?

I wouldn't want to give a definitive answer, but the only tools I have seen using the 72.14.199.nn IP range are Feedfetcher and Googlebot-Mobile (both of which are also known to use 209.85.238.nn). I don't block them myself, and doubt that Feedfetcher itself could be the source of mischief.

But then I am constantly surprised by what is possible.

...

GaryK

7:24 pm on Nov 25, 2008 (gmt 0)

Wouldn't the only way to submit the feed to Google be via Webmaster Tools? If so then before Google will crawl a submitted site is after it finds the validation code you have to put in a meta tag or a file on the site you want crawled. Regardless, I'm not sure how having Feedfetcher look for a non-existent file could harm your site.

That seems to suggest you, or someone with your login credentials, has to submit the site. So it's probably what Samizdata suggested: The feed from whomever had the IP before you did is attracting Feedfetcher and it will eventually give up and go away with no harm done.

Demaestro

7:29 pm on Nov 25, 2008 (gmt 0)

Wouldn't the only way to submit the feed to Google be via Webmaster Tools?

No, The personal page iGoogle (google.com/ig) allows you to add RSS feeds into your personal home page and Feedfetcher is what Google sends as the UA when that page requests feeds from a site.

It might just be one person who added something from that site to their iGoogle page.

GaryK

7:36 pm on Nov 25, 2008 (gmt 0)

I stand corrected. Thanks for enlightening me. I've never used iGoogle other than for a brief test. :)

Even with this new knowledge I can't see how the lack of a feed file could harm a site unless the file is supposed to be there. In this case it's not.

Demaestro

7:52 pm on Nov 25, 2008 (gmt 0)

I agree that the chance for harm here is nill.

Especially if it is iGoogle because I am not even sure Google does anything other then fetch the feed, I don't think it indexes anything at that time including the response headers.

dstiles

11:41 pm on Nov 25, 2008 (gmt 0)

I know from experience that feedfetcher visits sites that don't have RSS feeds.

If the visits are due to some moron adding a site into their google page then surely google should drop the attempt after being shown a 404 or 403?

incrediBILL

6:04 pm on Nov 26, 2008 (gmt 0)

Feedfetcher also steps off the path and tries to access pages indexed in the feed.

I allow raw access to my RSS feed but swat Feedfetcher and many others when they try to access anything else.

Samizdata

6:31 pm on Nov 26, 2008 (gmt 0)

Despite a misdirected link on Google's support site (as found by a Google search for "Feedfetcher Google") I eventually located - thanks to a third party site - the official Google Feedfetcher FAQ:

[google.com...]

I won't bother to quote it, as I'm sure the interested parties here will read it thoroughly.

...

GaryK

10:15 pm on Nov 26, 2008 (gmt 0)

Lovely. Like I needed to read that at the end of this totally crappy day. Still, thanks Samizdata. That was an interesting link.

smallcompany

12:03 am on Nov 27, 2008 (gmt 0)

the official Google Feedfetcher FAQ

I was there on the day before I initiated this post. All is good except this:

How do I request that Google not retrieve some or all of my site's feeds?
Since Feedfetcher requests are all user-initiated, it does not follow the typical robots.txt guidelines for robots. For detailed instructions about how to prevent Feedfetcher from requesting all or part of your site, please see our Removals page.

There is nothing on removals page in a relation to the Feedfetcher.

When I saw that first time, naturally (organically), I entered the term "feedfetcher" into the search box and got this:

Your search - feedfetcher - did not match any answers in our Help Center.
Please edit your search terms and try again.

Oh man...?!

I "killed" it in .htaccess via banning the user agent.

dstiles

2:34 am on Nov 28, 2008 (gmt 0)

I see no technical reason why feedfetcher can't obey robots.txt. First time up and periodically thereafter (say couple of days) check robots.txt for permission and behave accordingly.

Samizdata

3:08 am on Nov 28, 2008 (gmt 0)

Google would probably say that it's not a crawler (it is supposed to only fetch submitted feeds and not follow links) and is therefore not covered by the Robots Exclusion Protocol.

I sympathise with your view - few webmasters use a robots.txt file, but those who do would probably prefer it to apply to all non-human requests. The people who make bots do not like being thwarted though, and can interpret the (voluntary) protocol to suit themselves.

It's a jungle out there.

...

GaryK

5:02 pm on Nov 28, 2008 (gmt 0)

It's a jungle out there.

And Bill is our alpha male lion. :)