homepage Welcome to WebmasterWorld Guest from 54.226.213.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google appears to be ignoring 301'd URLs
Googlebot indexing site by IP, even though server is set up to 301
rowan194




msg:4484815
 9:13 pm on Aug 14, 2012 (gmt 0)

I have my server set up to 301 any loads from the non preferred domain example.com or 123.123.123.123, to the preferred domain www.example.com.

This is intended to prevent duplicate content indexing, and keep everything on a single consistent domain. This redirect has been in place for over a year, probably more like 18+ months.

This doesn't seem to have stopped Googlebot though - roughly every fourth page fetch at the moment is Googlebot trying to index the site by its IP, even though every such attempt is 301'd to www.example.com

Shouldn't a 301 response effectively remove that URL from the index? Google still has 600k+ pages showing when I search for "site:123.123.123.123" ... remember this 301 redirect has been in place for some time.

There's 3 things I can think of to do:

1) Add [123.123.123.123...] to GWT and set a lower crawl rate. This probably won't remove the pages from the index, and crawl rate will revert to what G decides is appropriate in 90 days anyway.

2) Start returning a 404 or 410 (Gone) status when Googlebot attempts a fetch from 123.123.123.123. (I can't do this universally because G is still referring searchers to the IP-only hostname!)

3) Set up a custom robots.txt for [123.123.123.123...] that disallows everything. I presume this will remove all [123.123.123.123...] URLs from the SERPs, but I also presumed that 301'ing everything would too :)

Thanks in advance for any tips.

 

g1smd




msg:4484834
 10:38 pm on Aug 14, 2012 (gmt 0)

Leave the 301 in place. If it brings visitors to your site don't block them.

Google visits URLs forever irresepective of what response code you return. They do this just in case a URL ever comes back to life with real content.

The 301 is the best response. Google will drop those URLs from the SERPs eventually. Don't force a quicker drop unless you want less visitors to your site for the next three to six months.

Don't block with robots.txt or return 404.

Register every possible hostname variant (www/non-www/IP) in Webmastertools and look at the crawl stats and crawl error reports for each.

rowan194




msg:4484863
 11:38 pm on Aug 14, 2012 (gmt 0)

referrals from google for...
123.123.123.123 -> 4.2%
example.com -> 1.1%
www.example.com -> 94.7%

I'd be happy to lose 5% of my visitors if Gbot stopped banging my site 3 times for every possible page (IP, example.com, www.example.com). I'm having trouble understanding why Google is ranking example.com and the IP, since they are 100% duplicate content.

I think part of the problem is that my site has so many pages that what's in the index is stale. I checked my logs for the first few results of "site:123.123.123.123" and Googlebot last fetched the URLs 6 months (!) ago. Perhaps this is the reason that duplicate content detection isn't working, as the page content generally changes (sometimes drastically) every 3-4 months.

The sheer number of pages on my site causes Gbot to fetch at a decent rate (100,000+ per hostname per day if not manually configured in GWT), which is why doing it in triplicate is even more of an issue.

g1smd




msg:4484865
 11:48 pm on Aug 14, 2012 (gmt 0)

I'd be happy to lose 5% of my visitors if Gbot stopped banging my site 3 times for every possible page

...but they're not! as you said...

Googlebot last fetched the URLs 6 months (!) ago

Yep, so they aren't banging your site at all. And your real pages are ranking and so are some of the old versions of your pages via Supplemental Results for the IP, and the redirect gets the visitor to the right place.

I'd say you're getting an extra 5% added to your visitor totals, for no outlay at all. I'd not block it.

rowan194




msg:4484866
 11:56 pm on Aug 14, 2012 (gmt 0)

No, I was only talking about the first few results for site:123.123.123.123 - those few that I checked were last fetched by Googlebot 6 months ago.

Googlebot is still busy fetching every other page it can; 75,000 fetch attempts to 123.123.123.123 in the past 24 hours. All 301'd, as they have been for the past 12+ months.

g1smd




msg:4484869
 12:52 am on Aug 15, 2012 (gmt 0)

75,000 fetch attempts to 123.123.123.123 in the past 24 hours

That's a lot. Almost one per second. There's only 86400 seconds in a day!

However if you still have 600 000 URLs showing for the IP, it might be a good while yet before crawling tails off.

What's the trend for the crawl stats for the IP as shown in WMT?

rowan194




msg:4484918
 7:26 am on Aug 15, 2012 (gmt 0)

I did some quick calcs and realised this is most likely going to be an ongoing problem. To fetch 200 million pages from my site at 1 per second would take 6 years.

As an interim measure I've created a GWT account for 123.123.123.123 to try to slow Googlebot down; unfortunately they won't let me change the crawl rate yet.

"We do not have enough information about your site at this time to allow changing the crawl rate. Please visit again later."

This is probably an edge case as most sites would not have 200m+ pages. Still, it's very frustrating that I need to create 3 different GWT accounts for the same site, and that I have to manually set the crawl rate every 90 days. I wish Googlebot would recognise and respect the crawl-delay directive... other popular crawlers do.

phranque




msg:4484966
 10:30 am on Aug 15, 2012 (gmt 0)

it might help solve part of your problem by using GWT to set the preferred domain to www.example.com vs example.com.

Official Google Webmaster Central Blog: Setting the preferred domain:
http://googlewebmastercentral.blogspot.com/2006/09/setting-preferred-domain.html [googlewebmastercentral.blogspot.com]

rowan194




msg:4484974
 10:56 am on Aug 15, 2012 (gmt 0)

Thanks for the suggestion phranque, but I already did that a year ago. Googlebot is still fetching from example.com, even though that is also 301'd for every load, and has been set as the NON preferred domain for a year.

GWT seems to have no way to associate 123.123.123.123 with a preferred domain (or any domain) - many options say "Restricted to root level domains only"

So I'm left with 3 verified sites that are actually identical, www.example.com preferred over example.com, yet Googlebot is still fetching from www.example.com (200), example.com (301) and 123.123.123.123 (301). It really is a waste of resources to fetch everything 3 times, particularly because I have a huge number of pages.

I've added the "X-Robots-Tag: noindex" header to the 301 redirect to try to send a stronger message to Googlebot that I do not want these URLs in their index.

phranque




msg:4485027
 12:56 pm on Aug 15, 2012 (gmt 0)

the "X-Robots-Tag: noindex" header is technically the same to google as a meta robots noindex element.

have you actually verified in your server access logs that googlebot is getting the 301 and/or have you tried fetch as googlebot in GWT for the IP and example.com hostname?

rowan194




msg:4485119
 4:51 pm on Aug 15, 2012 (gmt 0)

My access logs go back to September 2011, so I've done a little experiment where I pulled out the first few page fetches at that time by Googlebot, to see when subsequent loads were (and which hostname they were directed at)

page #1:
28/Sep/2011:11:00:04 ... 301 ... example.com
28/Sep/2011:11:00:32 ... 200 ... www.example.com
20/Jun/2012:23:40:51 ... 200 ... www.example.com
(nearly 9 months between re-fetches from www.example.com, and no re-fetch from example.com after nearly 11 months)

page #2:
28/Sep/2011:11:00:05 ... 301 ... example.com
28/Sep/2011:11:00:33 ... 200 ... www.example.com
20/May/2012:10:28:50 ... 200 ... www.example.com

page #3:
28/Sep/2011:11:00:06 ... 301 ... example.com
28/Sep/2011:11:00:36 ... 200 ... www.example.com
15/Jan/2012:08:54:03 ... 200 ... www.example.com

So it looks like G may actually be marking 301'd content on example.com as effectively deleted and not coming back; it's just that I have so many pages, so the average time between the last 200 fetch of a page pre-June 2011, and the "new" 301 fetch, is impossibly large. It could take literally years for Googlebot to fetch and react to a 301 for every page they have queued on the wrong hostnames. The domain has been active for 5+ years so it's possible a lot of what is still showing in the index was fetched prior to setting the preferred domain in GWT a year ago.

I would guess, apart from manual intervention by Google, that a custom robots.txt disallowing everything on example.com and 123.123.123.123 would be the only way to quickly shed the unwanted URLs from their index.

g1smd




msg:4485181
 7:13 pm on Aug 15, 2012 (gmt 0)

200 million pages

Ah, if you had said that up front the thinking may have been a little different.

You have two real choices for the IP indexing:

One is to robots.txt disallow requests arriving at the IP. This will need very careful scripting to ensure that the robots.txt file is not served for non-www or www hostname requests. I would rewrite requests for robots.txt to /robots.php and then detect the requested hostname in the PHP script and serve the right content from there.

The other alternative is to carry on serving the 301 redirect and let Google continue fixing their data that way.

rowan194




msg:4489210
 6:50 pm on Aug 28, 2012 (gmt 0)

FWIW: I verified the IP only version of my site 2 weeks ago, but GWT is still reporting-

"We do not have enough information about your site at this time to allow changing the crawl rate. Please visit again later."

I don't know if this is again unique to my 200m+ page site, or there's a more generic reason they are not yet letting me change the crawl rate. (Googlebot has been crawling the IP of my site for, literally, years.)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved