Welcome to WebmasterWorld Guest from 35.173.57.202

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot still indexing dead pages after months (googlebot 403 410)

     
1:41 pm on Oct 10, 2014 (gmt 0)

Junior Member from IT 

10+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


I used to rewrite URLs of a photogallery (coppermine gallery) via htaccess rules.
After having noticed a few errors, a few months ago I removed the content of htaccess and reverted back to the original URL structure.
However, as of today, googlebot is still indexing thousands of broken links.
You can see some of them by typing
site:example.com keyword htmlalbums

in a Google search box.

I can't figure out why Google is still trying to scrape those pages, even if they haven't existed for months.

Given that an error 404 is not enough, I was thinking about 410ing those pages. Will googlebot stop indexing them in the near future, thus stop wasting server resources?

All the pages have the string htmlalbums in common, and used to reside inside /gallery folder.

I wonder if putting this Redirectmatch in the .htaccess file inside /gallery folder will do the trick. I tried it in a regex tester, and it should work.
RedirectMatch gone (.*)htmlalbums(.*)


Any help would be much appreciated.

[edited by: aakk9999 at 2:35 pm (utc) on Oct 10, 2014]
3:56 pm on Oct 10, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


However, as of today, googlebot is still indexing thousands of broken links.


I am not sure if you are saying that new URLs (which return 404) are being indexed (the number of these pages in Google index increasing) or are you saying that the already indexed pages are not yet being dropped from the index?
4:21 pm on Oct 10, 2014 (gmt 0)

Junior Member from IT 

10+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


Well, GWT shows a constantly increasing number of broken links (25.000 in June, about 108.000 as of now), even if Google search results shows about 1.800 results.
If, for somehow reason, the fact that the links are dead is not enough for googlebot to stop, then I wonder if http 410 could be my last resort.
Needless to say, /gallery folder is accessible from anybody.
4:23 pm on Oct 10, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


Ignore WMT
4:50 pm on Oct 10, 2014 (gmt 0)

Junior Member from IT 

10+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


Of course, but I can't ignore the thousand of wrong results showing up in Google, nor the useless googlebot pings that waste resources on my server.
If non-existent pages are still indexed after months and googlebot still tries to grab a fresh version of them, then there's something wrong.
Is
RedirectMatch gone (.*)htmlalbums(.*) 
of any help in http410'ing pages containing the string "htmlalbums"?
6:35 pm on Oct 10, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


410ing them might help [they are usually dropped faster and not revisited as frequently sooner], and certainly won't hurt -- As far as resources go, a 404 is essentially nothing to serve, even thousands of them, to the point I'd recommend a custom error page for either 404 or 410's so you have a chance of capturing visitors from any URL Google actually happens to show in the results.
8:11 pm on Oct 10, 2014 (gmt 0)

Junior Member from IT 

10+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


That's a wise idea, in fact I've already set up a custom 404 error page a while ago.

I've put the RedirectMatch rule in /gallery folder, and it seems to correctly deliver the actual pages to users and bots, and show http 410 error for the non existent URLs.
GOOGLEBOT_IP - - [10/Oct/2014:22:18:45 +0200] "GET /gallery/displayimage-1.htmlalbums/userpics/10001/mostra-26-1234-_Grant_Park_.htmlalbums/viaggio-chicago/albums/viaggio-washington/mostra-26-1230-_Millennium_Park_Il_Cloud_Gate_2_.html?pid=1234 HTTP/1.1" 410 637 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
GOOGLEBOT_IP - - [10/Oct/2014:22:18:46 +0200] "GET /gallery/displayimage.php?album=36&pid=1676 HTTP/1.1" 200 9097 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members