homepage Welcome to WebmasterWorld Guest from 50.17.7.84
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot is obsessed with index.php... how best to fix?
Sgt_Kickaxe




msg:4525806
 7:15 am on Dec 7, 2012 (gmt 0)

Server logs indicate that for almost every url Googlebot visits it also attempts the same url with index.php. e.g. example.com/index.php or example.com/index.php/great-content. Logs indicate it usually hits the index.php version first.

htaccess redirects any search containing an index.php to the non-index.php equivalent, needless to say that's a lot of redirecting going on and a lot of wasted crawl budget. To compound matters if I make a page return 404 or 410 Googlebot ends up on a chain of redirects.

- Almost no incoming links contain index.php in the url and certainly no urls on my site do either.

For the first 6 months of this sites life the urls DID contain index.php due to limited access to htaccess by the host but I changed host and implemented the fix mentioned above, this was over 6 years ago.

Should I bite the bullet and have any search for index.php versions of a url return 410? Other options?

 

deadsea




msg:4525860
 11:33 am on Dec 7, 2012 (gmt 0)

I would keep up with the 301 redirects. We had a site where we changed every single url on the site in 2002. 10 years later Googlebot was still crawling the old style urls and we were still serving 301 redirects for them. Googlebot never seems to forget these things. But it didn't seem to hurt our SEO efforts either.

levo




msg:4525872
 11:58 am on Dec 7, 2012 (gmt 0)

Switch to 410. Also make sure that the server doesn't redirect

example.com/index.php/content to www.example.com/index.php/content
example.com/index.php/content?somequerystringthatyouredirect to example.com/index.php/content

If you have pages that redirect and end up with 404/410, Google keeps crawling them (at least it crawls them more often)

g1smd




msg:4525876
 12:11 pm on Dec 7, 2012 (gmt 0)

@levo There is no reason at all to retain "index.php" within the URLs. At least that part of the URL should be dumped.

I would continue with the 301 redirects. Google will request every URL they have ever seen, forever.

Sgt_Kickaxe




msg:4525879
 12:23 pm on Dec 7, 2012 (gmt 0)

Google will request every URL they have ever seen, forever.


That's what it feels like, for sure!

Since I'm redirecting non-www to www I haven't worried too much about redirecting to remove /index.php/ but then Googlebot doesn't request my non-www copies very often, it's the /index.php/ they want on 90% of requests, they visit the index.php version first that often still.

I also see this in my logs a lot for pages that are 410 (expired content), after googlebot requests the index.php version of course.

"GET www.example.com/expired-content HTTP/1.1" 410 636 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 "redirect-handler" "redirect:/index.php"


I don't remember seeing the parts in bold until fairly recently.

g1smd




msg:4525883
 1:07 pm on Dec 7, 2012 (gmt 0)

If non-www with index.php redirects to www with index.php then that's part of the problem.

The first rule should redirect any request with index.php to www without it.

A later rule should redirect all non-www to www, but this rule will never be activated for requests with index.php within, because the earlier rule will already have taken care of all problems (the index.php and the www).

levo




msg:4525885
 1:21 pm on Dec 7, 2012 (gmt 0)

Google will request every URL they have ever seen, forever.


It does, but it requests redirected URLs much more often. Redirecting URLs with index.php won't slow down Googlebot on those pages. Well, you're already doing it...

My website had kind of a similar problem, Googlebot used to hit non-existing pages more than existing pages (~5-to-1). I've fixed it by returning 410 before any canonical redirections.

My suggestion is to return 410, and make sure you don't have redirects (non-www fixes, query string dropping etc.) that end up with 404s.

Sgt_Kickaxe




msg:4525895
 2:06 pm on Dec 7, 2012 (gmt 0)

g1smd, I have the non-www to www check placed after the index.php to www non-index.php so all is good on that front. The redirects themselves seem fine, I just wish they weren't happening so frequently. It seems that Google is defaulting to looking for versions with index.php first right now.

deadsea




msg:4525913
 2:48 pm on Dec 7, 2012 (gmt 0)

Redirects are so cheap and fast, that they shouldn't be putting much load on your server.

They also shouldn't impact your crawl budget much. Googlebot seems to be limited by "time spent downloading" rather than "number of requests". So quick redirects shouldn't limit your crawl budget very much.

Sgt_Kickaxe




msg:4525931
 3:40 pm on Dec 7, 2012 (gmt 0)

That's probably very true deadsea, still, I'd like to say "hey googlebot, stop checking the index.php versions first almost.every.single.time please, they will never exist!" but the redirects say they do, and they've moved. A good dose of 410 *might* change that, eventually? If nothing else it might reduce the number of "not selected" in GWT, I'm curious to know if these are included in that total.

Str82u




msg:4525934
 4:01 pm on Dec 7, 2012 (gmt 0)

Not sure why Google's doing that (outside of the overstated fact that once they find a URL they won't stop looking for it, ever). One trend I've seen in official (government) websites is moving to CMSs using the /index.php/ handler for SEF pagenames. I have a CMS like this also but it removes /index.php/ if you want but in your case it's up to Google whether or not they remove those references, 410 might not have the intended effect with them, as if they keep a permenant archive of URLs to compare old and new versions of sites so they don't miss a thing.

Sgt_Kickaxe




msg:4525955
 5:53 pm on Dec 7, 2012 (gmt 0)

Now that I'm looking more closely what exactly is going on with "redirect-handler" "redirect:/index.php" in googlebot server logs? It also occasionally says "redirect-handler" "/var/chroot/home/content/..." instead of showing a url as host.

Could these be the source of my headaches?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved