deadsea

msg:4525860 | 11:33 am on Dec 7, 2012 (gmt 0) |
I would keep up with the 301 redirects. We had a site where we changed every single url on the site in 2002. 10 years later Googlebot was still crawling the old style urls and we were still serving 301 redirects for them. Googlebot never seems to forget these things. But it didn't seem to hurt our SEO efforts either.
|
levo

msg:4525872 | 11:58 am on Dec 7, 2012 (gmt 0) |
Switch to 410. Also make sure that the server doesn't redirect example.com/index.php/content to www.example.com/index.php/content example.com/index.php/content?somequerystringthatyouredirect to example.com/index.php/content If you have pages that redirect and end up with 404/410, Google keeps crawling them (at least it crawls them more often)
|
g1smd

msg:4525876 | 12:11 pm on Dec 7, 2012 (gmt 0) |
@levo There is no reason at all to retain "index.php" within the URLs. At least that part of the URL should be dumped. I would continue with the 301 redirects. Google will request every URL they have ever seen, forever.
|
Sgt_Kickaxe

msg:4525879 | 12:23 pm on Dec 7, 2012 (gmt 0) |
| Google will request every URL they have ever seen, forever. |
| That's what it feels like, for sure! Since I'm redirecting non-www to www I haven't worried too much about redirecting to remove /index.php/ but then Googlebot doesn't request my non-www copies very often, it's the /index.php/ they want on 90% of requests, they visit the index.php version first that often still. I also see this in my logs a lot for pages that are 410 (expired content), after googlebot requests the index.php version of course. | "GET www.example.com/expired-content HTTP/1.1" 410 636 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 "redirect-handler" "redirect:/index.php" |
| I don't remember seeing the parts in bold until fairly recently.
|
g1smd

msg:4525883 | 1:07 pm on Dec 7, 2012 (gmt 0) |
If non-www with index.php redirects to www with index.php then that's part of the problem. The first rule should redirect any request with index.php to www without it. A later rule should redirect all non-www to www, but this rule will never be activated for requests with index.php within, because the earlier rule will already have taken care of all problems (the index.php and the www).
|
levo

msg:4525885 | 1:21 pm on Dec 7, 2012 (gmt 0) |
| Google will request every URL they have ever seen, forever. |
| It does, but it requests redirected URLs much more often. Redirecting URLs with index.php won't slow down Googlebot on those pages. Well, you're already doing it... My website had kind of a similar problem, Googlebot used to hit non-existing pages more than existing pages (~5-to-1). I've fixed it by returning 410 before any canonical redirections. My suggestion is to return 410, and make sure you don't have redirects (non-www fixes, query string dropping etc.) that end up with 404s.
|
Sgt_Kickaxe

msg:4525895 | 2:06 pm on Dec 7, 2012 (gmt 0) |
g1smd, I have the non-www to www check placed after the index.php to www non-index.php so all is good on that front. The redirects themselves seem fine, I just wish they weren't happening so frequently. It seems that Google is defaulting to looking for versions with index.php first right now.
|
deadsea

msg:4525913 | 2:48 pm on Dec 7, 2012 (gmt 0) |
Redirects are so cheap and fast, that they shouldn't be putting much load on your server. They also shouldn't impact your crawl budget much. Googlebot seems to be limited by "time spent downloading" rather than "number of requests". So quick redirects shouldn't limit your crawl budget very much.
|
Sgt_Kickaxe

msg:4525931 | 3:40 pm on Dec 7, 2012 (gmt 0) |
That's probably very true deadsea, still, I'd like to say "hey googlebot, stop checking the index.php versions first almost.every.single.time please, they will never exist!" but the redirects say they do, and they've moved. A good dose of 410 *might* change that, eventually? If nothing else it might reduce the number of "not selected" in GWT, I'm curious to know if these are included in that total.
|
Str82u

msg:4525934 | 4:01 pm on Dec 7, 2012 (gmt 0) |
Not sure why Google's doing that (outside of the overstated fact that once they find a URL they won't stop looking for it, ever). One trend I've seen in official (government) websites is moving to CMSs using the /index.php/ handler for SEF pagenames. I have a CMS like this also but it removes /index.php/ if you want but in your case it's up to Google whether or not they remove those references, 410 might not have the intended effect with them, as if they keep a permenant archive of URLs to compare old and new versions of sites so they don't miss a thing.
|
Sgt_Kickaxe

msg:4525955 | 5:53 pm on Dec 7, 2012 (gmt 0) |
Now that I'm looking more closely what exactly is going on with "redirect-handler" "redirect:/index.php" in googlebot server logs? It also occasionally says "redirect-handler" "/var/chroot/home/content/..." instead of showing a url as host. Could these be the source of my headaches?
|
|