Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Lots of new "errors" in GWT coming from pages found years ago.

         

Sgt_Kickaxe

10:22 am on Nov 26, 2010 (gmt 0)



A year ago I moved a site from one host to another, I needed .htaccess control and my host would not allow it. A workaround to not having .htaccess control was to add /index.php/ to every page which allowed wordpress permalinks to function.

When I made the move I removed all instances of /index.php/ from every link, file and image. I also included an /index.php/ to / command in my .htaccess file so that no pages display the workaround and all are 301 redirected to the version without it.

It went quite well and indexing/traffic did not suffer. Out of the blue my Google webmaster tools has begun displaying lots of "Crawl Errors" for pages that contain /index.php/. Likewise any regular page that has expired/been removed shows dozens of urls pointing to each, all containing /index.php/.

By out of the blue I mean it's been over a year now, so why now? I'm inclined to ignore it, traffic is still not suffering, but I'm curious as to what changed on Google's end because nothing changed on mine. Ideas?

tedster

6:54 pm on Nov 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This was mentioned as a side topic in another thread, too. Some members are finding valuable data in this "historical dump", such as old backlinks that they can reclaim. Others are not giving it much attention.

I'm not sure about "why now?" I guess Google is trying to be helpful and share more of what they see.

iamlost

7:46 pm on Nov 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I suspect it is part of the Colossus/GFS2 framework/filesystem and the Caffeine search-indexing implementation growing pains with all the historic index layers being churned up.

Interesting that they appear to archive/replace and not overwrite or delete old data... :)

Sgt_Kickaxe

12:26 am on Nov 27, 2010 (gmt 0)



Interesting that they appear to archive/replace and not overwrite or delete old data... :)


Actually that's very interesting considering the site has had a noarchive meta tag in place since the change. I needed to know when all of my pages in Google had been updated before changing other things and noarchive lets me know when a new page replaces an old (the cache disapears).

So this is happening even with, or in spite of, noarchive. The files are obviously still cached by Google but the cache link is removed from serps. The most obvious conclusion would be that Google serps and GWT run from different data sets, or at least on a different timeline. I'm not sure that's the case however because of the time factor, it's been a year+ and GWT shows last weeks data already.

It begs the question, a site with noindex (not just noarchive) shouldn't appear in serps at all but Google will likely make a cache of it for themselves so... do links from a noindex site interact with the rest of the web ? I suppose that's for another thread.

I wish noarchive actually meant "do not keep a copy of this site" and not just "hide the copy".

tedster

2:38 am on Nov 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wish noarchive actually meant "do not keep a copy of this site" and not just "hide the copy".

Right - noarchive just means "don't make a copy visible to the public". Without private copies, Google couldn't maintain their ranking analysis of all the URLs they are crawling. Every timeone page changes, they'd need to re-crawl all the others simultaneously in real time.