homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 49 message thread spans 2 pages: 49 ( [1] 2 > >     
WMT - Web crawl glitch

 12:46 am on Jul 10, 2008 (gmt 0)

Is anyone seeing glitches in the web crawl section of WMT ?

It may be linked to this report : Webmaster Tools Content Analysis Glitch [webmasterworld.com]

When checking "Pages Not Found" they are reported as 404's [ not found ] - we have over 5,000 of them.

These pages have re directs on them to valid pages.



 2:39 am on Jul 10, 2008 (gmt 0)

These pages have re directs on them to valid pages.

What kind of redirect, technically?


 8:04 am on Jul 10, 2008 (gmt 0)

I should have said "some". Initially, the ones we can see have 301's on them .

But it will take some time to look through all the URL's as we suspect there may be others with different factors .

[edited by: Whitey at 8:04 am (utc) on July 10, 2008]


 11:02 am on Jul 10, 2008 (gmt 0)

Whitey, everyday a new nightmare for you. Actually, when you are reporting I am often affected, too. ATM there seem some messy things going on at google, creating possible collateral damage.


 11:42 am on Jul 11, 2008 (gmt 0)

I've been meaning to mention this for a week or two now, but was giving Google more time to update their data.

I have an incoming link that someone mistyped on a forum about a month ago. What they typed as a URL does not exist on the site.

As soon as I was alerted to it, by it appearing in the "Pages Not Found (404 Errors)" report in WMT, I opened up .htaccess and added a single line of code to redirect calls for /blohbloh.html to the /blahblah/ folder, with a 301 redirect.

The non-existing (404) URL still remains in the WMT "Pages Not Found (404 Errors)" report. The "Last Calculated" date for that single URL still remains almost a month old too.

I am not sure whether Google has it programmed such that it needs to see a "200 OK" HTTP status before the URL will be removed from the error list (if it does work like that, then that would be a programming error, in my view), or whether I just need to give it more time to pick up the 301 status and then act on it.

I am keeping an eye open (in the "External Links" report in WMT) to see when the other (correct) link from that same forum has its "Last Found" date updated. The "Last Found" date for that other URL has been updated at least once since the forum post was originally discovered (and the "Last Found" date for that link, in the External Links section, matches the "Last Updated" date [NOT "Last Calculated" date for the single URL -- that's a month old still] for the whole of the Web Crawl section of the report), but I am not sure if that new date is the day before, or the day after, I had amended the .htaccess file to add the redirect.

Once the date updates again, I will be sure. I had been holding off posting here, until then, in case it was a false alarm...


 1:12 am on Jul 12, 2008 (gmt 0)

Yes it's going to be several days before I can also provide some further inputs.

Some scrutiny by other members in the meantime may help to start reveal some anomalies, either in the report or in the way sites are crawled.


 1:19 am on Jul 12, 2008 (gmt 0)

*** Whitey, everyday a new nightmare for you ***

I think, in reality, 90% (maybe 99%) of users miss these little things or just dismiss them, when in fact some of them can be the tip of something very important.


 2:01 am on Jul 12, 2008 (gmt 0)

100% agreed .. research in these areas is vital


 11:46 pm on Jul 17, 2008 (gmt 0)

OK - we've done some checking :

The headers are fine, it just looks like WMT – as it is also saying there are a whole bunch of URLs restricted by robots.txt – which is clearly not the case – they are just URLs from an old site

We also have 404's showing on fully cached and indexed pages that are ranking well

If there are this range of problems inside WMT , I just wonder if there is any correlation to the SERP's.

Also, as before at Content Analysis Glitch [webmasterworld.com] it looks like the reports of problems are pretty scant , so either folks are not adopting WMT, it's not widespread, or folks are paying little regard to the real things that matter on their sites.

Although it's a good step forward, I guess full adoption of WMT is going to be limited if things don't work, as confidence in the system will be limited.

Any other reports?

[edited by: Whitey at 11:47 pm (utc) on July 17, 2008]


 12:05 am on Jul 18, 2008 (gmt 0)

Hmm. The "Crawl Stats" graphs updated a couple of days ago. Is this a weekly event now? Seems a bit like it to me.
Anyway, on the busiest days, according to the graphs, Google pulled up to 20 pages from the site per day.

I know for a fact, that on many days they actually pulled about 50 to 80 pages. On some days they did pull less than ten pages.


 12:11 am on Jul 18, 2008 (gmt 0)

Another issue is that the date and time stamps used to build the filenames for the CSV exports are always incorrect.

The "hours" field is wrong. The date/time is supposed to be UTC/Z, but appears to be a random number of hours behind reality. At certain times of the day it runs some 8 or 9 hours slow (from memory), and at other times it can be (I think) some 16 or 17 hours wrong.

The date is usually right (or lagging by a day). The "minutes" and "seconds" are always right. It is the "hours" that are wrong. I need to save a file once per hour for a whole 24 hours to see the range of problems because I suspect that there are TWO issues interacting there - one of which only occurs when you have moved over to a new date, while Google is still in "yesterday" at their time zone.

If I go to save a file right now... the filename ends: ..._20080717T051122Z.csv

The date is now 2008-07-18 and the time is now 00:11:22 in the UTC/Z time zone, so this is lagging by 17 hours.

Clock time in the UK is currently 01:11 am as the UK is on DST (British Summer Time) which is UTC+0100 (or Z+1).


 10:45 pm on Jul 18, 2008 (gmt 0)

OK. Another test, just before midnight UK time.

Current UTC/Z Date/Time is 2008-07-18 at 22:33:44 but Google adds this _20080718T033344Z.csv as a part of the filename instead.

That's wrong by 19 hours.


 3:10 am on Jul 19, 2008 (gmt 0)

I think g1smd its a bit like the case of the hacker who was able to manipulate the online voting on a well know usa show and pondered how was it there security was so lax. The answer was "they couldn't care less" about online voting and I suspect the answer to your questions will be the same.


 7:56 am on Jul 19, 2008 (gmt 0)

I do wonder whether it exposes some underlying bug with data collection and/or processing, or whether it's just a bug with making the timestamp for the file. We'll likely never know for sure.


 10:58 pm on Jul 19, 2008 (gmt 0)

Re: #:369623

OK. This is a bug as far as I can see.

The WMT data for incoming links has been updated again today. That data shows a particular page on some other site was last crawled on July 11th, by way of Google finding a good link to my site, the proof being that the link was "Last Calculated" on that date.

Over in the Crawl Errors report, the supposedly 404 URL that is listed has actually been returning a 301 redirect for several weeks now, and yet it continues to show in WMT as a 404 error. The WMT report continues to show the URL as Last Calculated on June 16th, and the report page itself as being Last Updated on July 4th. That duff link is on the same page as was reported as being successfully crawled on July 11th in the linking report, yet we are to believe that the other link to my site on that very same page has not been discovered to have changed status from 404 to 301 yet.

[edited by: g1smd at 11:14 pm (utc) on July 19, 2008]


 11:13 pm on Jul 19, 2008 (gmt 0)

Any correlation to the way the SERP's are calculated [ potentially incorrectly ] or is WMT totally separate ?


 12:01 pm on Jul 20, 2008 (gmt 0)

It looks like these threads might be evolving from " WMT Content Anaysis Glitch " probs to larger 301 indexing issues


It's not confined to GWT- I can confirm it's dropping some URLs from the index. The 404 errors seem to go away when the URLs are reindexed by standard Googlebot.
High level pages are much less likely to be dropped from the index due to this 404 issue but deeper level pages with less page rank are being scrubbed from the index.

The main site was 301-canonicalized at birth, so it's rare to ever encounter a non-canonical link

Lucky you :) I have dmoz listings pointing to the www.example.com and the example.com domains so no such luck for me.

hmm ... this is what i can see to , just wanted to see another data point.


 12:59 pm on Jul 23, 2008 (gmt 0)

I had to re-verify a site in WMT the other day. I use the "verify site using uploaded HTML file" method.

The WMT error message said that the file returned "200 OK". Huh? I'm not sure why the site is no longer verified (Sidenote: I only looked at the numbers in the filename, didn't note the CASE of the word "Google" in the error message; see below). Clicking the link instantly re-verified the file as OK.

Anyway, there was a request for /GOOGLE46ce6a4b4870a4376.html earlier that day (which shows as a 404 Error in the site stats), instead of the correct filename of /google46ce6a4b4870a4376.html (all lower-case) being requested. As I haven't got access to the raw log files on this site, I can't look into this further.

Notice that the "GOOGLE" part of the URL was requested in upper case, but none of the letters in the hexadecimal number were. Is this a WMT bug, or some sort of attempt at error correction, or perhaps a test to see that the site doesn't simply return 200 OK for any and all URL requests: lower-case should return 200 OK and upper-case should return 404 (except servers using IIS would return 200 OK for both {I use Apache}).

Note that Google also regularly asks for /no_exist46ce6a4b4870a4376.html which appears to be a test of what the standard 404 Error Page looks like.

Note: This isn't the real Google WMT ID number. I changed some of the digits in this example.


 9:50 am on Jul 25, 2008 (gmt 0)

It doesn't look like they are working to fix problem. In fact, I think it's getting worse.

Whitey, you still seeing the same errors?


 4:31 pm on Jul 25, 2008 (gmt 0)


The TOP SEARCH QUERIES in my GWT are not showing up for more than a week now. Previously July data was showing up but now it is gone. It's been 7-8 days now!
Any body else having the same problem?


[edited by: tedster at 4:33 pm (utc) on July 25, 2008]
[edit reason] moved from another location [/edit]


 4:41 pm on Jul 25, 2008 (gmt 0)

I am to techman, our biggest site which is pr7 hasnt updated for search terms in WMT since the 13th. Googlebot is still ripping away at the site and traffic is unchanged. Cache also dates to the 13th as well for the homepage.

Im thinking with the rollout of the new pr and the new look to cache some backend issues have arisen that they are working on, if you look at googles official forums about wmt its overflowing with these very same issues.


 6:56 pm on Jul 25, 2008 (gmt 0)

WMT had been showing "We Last Visited Your Home Page on July 10th", for quite a while and the message has been updated today to say "We Last Visited Your Home Page on July 16th".

July 16th is a very long time ago; that's a huge time lag for that fact to appear in WMT. It's usually just a couple of days. Why the long delay in that fact appearing now?


 9:08 pm on Jul 25, 2008 (gmt 0)

Notice that the "GOOGLE" part of the URL was requested in upper case, but none of the letters in the hexadecimal number were. Is this a WMT bug, or some sort of attempt at error correction, or perhaps a test to see that the site doesn't simply return 200 OK for any and all URL requests: lower-case should return 200 OK and upper-case should return 404 (except servers using IIS would return 200 OK for both {I use Apache}).

Yes, it is a case-sensitivity test. A server might return 200-OK, 404-Not Found, or a 301 redirect to the correct-cased URL, depending on its case-handling and configuration. They want to know what to expect. :)



 9:49 pm on Jul 25, 2008 (gmt 0)

Whitey, you still seeing the same errors?
- Still the same.

 6:57 pm on Jul 26, 2008 (gmt 0)

Re: #:3696231 and #3702701

The Links Lists (both External and Internal) updated again today in WMT, with data up until July 20th. They have found some new links and re-verified some of the existing ones.

Again the Crawl Error (mentioned above) remains, with month old data showing for it. This is despite the fact that they have looked at that Page "X" the link is on, several more times since I first mentioned the problem. Google hasn't noticed that the URL (on my site) in the link on that Page "X" (on that other site) has changed from being a 404 to be a 301 redirect to some real content. The Crawl Error is from June 16th or thereabouts, and the Crawl Error page still says that the page was last updated on July 4th.

Another oddity. On that Page "X" of the other site, it also has links to two other pages on my site. It links to Page "A" on my site and to Page "B" on my site. WMT says that the link to Page "A" was last found on July 11th, and the link to Page B was last found on July 16th.

Surely it would have found both at the same time? Both of those links are on the same page on the other site. At the same time it could/should have rectified the Crawl Error error, no?


 7:13 pm on Jul 26, 2008 (gmt 0)

I still have no access to the Search Queries data for the first week of July, now billed as "three weeks ago", in WMT.

"Data Unavailable" it says.


 8:07 pm on Jul 26, 2008 (gmt 0)

g1smd - Are you seeing this potentially reflected in your site's results?

It's a worry when seeing the wrong interpretation applied like this.

I know the reporting systems are supposed to be seperated, but surely there must be an operational link [ sorry to repeat this question against previous assertions ]?


 8:25 pm on Jul 26, 2008 (gmt 0)

The site is too new to really be getting a handle as to what else is going on.

All I can say, is that the last two pages of 80-odd pages continually evade being indexed by Google (but then again LIve Search only just found those pages in recent days), and the site:domain.com/* search on Google indicates that only about 20 pages are in the main index.


 8:47 pm on Jul 26, 2008 (gmt 0)

I still have no access to the Search Queries data for the first week of July, now billed as "three weeks ago", in WMT.

I experienced the same thing a few days ago but it only lasted about two days before returning. It happened across four of my sites with their ages ranging from 4 to 8 years, and sizes ranging from ~10 pages to ~950 pages. I think that WMT was simply having problems with itself.

Another thing, I have 2 sitemaps that are telling me they have one error each. The detail explains that there is one page on each of the two sites involved that is giving a high response time, both accessed on the 24th. Again, I strongly suspect that was a problem outside of my sites. Possibly a point in time where the internet got sluggish or simply a googlebot problem. I don't really have a clue, just guessing.

On the bright side, the ~950 page site just finished being rebuilt from the ground up with a few new pages added. Googlebot just spent the last day and a half spidering it and my fingers are crossed that I'll start seeing some visitors from Google soon. Traffic from Google next to disappeared over the last couple of years after I made a major change (read "mistake") to that site and blew off all of the URL's that it had. 301's caught a lot of it but it's been suffering far too long.


 2:52 am on Jul 28, 2008 (gmt 0)

I just noticed this new alert :

URLs not followed
When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot because they contained too many redirects. Please change the URLs in your Sitemap that redirect and replace them with the destination URL (the redirect target). All valid URLs will still be submitted.

302 (Moved temporarily)
Jul 25, 2008

Doesn’t make any sense – more WMT glitches I imagine – the URL returns a 200 OK with no redirects at all

What do you make of this ?

[edited by: tedster at 5:41 am (utc) on July 28, 2008]
[edit reason] moved from another location [/edit]

This 49 message thread spans 2 pages: 49 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved