Googlebot is combining parts of different URLs

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot is combining parts of different URLs

AWSwS

3:49 pm on Jan 14, 2010 (gmt 0)

I don't know if anyone has had this issue before. But any information would be wonderful if anyone has.

i've recently been receiving hits from googlebot around 10am GMT on our site, but the links its attempting to hit are incorrect and have never been in that structure.

i have two styles of links

SEO: description_prodid.htm
Dynamic: dynamicfile.asp?x=y

googlebot has been hitting the site with links like: description/dynamicfile.asp?x=y

cutting off the product code and .htm from the seo link.

This makes no sense and no matter what i search in google, i receive no information to help me solve this issue.

Any ideas?

tedster

1:56 am on Jan 16, 2010 (gmt 0)

One of the things that googlebot does is sort of a "stress test" for your URLs. It's a way of discovering how servers are configured for URL variations, as well as testing whether the servers HTTP responses (such as 404) are valid. This helps them protect their crawling and indexing resources from getting lost in endless loops that some servers create.

This kind of crawling also probes to see if their are any URLs available that they have not discovered.

I'm not saying for sure that this is what you're seeing - it could simply be that some website somewhere has these malformed URL in their links. Scraper sites are often automated, and their programming also gets buggy. Heck, googlebot's programming can also get buggy, for that matter.

Unless you see googlebot using up a lot of budgeted crawling cycles for your domain with this kind of thing (I mean, to the detriment of your real URL indexing) I wouldn't worry about it.

herculano

2:04 am on Jan 16, 2010 (gmt 0)

sounds like a canonical issue...?

helpnow

2:23 am on Jan 16, 2010 (gmt 0)

Oh yeah, I know about this!

In one section of my site, I have some URLs in this format site.com/Subject109999.

Database driven, of course. Now, that database tops out at, say, 101000. 101001 does not exist. In fact, a long time ago, we wrote our site so that if you did hit 101001, or any other number, it would just go to the home page: it had nothing to show.

Well, in September 2009, googlebot started guessing on numbers, and it grabbed about 40,000 (yes, 40,000) different URLs in this sequence.

They don't exist. Our database cannot kick them out. And no one would link to those 40,000 URLs. It was the result of googlebot drawing some assumptions, and going fishing.

The end result was, I got 40,000 bad URLs indexed in the SERPs on this one section of my site. Oh, and googlebot did them sequentially, so, they go from 1000 to 41,000, then stop. Probably that is as far as it got until I stopped it with a noindex.

The crappy thing is, they are tough to get out. Oh, I figured this out back in October, and immediately did the obligatory noindex. No problem. However, they are not coming out of the SERPs very fast.

So, I am deleting the URLs by hand using the URL Removal tool. Problem is, I cannot delete the directory, because I have to safeguard the first 1000 good URLs, and focus on deleting out all the ones above 1000. Thank god I found a greasmonkey function that preps them 100 at a time for me. Oh, and I have also discovered that the URL Removal Tool craps out around 1000 per day. I do one batch of 100 every hour or so, and do as many batches as I can during the day. I still have about 26,000 to dump out, which will take me another 3-4 weeks.

Oh, and to add some comedy to the whole situation, google also sent me an email complaining about how I have too many URLs. Thanks.

I had another similar situation with 5000 duplicates of my home page from September too. My home page disappeared from the SERPs. Gone! I chased it with httpd.conf, and then noindex. Waited 3 weeks. They were coming out at a rate of about 100 a week. That meant about a year until they'd all be gone. I finally said, scr-w this, used the URL Removal Tool to dump the directory, they were all gone in 6 hours or so, and 2 days later, my home page was back with my 8 sitelinks.

Ah, and one more thing in this tale: I can search a specific string from the 40,000 + pages, and google reports 1 result. ie the strign seardch says the 40000 don't exist. However, about once a week, it reveals more - it will give me a list of 10-50 that do exist. And sometimes it also reveals the real number, it will report 1-10 of around 30,000 results, but still only display about 10-50. So, don't trust what google reports - they are hidden!

These 2 experiences have really adjusted my impression of numbers reported in google serps, and how googlebot works. Don't trust either. Assume nothing.

Even right now, I just checked,and it reports 1 result on those 26,000 remaining. I know it is still around 26,000 because I have another tool which checked each URL, one by one, to get the real count and to see if it is indexed or not. Checking the URLs, URL by URL, and not by a string, reveals that there are 26,000 that exist in the SERPs.

Crazy!

So that's my whole story, showing how I know it is true that googlebot is doing some fuzzy logic on the back end, surmising which URLs might exist based upon patterns it feels it has detected. Even if those URLs don't exist and there are no links to them. So make sure your site is airtight and all exceptions are noindexed, and do not trust string searches - search for specific URLs.

TheMadScientist

2:33 am on Jan 16, 2010 (gmt 0)

In fact, a long time ago, we wrote our site so that if you did hit 1001, or any other number, it would just go to the home page

I only skimmed the rest of the post, but got to the 'using the removal tool' and my first thought is if you can redirect to the home page you can serve a 404 OR 410 (a 410 will be dropped sooner), so personally, I would quit worrying about the removal tool and serve up a proper code for a pages (locations) that do not exist with a custom error page containing links to prominent locations on the site and a noindex meta tag...

helpnow

2:40 am on Jan 16, 2010 (gmt 0)

Sorry, MadScientist, you should have read my whole post, as painful as it may have been. ; ) There is Truth in it. I did the 410 (httpd.conf), since that is what I was told to do, and waited 3 weeks. They were going out of the SERPs at the rate of 100 a week. Then I did the URL removal Tool, they were all gone in hours, and my home page was back in 2 days.

Quit worrying? -grin- It was the difference between having or not having my home page / sitelinks in the SERPs. It was worth worrying about.

On this, due to my epxerience, I must respectfully disagree with you.

I am sharing my experience here in the hope that it may help someone else. Do not be shy to use the URL Removal Tool - it works fast and it can fix big problems.

TheMadScientist

3:00 am on Jan 16, 2010 (gmt 0)

Like I said I didn't read the whole post, mainly, because I thought 'WOW, why didn't they do this from the start or a year ago, when there have been large numbers of posts here talking about the handling of errors, error pages, server response codes and many of those posts (or threads) very clearly state redirecting non-existent pages to the home page is NOT recommended?' (Most of the posts I'm referring to have been made by very respected members, including jdMorgan and g1smd who are both staples in the Apache forum.)

If that's what you had to do to get your home page back in the index, then HOPEFULLY future readers will read both posts and realize they need to serve proper status codes for all non-existent pages to keep from having your obvious headache, time invested and concerns. IOW: I hope they learn from your 'not-so-fun' experience...

IMO On the scale of 'importance' proper server codes and handling of errors ranges from very important to critical.

I'm not trying to make light of your situation or say I would not have done the same thing if faced with the situation, but also, in my defense, all your post says is you 'chased it with the httpd.conf', and personally, I have no idea what that means, so I posted what I would have done, which is (now) obviously what you started with after realizing there was an issue. (I usually look for 'key points' when I skim, and I did skim yours for any note stating you changed them to proper codes, but did not see 404 or 410 mentioned anywhere and serving the proper code is the first correction... IOW: The proper correction is what I was looking for while skimming... The URL removal tool may assist in speed, but the proper, long term correction is to serve the correct codes.)