Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawls thousands of fake 404 pages

         

marcob

10:25 am on Aug 12, 2011 (gmt 0)

10+ Year Member



We are desperate and need help on the following.

Googlebot generates and crawls tenthousands of 404 pages on our site.
We cannot find why, and where these links are generated.

GWT says these url's are generated internally from our site, it even gives source url's. But these source url's are also 404 pages.

e.g. it reports the following 404 url
http://www.example.com/sparen/lenen/autoleningen-vergelijken/rate/_______________________________bank-van-de-post_61814.html?filter_order=tbl.rate&filter_order_Dir=asc&type=newcar&sbmt=bank&limit=5&start=5

and claims it has been internally linked through the same page
http://www.example.com/sparen/lenen/autoleningen-vergelijken/rate/______________________________bank-van-de-post_61814.html?filter_order=tbl.rate&filter_order_Dir=asc&type=newcar&sbmt=bank&limit=5&start=5

this is rubbish of course.

We now have more that 100.000 of these 404 url's in GWT. And still cannot find what is generating them. The source is always another 404 url, also of the same structure.
Can anyone help?

[edited by: tedster at 3:26 pm (utc) on Aug 12, 2011]
[edit reason] switch to example.com [/edit]

tedster

3:32 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello marcob, and welcome to the forums.

Do a lot of those 404 URLs include a chain of underscore characters?

marcob

3:41 pm on Aug 12, 2011 (gmt 0)

10+ Year Member



yes, most of these 404 url's (not all) include these underscores, and the number of underscore characters seems to increase as the problem gets worse

matrix_jan

3:58 pm on Aug 12, 2011 (gmt 0)

10+ Year Member



I had a similar issue, 301 is your only solution. Don't wait for G to pull those 404s out of WT. Determine what they all have in common and make a 301 to the nearest folder at least using either .htaccess or your page generator... Act quick, I will take months for the 404s to clear out. G-Bot sometimes tries to crawl a made-up page, which it thinks will exist since similar pages exist too...

marcob

5:49 pm on Aug 12, 2011 (gmt 0)

10+ Year Member



thank you all for your responses

@matrix_jan: in your similar issue, did you also have tenthousands of these fake url's? And were they also generated interally? and why use a 301 on them? i'm a little scared of that, because it might confuse G even more

lucy24

8:44 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I smell an htaccess issue. There have been several very similar problems reported recently next door in the Apache forum. Nonexistent documents were being created and then crawled. Don't even search; just eyeball the last month or two of topic titles.

g1smd

8:51 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



First, stop Google adding to this mess.

Add this near the start of the .htaccess file:

RewriteRule __ [G]


Anything with a double (or more) underscore will be flagged as "Gone".

Next, you will have a very difficult job to find the likely combination of relative linking and botched rewriting that is causing the problem.

matrix_jan

11:42 pm on Aug 12, 2011 (gmt 0)

10+ Year Member



@marcob

I started 301-ing when the number of 404s in WT reached 600-700... There is no stopping this, if you don't redirect those fake pages the number is going to increase and very dramatically. One page links to two fake pages, those two link to four... and so on... so act quick.

Why I did 301? At first comes panic, then comes reaction, and then comes the research and thinking... say google(or someone) wants to access http://www.example.com/category/subcategory/somefakestuff.html . For me the best way was to redirect(301) the fake stuff to subcategory... Now you do your thinking and do what's best for your website. After few months of 301-ing I still have 100+ 404s showing up in WT even though they are 301-ed.

In your post you say that the page is linked from the same page, but it looks like the second one has one more _ then the first one.

marcob

12:56 pm on Aug 13, 2011 (gmt 0)

10+ Year Member



thank you all for your input and especially to matrix_jan

You are right, i didn't notice it first, but indeed the second one has more _ then the first one. Nevertheless, they both are 404 pages.

We loaded the pages trough the google webmaster tool and everything seems normal. So that rules out a .htaccess problem I guess.
We have more than 100.000 404 pages now. We will start to 301 them.

Sgt_Kickaxe

1:37 pm on Aug 13, 2011 (gmt 0)



This isn't new, when you create a dynamic parameter, such as anything with ?+&=, Google seems to guess at some possibilities to see what will happen. If you have a standard &page=1 Google will test it with &page=800. If you run a standard CMS Google will test the back end for various things such as to see if remote publishing is on or off.

Some of this MUST apply to ranking factors. Blogs with remote publishing on may score lower than those with it off. Blogs that return true 404 on pages Google has never seen and doesn't expect to find will likely rank higher than blogs who return an empty template page to the random variable.

Make sure your parameters are locked down and that nothing is returned besides 404 on pages that do not exist.

Marcob, those 404's are supposed to be what happens when Google crawls pages that don't exist so I wouldn't worry about it.

1script

5:05 pm on Aug 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We have more than 100.000 404 pages now. We will start to 301 them.
I'm afraid this approach can backfire. It's like you're saying: "yes, I confirm that these were actual valid URLs but now I want them all changed to http://www.example.com/#*$!.html" Now you have 100,000 pages, all with the same content (if there was anything after that 404 response header) pointing to this one page. It'll probably just get killed for duplicate content. And if you're cycling those non-existent error pages onto an important category page (worst of all - homepage) - a page that was propping up other content pages with internal links - its demotion can damage the entire site.

This is a doomsday scenario, I realize that. But I would have to say that none of my past attempts to "clean" errors from WMT by either asking to remove the erroneous page URLs from index (Crawler Access->Remove URL) or 301-ing it elsewhere resulted in anything other than wasted time (best) or disaster (worst). If the number of bogus pages you're trying to get rid of is high in proportion to the total amount of pages on the site, it may lead to complete destruction of the site's rank - I have two such sites receiving Google search referrals in single digits down from about a 1000/day after such attempts.

I also tried to make sure a 410 gets returned instead of 404 if the URL is really-really wrong and could never exist. It did pretty much nothing. Couple years later Gbot still returns for those URLs. Frustrates the heck out of me because on some days 20% of googlebot visits are wasted on re-visiting those bad URLs.

Anyways. this is just my observation, may not be applicable to your particular situation. Just wanted to say: "be careful"

g1smd

6:24 pm on Aug 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, all of that. That's why I recommended the [G] for "410 Gone" in the code snippet above.

Just removed 50 000 URLs from indexing using that method.

1script

6:58 pm on Aug 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just removed 50 000 URLs from indexing using that method.

How can you be sure it's actually removed? I see them still coming for those 410-ed URLs for a long time since. I've had a stupid programming error injecting 10,000s of bogus URLs into category listings on a forum site in the fall of 2009. I've noticed and fixed the error by returning 410 on misformed URLs in Jan 2010 and I'm looking at approx 100 hits by Googlebot on some of those URLs today.
They are static-looking and stand out easily because they all look like

/something-Page1-1.htm and /something-Page2-2.htm

whereas the actual proper URLs are

/something-1.htm and /something-2.htm.

I made sure a URL like this can no longer be produced on this site back in January 2010 and I keep tripping over those in my stats ever since.

g1smd

7:17 pm on Aug 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google never forgets a URL and will check it periodically, essentially forever. You can't stop that. The 410 response does keep all those URLs out of the index though. It also stops proliferation of further links.

If you want crawling to stop, then also add this to your robots.txt file:
Disallow: /*__

(however, some duff URLs will probably still appear in the SERPs, as URL-only entries, for a very long time).

You still need the .htaccess rules in place for user agents that ignore the robots.txt file entry.