Google Indexing Nonexistent Files

Forum Moderators: goodroi

Message Too Old, No Replies

Google Indexing Nonexistent Files

Tandem

9:22 pm on Apr 11, 2009 (gmt 0)

My Error Logs show quite a few 404 entries.
The referer is usually google.gr, google.ee google.com.br etc.

Please keep in mind, this is not a case of broken links, renamed or removed files. The files in question never existed on the sites (as far as I know).

Also, the sites and the directories that the indexes point to have all robots.txt files with the following:
User-agent: *
Disallow: /

The sites are for private use and are not indexed by SEs. I am aware that bots can ignore the robots.txt files.

What concerns me is that the file names usually are something along the lines:
....serial-free.html
....CD-key-changer.html
...something-sex.html and so on.

Does anyone have any ideas about what's is going on? How do these end up in google index?

g1smd

9:44 pm on Apr 11, 2009 (gmt 0)

Are they indexing them (i.e. those URLs appear in the SERPs) are are they merely requesting them from the server (and there's no sign of those URLs in the SERPs)?

Tandem

9:51 pm on Apr 11, 2009 (gmt 0)

I checked the SERPs and yes, these non-existent files are in the index.

[edited by: Tandem at 9:52 pm (utc) on April 11, 2009]

Chris_R

9:57 pm on Apr 11, 2009 (gmt 0)

Are they cached?

If so - there is a problem.

If not - and you have them returning a 404 - you can use webmaster tools to delete them.

robots.txt doesn't tell Google not to index something - just not to crawl. If google can't see the pages as 404s (as you have prevented them from crawling them) - they will index them if they have links to them sometimes.

This puts you into a vicious cycle. You have to use either webmaster tools - or the X-Robots tag header (through apache or whatever) - robots.txt itself won't fix this.

Tandem

10:02 pm on Apr 11, 2009 (gmt 0)

Chris:

Yes, the sites are returning 404, that is how I found out about this.

Since the files were never on that server, why would any of the google bots add it to their index?

My understanding was that google bot verify existence of files before indexing them.

Chris_R

10:14 pm on Apr 11, 2009 (gmt 0)

My understanding was that google bot verify existence of files before indexing them.

They can't in your case - as they have been blocked from crawling them. Google works on anchor text and links to a great extent. If you link to something - that can't be crawled - Google can still index it. It thinks content is there (cause it is linked to). Google has done this from the very beginning. Their original papers talk about being able to see an email as a relevant result - even though it couldn't crawl it.

So in your case...

1. Google sees link from somewhere else
2. Google adds your URL to list to crawl
3. Google trys your site, but finds robots.txt prohibiting it from crawling.
4. Google adds your URL to index based on anchor text and incoming links.
5. Google tries your site again.

Still banned - it never sees 404 - and therefore never removes it. Eventually the links to it will probably drop off.

But you can remove it using webmaster tools.

keep in mind links like this usually don't show up in something competitive. You may see it when doing site: , but usually not when searching something competitive.

g1smd

10:30 pm on Apr 11, 2009 (gmt 0)

*** these non-existent files are in the index. ***

Those URLs can only be indexed for content if they return 200 OK or 302 and the bots reached the pages.

If those URLs are robots.txt excluded, then there should only be URL-only entries in the SERPs for those URLs (even if the URLs were returning 404 - because bots would never get to see that).

If there is content listed in the title and/or snippet and/or content is shown in Google's cache for those URLs, then those URLs *must* by definition have returned a 200 OK status at some time in the past and actually returned real content too. In that case you need to look to some sort of server exploit or hack having been perpetrated against your site.

[edited by: g1smd at 10:32 pm (utc) on April 11, 2009]

Tandem

10:31 pm on Apr 11, 2009 (gmt 0)

I appreciate your input Chris.

My main concern is not as much that these are showing up in a SE index, but that someone is creating these links in first place.

[edited by: Tandem at 11:23 pm (utc) on April 11, 2009]

Tandem

10:42 pm on Apr 11, 2009 (gmt 0)

Thanks g1smd,
that is what I am worried about, some malicious intend behind all this.

g1smd

10:42 pm on Apr 11, 2009 (gmt 0)

You can't stop people creating links but you should let searchengine bots see your site returning "404 Not Found" for those requests.

Tandem

10:54 pm on Apr 11, 2009 (gmt 0)

Not sure how to do that.
How do I configure this so that directories with limited access return a 404?

Chris_R

11:52 pm on Apr 11, 2009 (gmt 0)

If your (real) content is in a subdirectory - you could restrict Google just to that directory with robots.txt.

How do I configure this so that directories with limited access return a 404?

It still returns a 404 - Google just won't see it

some malicious intend behind all this.

Usually it is not the case - it is either typos, someone brute forcing webspam, or something else.

Google takes this into account - so I wouldn't worry too much about it.

I run into this kind of stuff all the time -- if it isn't that many files - using webmaster tools is VERY easy PLUS even if you do through up a 404 - if the links are still out there - Google will still come back looking for them.

As long as the pages are in the robots.txt - Google will allow you to remove them through webmaster tools.

Other than that - you will have to use a X-Robots tag.

I have Google coming back looking for pages that are 3 years old sometimes.

In the future - putting your own stuff in subdirectories (if you don't already) gives you a lot of flexibility with Google.

Tandem

12:00 am on Apr 12, 2009 (gmt 0)

Thank you Chris.

z329224946

4:53 am on Apr 12, 2009 (gmt 0)

I think this is the google's problem.

martinibuster

5:19 am on Apr 12, 2009 (gmt 0)

One reason not mentioned, and it may or may not apply is IP address changes in tandem with faulty DNS settings. This will cause another site now hosted on your former IP to show content with your domain name. This happened to me and I lost all my rankings for two months because of this mistake caused by my web host. The host had some internal settings screwed up, had nothing to do with my registrar settings. Google still thinks the content is on my site and a site search returns URLs that don't exist on my server. But at least my rankings have returned though not as well as before.

Tandem

6:22 pm on Apr 12, 2009 (gmt 0)

I don't think this is the case as I have been using the same host for a long time now with the same IP.

This is the latest entry:
[Sun Apr 12 06:26:29 2009] [error] [client 78.160.###.###] File does not exist: /home/SITENAME/public_html/DIR/Smileys, referer: [google.com.tr...]

My site is no longer among the results because I removed it from the index using Google's Remove URLs tool.

[edited by: Tandem at 6:27 pm (utc) on April 12, 2009]

martinibuster

6:37 pm on Apr 12, 2009 (gmt 0)

Yes, it might not be your issue, but just to be clear for others that might experience this, the phenomenom I experienced occurred with the same web host, no change of host occurred. The only thing that changed was the IP due to their screwup, plus faulty settings on their end which was causing traffic related to my domain to show up on the old IP, displaying someone elses content, while still showing domain as the URL.

Josun

5:28 am on May 20, 2009 (gmt 0)

Did you try to delete all of your files at your webhost's server totally?

Tandem

5:49 am on May 20, 2009 (gmt 0)

There is nothing to delete. The files in question never existed on the sites.

baanmaha

6:34 am on Jun 20, 2009 (gmt 0)

Thank you for advice