Forum Moderators: goodroi
Please keep in mind, this is not a case of broken links, renamed or removed files. The files in question never existed on the sites (as far as I know).
Also, the sites and the directories that the indexes point to have all robots.txt files with the following:
User-agent: *
Disallow: /
The sites are for private use and are not indexed by SEs. I am aware that bots can ignore the robots.txt files.
What concerns me is that the file names usually are something along the lines:
....serial-free.html
....CD-key-changer.html
...something-sex.html and so on.
Does anyone have any ideas about what's is going on? How do these end up in google index?
If so - there is a problem.
If not - and you have them returning a 404 - you can use webmaster tools to delete them.
robots.txt doesn't tell Google not to index something - just not to crawl. If google can't see the pages as 404s (as you have prevented them from crawling them) - they will index them if they have links to them sometimes.
This puts you into a vicious cycle. You have to use either webmaster tools - or the X-Robots tag header (through apache or whatever) - robots.txt itself won't fix this.
My understanding was that google bot verify existence of files before indexing them.
They can't in your case - as they have been blocked from crawling them. Google works on anchor text and links to a great extent. If you link to something - that can't be crawled - Google can still index it. It thinks content is there (cause it is linked to). Google has done this from the very beginning. Their original papers talk about being able to see an email as a relevant result - even though it couldn't crawl it.
So in your case...
1. Google sees link from somewhere else
2. Google adds your URL to list to crawl
3. Google trys your site, but finds robots.txt prohibiting it from crawling.
4. Google adds your URL to index based on anchor text and incoming links.
5. Google tries your site again.
Still banned - it never sees 404 - and therefore never removes it. Eventually the links to it will probably drop off.
But you can remove it using webmaster tools.
keep in mind links like this usually don't show up in something competitive. You may see it when doing site: , but usually not when searching something competitive.
Those URLs can only be indexed for content if they return 200 OK or 302 and the bots reached the pages.
If those URLs are robots.txt excluded, then there should only be URL-only entries in the SERPs for those URLs (even if the URLs were returning 404 - because bots would never get to see that).
If there is content listed in the title and/or snippet and/or content is shown in Google's cache for those URLs, then those URLs *must* by definition have returned a 200 OK status at some time in the past and actually returned real content too. In that case you need to look to some sort of server exploit or hack having been perpetrated against your site.
[edited by: g1smd at 10:32 pm (utc) on April 11, 2009]
How do I configure this so that directories with limited access return a 404?
It still returns a 404 - Google just won't see it
some malicious intend behind all this.
Usually it is not the case - it is either typos, someone brute forcing webspam, or something else.
Google takes this into account - so I wouldn't worry too much about it.
I run into this kind of stuff all the time -- if it isn't that many files - using webmaster tools is VERY easy PLUS even if you do through up a 404 - if the links are still out there - Google will still come back looking for them.
As long as the pages are in the robots.txt - Google will allow you to remove them through webmaster tools.
Other than that - you will have to use a X-Robots tag.
I have Google coming back looking for pages that are 3 years old sometimes.
In the future - putting your own stuff in subdirectories (if you don't already) gives you a lot of flexibility with Google.
This is the latest entry:
[Sun Apr 12 06:26:29 2009] [error] [client 78.160.###.###] File does not exist: /home/SITENAME/public_html/DIR/Smileys, referer: [google.com.tr...]
My site is no longer among the results because I removed it from the index using Google's Remove URLs tool.
[edited by: Tandem at 6:27 pm (utc) on April 12, 2009]