Forum Moderators: Robert Charlton & goodroi
From robots.txt FAQ [robotstxt.org]:
How do I prevent robots scanning my site?The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:
User-agent: *
Disallow: /
From Introduction [robotstxt.org]
...These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
From Google FAQ [google.com]:
1. How should I request that Google not crawl part or all of my site?The standard for robot exclusion given at [robotstxt.org...] provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".)...
Its also Mozilla Bot - who knows what the purpose of this bot is?
Not the only person it has happened to:-
http://www.google.com/search?hl=en&q=googlebot+mozilla [google.com]
Anybody can see and study all the pages in the public directories of the site.
Simply we should be aware that the pages that are forbidden in the robots.txt may influence the rank of the pages that are allowed.
Vadim.
whatever it means technically, doesn't matter. Google will still read the files, but not list them.
User-agent: whatever
Disallow: /foo/ then accessing anything within /foo/ is a no-no for the bot, and it is irrelevant whether it would be listed or not in a serp. The purpose of disallowing a directory or even / is not that the bot will crawl it nevertheless.
Slightly OT:
It is also well known that Googlebot-Image does behave badly and ignores anything that is listed under User-agent: * in robots.txt. You need to copy all those lines listed under User-agent: * to a new section User-agent: Googlebot-Image .
Interestingly, also Yahoo-MMCrawler does not behave well.
Wrong again.
They will show the files as URL-only entries in the SERPs. They appear as URL-only, only because Google is asked to not index the content. Robots.txt says nothing about recording that the URL simply exists, so Google does record it.
You have to manually remove the entries by submitting the URL of your robots.txt file to the URL console (removal tool). Removal takes a few days.
Right. The basic idea of robots.txt was to stop bots from wasting bandwidth a site pays for. If I have a site with 10,000 pages I don't care about whether it is listed in Google, I don't want to pay for the bandwidth of Googlebot accessing these pages over and over again.
Googlebot could have fallen into the bot trap or index banned pages because of the problem googlebot has with 302 redirects. Maybe a 302 redirect caused googlebot to hit your robot trap without knowing it had been redirected there from a diferent site, thus did not realize it needed to request the robots.txt file for your site.
Possible? yes/no
Possible? yes/no
It is definetely some kind of problem, but we could only speculate on what it is.
As several people mentioned before, robots.txt is a ban for the robot, and they should obey it. Every robot should check robots.txt before accessing any documents on a given domain. It's simple as that.
Even
wgetin *nix systems obeys robots.txt by default. If you try to download a page that is disallowed, you will get an error.
semantics - that's still not indexing them for public consumption.
> Every robot should check robots.txt before accessing
> any documents on a given domain. It's simple as that.
I agree. Unfortunatly, that is not Googles interpretation of the worthless, toothless, all-but-useless, robots exclusion standard.
The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.
g1smd said : They will show the files as URL-only entries in the SERPs. They appear as URL-only, only because Google is asked to not index the content. Robots.txt says nothing about recording that the URL simply exists, so Google does record it.You have to manually remove the entries by submitting the URL of your robots.txt file to the URL console (removal tool). Removal takes a few days.
Hello everybody
so what g1smd says is correct as far as i have seen for my website....that after u add links to robots.txt it doesn't show up with snippet..rather only URL...
but if i want that only URL should also be removed so do i have to type in the remove URL tool in google something like this..
" [somename.com...] "
hope someone make it clear for me..
Thanxs..a lot ..to every1
Regards,
KaMran:-)