|My robots.txt restricts crawling of anything in /legal directory. My privacy page, which is in /legal directory, restricts crawling and indexing via meta tags.. |
If your robots.txt prohibits crawling of the directory, then how is Google supposed to see your meta tags?
Thats just an additional precaution, that if it ever gets to the page, I tell it, not to index it.
|what I like is that MSN fully obeys the robots.txt and NoIndex tags, while Google shows those links.. |
Did Googlebot try and retrieve URLs forbidden in robots.txt? That is what the Standard for Robots Exclusion is all about - retrieval. The main reason it was introduced was to stop robots getting 'lost' in infinite URL spaces generated by CGI programs - not to stop a search engine linking to a page.
py9jmas, okay so what is the way to tell a search engine not to link to a page? and what is the use of listing (linking) a page and just increasing the page count if the page should not be indexed?
>Did Googlebot try and retrieve URLs forbidden in robots.txt? That is what the Standard for Robots Exclusion is all about - retrieval. The main reason it was introduced was to stop robots getting 'lost' in infinite URL spaces generated by CGI programs - not to stop a search engine linking to a page.
Right. Look at the name of the file: robots.txt. Basically it is how a site tells a spider "I don't want your bot wasting the bandwidth *I* pay for". The idea wasn't privacy. If someone wants privacy, then don't put the content on the WWW without password protection. Anything less has the problem it is nothing but an attempt at security by obscurity.
|what I like is that MSN fully obeys the robots.txt |
Errhhhmmm! Are you taking about the same MSN and the same internet as the rest of us ..cos some of their bots have been totally ignoring robots .txt whenever they feel like it for along time now ..and are currently doing so again ...
Maybe the name of the game is to eventually be able to put up a bigger "indexed pages" number on the "Search page" than google ..but if they keep this up there are gonna be some very specific robot bans going in all over ...
On the other hand Redmond could send out checks for all the bandwidth they are costing us while they do their market research ....< only in my dreams >
|Thats just an additional precaution, that if it ever gets to the page, I tell it, not to index it. |
It's very possible for pages within prohibited areas to be displayes in the serps. This can happen when google knows the page exists because there are links pointing to it. Very often the page will appear in the results as title with no description. The title will be based on anchor I assume.
What mack said. If a page is in robots.txt, we won't crawl it, but we can still return it as a search result if we have good evidence that the page is relevant to a query. In this case, we'll return just the url (no title and no cached page because we didn't fetch the page itself).
Here's a good example of why that can help users. For a long time, the California Department of Motor Vehicles (DMV) had a robots.txt that didn't let search engines crawl their site. But for a query like "california dmv" we could still return the proper url, even though we weren't able to fetch the page.
sdani, if you don't want the page to show up at all, you can guarantee that by letting Google see the noindex meta tag by fetching that page.
For the curious readers: we were eventually able to convince the DMV to let search engines crawl the site, but we did have to make an appointment and then wait in line for a while. ;)
Thanks GoogleGuy.. I did not know that if I allow from robots.txt and specify noindex metatag, then the url will not show up atall.
I think this works (for me atleast).