Welcome to WebmasterWorld Guest from 54.157.222.62

Forum Moderators: mack

Message Too Old, No Replies

What I like better about MSN compared to Google

so far

   
10:47 pm on Jul 4, 2004 (gmt 0)

10+ Year Member



My robots.txt restricts crawling of anything in /legal directory. My privacy page, which is in /legal directory, restricts crawling and indexing via meta tags..
1. On site:mydomin.com google still lists my privacy page as one of the web pages on my site, because it found the link on my home page (although it does not show any content from that page and obeys noindex tags).
2. MSN shows all the pages of my site if I do a site:domain.com, EXCEPT this privacy page link and another one where I have specified NoIndex.

So.. what I like is that MSN fully obeys the robots.txt and NoIndex tags, while Google shows those links.. may be just to boost the number of pages in its index.

8:30 am on Jul 9, 2004 (gmt 0)

10+ Year Member



My robots.txt restricts crawling of anything in /legal directory. My privacy page, which is in /legal directory, restricts crawling and indexing via meta tags..

If your robots.txt prohibits crawling of the directory, then how is Google supposed to see your meta tags?

10:10 am on Jul 9, 2004 (gmt 0)

10+ Year Member



Thats just an additional precaution, that if it ever gets to the page, I tell it, not to index it.
10:45 am on Jul 9, 2004 (gmt 0)

10+ Year Member



what I like is that MSN fully obeys the robots.txt and NoIndex tags, while Google shows those links..

Did Googlebot try and retrieve URLs forbidden in robots.txt? That is what the Standard for Robots Exclusion is all about - retrieval. The main reason it was introduced was to stop robots getting 'lost' in infinite URL spaces generated by CGI programs - not to stop a search engine linking to a page.

[robotstxt.org...]

11:26 am on Jul 9, 2004 (gmt 0)

10+ Year Member



py9jmas, okay so what is the way to tell a search engine not to link to a page? and what is the use of listing (linking) a page and just increasing the page count if the page should not be indexed?
12:34 pm on Jul 9, 2004 (gmt 0)

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>Did Googlebot try and retrieve URLs forbidden in robots.txt? That is what the Standard for Robots Exclusion is all about - retrieval. The main reason it was introduced was to stop robots getting 'lost' in infinite URL spaces generated by CGI programs - not to stop a search engine linking to a page.

Right. Look at the name of the file: robots.txt. Basically it is how a site tells a spider "I don't want your bot wasting the bandwidth *I* pay for". The idea wasn't privacy. If someone wants privacy, then don't put the content on the WWW without password protection. Anything less has the problem it is nothing but an attempt at security by obscurity.

12:49 pm on Jul 9, 2004 (gmt 0)

WebmasterWorld Senior Member leosghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



what I like is that MSN fully obeys the robots.txt

Errhhhmmm! Are you taking about the same MSN and the same internet as the rest of us ..cos some of their bots have been totally ignoring robots .txt whenever they feel like it for along time now ..and are currently doing so again ...

Maybe the name of the game is to eventually be able to put up a bigger "indexed pages" number on the "Search page" than google ..but if they keep this up there are gonna be some very specific robot bans going in all over ...

On the other hand Redmond could send out checks for all the bandwidth they are costing us while they do their market research ....< only in my dreams >

12:53 pm on Jul 9, 2004 (gmt 0)

WebmasterWorld Administrator mack is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Thats just an additional precaution, that if it ever gets to the page, I tell it, not to index it.

It's very possible for pages within prohibited areas to be displayes in the serps. This can happen when google knows the page exists because there are links pointing to it. Very often the page will appear in the results as title with no description. The title will be based on anchor I assume.

Mack.

9:54 pm on Jul 9, 2004 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



What mack said. If a page is in robots.txt, we won't crawl it, but we can still return it as a search result if we have good evidence that the page is relevant to a query. In this case, we'll return just the url (no title and no cached page because we didn't fetch the page itself).

Here's a good example of why that can help users. For a long time, the California Department of Motor Vehicles (DMV) had a robots.txt that didn't let search engines crawl their site. But for a query like "california dmv" we could still return the proper url, even though we weren't able to fetch the page.

sdani, if you don't want the page to show up at all, you can guarantee that by letting Google see the noindex meta tag by fetching that page.

For the curious readers: we were eventually able to convince the DMV to let search engines crawl the site, but we did have to make an appointment and then wait in line for a while. ;)

10:02 pm on Jul 9, 2004 (gmt 0)

10+ Year Member



Thanks GoogleGuy.. I did not know that if I allow from robots.txt and specify noindex metatag, then the url will not show up atall.

I think this works (for me atleast).
SD