Google Indexing URLs blocked by robots.txt - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Indexing URLs blocked by robots.txt

benjones

1:05 pm on Feb 27, 2007 (gmt 0)

10+ Year Member

Hi,

We recently made a change to our site regarding the link tags in the HTML. We run a site for classifieds and the detail of the advert pops in a new window.

Before Feb the actual href element only included a '#' to stop it loading a URL and an onclick statement actually handled the javascript window popup.

When we redesigned the site after Feb, the href now included the real URL to the popup but the onclick still takes care of the javascript popup, returning false to stop the href from taking effect.

The scripts that popup the advert detail are actually in our robots.txt file and Google has been keeping away from them for a few years now, however what we saw since the relaunch is that Google is now indexing the URL's for the popups, it doesnt seem to be actually going to the pages, but just listing the URL's it comes across as there isn't a cached page or page title in the listings!

Can someone please let me know if this behaviour is correct? Is Google just indexing the URL's and ignoring the pages?

thanks,

Ben

jdMorgan

1:49 pm on Feb 27, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Robots.txt has nothing to do with indexing -- defined as putting the URL in the index. As designed, robots.txt asks robots not to *fetch* the Disallowed URL.

As such, it was intended primarily as a bandwidth control mechanism, for example, to keep robots from spidering 'infinite' dynamic URL spaces in cgi-bin directories.

Therefore, what you are seeing is that Googlebot has complied with your robots.txt by not actually fetching the pages, despite finding their URLs on the pages that it is allowed to fetch. As a result, you get what we have traditionally been calling a "URL-only listing" in the SERPs. This is because Google and several of the other majors have been pursuing what they call the "deep Web," and seem to think that there is some great value in these URL-only listings (I disagree, but then, I wasn't asked). ;)

If you want to eliminate these listings, you should *allow* the pages to be fetched by removing their Disallow lines in robots.txt, and instead use the on-page HTML <meta name="robots" content="noindex,nofollow"> tag. This functions differently, and results in the "don't mention it" behaviour you want.

Remember that in order to read this meta-tag, the spider must fetch the page, so both steps must be done -- The Disallow for the page must be removed from robots.txt in order for its meta-tag can be read.

More info: [robotstxt.org...]

Jim