homepage Welcome to WebmasterWorld Guest from 184.73.104.82
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google indexing blocked content
fatpeter

10+ Year Member



 
Msg#: 4338354 posted 6:53 am on Jul 12, 2011 (gmt 0)

A while ago I had the following directive in robots.txt

User-Agent: *
Disallow: /cgi-bin/

but i had a problem with adsense not showing adverts on pages below the cgi-bin

for example cgi-bin/links/showpicture.cgi?ID=14063

I didn't want any content on the site under the cgi-bin indexed as it is all dupe content and the previous directive seemed to work just fine.

I changed the directive to
User-Agent: *
Disallow: /cgi-bin/

User-Agent: MediaPartners-Google
allow: /cgi-bin/

to allow adsense bot

Now google has started to index 80,000 pages under the cgi-bin.

Is my directive wrong ? I've searched and searched but i can't find a reason why they are indexing these pages...

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4338354 posted 7:41 am on Jul 12, 2011 (gmt 0)

You have allowed Google in. Try adding

User-agent: Googlebot
Disallow: /cgi-bin


to what you already have or add the
meta robots noindex tag to all of the pages in /cgi-bin.
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4338354 posted 6:49 pm on Jul 20, 2011 (gmt 0)

Then, after your robots.txt and sitemap file and meta tags are good to go, use Google Webmaster Tools to confirm everything. And then remove a directory whole-hog from Googlebot's reach:

-> Site configuration
--> Crawler access (where you can test robots.txt)
---> Remove URL

See also the following page linked-to from "Crawler access": "Do I need to make changes to content I want removed from Google?"

fatpeter

10+ Year Member



 
Msg#: 4338354 posted 8:17 pm on Jul 20, 2011 (gmt 0)

Hi

Thanks for the replies. What i don't understand is this...

I had cgi-bin blocked from all robots with an exception for User-agent: Mediapartners-Google.
When i tested a url under cgi-bin in webmaster tools it said


www.?.com/cgi-bin/?ID=?

Googlebot
Blocked by line 59: Disallow: /cgi-bin

Googlebot Mediapartners-Google
Allowed by line 21: Allow: /cgi-bin/


so you would think that would be o.k.

I've made some changes but a week later there are still over a 100,000 cgi-bin pages in a site: command

as for removing them with the "Remove URL" tool.

Will that work for a directory ?

Do I really need to remove them if they are banned in robots.txt ?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4338354 posted 9:17 pm on Jul 20, 2011 (gmt 0)

Yes, you can remove whole directories, but not right this second because they're fixing a teeny weeny little bug in the "Remove" tool (different thread, I think over in the Google subforum). You don't have to remove them, but if you don't, they will stick around for months if not years.

The googlebot goes by its own rules ;)

mikeavery11



 
Msg#: 4338354 posted 5:52 pm on Sep 10, 2011 (gmt 0)

try like this

You have allowed Google in.
User-agent: Googlebot
Disallow: /cgi-bin

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved