homepage Welcome to WebmasterWorld Guest from 54.211.201.65
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How do I stop g bot from crawling directories?
shazam




msg:4368078
 3:05 am on Sep 28, 2011 (gmt 0)

So on a few wp sites I tried this global translator plugin. It was not such a good decision, so I removed it. I also removed all the directories that it made:

/gl/
/it/
/po/ etc.....

I then blocked all these directories in the robots.txt file.

As we all know google doesn't respect the robots.txt these days, I keep getting 66.249.68.245 poking around where it doesn't belong. The end result, since all these directories don't exist and automatically redirect to the same url with the /gl/ removed, is the pesky duplicate content.

The only option I know of, is to manually rebuild all these directories and place something like:

<Files *>
Order allow,deny
Deny from All
</Files>

in the htaccess for these dir's.

I can't seem to find an htaccess or other method for stopping g bot without rebuilding directories that have no business being rebuilt. They were deleted because I don't want them and I don't want pesky bots poking around.

Before it gets suggested, no, I am not going to give g more data by putting these sites in webmaster tools and using their remove url tool.

I would prefer to have these directories redirect to the English pages like they already do. This offers the best user experience, but of course is in conflict with g as I am surely getting pinged for dup content.

Has anyone ever tested simply blocking 66.249.68.245 from their sites? Does g just send another IP to ignore the bot.txt? Does your site get penalized or de-indexed?

 

lucy24




msg:4368090
 4:22 am on Sep 28, 2011 (gmt 0)

If you start trying to do it by IP you will be playing whack-a-mole with g### for the rest of your life.

Are there any legitimate human browsers that happen to contain the string "bot" in their names? If not, a simple

RewriteCond %{HTTP_USER_AGENT} bot
RewriteRule ^(gl|it|po)/ - [F]

should get rid of everyone. (That's assuming top-level directories. Otherwise use [^/]+/ in place of ^) G### can hardly complain about being refused access to a place they weren't supposed to be trying in the first place.

Come to think of it, 403 doesn't get listed in GWT Crawl Errors does it? Only 404 and 410.

tedster




msg:4368096
 4:52 am on Sep 28, 2011 (gmt 0)

I would prefer to have these directories redirect to the English pages like they already do. This offers the best user experience, but of course is in conflict with g as I am surely getting pinged for dup content.

The whole robots.txt issue should just evaporate if the redirect is technically sound. The URL should be changing in the browser bar when you request the old URL, and a header checker should show a 301 status for the original URLs.

If the bot gets that experience, then you don't have duplicate content.

shazam




msg:4368696
 12:00 pm on Sep 29, 2011 (gmt 0)

Thanks for the suggestions.

I would rather just block ALL traffic, not just the bots. It may appear to be cloaking if they do a manual inspection.

The whole thing is just silly. I guy should be able to simply delete some folders/pages and redirect them to the old English only version without worrying about such nonsense like dup content or cloaking. These days almost whatever you do will get penalized in one form or another. I miss the days when we just had to make good websites with quality content and the search engines followed along.

It's also starting to look like robots.txt files are a complete waste of time.

phranque




msg:4368727
 1:30 pm on Sep 29, 2011 (gmt 0)

410 Gone:

RewriteRule ^(gl|it|po)/ - [G]

lucy24




msg:4368908
 9:18 pm on Sep 29, 2011 (gmt 0)

<drifting o/t>
The whole thing is just silly. I guy should be able to simply delete some folders/pages and redirect them to the old English only version without worrying about such nonsense like dup content or cloaking. These days almost whatever you do will get penalized in one form or another. I miss the days when we just had to make good websites with quality content and the search engines followed along.

I know the feeling. I've got one directory that is roboted-out-- perfectly fine for humans, I just don't want it indexed. And IT FEELS SO ### GOOD to be able to move files and rearrange files and split files and fiddle with links without once having to think about how many things I will now need to add to htaccess-- and keep them there for all eternity-- for the sole benefit of search engines.
</end drift>

And oops. Sorry, phranque, of course you're right. [G], not [F].

g1smd




msg:4404969
 5:38 pm on Jan 8, 2012 (gmt 0)

automatically redirect to the same url with the /gl/ removed, is the pesky duplicate content

simply delete some folders/pages and redirect them to the old English only version without worrying about such nonsense like dup content

Redirects are not duplicate content.

Duplicate content is same content served with 200 OK status at multiple URLs.

The danger of "redirecting everything" is "infinite URL space".

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved