Welcome to WebmasterWorld Guest from 54.159.44.227

Forum Moderators: goodroi

Message Too Old, No Replies

How do I stop g bot from crawling directories?

     

shazam

3:05 am on Sep 28, 2011 (gmt 0)



So on a few wp sites I tried this global translator plugin. It was not such a good decision, so I removed it. I also removed all the directories that it made:

/gl/
/it/
/po/ etc.....

I then blocked all these directories in the robots.txt file.

As we all know google doesn't respect the robots.txt these days, I keep getting 66.249.68.245 poking around where it doesn't belong. The end result, since all these directories don't exist and automatically redirect to the same url with the /gl/ removed, is the pesky duplicate content.

The only option I know of, is to manually rebuild all these directories and place something like:

<Files *>
Order allow,deny
Deny from All
</Files>

in the htaccess for these dir's.

I can't seem to find an htaccess or other method for stopping g bot without rebuilding directories that have no business being rebuilt. They were deleted because I don't want them and I don't want pesky bots poking around.

Before it gets suggested, no, I am not going to give g more data by putting these sites in webmaster tools and using their remove url tool.

I would prefer to have these directories redirect to the English pages like they already do. This offers the best user experience, but of course is in conflict with g as I am surely getting pinged for dup content.

Has anyone ever tested simply blocking 66.249.68.245 from their sites? Does g just send another IP to ignore the bot.txt? Does your site get penalized or de-indexed?

lucy24

4:22 am on Sep 28, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



If you start trying to do it by IP you will be playing whack-a-mole with g### for the rest of your life.

Are there any legitimate human browsers that happen to contain the string "bot" in their names? If not, a simple

RewriteCond %{HTTP_USER_AGENT} bot
RewriteRule ^(gl|it|po)/ - [F]

should get rid of everyone. (That's assuming top-level directories. Otherwise use [^/]+/ in place of ^) G### can hardly complain about being refused access to a place they weren't supposed to be trying in the first place.

Come to think of it, 403 doesn't get listed in GWT Crawl Errors does it? Only 404 and 410.

tedster

4:52 am on Sep 28, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I would prefer to have these directories redirect to the English pages like they already do. This offers the best user experience, but of course is in conflict with g as I am surely getting pinged for dup content.

The whole robots.txt issue should just evaporate if the redirect is technically sound. The URL should be changing in the browser bar when you request the old URL, and a header checker should show a 301 status for the original URLs.

If the bot gets that experience, then you don't have duplicate content.

shazam

12:00 pm on Sep 29, 2011 (gmt 0)



Thanks for the suggestions.

I would rather just block ALL traffic, not just the bots. It may appear to be cloaking if they do a manual inspection.

The whole thing is just silly. I guy should be able to simply delete some folders/pages and redirect them to the old English only version without worrying about such nonsense like dup content or cloaking. These days almost whatever you do will get penalized in one form or another. I miss the days when we just had to make good websites with quality content and the search engines followed along.

It's also starting to look like robots.txt files are a complete waste of time.

phranque

1:30 pm on Sep 29, 2011 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



410 Gone:

RewriteRule ^(gl|it|po)/ - [G]

lucy24

9:18 pm on Sep 29, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



<drifting o/t>
The whole thing is just silly. I guy should be able to simply delete some folders/pages and redirect them to the old English only version without worrying about such nonsense like dup content or cloaking. These days almost whatever you do will get penalized in one form or another. I miss the days when we just had to make good websites with quality content and the search engines followed along.

I know the feeling. I've got one directory that is roboted-out-- perfectly fine for humans, I just don't want it indexed. And IT FEELS SO ### GOOD to be able to move files and rearrange files and split files and fiddle with links without once having to think about how many things I will now need to add to htaccess-- and keep them there for all eternity-- for the sole benefit of search engines.
</end drift>

And oops. Sorry, phranque, of course you're right. [G], not [F].

g1smd

5:38 pm on Jan 8, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



automatically redirect to the same url with the /gl/ removed, is the pesky duplicate content

simply delete some folders/pages and redirect them to the old English only version without worrying about such nonsense like dup content

Redirects are not duplicate content.

Duplicate content is same content served with 200 OK status at multiple URLs.

The danger of "redirecting everything" is "infinite URL space".
 

Featured Threads

Hot Threads This Week

Hot Threads This Month