Can I refuse Googlebot access via htaccess/ web.config?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Can I refuse Googlebot access via htaccess/ web.config?

scooterdude

10:26 am on Jan 2, 2013 (gmt 0)

Hi All

This is rhetorical off course :)

Google I now know crawls all urls it find, whether blocked by robots, noindexed or nofollowed, and it indexes them all regardless, ok they provide no cache or details sometimes, yet sometimes I've seen the pages description on the url

Thing is the pages , largely product pages are many(000s) and content is directly from supplier, so I try to keep em out of the index.

Crawling these urls endlessly mean they don't get to crawl pages I've worked on,

I finally realised why sites i though had a few hundred pages were indicated as having many (000s)

So , If I block just Googlebot from such pages, does anyone know how Google reacts to such?

goodroi

6:52 pm on Jan 2, 2013 (gmt 0)

I regularly use htaccess to block Google from accessing certain pages and I have not noticed any issues. I don't recall using it to block 100,000s of pages on one site.

Google tends to crawl pages with more traffic and backlinks going to them. If you want Google to pay more attention to a page, see what you can do to boost traffic and/or backlinks.

scooterdude

12:29 am on Jan 3, 2013 (gmt 0)

Hi Goodroi

uncontrolled, site could easily get to 7 figure url totals to crawl, and naturally , they'd rank for nothing an not even get indexed, but all that crawling for zip outcome has been painful

How do you block gbot

Can i redirect to a 403 page , would that stop them from requesting that directory, for example all calls to urls in

"/blocked-widgets/"

sent to a single page returning 403 forbidden

Sgt_Kickaxe

3:20 am on Jan 3, 2013 (gmt 0)

Redirecting doesn't get rid of the urls. What you want is for your indexed urls to return their content and for every other url possible to return a 404. I realize that's difficult when its a category page as you mentioned but you could force a 404 on any url ending with a backslash(or vice versa if your urls append one all the time).

404 is the only virtual surefire way to get Google to stop weighing a page against your domain. If a url has EVER had content on it Google will indeed crawl it for eternity.

lucy24

10:06 am on Jan 3, 2013 (gmt 0)

Can i redirect to a 403 page

Do you mean what you said? Issue a 301 redirect to the physical page that you use as your 403 page? Why would this be better than returning a 403 outright? I'd think it would be vastly worse, since g### would read it as something perilously close to a Soft 404. With a redirect they wouldn't see your 403 page thousands of separate times; they'd simply put it on their shopping list for later. And then they'd notice that all your pages are redirecting to the same place.

scooterdude

11:14 am on Jan 3, 2013 (gmt 0)

Thanks all, I'll give that a shot.

g1smd

2:04 pm on Jan 3, 2013 (gmt 0)

Return the 4xx status code at the originally requested URL to signify that particular URL does not exist or no longer exists.

scooterdude

2:34 pm on Jan 3, 2013 (gmt 0)

Return the 4xx status code at the originally requested URL to signify that particular URL does not exist or no longer exists.

I now intend to return 4xx at the url as recommended by most,

However my conundrum is that the urls do exist

I want human users to use them,

But block googlebot/bingbot et all from visiting/indexing/requesting thes urls which are barred via robots.txt already.

Might G with its penchant for penalties one decide that this was cloaking?

g1smd

2:58 pm on Jan 3, 2013 (gmt 0)

If they exist, and people use them, then 404 is not appropriate.

The robots.txt will stop Google crawling them, but the URLs may still appear in the SERPs as URL-only entries.

If you want them out of the index, do not block with robots.txt but instead add the meta robots noindex tag on the page. They have to request the URL in order to know that you don't want it to appear in the SERPs.

scooterdude

4:28 pm on Jan 3, 2013 (gmt 0)

er no

I am familiar with the noindex route, my issue is to prevent spiders/robots from crawling potentially vast numbers of urls which i don't want indexed,

whilst allowing people to use those urls

bot activity can get so intense, that a dedicated server which i got to resolve the issue would be flattened for hours, and the bots , turn out to be gbot, bingbot, baidu , yandex, spiders I have no intention of banning

My fear is the cloaking label, and knowing g, theyd not even tell one that that was their thinking, one would have to read their "mind/penalty" :)

g1smd

4:56 pm on Jan 3, 2013 (gmt 0)

Well robots.txt will keep legit bots out of the site, but URLs may still turn up in the SERPs.

scooterdude

5:02 pm on Jan 3, 2013 (gmt 0)

i set a robot honey trap once or twice, would you like to know who ignored robots.txt :)