Stopping access to SEs

Forum Moderators: open

Message Too Old, No Replies

Stopping access to SEs

JuliaT

3:59 pm on Jun 2, 2006 (gmt 0)

I know nothing about server management so please forgive me if this is either in the wrong place or if I make no sense at all! ;)

I am in the process of building a site but certain pages will be in a subdomain (that will be member details) that I do not want any SEs to crawl. I will then exclude that entire directory in robots.txt.

It would seem though that Google doesn't pay attention to robots.txt and may still follow links to content in this subdomain so I have been told that we could effect that whole thing with a server mod_rewrite.

Now this is where I am totally lost as I haven't got a clue what they are talking about and would like some help if at all possible!

The server I am told by the hosting company is a Windows Server - so can any one point me in the right direction or let me know what to ask the hosting company to do for me?

Thanks!

mrMister

6:23 pm on Jun 2, 2006 (gmt 0)

Google will obey robots.txt if it is used correctly.

mod_rewrite is an apache module. There are similar rewrite modules for IIS but if you're on shared hosting it's unlikely that your hosting provider will have them installed.

JuliaT

6:44 pm on Jun 2, 2006 (gmt 0)

Thanks for that but can you be a bit more specific?

It is our own server, that hosts several of our own sites plus some for clients, so I'm sure we can do whatever is necessary to stop G on this site's subdomain.

As for the robots.txt can you tell me how to do this properly? I have read several other threads about this and they recommended using meta noindex;nofollow but others have said this hasn't worked for them, also several people on here have also said that G ignores the robots.txt disallow. I'm therefore looking for something that will definitely stop G even looking at these pages.

Thanks again for your help :)

tigger

6:53 pm on Jun 2, 2006 (gmt 0)

I'm also having similar problems stopping G crawling members pages that I want kept hidden so any further input would be appreciated

M3Guy

10:40 pm on Jun 2, 2006 (gmt 0)

I'm looking for the same kind of solution, so any help would be greatly appreciated

mattglet

12:08 am on Jun 3, 2006 (gmt 0)

Probably the most common method to make a "members only" area is to have each page you want protected to check for a cookie or session variable.

The best way to set the cookie or session variable is to create a login page which would require human input to validate they can access the area.

As for protecting the pages, you can create Server Side Includes in classic ASP (which would contain the code that checks the cookie/session values), or use Access Roles in .NET.

tigger

7:53 am on Jun 3, 2006 (gmt 0)

thanks for the feedback, but when I say "members pages" that was just a way to describe them, they are not pages that people will need to go to or login just pages I DON'T want G crawling

I need to set up either a folder or sub domain that G can't access at all - I had hoped the noindex, nofollow would do the trick but G is still crawling those pages and the same as JuliaT this is something I've never had to do before so am at a complete loss as the best way forward

JuliaT

8:24 am on Jun 3, 2006 (gmt 0)

Thanks but what I really need to know is how to 'properly' use the robots.txt and also what server change needs to be made to mod-rewrite to block Google from this particular subdomain.

mrMister

1:37 pm on Jun 4, 2006 (gmt 0)

You need to install something like:

[isapirewrite.com...]

JuliaT

2:48 pm on Jun 4, 2006 (gmt 0)

Thanks but my hosting company have said they can set whatever permissions need doing but I need to tell them what to do and that's what I'm completely stuck on as I've never dabbled in this at all.

Is there anyone who can help by telling me what to tell them to stop Google from being able to get to the sub-domain?

TypicalSurfer

3:05 pm on Jun 4, 2006 (gmt 0)

You say you are on a windows server. Is that just the OS or are you running windows server sofware? You could be running apache server on a windows box and that would give the option of using htaccess to block bots.

If you don't know your server software, go to netcraft.com, enter your website URL and it should report back what you are running, from there you develop a plan to proceed.

JuliaT

3:29 pm on Jun 4, 2006 (gmt 0)

It says "Microsoft-IIS on Windows 2000" so it is windows on windows!

mrMister

6:35 pm on Jun 4, 2006 (gmt 0)

Is there anyone who can help by telling me what to tell them to stop Google from being able to get to the sub-domain?

You need to install IIS Rewrite and then modify the http.ini file to include the mod_rewrite expression.

M3Guy

8:18 pm on Jun 4, 2006 (gmt 0)

So if that is done will it actually stop G visiting the subdomain? or.....would it be required to use mod_rewrite like on an apache box and show G a 403 when it looks for any url within the subdomain?

Thanks in advance

mrMister

2:30 am on Jun 5, 2006 (gmt 0)

you use the same rewrite code as you would on Apache.

M3Guy

10:00 am on Jun 5, 2006 (gmt 0)

OK, thats good to know, but i have no idea on Apache either, I've simply seen it mentioned elsewhere lol

Would anyone know of a good place to find out how to set all this up as beyond standard design and seo etc. I have absolutely no idea

Thanks in advance

M3Guy

10:35 pm on Jun 20, 2006 (gmt 0)

Ok, all sorted

Got a rewrite program for IIS and have had all coding checked by their bods, and it does seem to have kept our loving bot out of the folder I wanted to restrict access to, for now..........

Pfui

12:57 am on Jun 21, 2006 (gmt 0)

Sorry I'm clearly late to the party, M3Guy, but I thought I'd at least bring a few tips --

Google and the majors misbehave, but they usually heed robots.txt. (In my experience, they're okay approx. 80-90% of the time.) However, FAR greater problems will come not from the majors but from the literally thousands of bots/crawlers/scrapers/whatevers that don't even bother to ask for, let alone heed, robots.txt.

So feel free to peruse the robots.txt Forum [webmasterworld.com], ditto the Search Engine Spider Identification Forum [webmasterworld.com]. Bot-watchers keep a sharp eye out for troublemakers of all kinds and provide IPs, UAs, and Hosts for your blocking pleasure:)

(Both of those forums (fora?), and others of their ilk, are in the larger The Search Engine World [webmasterworld.com] area.)

And when it comes to mod_rewrite kinds of details -- all of which can be head-bangingly tricky, regardless of server -- the Apache Web Server Forum [webmasterworld.com] has a TON of rewrite-related posts and the occasional non-Apache clarification and/or pointer to resources. (Keep an eye out for Moderator jdMorgan's posts -- he's your best source for code examples that work the first time:)

Last but not least, when you're looking for posts that have gone before, or researching certain RewriteCond statements or checking on whether a UA is a known threat, your quickest, easiest bet is to search WebmasterWorld [webmasterworld.com].

Good luck!