Forum Moderators: open
I am in the process of building a site but certain pages will be in a subdomain (that will be member details) that I do not want any SEs to crawl. I will then exclude that entire directory in robots.txt.
It would seem though that Google doesn't pay attention to robots.txt and may still follow links to content in this subdomain so I have been told that we could effect that whole thing with a server mod_rewrite.
Now this is where I am totally lost as I haven't got a clue what they are talking about and would like some help if at all possible!
The server I am told by the hosting company is a Windows Server - so can any one point me in the right direction or let me know what to ask the hosting company to do for me?
Thanks!
It is our own server, that hosts several of our own sites plus some for clients, so I'm sure we can do whatever is necessary to stop G on this site's subdomain.
As for the robots.txt can you tell me how to do this properly? I have read several other threads about this and they recommended using meta noindex;nofollow but others have said this hasn't worked for them, also several people on here have also said that G ignores the robots.txt disallow. I'm therefore looking for something that will definitely stop G even looking at these pages.
Thanks again for your help :)
The best way to set the cookie or session variable is to create a login page which would require human input to validate they can access the area.
As for protecting the pages, you can create Server Side Includes in classic ASP (which would contain the code that checks the cookie/session values), or use Access Roles in .NET.
I need to set up either a folder or sub domain that G can't access at all - I had hoped the noindex, nofollow would do the trick but G is still crawling those pages and the same as JuliaT this is something I've never had to do before so am at a complete loss as the best way forward
[isapirewrite.com...]
Is there anyone who can help by telling me what to tell them to stop Google from being able to get to the sub-domain?
If you don't know your server software, go to netcraft.com, enter your website URL and it should report back what you are running, from there you develop a plan to proceed.
Google and the majors misbehave, but they usually heed robots.txt. (In my experience, they're okay approx. 80-90% of the time.) However, FAR greater problems will come not from the majors but from the literally thousands of bots/crawlers/scrapers/whatevers that don't even bother to ask for, let alone heed, robots.txt.
So feel free to peruse the robots.txt Forum [webmasterworld.com], ditto the Search Engine Spider Identification Forum [webmasterworld.com]. Bot-watchers keep a sharp eye out for troublemakers of all kinds and provide IPs, UAs, and Hosts for your blocking pleasure:)
(Both of those forums (fora?), and others of their ilk, are in the larger The Search Engine World [webmasterworld.com] area.)
And when it comes to mod_rewrite kinds of details -- all of which can be head-bangingly tricky, regardless of server -- the Apache Web Server Forum [webmasterworld.com] has a TON of rewrite-related posts and the occasional non-Apache clarification and/or pointer to resources. (Keep an eye out for Moderator jdMorgan's posts -- he's your best source for code examples that work the first time:)
Last but not least, when you're looking for posts that have gone before, or researching certain RewriteCond statements or checking on whether a UA is a known threat, your quickest, easiest bet is to search WebmasterWorld [webmasterworld.com].
Good luck!