homepage Welcome to WebmasterWorld Guest from 54.166.53.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
GoogleBot muxing URLS and I want to block it
webdude

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3686808 posted 12:25 pm on Jun 30, 2008 (gmt 0)

I am having a strange problem where Googlebot has found URLs that don't exist yet they serve 200. I would like to block all of these urls on the site. This has caused over 5000 pages of dupe content. I have not been able to find where the urls are coming from, whether malicious or not, and would like to make sure I am getting the synatx right before I add this to the robots.txt file.

All good urls should contain a ?_function=(whatever) in the beginning of the query string like this...

mysite.com/appfile.xyz?_function=detail&ForumMasterThreads_uid1=222&start=1&ForumNumber=1

Googlebot (or some other entity) is finding query strings like this...

mysite.com/appfile.xyz?ForumMasterThreads_uid1=222&start=1&_function=detail&ForumNumber=1

...and other variations of this BUT all the muxed variation start with ?ForumMasterThreads_uid1=(n)

I would like to disallow all of these but need to make sure to allow all others. Is this the correct syntax and would this work?

User-agent: Googlebot
Disallow: /*?ForumMasterThreads_uid1

By the way, I checked, double checked, triple checked and yes... quadriple checked my code and there are no pages that generate these URLs. Any help would be appreciated.

 

physics

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3686808 posted 4:25 pm on Jun 30, 2008 (gmt 0)

I think you'll have to do this at the .htaccess level. Rewrite the request to a 404 if it has the strange parameter. For query string examples see:

[webmasterworld.com...]

webdude

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3686808 posted 6:16 pm on Jun 30, 2008 (gmt 0)

Well that's fine for apache, etc... what about windoze? I can achieve the same thing using the robots text, can I not? I do not want any queries going to...

?ForumMasterThreads

Receptional Andy



 
Msg#: 3686808 posted 6:31 pm on Jun 30, 2008 (gmt 0)

Your disallow line is syntactically correct and will work for Googlebot: asterisk is a wildcard, a question mark is treated as a character rather than an operator, robots exclusion is prefix-matching.

Note, though that parameter order in a url is not normally significant. This URL would work in most instances:

appfile.xyz?ForumMasterThreads_uid1=222&_function=detail&start=1&ForumNumber=1

So, as long as you're sure you aren't linking to parameters in different orders things should be OK.

Incidentally, there's a chance Google is creating these URLs as a result its 'form crawling' behaviour [webmasterworld.com].

srtech

5+ Year Member



 
Msg#: 3686808 posted 5:25 pm on Jul 1, 2008 (gmt 0)

With Windows you can log into the IIS manager and manually click on each file and re-direct it. I just did this today for a bunch of old .html files that we had which are now being re-directed to .asp files. I checked with Google and they had duplicate content. Over time, all of the .html's will switch over to .asp's in their index once Google picks up on all of the 301 re-directs.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3686808 posted 2:29 pm on Jul 11, 2008 (gmt 0)

The robots.txt rule will block spidering, but the URLs may well hang around in Google SERPs as URL-only listings for a very long time after that.

You are better off in the long run to add a redirect, but the robots rule is a good thing to start off with.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved