GoogleBot muxing URLS and I want to block it - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

GoogleBot muxing URLS and I want to block it

webdude

12:25 pm on Jun 30, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I am having a strange problem where Googlebot has found URLs that don't exist yet they serve 200. I would like to block all of these urls on the site. This has caused over 5000 pages of dupe content. I have not been able to find where the urls are coming from, whether malicious or not, and would like to make sure I am getting the synatx right before I add this to the robots.txt file.

All good urls should contain a ?_function=(whatever) in the beginning of the query string like this...

mysite.com/appfile.xyz?_function=detail&ForumMasterThreads_uid1=222&start=1&ForumNumber=1

Googlebot (or some other entity) is finding query strings like this...

mysite.com/appfile.xyz?ForumMasterThreads_uid1=222&start=1&_function=detail&ForumNumber=1

...and other variations of this BUT all the muxed variation start with ?ForumMasterThreads_uid1=(n)

I would like to disallow all of these but need to make sure to allow all others. Is this the correct syntax and would this work?

User-agent: Googlebot
Disallow: /*?ForumMasterThreads_uid1

By the way, I checked, double checked, triple checked and yes... quadriple checked my code and there are no pages that generate these URLs. Any help would be appreciated.

physics

4:25 pm on Jun 30, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I think you'll have to do this at the .htaccess level. Rewrite the request to a 404 if it has the strange parameter. For query string examples see:

[webmasterworld.com...]

webdude

6:16 pm on Jun 30, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well that's fine for apache, etc... what about windoze? I can achieve the same thing using the robots text, can I not? I do not want any queries going to...

?ForumMasterThreads

Receptional Andy

6:31 pm on Jun 30, 2008 (gmt 0)

Your disallow line is syntactically correct and will work for Googlebot: asterisk is a wildcard, a question mark is treated as a character rather than an operator, robots exclusion is prefix-matching.

Note, though that parameter order in a url is not normally significant. This URL would work in most instances:

appfile.xyz?ForumMasterThreads_uid1=222&_function=detail&start=1&ForumNumber=1

So, as long as you're sure you aren't linking to parameters in different orders things should be OK.

Incidentally, there's a chance Google is creating these URLs as a result its 'form crawling' behaviour [webmasterworld.com].

srtech

5:25 pm on Jul 1, 2008 (gmt 0)

10+ Year Member

With Windows you can log into the IIS manager and manually click on each file and re-direct it. I just did this today for a bunch of old .html files that we had which are now being re-directed to .asp files. I checked with Google and they had duplicate content. Over time, all of the .html's will switch over to .asp's in their index once Google picks up on all of the 301 re-directs.

g1smd

2:29 pm on Jul 11, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The robots.txt rule will block spidering, but the URLs may well hang around in Google SERPs as URL-only listings for a very long time after that.

You are better off in the long run to add a redirect, but the robots rule is a good thing to start off with.