Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Webmaster Tools 'Web Crawl' and bogus query string links

         

doughayman

12:11 am on Jun 9, 2008 (gmt 0)

10+ Year Member



Hi,

I tried to embed this issue in another topic and it got lost, so I decided to start a thread on it.

In Google WT Web Crawl, I have successfully blocked various URL's through my judiciious use of ROBOTS.TXT "disallow" clauses. Specifically, I have a ROBOTS.TXT disallow clause of:

Disallow: /*?

I added this to my ROBOTS.TXT file since I was getting Google crawl entries of:

www.domain.com/index.htm (which I want crawled & indexed)

AND

www.domain.com/index.htm?ref=someotherdomain.com

My ROBOTS.TXT disallow clause above successfully blocked this latter URL.

This latter URL is not referenced on my site, and is not found anywhere on the Net (yet). Additionally, I do not offer any sort of subaffiliate program, so this link means nothing to me.

So, here are some questions that I have on this:

1) Where is Google finding this link and what does it mean ? Does
it refer to someone trying to create a scraper site, that isn't
yet indexed ? Where is Google finding this ?

2) If I didn't block this 2nd URL via ROBOTS.TXT, and it got
crawled and indexed by Google, would this create an opportunity
for a duplicate content penalty with the 1st link mentioned
above ? [NOTE: that is my thought, and why I decided to block
it in ROBOTS.TXT]

3) Assuming that this is somehow tied to malicious acts, is there
any sort of means to alert Google to this ?

Thanks in advance,

Doug

tedster

2:52 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's my take:

1. It can be very hard to know where Google gets a url. It might come from toolbar data, direct submission, someone else's page that is not now indexed but once was, a cloaked page, a Google test of how your domain handles this "invented" url -- probably more, too!

2. Yes, there's an opportunity for trouble in the SERPs - especially if this happens a lot.

3. Handling it technically as you did is the best thing to do.

doughayman

3:05 pm on Jun 9, 2008 (gmt 0)

10+ Year Member



Thanks for you input, Ted.

Once concern I have though, is that by blocking:

www.domain.com/index.htm?ref=someotherdomain

am I also effectively blocking:

www.domain.com/index.htm ?

The reason I asked this, is that the block of the 1st link above occurred, for the first time on June 4th of this month, and that is when I saw a precipitous drop in traffic (ANOTHER THREAD DEDICATED TO THIS ISSUE).

tedster

3:08 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could avoid that potential problem by using a 301 that drops the query string, and then drop the disallow rule from robots.txt

doughayman

3:46 pm on Jun 9, 2008 (gmt 0)

10+ Year Member



Yes, good idea !