Did src= markers inflate our pages indexed?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Did src= markers inflate our pages indexed?

Major decrease in # of pages indexed

vicyankees

2:30 am on Jun 27, 2005 (gmt 0)

With the most recent Google update, we saw a MAJOR decrease in page saturation in Google most noteably. We went from 2.2 million pages indexed down to just over 900k.

My theory is that since we use source markers on many pages as an easy way to internally recognize the traffic to a specific page, whether the traffic came from email, specific advertisment placement throughout the site, etc... But i feel each of these src markers is being treated as seperate and unique pages - is that true? If so, how do we fix this issue?

Thanks in advance for anyone's input on this...

Vic

tedster

5:48 pm on Jul 2, 2005 (gmt 0)

That definitely could figure into it. Every query string creates a different url, even if you serve the same on-page content -- and it sounds like Google just zapped some obvious duplicate content for you.

As a rule of thumb, most click-trail tracking is best done with cookies. You only want Google to see one url for one bit of content, and not alternate urls that get the same thing. You can use robots.txt, robots meta tags and so on to limit what you want Gbot to see.

vicyankees

12:14 am on Jul 3, 2005 (gmt 0)

Can i use the robots.txt to have it consider all?src= to be ignored and to consider it the main URL...

ie...

www.domain.com/page1.html?src=blah

and have the robots.txt tell the spiders that the page name really should be...

www.domain.com/page1.html

is that possible with the robots or any other method? i dont want them to exclude all of these pages because they are heavily linked and some have a PR of 6 while others have 4s. My main domain has a PR of 8 and I want to be sure to take full advantage of this.

Thanks for your help...

Vic

joeduck

12:22 am on Jul 3, 2005 (gmt 0)

note that site:yoursite.com is a general measure and not an exact count.

vicyankees

12:50 am on Jul 3, 2005 (gmt 0)

Yes - but if it drops in half - thats a GREAT deal and a good "measure"... We are now down to 616,000 and if i do a site:___ and then an inurl:?src= i only get 7,000 which has remained consistent for the last few days while the overall count continues to drop...

joeduck

4:17 pm on Jul 7, 2005 (gmt 0)

vic -

Google follows robots.txt exclusion protocol but I'm not sure if you can use that for the src. You can exclude a directory from indexing using "disallow" /dir/.

vicyankees

5:05 pm on Jul 7, 2005 (gmt 0)

thanks - but i cannot disallow and entire directory as the pages i want indexed are the same as the ones with the src marker.

SebastianX

6:50 pm on Jul 7, 2005 (gmt 0)

AFAIK Google accepts cloaking to prevent it from indexing duplicate content. Try somethink like

$userAgent = getenv("HTTP_USER_AGENT");
$queryString = getenv("QUERY_STRING");
$domain = getenv("HTTP_HOST");
if (stristr($userAgent, "Googlebot")) {
$isSpider = TRUE;
}
if ($isSpider AND !empty($queryString)) {
header("HTTP/1.1 301");
header("Location: [$domain$PHP_SELF");...]
exit;
}

This PHP code will redirect Googlebot to the clean URL w/o ?src=anything, while all other user agents are tracked.

tedster

3:23 am on Jul 8, 2005 (gmt 0)

Here's another idea -- and depending on your situation it might be easier to execute. If you can change ?src= to ?id= then Google has stated that they will definitely NOT crawl those urls.

"Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index."
[google.com...]
You'd just want to be sure that Google crawls some version of those urls.

SebastianX

7:23 am on Jul 8, 2005 (gmt 0)

>Here's another idea -- and depending on your situation it might be easier to execute. If you can change?src= to?id= then Google has stated that they will definitely NOT crawl those urls.

Will not work. Google *does* index URLs with a short value of 'ID'-variables in the query string.