Forum Moderators: open
Please help!
Not sure why but I wish they would stop, all of them show the same title and description. This causes two problems for me:
1. screws up my roi tracking
2. duplicate content, main page in index 6 times, but there is only one page.
Our solution to prevent new tracking codes being picked up was to put all inbound tracking links through a script which extracts the tracking code and then 301's the requestor to a clean URL. So far this has seemed to work OK and we haven't seen any new codes appearing in Google.
Unfortunately once the pages _are_ in the index it seems to be very hard to get them out again. The only way I could think of (aside from asking Google nicely to do it for you) was to detect requests for pages using the old tracking methods and redirect them to a clean URL in the same way as the tracking script. The problem is that you need to do this for _all_ the possible landing pages :(
If anyone has a better solution I'd like to hear it because we still have pages in the index that use tracking codes which we stopped using 6 months ago!
Any idea how I can protect from google indexing these?
also, I see google indexing incorrect urls, some with spaces, some totally wrong urls like: www.domain.com/%1Fnmasbdkjabsdlkvjb%20slkhs876d98fyts
all of my pages are static html, so I have no idea how this stuff is getting in?
Any idea how I can protect from google indexing these?
The only protection I know of is to make sure that if those pages are requested from your server by Google, the response is either a 404 (if the page is actually invalid) or to 301 them to a clean URL that you don't mind appearing in the SERPs. Not sure how other SEs handle 301s though.
also, I see google indexing incorrect urls, some with spaces, some totally wrong urls like: www.domain.com/%1Fnmasbdkjabsdlkvjb%20slkhs876d98fytsall of my pages are static html, so I have no idea how this stuff is getting in?
Google sometimes shows spaces in the SERPs URLs even when there are none in the link - my guess is that this is protection from screen scraping.
As to the totally wrong URLs - I can't explain that one.
My suspicion is that these bad URLs that are being picked up by google are down to the increase in so called "directories" that are filled with affiliate and PPC links and scrape content from search engines. The scraping seems to be a bit hit and miss sometimes and things get mangled.
Put a robots.txt to block the page with tracking code or other ways explained here
[google.com...]
Then go to
[services.google.com:8882...]
Register and submit changes.
I have removed many pages before when google crawled a lot of useless redirect scripts. It took around 1 day.
www.mydomain.com/?ref=123
or
www.mydomain.com/index.html?ref=123
You can't put this in robots.txt:
User-agent: *
Disallow: /?ref=123
Disallow: /index.html?ref=123
can you? I'm happy to be proved wrong on this :)
I guess you could do it using the robots metatag, provided your page is active, by checking the querystring in the page code and serving a noindex if a reference code is set. But if your pages are static html this wouldn't work either.
They are correct!
www.mydomain.com/?ref=123
or
www.mydomain.com/index.html?ref=123
You can't put this in robots.txt:
User-agent: *
Disallow: /?ref=123
Disallow: /index.html?ref=123
BUT, remember that
Disallow: /?ref=123 will not only DISALLOW
www.mydomain.com/?ref=123
but also
www.mydomain.com/?ref=123456
So if all your tracking start with /?ref you just need one line
Disallow: /?ref
or another more for safety
Disallow: /index.html?ref
I recommend adding meta tag too as a backup but not as the only solution. I always discourage people to use meta tag to stop robot. For robots.txt, a file is not fecthed if it is not allowed! For meta tag, the page is fetched before it knows it can't index it! GoogleBot is a hungry monster and it will always take up lots of bandwidth which is useless. Since GoogleBot crawl a certain amount of links for one site, you can reduce the chance for GoogleBot to crawl more useful pages. And lastly, the most important, using metarobot to stop a bot will usually end up with the page appear in Google database with no title and description and contents (just the URL).
With regards Google just including the URL of a page in the index when you use the robots metatag, I was told that the opposite is true:
[webmasterworld.com...]