Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Removing pages with session IDs

options

         

Fiver

5:48 pm on Nov 1, 2006 (gmt 0)

10+ Year Member



With reference to this thread (now closed):
[webmasterworld.com...]

I wish to remove pages with session IDs from a large site that suffers from many indexed supplementals.

I would use a line like
disallow:/*sessionid

And then tell the google removal tool to revisit the robots.txt

My issue is this:
There are many pages indexed on this site via external links to pages with the sessionid variable - they are partner sites. There aren't any links on the actual site with these partner session id's in the link.

Will the removal bot simply crawl the site and remove urls it comes across with *sessionid in it? Or is the removal bot smart enough to crawl all pages indexed currently in google for that site?

All pages we want removed are now served automatically with noindex nofollow when they have sessionid in the url. But google's cache of these pages is months old, and it simply isn't crawling them because the links are from external pages, which it apparently rarely visits.

I could build a huge stupid sitemap to link to all of these pages so google will crawl them again, but that seems backwards. I could also remove them with the removal tool one at a time... but there are far too many.

g1smd

6:40 pm on Nov 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Disallow: * notation is only understood by Googlebot, so that * code must go in the User-Agent: Googlebot section of the file.

If you have a User-Agent: Googlebot section in your file, then Google will ignore everything in the User-Agent: * section.

Fiver

6:57 pm on Nov 1, 2006 (gmt 0)

10+ Year Member



alright g1smd, thanks, that's understood. I'll let the other engines find the noindex metas in their own time, it's just google I want to help along. poor girl's a little slow.

any idea of how the removal bot does its thing?

AjiNIMC

9:52 am on Nov 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google also removes the session id paramaters from the urls if it is having a standard paramater name. Recently Matt cutts was talking at emarketing talk show where he said that if google sees that a lot of websites have a parameter like session id which returns the same page they starts ignoring it.

Supplement pages can have various reasons as well.

mcskoufis

7:08 am on Nov 5, 2006 (gmt 0)

10+ Year Member



This has been puzzling me for quite some time now.

Even though Google say that they support the wildcard(*) in robots.txt and it is explained on the following link, the URL removal tool simply does not recognise this line!

To remove dynamically generated pages, you'd use this robots.txt entry:

User-agent: Googlebot
Disallow: /*?

[google.com...]

On the URL Removal tool it says:

URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?

But if you have this in your robots.txt and you go to the relevant sitemaps tool and test it against your robots.txt it says that URL is being blocked.

I am totally confused, but it seems to be working.

The other strange behaviour with the removal tool is that some urls which are now 404 it likes to remove but others don't. In my blog I the archive feature with links to dates which were all indexed by google. Now I removed those pages and the URL removal tool says "request denied"...

There must be a way to disallow ALL dynamic urls on a site. I can't believe that the wildcard is not yet a standard.

tedster

3:27 pm on Nov 5, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's a difference between what you can put in the robots.txt for googlebot (it accepts wild cards) and what the url removal tool will accept (it must be one unique url at a time.)

The "request denied" message is probably about some tehnical issue or other. However, I would not advise over using the removal tool -- save it for important issues and let Google handle your ordinary 404 responses however it sees fit.

By trying to "micro-manage" the Google index, many people have accidentally created significant ranking problems. No other search engine that I know of even has such a tool -- I suggest treating it like the very heavy duty action that it is.

mcskoufis

6:03 pm on Nov 5, 2006 (gmt 0)

10+ Year Member



There's a difference between what you can put in the robots.txt for googlebot (it accepts wild cards) and what the url removal tool will accept (it must be one unique url at a time.)

Apart from the "remove url" option it also allows to remove content according to the robots.txt. For this option I am talking about.

Fiver

2:25 pm on Nov 15, 2006 (gmt 0)

10+ Year Member



So my only solution is to build a huge and absurd sitemap to link to all of these pages so google will crawl them again, and take notice of the noindex tag?

have to admit, that seems a tad counter-intuitive.

g1smd

1:29 pm on Nov 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If Google has previously crawled a URL then they will eventually crawl it again - in their own time - with or without a sitemap.