Using Robots.txt to Exclude Duplicate Database pages

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Using Robots.txt to Exclude Duplicate Database pages

sid560

2:09 pm on Nov 11, 2006 (gmt 0)

I noticed many variations of the database pages are indexed and usually turn supplemental. Would it be a good idea to add these by type to Disallow in robots.txt?

Urls like
Disallow: /*reply_to_ad.cfm*
Disallow: /*sort_by=*
Disallow: /*my_chk_list*
Disallow: /*&lst_start=*
Disallow: /*session_key*

These allow result in supplements and are probably a drag on my ranking for the most important pages. Good idea? Any more?

tedster

11:51 pm on Nov 11, 2006 (gmt 0)

I think this is a solid approach - and use it on several websites. I would rather CHOOSE which url Google indexes for a given bit of content, and in some cases, even which sorted or filtered version of the data gets spidered. Why let googlebot run in circles when you can help it to go straight for the good stuff?

g1smd

12:25 am on Nov 12, 2006 (gmt 0)

Yes, do add those, but ONLY in the User-agent: Googlebot section - other bots do not yet understand wildcard URLs.

Also be aware that if you have a User-agent: Googlebot section, that ALL instructions for Google must go in that section. The User-agent: * section is completely ignored by Google when the User-agent: Googlebot section is included.

tedster

12:44 am on Nov 12, 2006 (gmt 0)

Actually, Yahoo announced last Friday that slurp now supports wildcards in robots.txt, too.

[webmasterworld.com...]

g1smd

5:03 pm on Nov 12, 2006 (gmt 0)

OK, I can see the robots.txt file getting longer and longer with repetition per agent.

Asia_Expat

10:06 pm on Nov 12, 2006 (gmt 0)

I added some wildcard exclusions to my robots file a week or two ago to manage my database driven forum installation. I am already seeing some very promising improvements in traffic. I agree with Tedster... this appears to be a solid method. It is also compliant, white hat and uncomplicated (once you've figured exactly which URL's you should be indexing).

[edited by: Asia_Expat at 10:07 pm (utc) on Nov. 12, 2006]

g1smd

10:15 pm on Nov 12, 2006 (gmt 0)

Read back a few months to see how I also managed to get a 40 000 thread forum (one that exposed more than 10 URLs for every thread, as well as exposing almost another half a million "you are not logged in" pages) reindexed as 40 000 threads with one URL per thread, and a few thousand thread index pages.

It has taken just over a year for everything to fall into place. Previously it had about 750 000 indexed URLs, and very many were marked as Supplemental.