Forum Moderators: phranque

Message Too Old, No Replies

spiders and an infinite number of pages

How can I steer spiders in my infinite number of pages?

         

amoore

7:06 pm on Feb 15, 2002 (gmt 0)

10+ Year Member




I have a database driven calendar site that creates an infinite number of pages, like example.com/2/13/2002/foo.html, example.com/2/14/2002/foo.html, example.com/2/15/2002/foo.html and so on. The pages are all (almost) the same default template page unless there is something in the database that corresponds to that date. Additionally, each one has a link to the pages for the next and previous date. (it's actually a good deal more complex than that, but I assume you don't care.)

This works great for spiders, they eat up all those pages and index them apparently a lot better than if I had used a URL scheme that gives away the fact that they are dynamic. For instance, putting the date in a query string.

This is a problem, though, because a spider can wonder around in this endless site for days and find pages. Most are identical, especially as they get into the distant past or future, but most spiders don't seem to care (especially scooter). I would like to steer, or encourage the spiders to stay in the material that is somewhat current, or at least the date periods that have more useful information.

I can think of some ways to do it, for instance, use robots.txt to prevent them from hitting years other than this one (or so), or making my application return a redirect (or something) to pages well in the distant past or future, or some other hacks. I bet you guys can come up with a more creative solution, though, or perhaps have even solved something similar. Any good ideas?

Brett_Tabke

5:32 am on Feb 20, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Unwitting spider traps are a fact of life in the dynamic sphere. I use random urls here on some links as a low tech but highly effective means of cache busting. It can be a very big problem.

You have the right idea about putting up "stoppers" in the form of dead end links such as missing years in your calendar.

You really should be careful. I have no doubt you are going to get spotted from some se. They don't take kindly to those types of db sites. You'll get spotted as dupe content on all those pages that only have different dates.

Damian

6:05 am on Feb 20, 2002 (gmt 0)

10+ Year Member



Maybe you can have the forward and backward links dynamically generated on the fly (ie. from an ssi), which would not return any links or different links if the year is not this year ánd the ip is on your spider ip list ?

Tapolyai

6:26 am on Feb 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



or you could just update your robots.txt file to exclude anything other then one year's worth of info.

Any other spiders that gets trapped... you would want trapped...