Forum Moderators: open
Our site is dynamic DB driven... we have session IDs in the URL and IPs in the URL... so I suspect that was part of the problem. But in the past (two years now) Google has successfuly crawled our site. So I'm a little perplexed on what to do.
Other than redesigning our site does anyone have suggestions that allows Google to crawl, but not bring us to our knees?
Thanks,
JP
Make sure your TOS says the site is only to be accessed by true users and then hit 'em with a law suit for not reading and obeying it ;)
I'm all for changing robots.txt so that it has to explicitely say a site may be crawled by spiders....that should get the SE's index sizes back to something manageable...LOL ;)
Heck, the law says if some uninvited Joe decided to wander through my home in the middle of the day I am perfectly entitled to blast him with a 12 guage....same should apply to spiders online :)
Seriously though, Google needs to put a little more thought into what its bot is doing with regard to dynamic URLs. This is not your problem to fix, it is theirs.
This is not your problem to fix, it is theirs.
Google specifically ask webmasters not to serve googlebot with an SID.
Fairly easy to do this by user-agent.
Seriously though, Google needs to put a little more thought into what its bot is doing with regard to dynamic URLs.
I agree with that though - 99% of SID's actually have "SID=" at the end of the URL - no idea why they can't just ignore everything from "SID=" onwards.
TJ
Make sure your TOS says the site is only to be accessed by true users and then hit 'em with a law suit for not reading and obeying it ;)
I don't see that any such statement is necessary. It behoves all search engines to manage their robots in a manner that is not detrimental to the sites they are visiting. If I write a program that brings a website to its knees it's likely to be called a D.O.S. attack and if I'm caught I should rightly end up in court. I do not believe that a defence based on the site's T.O.S. would be effective.
Kaled.
IANAL, but assuming it was unintentional I'd day not likely in the US. I'd never rule against Google if I were the juror. Anone who is savvy enough to know how to serve up dynamic URLs should know about robots.txt.
1. Email Google your situation and ask them if they can pace the crawling of your site.
2. Or...Get a more powerful server and a reliable host. Think of it this way, if those spider access are unique human access then you will still have the same problem.