Forum Moderators: open
Jakarta Commons-HttpClient/3.0-rc2 from 216.178.35.203 amongst others.
I tried to send a quick message to them via their web form but that just threw an error. Oh well, let it eat 403's then...
Just one line :)
client.getParams().setParameter("http.useragent","MySpace News (http://news.myspace.com)");
We should start to petition these companies to fix their UAs.
It's one lousy line of code left out by one lazy or incompetent programmer which is a signal of quality issue that it shouldn't be allowed to crawl in the first place.
If these bots can't include the following:
1) actual UA identifying the source
2) include a link to a page with more information about what they want, and what parts of robots.txt they honor
3) a form to contact them for bugs and/or crawl removal if it fails to stop via robots.txt
Without those, it should just be an automatic block, party over, done.