Forum Moderators: open
Just a heads up.
Requested the same two specific pages, six times over fourteen minutes.
No robots.txt, no images.
Although most everybody has the UA denied (in one form or another).
The subnet provider (which is the same company as the backbone) offers a specific explantion on document retrieval.
viw1219675461840484619yvmtlibwww-perl/5.801
kxd1219610210589019775qelflibwww-perl/5.801
wuw1219636009107147216fobklibwww-perl/5.801
(CPAN shows the current mod is v5.814, circa July 25, 2008.)
I don't know if that's simply a misconfigured script, or the front-end data is intentionally designed to break "^libwww-perl" rewrites, etc. FWIW
designed to break "^libwww-perl" rewrites
That's why I don't use anchors for the most part so "libwww-perl" would zap any variation on that theme.
Then again, I whitelist so any variation wouldn't pass the whitelist in the first place which is limited to googlebot, slurp, msnbot, teoma, MSIE, Firefox and Opera.
Everything else goes away.
I also post filter MSIE, Firefox and Opera for bad keywords like "crawl" or "download" or "http:" addresses and dump those as well.
The downside is mobile devices get whacked but the upside is I don't get many of them in the first place.
[edited by: incrediBILL at 7:46 pm (utc) on Aug. 25, 2008]
I let in Safari, Mozilla, Netscape and Konqueror pass but they only pass because they satisfy my browser filter rules.
Why would anyone let Lynx in? It's usually used by tools that strip off the HTML just to scrape the text so it's kicked to the curb.
Anyone that installs any junk that adds promotional HREFs in the UA get the boot so this month 800+ of one browser plug-in with their promotional hits went straight into the trash ;)
[edited by: incrediBILL at 4:29 am (utc) on Aug. 26, 2008]