Forum Moderators: open
123.125.71.104 is one of the Chinese bot sources - I block that but it's useful to know that both JP and CN are using the same UA.But are they feeding their results to the same search engine?
iptables -A INPUT -s 119.63.192.0/21 -j DROP
service iptables save
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider+
Disallow: / Next step
iptables -A INPUT -s 119.63.192.0/21 -j DROP
service iptables save
I don't know if it's a valid concern but what if they don't treat a 403 HTTP response properly and keep pounding?
"Extractor" is one of the so-called standard deny terms.
Others; crawler, spider, download, harvest, email, larbin, Nutch, link, PHP, Reaper, Wget, fetch, curl, libww, and any variations of wording with similar definitions.
Do they call themselves by any other name?
Btw, WIkipedia says: "The user-agent string of Baidu search engine is baiduspider." [en.wikipedia.org...]
I must be missing something here -- no mod_rewrite? -- because a solution's always seemed as easy as:
RewriteCond %{HTTP_USER_AGENT} baidu [NC]
RewriteCond %{REQUEST_URI} !^robots\.txt$
RewriteRule .* - [F]
(10,000 pages crawled daily by a company that provides no discernible benefit? No way.)