Welcome to WebmasterWorld Guest from 54.90.204.233

Forum Moderators: Ocean10000

Message Too Old, No Replies

BaiDuSpider

     
8:55 am on Mar 29, 2001 (gmt 0)

New User

10+ Year Member

joined:Apr 6, 2004
posts:6
votes: 0


is there somebody who know it ??
i have no other info on it
5:36 pm on Mar 29, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 17, 2000
posts:2924
votes: 0


Can't say I have any info on it. If you could dig up the IP, I could do some snooping for you.
8:19 pm on Mar 29, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member jeremy_goodrich is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 4, 2000
posts:3468
votes: 0


211.100.24.91

That's the IP...I've seen it too, only a few times. This one is from China.

Not sure if it's the same. I found it only on 3 or four pages, one of them looked like the bot, if it was one, did a malformed get request.

Jeremy, The artist formerly known as Han Solo

FlemmingLeer

7:16 pm on Apr 8, 2001 (gmt 0)

Inactive Member
Account Expired

 
 


Hi,

I recently also got aware of this spider.

The IP address belongs to net263.com who uses BaiDuSpider from www.baidu.com (supplier like Inktomi to other SE)

[apnic.net...]

The BaiDuSpider was also spotted here in Denmark.

Namaste

10:31 am on Apr 9, 2001 (gmt 0)

New User

10+ Year Member

joined:Apr 6, 2004
posts:6
votes: 0


thanks a lot
3:29 pm on Aug 17, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 22, 2001
posts:2450
votes: 0


Here's my recently compiled list of Baidu IPs:

162.105.207.192
202.103.134.196
202.108.250.226
211.100.24.91
211.100.24.92
211.100.24.93

3:28 pm on Aug 20, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2001
posts:1609
votes: 0


add this:

202.108.250.243

10:56 pm on Aug 20, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 14, 2002
posts:378
votes: 0


I also Have Baidu spider from:

211.100.25.*

7:48 pm on Oct 22, 2002 (gmt 0)

Full Member

10+ Year Member

joined:Oct 22, 2002
posts:217
votes: 0


Also: 202.108.250.199
5:33 pm on Oct 24, 2002 (gmt 0)

New User

10+ Year Member

joined:Feb 8, 2004
posts:8
votes: 1


+ 202.108.250.197
4:45 pm on Oct 25, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


I had it from:

202.108.250.195

This thing is a ***! I have had it read robots.txt, but it sure does not follow it!

dave

[edited by: eelixduppy at 9:42 pm (utc) on Feb. 18, 2009]

TheOddOne

3:51 am on Oct 27, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


I've had hits from:
202.108.250.195
202.108.250.198
202.108.250.199
202.108.250.241
202.108.250.242
202.108.250.243

So it appears that they are using the
202.108.250/24 for this - i've been getting at least 2-3 hits a day from this thing...

--ToO

3:22 am on Oct 28, 2002 (gmt 0)

New User

10+ Year Member

joined:Oct 28, 2002
posts:6
votes: 0


I am running BaiDuSpider and very sorry to hear that.
According this spider's design,It is impossible.
If someone can give me robots.txt and a piece of the log in your site for me to study,It will be helpful.
Thank you for your bringing forward this bug.
4:39 am on Oct 28, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Hi Chenjk:

How about this:

User-agent: BaiDuSpider
Disallow: /

and, from my log for TODAY:

202.108.250.195 - - [27/Oct/2002:04:35:49 -0700] "GET / HTTP/1.1" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:04:36:35 -0700] "GET [**********.com...] HTTP/1.0" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:04:36:39 -0700] "GET [**********.com...] HTTP/1.1" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:04:37:01 -0700] "GET / HTTP/1.0" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:14:20:45 -0700] "GET / HTTP/1.1" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:14:20:49 -0700] "GET [**********.com...] HTTP/1.0" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:14:21:10 -0700] "GET [**********.com...] HTTP/1.1" 403 750 "-" "BaiDuSpider"
202.108.250.195 - - [27/Oct/2002:14:21:28 -0700] "GET / HTTP/1.0" 403 750 "-" "BaiDuSpider"

The ONLY reason there are not more is that I have had to block the IP's this spider uses at my firewall... and I just have not yet added 202.108.250.195.

Dispite the pretty clear instructions in MY robots.txt, this spider does not seem to understand to stay away!

dave

[edited by: littleman at 7:19 pm (utc) on Oct. 28, 2002]
[edit reason] took the links out [/edit]

4:48 am on Oct 28, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Carfac,

I think that should be:

User-agent: Baidu
Disallow: /

[bar.baidu.com...]

6:18 am on Oct 28, 2002 (gmt 0)

New User

10+ Year Member

joined:Oct 28, 2002
posts:6
votes: 0


Hi Carfac,
I think I got it.
First,we used "baiduspider" to check the robots.txt in baidu spider, and it appear as "BaiDuSpider" in your site log. It is a bug introduced with system updating.I will fix this bug ASAP.
Second, our spider is an incremental spider. it checked the robots.txt and restored it locally. It could not find your robots.txt changed until the next checking(two days later usually).It perhaps break the protocol if the robots.txt changed before the next checking.I will adjust the strategy to reduce these cases.

Thank you very much for help. It is very nice to me if your re-check it later and tell me the result.

3:22 pm on Oct 28, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


KeyMaster:

Thanks for that update- I will fix that.

chenjk:

Glad to be of some help!

dave

4:02 am on Nov 5, 2002 (gmt 0)

New User

10+ Year Member

joined:Oct 28, 2002
posts:6
votes: 0


I have fixed these bugs. Please re-check it. If you have some questions about baiduspider, please send email to baiduspider@baidu.com. I will reply soon.
Thank you very much.
4:36 am on Nov 5, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


chenjk,

In all the excitement here, one thing did not get mentioned:

Your spider will not be welcome on many web sites simply because it does not provide a link to more information or a contact address in its user-agent string.

Here is the Googlebot user-agent string as seen in my logs:

 Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Therefore, if I as a webmaster have a question about "What is Googlebot?" I can go to that URL and find out. Information is provided about how to report bugs, what is the spider name that should be used in robots.txt, and more.

Here is the user-agent string from Slurp, which is Inktomi's spider:

 Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; 
http://www.inktomi.com/slurp.html)

They give both a web page URL and an e-mail contact address.

I suggest you change your user-agent string to

 BaiDuSpider/1.0 (BaiDuSpiderATbaidu.com)

so that webmasters can contact you. It will build some trust and reduce the number of webmasters who will block BaiDu as an unknown spider.

If you have a web page about the spider, that would be even better!

 BaiDuSpider/1.0 (BaiDuSpiderATbaidu.com; 
http://www.baidu.com/spiderinfo.html)

(You may want several pages for different language versions of the information page)

Thanks,
Jim

2:55 am on Nov 22, 2002 (gmt 0)

New User

10+ Year Member

joined:Oct 28, 2002
posts:6
votes: 0


thank you for your valuable advice.
We have update the user-agent string. It looks like
baiduspider+(+http://www.baidu.com/search/spider.htm)
But we have only a page in Chinese now and will supply other language pages soon.
3:47 am on Nov 22, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


chenjk,

Running your spider information page through AltaVista's translator in Chinese-to-English mode works reasonably well.

For those who want the "quick answer", according to the information on that page, you may disallow BaiDuSpider using


User-agent: baiduspider
Disallow: /

Do we have any reports of this spider abusing sites with the correct User-agent in robots.txt?
How about sites using "User-agent: *"?

Thanks for the response!

Jim

4:27 am on Nov 22, 2002 (gmt 0)

New User

10+ Year Member

joined:Oct 28, 2002
posts:6
votes: 0


jdMorgan,
There is a bug in our spider system about a month ago and I have fixed it. Since then I have not received abusing about it.

If a site disallows baiduspider or * in its robots.txt, BaiduSpider will not crawler pages in this site. But in our information page, we suggest the webmasters using user-agent "baiduspider" to disallow our spider to avoid disallowing some spiders which they do not want to disallow by mistake.

Thanks for your suggestion.

4:47 am on Nov 22, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


chenjk,

Thank you for fixing the bug in BaiDuSpider, and for adding the link to the information page in the user-agent string. Both of these improvements will help to prevent your spider from being banned by more webmasters.

Jim

5:59 am on Dec 9, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


It appears that BaiDuSpider is back, and without the contact info in the UA string:

202.108.250.199 - - [08/Dec/2002:04:19:19 -0500] "GET /robots.txt HTTP/1.1" 200 1331 "-" "BaiDuSpider"
202.108.250.199 - - [08/Dec/2002:04:19:20 -0500] "GET / HTTP/1.1" 200 16416 "-" "BaiDuSpider"

Jim

7:14 am on Dec 10, 2002 (gmt 0)

New User

10+ Year Member

joined:Oct 28, 2002
posts:6
votes: 0


That is not so bad.
This log recorded our dns-agent's visit.
As I have mentioned, our spider will store robots.txt locally and check it periodically. It is done by our dns-agent.
Dns-agent just refreshes the robots.txt and checked if your site is case-sensitive or not. It does not follow any link. It just get the homepage, check the response head and drop the page.
So I does not set the user-agent to new one.
Does I break the robots protocol?
3:42 am on Jan 23, 2003 (gmt 0)

New User

10+ Year Member

joined:Jan 23, 2003
posts:22
votes: 0


Anybody have a list of ips for this bot. I don't have any content that would suite them, therefore I don't want my bandwidth sucked up by a search engine half way around the world.
5:02 am on Jan 23, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


aodonline,

This spider is now obeying robots.txt.

Add the following lines to your robots.txt file, followed by a blank line:


User-agent: baiduspider
Disallow: /

You can also use the AltaVista translator [babelfish.altavista.com] to translate baidu's robot info page [baidu.com].

Here's some introductory info [searchengineworld.com] on robots.txt

HTH,
Jim

2:55 am on Jan 24, 2003 (gmt 0)

New User

10+ Year Member

joined:Jan 19, 2003
posts:22
votes: 0


Three recent visits from the new BaiDuSpider. Note new agent spelling, and no check of robots.txt - hence a nice cozy placement in my ban list. :o

202.108.250.198 - - [21/Jan/2003:07:16:02 -0600] "GET / HTTP/1.1" 200 24627 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.250.198 - - [22/Jan/2003:15:59:04 -0600] "GET / HTTP/1.1" 200 769 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.250.198 - - [23/Jan/2003:08:38:01 -0600] "GET / HTTP/1.1" 200 769 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"

3:39 am on Jan 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


delete