Forum Moderators: open

Message Too Old, No Replies

Will this prevent googlebot from crawling?

Taiwan Spider Won't Stop Crawling.....

         

mrfeelingfine

8:19 pm on Jul 30, 2002 (gmt 0)

10+ Year Member



I've created a robots.txt disallow for Openfind data gatherer, Openbot/3.0+ a Tawain crawler so it now gets a 403 message when it attempts to crawl.

I've emailed numerous times per the email in the log file to the web site, but it keeps trying to crawl anyway. I'm getting returned mail from the email they suggest on their site.

I'm getting about 60 attempts a day from this crawler. Since it's getting the 403, will this NOT interfere when googlbot comes?

Should I do anything else??? Thanks.

Chris_R

8:56 pm on Jul 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Correct me if I am wrong, but a robots.txt file will not give any sort of error to any bot.

It is simply a text file that robots can choose to ignore if they wish.

You are better of using .htaccess to ban that bad bot. Look it up on google - there are tons of examples.

If you have the robots.txt file set up right - it will not ban googlebot.

jdMorgan

9:44 pm on Jul 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Openfind is a bad bot? It may not serve your customers if you don't sell to a Chinese clientele
(here or abroad), but calling it a "bad bot" and placing it in company with e-mail address
harvesters and site-suckers is a bit harsh.

Every webmaster must make the decision for him or herself as to what constitutes a bad bot. Heck,
I even found a legitimate user of larbin in my logs over the weekend! I e-mailed them and
warned them to clean up that "larbin@unspecified e-mail" bit because of what we do here on the
Search Engine Spider Identification forum... :)

If a robot ignores robots.txt AND serves no useful purpose to your customer base and/or the web
at large, then it's a "bad bot" in my book - and gets a "403 with prejudice". YMMV (a lot)

Jim

mrfeelingfine

10:09 pm on Jul 30, 2002 (gmt 0)

10+ Year Member



Howdy folks,

Thanks for the replies...

Just trying to figure what might constitute an "overload" of crawling from one bot in addition to another or multiples.

IOW, is it common to have a simultaneous deep crawl from numerous bots?

Could this create a problem with your server? Could the bots mess mess each other up as this is happening or is the server just popping out the requested info like the chef with the Ginsu knife and all is well?

Thanks.....

jdMorgan

10:21 pm on Jul 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mrfeelingfine,

Please take a minute or two and define your problem completely...

Are you getting a bunch of different spiders, like Googlebot, Openfind, and Lycos, all at once?
That is random.

Or are you getting two or three "copies" of the same spider, but coming in from different
remote_hosts (slightly different IP address or domain names) at the exact same time? This
can happen, although most of the big SE spiders try to avoid it - mostly successfully, but
not always.

If your site hosts a lot of high-bandwidth content and a lot of simultaneous users, and those
users are experiencing difficulties (interruptions in streaming media, for example) then, yes
your server is getting overloaded, and you may have to ban anything that hits you with simultaneous
requests except for customers. Otherwise, your server is likely set up to handle (at the very
least) 10 simultaneous requests as a totally-normal situation.

Please let us know your exact problem - the replies might then be more useful to you. :)

Jim

mrfeelingfine

11:21 pm on Jul 30, 2002 (gmt 0)

10+ Year Member



Hello all,

To clarify, and apologies for not doing so.

Today, and for the last 2 weeks I've been getting, on average, about 80/day *attempts* by Openbot/3.0 to crawl.

I checked, and realized that I have Openbot banned in .htaccess and in robots.txt.

So, Openbot is getting a 403 for everything they try to crawl. They seem to use just one IP.

**************

I have not seen googlebot in weeks, but I have gotten some crawling from scooter.

**************

My concern is that even with all these 403s, that these 80/day "visits" might prevent other robots from crawling with success.

My knowledge is limited, but my sense was that Openbot was basically a nuisance (for us, at least, our market is not Tawain or China)

We have about 5000 pages of content so I don't imagine that Openbot is going to stop its "knocking" any time soon.

We don't have high traffic, but my real key goal right now is to get our pages indexed by google, foremost and most important to anything.

mrfeelingfine

11:27 pm on Jul 30, 2002 (gmt 0)

10+ Year Member



>>Openfind is a bad bot? It may not serve your customers if you don't sell to a Chinese clientele (here or abroad), but calling it a "bad bot" and placing it in company with e-mail address harvesters and site-suckers is a bit harsh.

And, if I may, add I'm intrigued by the comment:

>>Chinese clientele (here or abroad)

Is this suggesting that openfind.com has a strong USA user base? Any where on the net where I could understand *their* consumer market better?

I certainly have no intent of being harsh on them, just bewildered by various info I've read about them....

jdMorgan

8:30 pm on Jul 31, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mrfeelingfine,

OK, I see what your concern is. 80 hits per day is nothing to be concerned about. 80 hits
per second on a heavily-oversubscribed shared server might be...

Your reference to using robots.txt to return a 403-Forbidden server code also confused me, so
just in case:

Try Disallowing Openbot in robots.txt to tell them not to spider your site. Then they might not
come back as often and bother your server. (They may check in once in a while anyway, but should
grab robots.txt, find out they're still not welcome, and then leave)

If they don't fetch and abide by robots.txt, just use .htaccess (on Apache Server) or a similar
approach to return a 403-Forbidden code to them and maybe they will eventually give up. Or, you
could try sending them a very simply-worded e-mail, asking that they stop spidering your site.

I haven't the foggiest idea where to find the demographics on the search engines used by foreign
nationals in the US or other countries. It would be interesting to know some hard numbers. However,
you can bet that the expatriate Taiwan Chinese will preferentially check with the "home boys" if a
decent search engine by and for Nationalist Chinese from Taiwan gets the bugs out, gets some
publicity and becomes widely-known in their communities.

At any rate, their 80 hits per day is not going to interfere with your success in getting spidered
by other search engines.

HTH,
Jim