homepage Welcome to WebmasterWorld Guest from 54.205.197.66
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How to Stop BecomeBot?
shilmy




msg:1528771
 3:29 am on Jun 23, 2005 (gmt 0)

Hi,

Recently, BecomeBot have been very active crawling my sites, eating a lot of bandwidth.

I've create robots.txt with content like this:

# Disallow BecomeBot
User-agent: BecomeBot
Disallow: /

But the robot seems ignore it.

Is there a way to stop BecomeBot completely from crawling my sites?

Thanks.

Regards,
Sjarief

 

Clint




msg:1528772
 9:41 am on Jun 23, 2005 (gmt 0)

If you are sure that's its correct name, and it's not working, you'll have to find the IP of the bot(s) and if you use cPanel, put their IP's in the "IP Deny" area. Or else contact them and ask them about it.

I just checked my access logs and I see I have the Become Bot visiting, but only one entry. (It's IP is 64.124.85.78 if you want to block in cPanel). The entry references this page [become.com...] (Authoritative URL). According to that page, you're using the correct robots.txt line, so looks like you're going to have to ask them, or else IP Deny 64.124.85. since they use that ENTIRE IP range!

Why do you want to block their bot, have you heard something bad about them? I just went to become.com and I can't be found in them, strange since it's visiting my site. So I'm wondering about submitting to them, but I don't even see where you can submit a site to them. Ahh, I see now they say this: "At this time, we do not accept submissions from webmasters." I did some searches to test it, and in my case I got nothing but non-relevant results, looks pretty bad.

Clint




msg:1528773
 9:51 am on Jun 23, 2005 (gmt 0)

Interesting. The Become bot only visited ONE of my pages. It first hit my robots.txt file, then only ONE webpage. I searched their SE for the product on that page, and I was 1st on the 1st page. I searched for other specific products I carry, and I'm also on the first page for them. Strange that it chose one of my FORWARDED domains that point to my MAIN domain as 1st spot in one search! FAIK, coincidentally that could be the very first time their bot has ever visited. I don't now why it chose this one page though and no others to visit! It was not even an entry page.

Their SE doesn't seem to do very well with the generic searches I first tried like "[blue widget] sales" or dealers, all I saw were non relevant results. But when searching for a specific product, it does better. I like the fact they have a rating system where you can rate each search, and I like their "auto complete" of searches. I've never seen that before with any SE.

shilmy




msg:1528774
 9:59 am on Jun 23, 2005 (gmt 0)

Thanks, if robots.txt cannot stop them, I will use cpanel to stop it.

The only reasons I want to stop it is because it's crawled heavily and deep to my sites so eat a lot of bandwidth and from my log, I haven't see any visitors referred by them. So I think it's just a wasted bandwidth.

Regards,
Sjarief

Clint




msg:1528775
 12:00 pm on Jun 23, 2005 (gmt 0)

You know if I were you, and this is just me, I'd be GLAD the bot is crawling your site. They are brand new from what I saw, (still in Beta), and with their new kind of "indexing intelligence" and features I mentioned, they could be a big hit. It's just going to take some time after they are out of Beta for their name to get around and the referrals. Something to think about. Looks like they came out in March.

With Google "freaking out" every few months, we need ALL the SE's we can find.

(From March) "Last week, Become.com sent out a press release that talked about it's patent pending ranking algorithm dubbed AIR. AIR stands for Affinity Index Ranking and based on claims from Become is the next generation search engine ranking algorithm.

Go to G and enter "become search engine" in quotes, and look at the hit on marketingshift.com , it's the 4th hit in my area.

shilmy




msg:1528776
 3:41 pm on Jun 23, 2005 (gmt 0)

May be you are right, we need more SE so we don't depend only on The Big G.

The problem with BecomeBot in my site is they eat about 100MB bandwidth in a snap. The crawl heavy and deep.

You know, my sites is using a script to showcase a products from amazon. If it is not stopped, can you imagine a bot crawling the whole amazon store at once?

I'm a newbie in robots.txt, is there a way to limit so the bot do not crawl the whole amazon store at once?

Clint




msg:1528777
 4:51 pm on Jun 23, 2005 (gmt 0)

You may want to start a new topic for that. I don't know if you can set time limits that small in the robots.txt file, but I think you can set periodic crawl dates, or just block it from accessing Amazon. It's not like they need the help. A Bot accessing another site isn't going to affect your bandwidth since it's not on your server, unless the Amazon pages ARE ON your server, as in MyDomain.com/amazon/whatever.

Span




msg:1528778
 5:39 pm on Jun 23, 2005 (gmt 0)

You can control the rate at which your site is crawled by using the Crawl-Delay feature. The Crawl-Delay feature allows you to specify the number of seconds between visits to your site. Note that it may take quite a long time to crawl a site if there are many pages and the Crawl-Delay is set high. You could specify an interval of 30 seconds between requests with an entry like this:

User-agent: BecomeBot
Crawl-Delay: 30
Disallow: /cgi-bin


vabtz




msg:1528779
 5:48 pm on Jun 23, 2005 (gmt 0)

are you kidding?

the engine is spam infested.

Reid




msg:1528780
 5:50 pm on Jun 23, 2005 (gmt 0)

if the bot is ignoring robots.txt then you should contact the bot developer or ban it period.

vabtz




msg:1528781
 6:41 pm on Jun 23, 2005 (gmt 0)

you could log in as root and try this command:
/etc/init.d/httpd shutdown

Big_Gig




msg:1528782
 3:09 pm on Jun 29, 2005 (gmt 0)

Hey guys,

I recently had this problem too. They spidered thousands of my pages in one day - totally took down the dedicated server!

I found this piece of code on their site which has worked well for me. It tells the bot to wait at least 30 seconds inbetween page requests:

User-agent: BecomeBot
Crawl-Delay: 30
Disallow: /cgi-bin

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved