Forum Moderators: open

Message Too Old, No Replies

Is this spider useful?

         

joker197cinque

8:08 am on Oct 4, 2005 (gmt 0)

10+ Year Member



MJ12bot/v1.0.2

Do you ban it from your pages?

Regards.

jatar_k

3:14 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



they have a bot page did you check it out?

Lord Majestic

3:17 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Older forum threads on this subject:

[webmasterworld.com...]
[webmasterworld.com...]

Disclaimer: I am the bot's creator.

wilderness

6:53 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you ban it from your pages?

What I or any other webmaster does should NOT have any influence or correlation on what you do!

Each webmaster must determine (on their own) what is beneficial or detrimental to their own website (s).

ncw164x

7:10 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am the bot's creator

Could you explain where the data is used on the sites that you spider, your bot visits my server's often and takes 1,000's of pages.

Lord Majestic

7:26 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The data is used to build a public WWW search engine, lots of details including link to the alpha version of the search engine (under heavy development with regular updates posted) is on the web site with bot's page. Not everything is indexed yet, but it will be within months.

Using gzip compression on your webserver is a good way to reduce traffic used by bot, if you are concerned about frequency of requests then using Crawl-Delay in robots.txt will help -- the bot will use even if its specified for other more famous bots then ours.

ncw164x

7:32 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No I am not concerned at all if the pages spidered are being put to good use even if it is in the near future

Lord Majestic

7:36 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



the pages spidered are being put to good use even if it is in the near future

The best use :)

The index right now is 44 mln, mainly pages crawled Nov'04-Feb'05, but I expect to have 10-20 times more indexed by the end of the year, and have the rest done in Q1 2006.

I will do my best to make sure that by this time next year even wilderness thought it was good to be crawled by MJ12bot ;)

wilderness

8:02 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



where the data is used on the sites that you spider

Or where it will go in the event of your demise?

Perhaps?

"Jack"

[google.com...]

Lord Majestic

8:29 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or where it will go in the event of your demise?

It will probably go to the Silicon Heaven (tm).

The data per se has zero value (from warehousing point of view) -- its what can be done with the data (ie indexing it and making it searchable with fast and relevant results) that matters.

surfin2u

8:53 pm on Oct 7, 2005 (gmt 0)

10+ Year Member



Or where it will go in the event of your demise?

It will probably go to the Silicon Heaven (tm).

That's funny! I hope that I go there too. ;-)

bcolflesh

8:56 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's funny! I hope that I go there too.

If you're bad, you go to Silicone Heaven - which is so bad it's good.

wilderness

8:56 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's funny! I hope that I go there too. ;)

Perhaps you and majestic can arrive on the same boat ;)

ncw164x

9:03 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



wilderness, I take it you are not too keen on Lord Majestic site ripper, err I mean spider ;)

wilderness

9:37 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I take it you are not too keen

I'm not too keen on many things when it comes to bots visiting my sites.

Some of it we must cough, gag and choke and hope they act
prudent.

Ex1: I've been getting some sparse activity from AOL Transit. Transit is the web browser that AOL folks use when they are not at home. This is unlike the other AOL blank UA and Refer fields or limitied UA of Mozilla Ver. This thing uses a standard UA.

Others we may deny outright.
For "ME"
RIPE, Major portions on APNIC, LACNIC.
Even many North American providers who fail to cooperate in enforcing their own UAG's.

Others we may deny and hope it doesn't effect our standing.

Ex: I've seen others mention the Google from 202 and it had left my sites alone until recently and they are eating 403's.
As result google has been testing if I'm spoofing pages.
One such test from a Savvis/Layered Technologies range with two different UA's (Media Partners and a standard browser) with neither reading robots and spidering anyway.)

As previously stated;
Each webamster must decide what is beneficial and detrimental to their respective websites.

"Jack" is detrimental to my websites ;)
Hopefully somebody will come along and sink the boat that he is hoping will take him to Silicon Heaven ;)
However, I still wish him the best of success and a healthy appetite for 403's ;)

Lord Majestic

10:01 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is "Jack" refering to MJ12bot?!?!

wilderness

10:12 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is "Jack" refering to MJ12bot?!?!

Jack is an example of another bot provided in a previous post.
Did you view the link I provided?

Lord Majestic

11:09 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My bad sorry, too much work today and still more to do :(

surfin2u

12:52 am on Oct 9, 2005 (gmt 0)

10+ Year Member



I must be in silicone heaven because the MJ12bot has decided to visit my site today!

Lord Majestic

1:24 am on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There is now a good chance of that happening to many people -- yesterday we crawled 20 mln urls, its a pretty high number for a small project. :)