GigaBlast Part 2

Forum Moderators: bakedjake

Message Too Old, No Replies

GigaBlast Part 2

New search engine

Key_Master

9:46 pm on Mar 16, 2002 (gmt 0)

continued from: [webmasterworld.com...]

208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.

I must admit, I'm a little concerned about the robots.txt issue.

ScottM

3:25 pm on Mar 17, 2002 (gmt 0)

Submitted 2 sites.

The first site has only the index page listed, the second site got every page spidered.

Very similar sites, and layout, different content.

The only reason I can think of is a meta tag <meta name="robots" content="ALL">? (On the site with only one page spidered)

Perhaps the spider mistakes this as a 'disallow'?

skirril

3:28 pm on Mar 17, 2002 (gmt 0)

Matt,

congratiulations on that site/SE,

I submitted my site, and it got spidered/added almost instantly, I think it will make a great addition if a few problems are solved (mostly speed).

Here's the details:

- Add a page detailing User Agent & behaviour of your robot, and put that into the robot UA String (like:gigabot/1.0; [gigablast.com...]

-Add some button for users to click if they feel the results are not relevant to their query, or to report "dead links".

I know this has some potential for abuse, but I think in the alpha/beta phase it can be helpful

other things (eg. for advanced search): only return sites that validate against the DTD they specify..

In advanced search there needs to be a text explaining some of the options..

Otherwise, very good site, good luck with it :)

ideavirus

4:50 pm on Mar 17, 2002 (gmt 0)

Ahh...Its impressive...

Added two sites of mine...and instantaneous...go indexed...!..An interesting development here, i would say...!

Guess..now the "mirror" is re-named as " cached "...sounds more relevant...!

As mentioned before...the site/SE badly needs a More professional look and feel...yeah..I understand..you are tweaking the algo ATM...

Good luck from mee TOOOO...!

Cheers
:)

<added>
matt...How about adding the number of pages found for a particular query in the SERPs..sure must be on the TODO list..!
</added>

Robert Charlton

7:48 pm on Mar 17, 2002 (gmt 0)

Matt - On some queries, I'm seeing the same page returned 3 or 4 times....

skirril

8:52 pm on Mar 17, 2002 (gmt 0)

Oh yes.

208.254.87.133 - - [17/Mar/2002:15:51:02 +0100] "GET /index.html HTTP/1.0" 200 6321 "-" "-"

you are currenlty sending no user agent, please do in the future

Skirril

MarkHutch

9:11 pm on Mar 17, 2002 (gmt 0)

Matt, I just submitted one of our URL's and your spider went right to work. Nice job...

greektomi

10:44 pm on Mar 17, 2002 (gmt 0)

This sure is a ton of fun!!!!

I'm glad to hear this is 'pre-beta' that means gigablast has even more potential than I originally thought :)

Kudos to you Matt!!! It seems that for now this is a one man job and as such is ultra-impressive.

I'm sending positive energy to this site and giving you as much word as my mouth can handle.

BEST OF LUCK!!!!

Greektomi

Chris_D

12:09 am on Mar 18, 2002 (gmt 0)

Hey Matt,

This is great!! Not only does it automatically get spidered and indexed - but your spider starts following the links from the submitted sites immediately!

This is incredibly impressive stuff - well done. Our techos here are in absolute awe - they are having a field day at the moment!

Keep up the great work!!!

Bonus

2:21 am on Mar 18, 2002 (gmt 0)

Is Gigablast down now or is it just me?

bobriggs

2:40 am on Mar 18, 2002 (gmt 0)

Down from here. I did a few queries, some worked, some returned internal server error 500. Now it is down.

mattdwells

5:08 am on Mar 18, 2002 (gmt 0)

the qa i get here is fantastic!

if you're interested here's some changes i made today due mostly to suggestions i've received on this thread:
. filtered naughtiness from top 5 queries
. increased default # lines in result summaries from 1 to 3
. decreased # of chars in summary lines by about half
. fixed critical bug in my communications layer
. removed trailing slash from root urls in search results
. added User-Agent: gigabot/1.0 to spiders
. fixed one bug in my dns client, another is still there
. banned search boss' 14,000+ IPs

this message board has been awesome!
Thanks very very much for all the posts, everybody!

matt

mattdwells

5:11 am on Mar 18, 2002 (gmt 0)

oh i forgot one more change.
a lot of people from here were searching for their urls like: "www.mysite.com", but since the page itself didn't have the word "mysite" on it, no search results would be found. so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself.

matt

MarkHutch

5:23 am on Mar 18, 2002 (gmt 0)

Matt, nice job again. I noticed that the number of sites in the index went to only 21,000. Did you have to start over? I submitted my index pages again. I hope that was ok.

bobriggs

5:28 am on Mar 18, 2002 (gmt 0)

Yuk. I don't know what happened, but the results I'm seeing don't match anything (much). Loaded with spam. Unless you like casino sites... ;)

bigrockman

5:34 am on Mar 18, 2002 (gmt 0)

I am new here today from WA forums and I am very impressed with this forum. Looks like I stumbed upon a great search engine in the making. Blew me away! Added url and BAMM, indexed immediately. GO MAN GO

(edited by: bigrockman at 5:46 am (utc) on Mar. 18, 2002)

mattdwells

5:35 am on Mar 18, 2002 (gmt 0)

yeah, i had to reset. :( one of the machines crashed and i was debugging its redundant twin remotely when i lost my telnet connection. i'm guessing comcast has some kind of anti-nat software that did it. i'll have to be more careful in the future, but since this is pre-beta i'm not worried too much about it.

as for the casino spam, i'll have to compile a domain/ip list of the offenders.

matt

mattdwells

5:41 am on Mar 18, 2002 (gmt 0)

also, the search results won't be that great until about 3 days from now when the link-analysis system has had some time to do its magic. It needs a fairly large corpus of documents to work with.

matt

MarkHutch

5:42 am on Mar 18, 2002 (gmt 0)

I just checked my logs and your spider looks like it's working great. I just submitted my index pages and the spider is following the links just fine. I noticed it only got the files in my root directories. Does it spider sub-directories too or just the files in the root??

bigrockman

5:44 am on Mar 18, 2002 (gmt 0)

I am a newbie, but how do I add this search engine to my list of them? suggestions matt?

bobriggs

6:07 am on Mar 18, 2002 (gmt 0)

Obviously pre-alpha, as you mentioned, so I know I'm jumping the gun here a little.

Any plans to start your own crawl, (sans submissions) maybe start with ODP and expand from there? Where would you start?

Or for that matter, if anyone on this forum were starting as mattdwells, where would you start from?

mattdwells

7:14 am on Mar 18, 2002 (gmt 0)

i do do my own crawl in addition to handling the submissions. one of the sites i start with is yahoo.

matt

mattdwells

7:18 am on Mar 18, 2002 (gmt 0)

when you submit a url to the spider, it should spider that page and all of the links on it fairly quickly. the links on the linking pages will be spidered, too, but their priority is not as high. it may be a few days before they get spidered.

In the future, once my index is bigger and there are not so many urls competing for spider time, they should be added quickly as well. if you really need them in for now, you'll have to submit them one at a time via the addUrl page.

matt

MarkHutch

7:45 am on Mar 18, 2002 (gmt 0)

Thank you for the info, Matt. It looks like your spider can handle about 300 pages every 5 to 10 seconds. (Just looking at the index page advance) That's a very fast rate, indeed.

starec

7:49 am on Mar 18, 2002 (gmt 0)

so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself

Psssst! Don't tell us! Otherwise it would be no fun.

Another thing: manual spam handling is not really scalable, is it? Good and aggressive auto filters is an essential part of any good public SE.

heini_dutch

9:21 am on Mar 18, 2002 (gmt 0)

Hi Matt, well done (already). I always love it when someone dares to compete with the Majors.

Question: What about indexing the robots.txt file, seems like your Robot isn't spidering the robots.txt file (according to my log-files).

Keep up the good work!

French Connexion

10:04 am on Mar 18, 2002 (gmt 0)

Hi Matt,

I like the job,
I just added my site and it is there on the fly.
now I tried for some of my KW. I get listed 4 or 5 times on the fisrt page. that is also great I am Happy as a SEO.
But as a user I would rather see only one link to a site (same url): the most relevant one and option to view the other pages from the site. do you have any projects in this view?
I don't know if any one shares this view?

Duke_of_Url

10:12 am on Mar 18, 2002 (gmt 0)

Matt

Great work, please keep it up!! I added 5 sites, went straight to the search and bingo! they were there.

TallTroll

10:19 am on Mar 18, 2002 (gmt 0)

Keewwl! I just added a site, with a real time watch open in another browser, and saw Gigabot/1.0 come through and scoop the top few pages (framed site, intrasite linking not yet up to scratch)

Most impressed thus far

engine

10:24 am on Mar 18, 2002 (gmt 0)

Hi Matt,

Good work.

Do my eyes deceive me: Has the "full pages indexed" reduced since last night?

Continued: [webmasterworld.com...]

This 59 message thread spans 2 pages: 59

GigaBlast Part 2

New search engine

Key_Master

ScottM

skirril

ideavirus

Robert Charlton

skirril

MarkHutch

greektomi

Chris_D

Bonus

bobriggs

mattdwells

mattdwells

MarkHutch

bobriggs

bigrockman

mattdwells

mattdwells

MarkHutch

bigrockman

bobriggs

mattdwells

mattdwells

MarkHutch

starec

heini_dutch

French Connexion

Duke_of_Url

TallTroll

engine

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week