Welcome to WebmasterWorld Guest from 54.196.217.43

Forum Moderators: bakedjake

Message Too Old, No Replies

GigaBlast Part 2

New search engine

     
9:46 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


continued from: [webmasterworld.com...]


208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.

I must admit, I'm a little concerned about the robots.txt issue.

3:25 pm on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 1, 2002
posts:1017
votes: 0


Submitted 2 sites.

The first site has only the index page listed, the second site got every page spidered.

Very similar sites, and layout, different content.

The only reason I can think of is a meta tag <meta name="robots" content="ALL">? (On the site with only one page spidered)

Perhaps the spider mistakes this as a 'disallow'?

3:28 pm on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Dec 19, 2000
posts:193
votes: 0


Matt,

congratiulations on that site/SE,

I submitted my site, and it got spidered/added almost instantly, I think it will make a great addition if a few problems are solved (mostly speed).

Here's the details:

- Add a page detailing User Agent & behaviour of your robot, and put that into the robot UA String (like:gigabot/1.0; [gigablast.com...]

-Add some button for users to click if they feel the results are not relevant to their query, or to report "dead links".

I know this has some potential for abuse, but I think in the alpha/beta phase it can be helpful

other things (eg. for advanced search): only return sites that validate against the DTD they specify..

In advanced search there needs to be a text explaining some of the options..

Otherwise, very good site, good luck with it :)

4:50 pm on Mar 17, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:June 16, 2001
posts:386
votes: 0


Ahh...Its impressive...

Added two sites of mine...and instantaneous...go indexed...!..An interesting development here, i would say...!

Guess..now the "mirror" is re-named as " cached "...sounds more relevant...!

As mentioned before...the site/SE badly needs a More professional look and feel...yeah..I understand..you are tweaking the algo ATM...

Good luck from mee TOOOO...!

Cheers
:)

<added>
matt...How about adding the number of pages found for a particular query in the SERPs..sure must be on the TODO list..!
</added>

7:48 pm on Mar 17, 2002 (gmt 0)

Moderator from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:11309
votes: 163


Matt - On some queries, I'm seeing the same page returned 3 or 4 times....
8:52 pm on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Dec 19, 2000
posts:193
votes: 0


Oh yes.

208.254.87.133 - - [17/Mar/2002:15:51:02 +0100] "GET /index.html HTTP/1.0" 200 6321 "-" "-"

you are currenlty sending no user agent, please do in the future

Skirril

9:11 pm on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 6, 2002
posts:742
votes: 0


Matt, I just submitted one of our URL's and your spider went right to work. Nice job...

greektomi

10:44 pm on Mar 17, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


This sure is a ton of fun!!!!

I'm glad to hear this is 'pre-beta' that means gigablast has even more potential than I originally thought :)

Kudos to you Matt!!! It seems that for now this is a one man job and as such is ultra-impressive.

I'm sending positive energy to this site and giving you as much word as my mouth can handle.

BEST OF LUCK!!!!

Greektomi

12:09 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 25, 2001
posts:660
votes: 0


Hey Matt,

This is great!! Not only does it automatically get spidered and indexed - but your spider starts following the links from the submitted sites immediately!

This is incredibly impressive stuff - well done. Our techos here are in absolute awe - they are having a field day at the moment!

Keep up the great work!!!

Bonus

2:21 am on Mar 18, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


Is Gigablast down now or is it just me?
2:40 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2001
posts:748
votes: 0


Down from here. I did a few queries, some worked, some returned internal server error 500. Now it is down.
5:08 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


the qa i get here is fantastic!

if you're interested here's some changes i made today due mostly to suggestions i've received on this thread:
. filtered naughtiness from top 5 queries
. increased default # lines in result summaries from 1 to 3
. decreased # of chars in summary lines by about half
. fixed critical bug in my communications layer
. removed trailing slash from root urls in search results
. added User-Agent: gigabot/1.0 to spiders
. fixed one bug in my dns client, another is still there
. banned search boss' 14,000+ IPs

this message board has been awesome!
Thanks very very much for all the posts, everybody!

matt

5:11 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


oh i forgot one more change.
a lot of people from here were searching for their urls like: "www.mysite.com", but since the page itself didn't have the word "mysite" on it, no search results would be found. so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself.

matt

5:23 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 6, 2002
posts:742
votes: 0


Matt, nice job again. I noticed that the number of sites in the index went to only 21,000. Did you have to start over? I submitted my index pages again. I hope that was ok.
5:28 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2001
posts:748
votes: 0


Yuk. I don't know what happened, but the results I'm seeing don't match anything (much). Loaded with spam. Unless you like casino sites... ;)

bigrockman

5:34 am on Mar 18, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


I am new here today from WA forums and I am very impressed with this forum. Looks like I stumbed upon a great search engine in the making. Blew me away! Added url and BAMM, indexed immediately. GO MAN GO

(edited by: bigrockman at 5:46 am (utc) on Mar. 18, 2002)

5:35 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


yeah, i had to reset. :( one of the machines crashed and i was debugging its redundant twin remotely when i lost my telnet connection. i'm guessing comcast has some kind of anti-nat software that did it. i'll have to be more careful in the future, but since this is pre-beta i'm not worried too much about it.

as for the casino spam, i'll have to compile a domain/ip list of the offenders.

matt

5:41 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


also, the search results won't be that great until about 3 days from now when the link-analysis system has had some time to do its magic. It needs a fairly large corpus of documents to work with.

matt

5:42 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 6, 2002
posts:742
votes: 0


I just checked my logs and your spider looks like it's working great. I just submitted my index pages and the spider is following the links just fine. I noticed it only got the files in my root directories. Does it spider sub-directories too or just the files in the root??

bigrockman

5:44 am on Mar 18, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


I am a newbie, but how do I add this search engine to my list of them? suggestions matt?
6:07 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2001
posts:748
votes: 0


Obviously pre-alpha, as you mentioned, so I know I'm jumping the gun here a little.

Any plans to start your own crawl, (sans submissions) maybe start with ODP and expand from there? Where would you start?

Or for that matter, if anyone on this forum were starting as mattdwells, where would you start from?

7:14 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


i do do my own crawl in addition to handling the submissions. one of the sites i start with is yahoo.

matt

7:18 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


when you submit a url to the spider, it should spider that page and all of the links on it fairly quickly. the links on the linking pages will be spidered, too, but their priority is not as high. it may be a few days before they get spidered.

In the future, once my index is bigger and there are not so many urls competing for spider time, they should be added quickly as well. if you really need them in for now, you'll have to submit them one at a time via the addUrl page.

matt

7:45 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 6, 2002
posts:742
votes: 0


Thank you for the info, Matt. It looks like your spider can handle about 300 pages every 5 to 10 seconds. (Just looking at the index page advance) That's a very fast rate, indeed.
7:49 am on Mar 18, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 17, 2001
posts:409
votes: 0


so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself

Psssst! Don't tell us! Otherwise it would be no fun.

Another thing: manual spam handling is not really scalable, is it? Good and aggressive auto filters is an essential part of any good public SE.

heini_dutch

9:21 am on Mar 18, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


Hi Matt, well done (already). I always love it when someone dares to compete with the Majors.

Question: What about indexing the robots.txt file, seems like your Robot isn't spidering the robots.txt file (according to my log-files).

Keep up the good work!

French Connexion

10:04 am on Mar 18, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


Hi Matt,

I like the job,
I just added my site and it is there on the fly.
now I tried for some of my KW. I get listed 4 or 5 times on the fisrt page. that is also great I am Happy as a SEO.
But as a user I would rather see only one link to a site (same url): the most relevant one and option to view the other pages from the site. do you have any projects in this view?
I don't know if any one shares this view?

10:12 am on Mar 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 7, 2002
posts:150
votes: 0



Matt

Great work, please keep it up!! I added 5 sites, went straight to the search and bingo! they were there.

R

10:19 am on Mar 18, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 6, 2001
posts:880
votes: 0


Keewwl! I just added a site, with a real time watch open in another browser, and saw Gigabot/1.0 come through and scoop the top few pages (framed site, intrasite linking not yet up to scratch)

Most impressed thus far

10:24 am on Mar 18, 2002 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:22282
votes: 236


Hi Matt,

Good work.

Do my eyes deceive me: Has the "full pages indexed" reduced since last night?



Continued: [webmasterworld.com...]
This 59 message thread spans 2 pages: 59