homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

This 65 message thread spans 3 pages: 65 ( [1] 2 3 > >     
GigaBlast Part 3

 11:17 am on Mar 18, 2002 (gmt 0)

Continued from: [webmasterworld.com...]

Looks impressive so far indeed. I'm really curious about any increase/decrease in relevance, once there's a significant number of sites indexed.

A few things to note, most of which you probably know already:

  • Always respect robots.txt for all pages.

  • The spider needs to do some load balancing, so that it doesn't fetch too many pages from the same site in a short time. The recommended ratio is about one page per minute and site (http://www.robotstxt.org/wc/robots.html)

  • Make sure that the images on your site are served with headers for creation date, size, and expiry date, so that the client can cache them. This will noticeably reduce the bandwidth requirements on your own system.

  • Only list one of www.example.com/ and www.example.com/index.html (home¦default.htm¦asp¦php, etc.) at least if they contain the same text.

  • Cluster the results, so that one site can't dominate the SERPs for any keyword combination.

  • I'm sure there's a lot more work waiting for you... ;)

    Thierry Zoller

     11:42 am on Mar 18, 2002 (gmt 0)

    Matt, you could reduce Bandwidth usage heavily if you'd optimze that html, you already removed the blanks, thats good but how about you use CSS to reduce the size of the result pages, imho you could save up to 30-40% since you use the same tags on and on :)


     11:43 am on Mar 18, 2002 (gmt 0)

    Good work Matt, and good luck.

    This time of the day submitted, spidered and listed within 3mins ;)

    On the robots.txt issue - I agree - a good bot should obey the rules (more work). The downside with robots.txt is that it's down to the spider to enfore the rules - it would be nice to have an apache module that sent 403's based on the robots.txt rules.

    Me I don't bother with robots.txt - if there is a bad bot mod_rewrite sorts it.

    Also of interest to all on this topic is [webmasterworld.com...] - Interesting argument on the price of a search engine - in particular Google, with Matt here demonstrating what one determined and skilled individual can do - maybe lowers the price further?.


     5:09 pm on Mar 18, 2002 (gmt 0)

    the sites i added dont seem to be there any more ... looks like i gotta add them again ....

    and i like the search by IP ... :)


     6:28 pm on Mar 18, 2002 (gmt 0)

    Most of our sites that we submitted still have not been hit by the spider, and of course not added. Must be some kind of dns problem still occuring. And yes, he had to reset the index last night so many sites that were in would have to be resubmitted I believe.

    Sure got our minds off Google for a few days huh? :)


     7:09 pm on Mar 18, 2002 (gmt 0)

    Matt is banning sites by IP. It's reasonable to assume that some innocent sites on server farms using a shared IP would be banned because of the actions of a few. Just another beneft of having a dedicated IP.


     7:41 pm on Mar 18, 2002 (gmt 0)

    Could you explain dedicated ip ?
    What about sites in the same c class but not linking? What is a server farm?


     8:00 pm on Mar 18, 2002 (gmt 0)

    Matt, I wonder if there is a way to turn off the document cacheing.


     10:00 pm on Mar 18, 2002 (gmt 0)

    Hey Matt,

    Good luck to your new Search Engine. It's real fast at spidering... I'll be using it for sure.


     10:14 pm on Mar 18, 2002 (gmt 0)

    Matt, looks great and felt great to see listings so quick... hundreds of listings so quick. ;)

    One concern -- cgi scripts are not being filtered out of the SERPs. For example, search on links.cfg and you'll see what I mean. I'd say that's a target for some serious abuse.


     11:01 pm on Mar 18, 2002 (gmt 0)

    Update, Matt. I submitted the index page from several of our sites yesterday afternoon. I just checked our log files and the spider is now moving out past the root directory into our sub-directories. You said yesterday that this would happen and it did. Just wanted to let you know that the spider is continuing to follow links. I'll keep you updated as it moves along.

    (edited by: MarkHutch at 6:55 pm (utc) on Mar. 26, 2002)


     11:26 pm on Mar 18, 2002 (gmt 0)

    Matt, this is fun for all webmasters to see a plug and play spider action in motion.

    Does gigabot support crawling with variable in the url, such as www.domain.org/cgi-bin/cs_compare?state=ca

    Great job and wishing you all the luck.

    Michael Weir

     11:39 pm on Mar 18, 2002 (gmt 0)

    Fast, good listings, and easy to use. Good work...keep it nice and simple. :)


     6:57 am on Mar 19, 2002 (gmt 0)

    Guess..the IP ban is effective on my site...I submitted twice and checked for the results...I did get the results...

    But today when i checked, the results are gone...none of my pages are there in the index..



     7:13 am on Mar 19, 2002 (gmt 0)

    ideavirus, when did you submit? It seems like people keep missing this, even though it's been said a few times in this thread (which is now spread across three posts, so I guess it's not surprising that things are missed) but yesterday Matt lost his entire database and had to start spidering again from scratch. So anything submitted before that happened Sunday evening is probably no longer there.

    brotherhood of LAN

     7:25 am on Mar 19, 2002 (gmt 0)

    His spider must be out on the rampage then, the site is v.slow

    so i see about the database being re-wrote....he has went from one and a half million to 50 000


     8:21 am on Mar 19, 2002 (gmt 0)

    hi guys,

    if you notice the number of docs shrinking it's because i reset the database.

    i won't do this to you once the thing is officially released, but it may happen again before then.

    thank you,


     3:46 pm on Mar 19, 2002 (gmt 0)

    1) Did the d/base reset again? A loa dof sites I put in yesterday seem to have dumped

    2) Intermittently, I see the "Last 5" only returning 4 results. I'm clicking fairly fast, so I don't think it's a blank line coming through

    3) Is it my imagination, or is there loads of German content in there? I've seen more German language lines in the SERPS from Gigablast than anywhere else I remember outside of a dedicated German language engine


     4:16 pm on Mar 19, 2002 (gmt 0)

    Hi Matt,

    All the best for your project.

    One thing, the last 5 searches is turning into :

    1. forum of its own
    2. A place for people asking for mafia connections :)
    3. A free advertising board

    Will you keep this feature?




     4:25 pm on Mar 19, 2002 (gmt 0)

    LOL steve_1881 sure is getting some attention, isn't he?


     4:36 pm on Mar 19, 2002 (gmt 0)

    it is getting a bit out of hand now, they are advertising cocaine (where's that pen).

    It's a shame people have to ruin what was/is a good idea for a SE.




     4:38 pm on Mar 19, 2002 (gmt 0)

    Better filter needed, or a hand-mod time delay


     6:11 pm on Mar 19, 2002 (gmt 0)

    1) Did the d/base reset again? A loa dof sites I put in yesterday seem to have dumped

    It appears that he did. I believe he posted about it just previous to your post.

    The force respider option is gone as well. I can only guess that the last five searches is a study tool for Matt at this point. I can't imagine leaving it in place.

    Matt is using a black list he got free from squidguard. I am wondering how many others are using this and what the criteria is to be placed on it. I found sites of ours and others we know on these lists for no apparent reason. There were even some IP's that are on our server that have yet to be developed. They don't even have an index page and they're on this black list. Oh btw, if you have an adult site you're trying to get listed, this would be the most likely reason it isn't from what we are reading on these lists.

    Matt: you might want to rethink using this data alone to decide what you do and don't want in your database.


     4:06 am on Mar 20, 2002 (gmt 0)

    I'm impressed...reallllly quick!


     8:43 am on Mar 20, 2002 (gmt 0)

    looks like another brownout! back to 26,000 pages.

    Gentleman, start your engines... again..

    Starters - please re-nominate.


     1:18 pm on Mar 20, 2002 (gmt 0)

    The submit url is temporarily out of order guys. Time to get back to the real world of SEO for a bit. ;)


     8:24 pm on Mar 20, 2002 (gmt 0)

    Well it's 181,536.
    I just resubmitted !
    I hope this one takes of well.

    Best wishes Matt,
    and ofcourse if any of the free logo designs coming to you are not to your liking you know who to ask ; )


     8:30 pm on Mar 20, 2002 (gmt 0)

    Hello Matt. By now you are probably pulling your hair out trying to get this thing to work right, but I hope you find and get all the bugs worked out soon. You're providing a much needed service and I hope you become successful and make millions of $$$ for your effort.


     8:44 pm on Mar 20, 2002 (gmt 0)

    I think Matt, that you should seek technical help in terms of sharing the tasks from some good programmer-pals, those probably whom you've not been in touch with for long, or some guys from this forum who are apt for the task.

    Imagine increasing efficiency 3 times per day, wow, that will speed up 2 months to 20 days I guess (if the hardware is not a problem-the limitation I mean)

    Cheers anyways,


     3:36 am on Mar 21, 2002 (gmt 0)

    Hello again, Matt. I hope you're still reading comments here in the forum. I submitted our site again today because the database got reset again. I just checked our logs and noticed that you are now requesting a robots.txt file. That's good. However, I did notice that the "Gigablast 1.0" part was no longer part of the ID. Might want to turn that back on if it's not too much trouble. Our sites are Linux based and connected to an OC-3 line, so the searching speed is no big deal to us, but some folks are going to get upset when search engines pull pages at a fast pace. According to our log, your search engine pulled about 250 pages in about 2 minutes. Some webmasters might get upset at such a fast crawling rate. In a nut shell, you're doing a fantastic job with this new search tool. Any idea when you're going to finish your beta test? I'm considering adding a link to your search engine on some of our sites, when you've got everything worked out. Keep up the good work...

    (edited by: MarkHutch at 12:55 am (utc) on Mar. 28, 2002)

    This 65 message thread spans 3 pages: 65 ( [1] 2 3 > >
    Global Options:
     top home search open messages active posts  

    Home / Forums Index / Search Engines / Alternative Search Engines
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved