Welcome to WebmasterWorld Guest from 50.16.112.199

Forum Moderators: bakedjake

GigaBlast Part 2

New search engine

   
9:46 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



continued from: [webmasterworld.com...]


208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.

I must admit, I'm a little concerned about the robots.txt issue.

10:00 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Seems to have trouble with active server pages
10:04 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I collected that info on Feb 20th. Perhaps that IP is no longer relevant?
10:18 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Littleman,

What is the connection between gigablast.com and brainbot.de?

10:26 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member



:) probably nothing, it seems to be a case if mistaken identity.
10:26 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I did change a page - 1 character - and then looked at the cache, it was the old page.

Now I find 208.254.87.133 on this site 11 minutes after the page updated. No UA, no referer. This was from the raw logs, funny, it was an unconditional GET request for the root page, but did not show up in my AXS logs. So, I stand corrected. Could be a problem with no UA sent (and the AXS script).

10:31 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, AXS has a bug. Get a visit with no agent and referrer and AXS won't log it.
10:31 pm on Mar 16, 2002 (gmt 0)

10+ Year Member



Bug :
[gigablast.com...]

Look for the mirror link, might be able to inject code into the page that way.

10:44 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Key. I was looking for little's IP in the raw logs, but never got around to 208.254.87.133. So no UA is being sent, yet.

You can imagine what I thought when it looked as if the pages had been pre-spidered and the cache looked as if it was just pulling the current page - SEO trap.

10:49 pm on Mar 16, 2002 (gmt 0)

10+ Year Member



Matt:

I hope your project kicks butt!

Just out of curiousity, is there a reason why some urls get added quickly and others don't (like not at all today)? If they don't get added should we resubmit them, if so, how long should we wait to do so? What does the force respider box mean on the add url?

11:39 pm on Mar 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Matt, if you can ever spare the time, how about giving us a vitual tour of Gigablast. Tell us about it's background, it's location, its hardware and systems, and what your short and long terms plans are. And tell us how this community might be able to assist you in building Gigablast.
12:24 am on Mar 17, 2002 (gmt 0)

10+ Year Member



ouch! i just got done recovering from a crash due to a long query. i'm surprised none of the other long queries didn't crash it. Thanks to whoever did the query:
"holy cow. i'll be a milliionaire if i can just get people to use this damn site"
I'm glad my system is redudant so i can recover quickly and easily from such bugs.

Thanks to wharsono i made the logo smaller.
you also shouldn't be able to put images and javascript into the front page via the last 5 queries mechanism now.

Thanks to Thierry Zoller for pointing out a bug. i think that one should be fixed now.

my bot doesn't use the user-agent tag yet, but should soon. it's also my policy to ban cloakers that abuse the search engine at my discretion, so be warned.

if your url doesn't get added quickly try checking the "force" option on the addUrl page. "force" tells gigablast to spider the url now even if it may already be in the queue for a later spidering. If still no luck it may be my custom, fritzy dns client. it has problems getting the ips for some sites. this is at the top of my to-fix queue.

i'll will be putting up a history and objective for gigablast within the next week or so onto the about page.

thanks for all the testing fellow webmasters. good luck optimizing!

matt

12:27 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Good luck. It is definetely promising, and spidering at a phenomenal rate
12:45 am on Mar 17, 2002 (gmt 0)

10+ Year Member



Hey, working hard there, mattdwells? ;-)

Looks like you sorted out most of the exploits with the "last 5 search" function. I would bet there are some more hidden, but I haven't found any yet!

The logo is better smaller. Maybe the page would look better if it felt more 'centered', the search box is too much to the right. I am sure you are just concentrating on the engine ATM!

I'll shut my face now and let you get on with tweaking your code. Its kind of exciting, seeing the 'birth of a search engine' :-)

12:52 am on Mar 17, 2002 (gmt 0)

10+ Year Member



Submitted a URL today. By the time I clicked the back button and did a search, the home page was already indexed (and #1 to boot). Yep, does feel like Infoseek.

The index seems to be growing at around 100 pages a second. If they can sustain this rate, their index will be bigger than Google's by the end of the year. :)

12:55 am on Mar 17, 2002 (gmt 0)

10+ Year Member



Matt:

I love being devil's advocate and trying to find the bugs - if someone were to search on:

"Gigablast - just testing - trying to find a bug - with a very long query and !@#$%^&*()_~!@#$%^&*()_+}{¦{<?><?><"

It generates a 500 internal error. Don't know if it is because of length or a special character - but that kills the search.

12:57 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good luck mattdwells.

I also had this observation: My site uses SSI, and all of the pages that do not change very often use the X-Bit Hack method. These are the only pages included in this index on my site. Other pages, which are 2 levels down in the site structure, do not return the last-modified date header. Some pages which are on the same level, are included in the index, and they return a last-modified header.

Can I assume that if the pages are a certain number of levels down in the subdirectory structure AND do not return a last modified date/time that they are not included in the index?

The only reason I'm making this assumption is because webmasterworld.com does not show a last modified date in the cache page, but does show the last date spidered.

1:23 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



none of the sites i added got spidered, and none appear to be listed.

i'll try again in a little while ...

1:27 am on Mar 17, 2002 (gmt 0)

10+ Year Member



Search Boss has it well spammed with their 5000+ domains of 100-800 pages each in every category under the sun.
1:29 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Good luck Matt! The world could use more free spiderring engines.
1:57 am on Mar 17, 2002 (gmt 0)



I just went back to visit and noticed that 1 site had the top 30 or so positions for my keywords. I wonder if the engine will eventually collapse all of these pages from the same site. Otherwise the most optimized site is going to get the first three or four pages of listings.
2:02 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Matt, when filtering "<" and ">" in the search string, the replacements are correct in the "last 5" display, but the "[cached]" link has "& gt;" and "& lt;" mixed up... ;)
2:37 am on Mar 17, 2002 (gmt 0)

10+ Year Member



search boss is an abuser. does anyone have a list of his ips so i can ban him?

btw, there's an open project going on at
[linugen.com...]

there's also some free blacklists you can get from squidguard. evidently searchboss has avoided these.

i try to incorporate these lists into my blacklist on a regular basis.

if you have search boss' ips... please! can i have them???

thanks,
matt

2:41 am on Mar 17, 2002 (gmt 0)

10+ Year Member



> I just went back to visit and noticed that
>1 site had the top 30 or so positions for >
> my keywords. I wonder if the engine will
> eventually collapse all of these pages
> from the same site. Otherwise the most
> optimized site is going to get the first
> three or four pages of listings.

site clustering is a the top of my TODO list...
matt

2:42 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hey Matt, want to share some info on what is powering your search engine?
2:46 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I seem to have problems getting the robot to follow links
2:51 am on Mar 17, 2002 (gmt 0)

10+ Year Member



Most of the sites I have submitted, a fairly small number comparitively I'm sure, have not been spidered or added. I see that you said you are having trouble picking up some of the IP's. I'm not sure how these particular sites would be affected as they are all very similar, on the same server. Some are being added and spidered immediately and others have not been hit even when forcing a respider. I have not seen the spider following links until I submit each sections index page, then it will list all of the pages linked on that index page. Just FYI, I'm sure all these things will be shaken out in the end. All in all I'm impressed.
3:20 am on Mar 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>i try to incorporate these lists into my blacklist on a regular basis

<LOL> Well, Matt, I see someone has already blacklisted my competitors on linugen. Check them out ... www.msn.com, www.yahoo.com, www.ebay.com. </LOL>

Looks to me like this vigilante police site is just a tool for abuse by the spammers.

7:08 am on Mar 17, 2002 (gmt 0)

10+ Year Member



Nice one, Matt! :)
Quick, efficient, neat and tidy - this layperson loves it!
Good to hear your responsive comments, too.
Blast on!
3:05 pm on Mar 17, 2002 (gmt 0)

10+ Year Member



Hmm still most of the sites that I have submitted have not been added or crawled. Anyone else having this problem. I don't want to keep resubmitting them and force the spidering as it didn't work the first time I retried. Guess I'll just wait until the IP bug is fixed.

<note> Matt does this have anything sites not having a dns lookup? I know some of mine are coming back with nothing but some are and are still not getting hit. just a thought.</note>

(edited by: Jill at 3:54 pm (utc) on Mar. 17, 2002)

This 59 message thread spans 2 pages: 59
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month