Welcome to WebmasterWorld Guest from 54.167.40.25

Forum Moderators: bakedjake

Message Too Old, No Replies

GigaBlast Part 2

New search engine

     
9:46 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


continued from: [webmasterworld.com...]


208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.

I must admit, I'm a little concerned about the robots.txt issue.

10:00 pm on Mar 16, 2002 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


Seems to have trouble with active server pages
10:04 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 17, 2000
posts:2924
votes: 0


I collected that info on Feb 20th. Perhaps that IP is no longer relevant?
10:18 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Littleman,

What is the connection between gigablast.com and brainbot.de?

10:26 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 17, 2000
posts:2924
votes: 0


:) probably nothing, it seems to be a case if mistaken identity.
10:26 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2001
posts:748
votes: 0


I did change a page - 1 character - and then looked at the cache, it was the old page.

Now I find 208.254.87.133 on this site 11 minutes after the page updated. No UA, no referer. This was from the raw logs, funny, it was an unconditional GET request for the root page, but did not show up in my AXS logs. So, I stand corrected. Could be a problem with no UA sent (and the AXS script).

10:31 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Yes, AXS has a bug. Get a visit with no agent and referrer and AXS won't log it.
10:31 pm on Mar 16, 2002 (gmt 0)

New User

10+ Year Member

joined:Mar 12, 2002
posts:30
votes: 0


Bug :
[gigablast.com...]

Look for the mirror link, might be able to inject code into the page that way.

10:44 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2001
posts:748
votes: 0


Thanks Key. I was looking for little's IP in the raw logs, but never got around to 208.254.87.133. So no UA is being sent, yet.

You can imagine what I thought when it looked as if the pages had been pre-spidered and the cache looked as if it was just pulling the current page - SEO trap.

10:49 pm on Mar 16, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 29, 2000
posts:649
votes: 0


Matt:

I hope your project kicks butt!

Just out of curiousity, is there a reason why some urls get added quickly and others don't (like not at all today)? If they don't get added should we resubmit them, if so, how long should we wait to do so? What does the force respider box mean on the add url?

11:39 pm on Mar 16, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 29, 2000
posts:1133
votes: 0


Matt, if you can ever spare the time, how about giving us a vitual tour of Gigablast. Tell us about it's background, it's location, its hardware and systems, and what your short and long terms plans are. And tell us how this community might be able to assist you in building Gigablast.
12:24 am on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


ouch! i just got done recovering from a crash due to a long query. i'm surprised none of the other long queries didn't crash it. Thanks to whoever did the query:
"holy cow. i'll be a milliionaire if i can just get people to use this damn site"
I'm glad my system is redudant so i can recover quickly and easily from such bugs.

Thanks to wharsono i made the logo smaller.
you also shouldn't be able to put images and javascript into the front page via the last 5 queries mechanism now.

Thanks to Thierry Zoller for pointing out a bug. i think that one should be fixed now.

my bot doesn't use the user-agent tag yet, but should soon. it's also my policy to ban cloakers that abuse the search engine at my discretion, so be warned.

if your url doesn't get added quickly try checking the "force" option on the addUrl page. "force" tells gigablast to spider the url now even if it may already be in the queue for a later spidering. If still no luck it may be my custom, fritzy dns client. it has problems getting the ips for some sites. this is at the top of my to-fix queue.

i'll will be putting up a history and objective for gigablast within the next week or so onto the about page.

thanks for all the testing fellow webmasters. good luck optimizing!

matt

12:27 am on Mar 17, 2002 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


Good luck. It is definetely promising, and spidering at a phenomenal rate
12:45 am on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:June 11, 2001
posts:134
votes: 0


Hey, working hard there, mattdwells? ;-)

Looks like you sorted out most of the exploits with the "last 5 search" function. I would bet there are some more hidden, but I haven't found any yet!

The logo is better smaller. Maybe the page would look better if it felt more 'centered', the search box is too much to the right. I am sure you are just concentrating on the engine ATM!

I'll shut my face now and let you get on with tweaking your code. Its kind of exciting, seeing the 'birth of a search engine' :-)

12:52 am on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 18, 2001
posts:49
votes: 0


Submitted a URL today. By the time I clicked the back button and did a search, the home page was already indexed (and #1 to boot). Yep, does feel like Infoseek.

The index seems to be growing at around 100 pages a second. If they can sustain this rate, their index will be bigger than Google's by the end of the year. :)

12:55 am on Mar 17, 2002 (gmt 0)

Full Member

10+ Year Member

joined:June 28, 2000
posts:280
votes: 0


Matt:

I love being devil's advocate and trying to find the bugs - if someone were to search on:

"Gigablast - just testing - trying to find a bug - with a very long query and !@#$%^&*()_~!@#$%^&*()_+}{¦{<?><?><"

It generates a 500 internal error. Don't know if it is because of length or a special character - but that kills the search.

12:57 am on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2001
posts:748
votes: 0


Good luck mattdwells.

I also had this observation: My site uses SSI, and all of the pages that do not change very often use the X-Bit Hack method. These are the only pages included in this index on my site. Other pages, which are 2 levels down in the site structure, do not return the last-modified date header. Some pages which are on the same level, are included in the index, and they return a last-modified header.

Can I assume that if the pages are a certain number of levels down in the subdirectory structure AND do not return a last modified date/time that they are not included in the index?

The only reason I'm making this assumption is because webmasterworld.com does not show a last modified date in the cache page, but does show the last date spidered.

1:23 am on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 26, 2001
posts:1076
votes: 0


none of the sites i added got spidered, and none appear to be listed.

i'll try again in a little while ...

1:27 am on Mar 17, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 24, 2001
posts:501
votes: 0


Search Boss has it well spammed with their 5000+ domains of 100-800 pages each in every category under the sun.
1:29 am on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 17, 2000
posts:2924
votes: 0


Good luck Matt! The world could use more free spiderring engines.
1:57 am on Mar 17, 2002 (gmt 0)

Junior Member

joined:Feb 14, 2002
posts:63
votes: 0


I just went back to visit and noticed that 1 site had the top 30 or so positions for my keywords. I wonder if the engine will eventually collapse all of these pages from the same site. Otherwise the most optimized site is going to get the first three or four pages of listings.
2:02 am on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 10, 2001
posts:1550
votes: 10


Matt, when filtering "<" and ">" in the search string, the replacements are correct in the "last 5" display, but the "[cached]" link has "& gt;" and "& lt;" mixed up... ;)
2:37 am on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


search boss is an abuser. does anyone have a list of his ips so i can ban him?

btw, there's an open project going on at
[linugen.com...]

there's also some free blacklists you can get from squidguard. evidently searchboss has avoided these.

i try to incorporate these lists into my blacklist on a regular basis.

if you have search boss' ips... please! can i have them???

thanks,
matt

2:41 am on Mar 17, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 16, 2002
posts:65
votes: 0


> I just went back to visit and noticed that
>1 site had the top 30 or so positions for >
> my keywords. I wonder if the engine will
> eventually collapse all of these pages
> from the same site. Otherwise the most
> optimized site is going to get the first
> three or four pages of listings.

site clustering is a the top of my TODO list...
matt

2:42 am on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member littleman is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 17, 2000
posts:2924
votes: 0


Hey Matt, want to share some info on what is powering your search engine?
2:46 am on Mar 17, 2002 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


I seem to have problems getting the robot to follow links
2:51 am on Mar 17, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 29, 2000
posts:649
votes: 0


Most of the sites I have submitted, a fairly small number comparitively I'm sure, have not been spidered or added. I see that you said you are having trouble picking up some of the IP's. I'm not sure how these particular sites would be affected as they are all very similar, on the same server. Some are being added and spidered immediately and others have not been hit even when forcing a respider. I have not seen the spider following links until I submit each sections index page, then it will list all of the pages linked on that index page. Just FYI, I'm sure all these things will be shaken out in the end. All in all I'm impressed.
3:20 am on Mar 17, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 29, 2000
posts:1133
votes: 0


>>i try to incorporate these lists into my blacklist on a regular basis

<LOL> Well, Matt, I see someone has already blacklisted my competitors on linugen. Check them out ... www.msn.com, www.yahoo.com, www.ebay.com. </LOL>

Looks to me like this vigilante police site is just a tool for abuse by the spammers.

7:08 am on Mar 17, 2002 (gmt 0)

New User

10+ Year Member

joined:Dec 18, 2001
posts:33
votes: 0


Nice one, Matt! :)
Quick, efficient, neat and tidy - this layperson loves it!
Good to hear your responsive comments, too.
Blast on!
3:05 pm on Mar 17, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 29, 2000
posts:649
votes: 0


Hmm still most of the sites that I have submitted have not been added or crawled. Anyone else having this problem. I don't want to keep resubmitting them and force the spidering as it didn't work the first time I retried. Guess I'll just wait until the IP bug is fixed.

<note> Matt does this have anything sites not having a dns lookup? I know some of mine are coming back with nothing but some are and are still not getting hit. just a thought.</note>

(edited by: Jill at 3:54 pm (utc) on Mar. 17, 2002)

This 59 message thread spans 2 pages: 59