homepage Welcome to WebmasterWorld Guest from 50.19.169.37
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

This 59 message thread spans 2 pages: 59 ( [1] 2 > >     
GigaBlast Part 2
New search engine
Key_Master




msg:462348
 9:46 pm on Mar 16, 2002 (gmt 0)

continued from: [webmasterworld.com...]


208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.

I must admit, I'm a little concerned about the robots.txt issue.

 

brotherhood of LAN




msg:462349
 10:00 pm on Mar 16, 2002 (gmt 0)

Seems to have trouble with active server pages

littleman




msg:462350
 10:04 pm on Mar 16, 2002 (gmt 0)

I collected that info on Feb 20th. Perhaps that IP is no longer relevant?

Key_Master




msg:462351
 10:18 pm on Mar 16, 2002 (gmt 0)

Littleman,

What is the connection between gigablast.com and brainbot.de?

littleman




msg:462352
 10:26 pm on Mar 16, 2002 (gmt 0)

:) probably nothing, it seems to be a case if mistaken identity.

bobriggs




msg:462353
 10:26 pm on Mar 16, 2002 (gmt 0)

I did change a page - 1 character - and then looked at the cache, it was the old page.

Now I find 208.254.87.133 on this site 11 minutes after the page updated. No UA, no referer. This was from the raw logs, funny, it was an unconditional GET request for the root page, but did not show up in my AXS logs. So, I stand corrected. Could be a problem with no UA sent (and the AXS script).

Key_Master




msg:462354
 10:31 pm on Mar 16, 2002 (gmt 0)

Yes, AXS has a bug. Get a visit with no agent and referrer and AXS won't log it.

Thierry Zoller




msg:462355
 10:31 pm on Mar 16, 2002 (gmt 0)

Bug :
[gigablast.com...]

Look for the mirror link, might be able to inject code into the page that way.

bobriggs




msg:462356
 10:44 pm on Mar 16, 2002 (gmt 0)

Thanks Key. I was looking for little's IP in the raw logs, but never got around to 208.254.87.133. So no UA is being sent, yet.

You can imagine what I thought when it looked as if the pages had been pre-spidered and the cache looked as if it was just pulling the current page - SEO trap.

Jill




msg:462357
 10:49 pm on Mar 16, 2002 (gmt 0)

Matt:

I hope your project kicks butt!

Just out of curiousity, is there a reason why some urls get added quickly and others don't (like not at all today)? If they don't get added should we resubmit them, if so, how long should we wait to do so? What does the force respider box mean on the add url?

mayor




msg:462358
 11:39 pm on Mar 16, 2002 (gmt 0)

Matt, if you can ever spare the time, how about giving us a vitual tour of Gigablast. Tell us about it's background, it's location, its hardware and systems, and what your short and long terms plans are. And tell us how this community might be able to assist you in building Gigablast.

mattdwells




msg:462359
 12:24 am on Mar 17, 2002 (gmt 0)

ouch! i just got done recovering from a crash due to a long query. i'm surprised none of the other long queries didn't crash it. Thanks to whoever did the query:
"holy cow. i'll be a milliionaire if i can just get people to use this damn site"
I'm glad my system is redudant so i can recover quickly and easily from such bugs.

Thanks to wharsono i made the logo smaller.
you also shouldn't be able to put images and javascript into the front page via the last 5 queries mechanism now.

Thanks to Thierry Zoller for pointing out a bug. i think that one should be fixed now.

my bot doesn't use the user-agent tag yet, but should soon. it's also my policy to ban cloakers that abuse the search engine at my discretion, so be warned.

if your url doesn't get added quickly try checking the "force" option on the addUrl page. "force" tells gigablast to spider the url now even if it may already be in the queue for a later spidering. If still no luck it may be my custom, fritzy dns client. it has problems getting the ips for some sites. this is at the top of my to-fix queue.

i'll will be putting up a history and objective for gigablast within the next week or so onto the about page.

thanks for all the testing fellow webmasters. good luck optimizing!

matt

brotherhood of LAN




msg:462360
 12:27 am on Mar 17, 2002 (gmt 0)

Good luck. It is definetely promising, and spidering at a phenomenal rate

electro




msg:462361
 12:45 am on Mar 17, 2002 (gmt 0)

Hey, working hard there, mattdwells? ;-)

Looks like you sorted out most of the exploits with the "last 5 search" function. I would bet there are some more hidden, but I haven't found any yet!

The logo is better smaller. Maybe the page would look better if it felt more 'centered', the search box is too much to the right. I am sure you are just concentrating on the engine ATM!

I'll shut my face now and let you get on with tweaking your code. Its kind of exciting, seeing the 'birth of a search engine' :-)

EX_S




msg:462362
 12:52 am on Mar 17, 2002 (gmt 0)

Submitted a URL today. By the time I clicked the back button and did a search, the home page was already indexed (and #1 to boot). Yep, does feel like Infoseek.

The index seems to be growing at around 100 pages a second. If they can sustain this rate, their index will be bigger than Google's by the end of the year. :)

Bradley




msg:462363
 12:55 am on Mar 17, 2002 (gmt 0)

Matt:

I love being devil's advocate and trying to find the bugs - if someone were to search on:

"Gigablast - just testing - trying to find a bug - with a very long query and !@#$%^&*()_~!@#$%^&*()_+}{¦{<?><?><"

It generates a 500 internal error. Don't know if it is because of length or a special character - but that kills the search.

bobriggs




msg:462364
 12:57 am on Mar 17, 2002 (gmt 0)

Good luck mattdwells.

I also had this observation: My site uses SSI, and all of the pages that do not change very often use the X-Bit Hack method. These are the only pages included in this index on my site. Other pages, which are 2 levels down in the site structure, do not return the last-modified date header. Some pages which are on the same level, are included in the index, and they return a last-modified header.

Can I assume that if the pages are a certain number of levels down in the subdirectory structure AND do not return a last modified date/time that they are not included in the index?

The only reason I'm making this assumption is because webmasterworld.com does not show a last modified date in the cache page, but does show the last date spidered.

Crazy_Fool




msg:462365
 1:23 am on Mar 17, 2002 (gmt 0)

none of the sites i added got spidered, and none appear to be listed.

i'll try again in a little while ...

nell




msg:462366
 1:27 am on Mar 17, 2002 (gmt 0)

Search Boss has it well spammed with their 5000+ domains of 100-800 pages each in every category under the sun.

littleman




msg:462367
 1:29 am on Mar 17, 2002 (gmt 0)

Good luck Matt! The world could use more free spiderring engines.

Bogglesworld




msg:462368
 1:57 am on Mar 17, 2002 (gmt 0)

I just went back to visit and noticed that 1 site had the top 30 or so positions for my keywords. I wonder if the engine will eventually collapse all of these pages from the same site. Otherwise the most optimized site is going to get the first three or four pages of listings.

bird




msg:462369
 2:02 am on Mar 17, 2002 (gmt 0)

Matt, when filtering "<" and ">" in the search string, the replacements are correct in the "last 5" display, but the "[cached]" link has "& gt;" and "& lt;" mixed up... ;)

mattdwells




msg:462370
 2:37 am on Mar 17, 2002 (gmt 0)

search boss is an abuser. does anyone have a list of his ips so i can ban him?

btw, there's an open project going on at
[linugen.com...]

there's also some free blacklists you can get from squidguard. evidently searchboss has avoided these.

i try to incorporate these lists into my blacklist on a regular basis.

if you have search boss' ips... please! can i have them???

thanks,
matt

mattdwells




msg:462371
 2:41 am on Mar 17, 2002 (gmt 0)

> I just went back to visit and noticed that
>1 site had the top 30 or so positions for >
> my keywords. I wonder if the engine will
> eventually collapse all of these pages
> from the same site. Otherwise the most
> optimized site is going to get the first
> three or four pages of listings.

site clustering is a the top of my TODO list...
matt

littleman




msg:462372
 2:42 am on Mar 17, 2002 (gmt 0)

Hey Matt, want to share some info on what is powering your search engine?

brotherhood of LAN




msg:462373
 2:46 am on Mar 17, 2002 (gmt 0)

I seem to have problems getting the robot to follow links

Jill




msg:462374
 2:51 am on Mar 17, 2002 (gmt 0)

Most of the sites I have submitted, a fairly small number comparitively I'm sure, have not been spidered or added. I see that you said you are having trouble picking up some of the IP's. I'm not sure how these particular sites would be affected as they are all very similar, on the same server. Some are being added and spidered immediately and others have not been hit even when forcing a respider. I have not seen the spider following links until I submit each sections index page, then it will list all of the pages linked on that index page. Just FYI, I'm sure all these things will be shaken out in the end. All in all I'm impressed.

mayor




msg:462375
 3:20 am on Mar 17, 2002 (gmt 0)

>>i try to incorporate these lists into my blacklist on a regular basis

<LOL> Well, Matt, I see someone has already blacklisted my competitors on linugen. Check them out ... www.msn.com, www.yahoo.com, www.ebay.com. </LOL>

Looks to me like this vigilante police site is just a tool for abuse by the spammers.

mgswebaus




msg:462376
 7:08 am on Mar 17, 2002 (gmt 0)

Nice one, Matt! :)
Quick, efficient, neat and tidy - this layperson loves it!
Good to hear your responsive comments, too.
Blast on!

Jill




msg:462377
 3:05 pm on Mar 17, 2002 (gmt 0)

Hmm still most of the sites that I have submitted have not been added or crawled. Anyone else having this problem. I don't want to keep resubmitting them and force the spidering as it didn't work the first time I retried. Guess I'll just wait until the IP bug is fixed.

<note> Matt does this have anything sites not having a dns lookup? I know some of mine are coming back with nothing but some are and are still not getting hit. just a thought.</note>

(edited by: Jill at 3:54 pm (utc) on Mar. 17, 2002)

This 59 message thread spans 2 pages: 59 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved