Forum Moderators: bakedjake
208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.
I must admit, I'm a little concerned about the robots.txt issue.
Now I find 208.254.87.133 on this site 11 minutes after the page updated. No UA, no referer. This was from the raw logs, funny, it was an unconditional GET request for the root page, but did not show up in my AXS logs. So, I stand corrected. Could be a problem with no UA sent (and the AXS script).
Look for the mirror link, might be able to inject code into the page that way.
Thanks to wharsono i made the logo smaller.
you also shouldn't be able to put images and javascript into the front page via the last 5 queries mechanism now.
Thanks to Thierry Zoller for pointing out a bug. i think that one should be fixed now.
my bot doesn't use the user-agent tag yet, but should soon. it's also my policy to ban cloakers that abuse the search engine at my discretion, so be warned.
if your url doesn't get added quickly try checking the "force" option on the addUrl page. "force" tells gigablast to spider the url now even if it may already be in the queue for a later spidering. If still no luck it may be my custom, fritzy dns client. it has problems getting the ips for some sites. this is at the top of my to-fix queue.
i'll will be putting up a history and objective for gigablast within the next week or so onto the about page.
thanks for all the testing fellow webmasters. good luck optimizing!
matt
Looks like you sorted out most of the exploits with the "last 5 search" function. I would bet there are some more hidden, but I haven't found any yet!
The logo is better smaller. Maybe the page would look better if it felt more 'centered', the search box is too much to the right. I am sure you are just concentrating on the engine ATM!
I'll shut my face now and let you get on with tweaking your code. Its kind of exciting, seeing the 'birth of a search engine' :-)
The index seems to be growing at around 100 pages a second. If they can sustain this rate, their index will be bigger than Google's by the end of the year. :)
I love being devil's advocate and trying to find the bugs - if someone were to search on:
"Gigablast - just testing - trying to find a bug - with a very long query and !@#$%^&*()_~!@#$%^&*()_+}{¦{<?><?><"
It generates a 500 internal error. Don't know if it is because of length or a special character - but that kills the search.
I also had this observation: My site uses SSI, and all of the pages that do not change very often use the X-Bit Hack method. These are the only pages included in this index on my site. Other pages, which are 2 levels down in the site structure, do not return the last-modified date header. Some pages which are on the same level, are included in the index, and they return a last-modified header.
Can I assume that if the pages are a certain number of levels down in the subdirectory structure AND do not return a last modified date/time that they are not included in the index?
The only reason I'm making this assumption is because webmasterworld.com does not show a last modified date in the cache page, but does show the last date spidered.
btw, there's an open project going on at
[linugen.com...]
there's also some free blacklists you can get from squidguard. evidently searchboss has avoided these.
i try to incorporate these lists into my blacklist on a regular basis.
if you have search boss' ips... please! can i have them???
thanks,
matt
site clustering is a the top of my TODO list...
matt
<LOL> Well, Matt, I see someone has already blacklisted my competitors on linugen. Check them out ... www.msn.com, www.yahoo.com, www.ebay.com. </LOL>
Looks to me like this vigilante police site is just a tool for abuse by the spammers.
<note> Matt does this have anything sites not having a dns lookup? I know some of mine are coming back with nothing but some are and are still not getting hit. just a thought.</note>
(edited by: Jill at 3:54 pm (utc) on Mar. 17, 2002)