Matt, you could reduce Bandwidth usage heavily if you'd optimze that html, you already removed the blanks, thats good but how about you use CSS to reduce the size of the result pages, imho you could save up to 30-40% since you use the same tags on and on :)
Good work Matt, and good luck.
This time of the day submitted, spidered and listed within 3mins ;)
On the robots.txt issue - I agree - a good bot should obey the rules (more work). The downside with robots.txt is that it's down to the spider to enfore the rules - it would be nice to have an apache module that sent 403's based on the robots.txt rules.
Me I don't bother with robots.txt - if there is a bad bot mod_rewrite sorts it.
Also of interest to all on this topic is [webmasterworld.com...] - Interesting argument on the price of a search engine - in particular Google, with Matt here demonstrating what one determined and skilled individual can do - maybe lowers the price further?.
the sites i added dont seem to be there any more ... looks like i gotta add them again ....
and i like the search by IP ... :)
Most of our sites that we submitted still have not been hit by the spider, and of course not added. Must be some kind of dns problem still occuring. And yes, he had to reset the index last night so many sites that were in would have to be resubmitted I believe.
Sure got our minds off Google for a few days huh? :)
Matt is banning sites by IP. It's reasonable to assume that some innocent sites on server farms using a shared IP would be banned because of the actions of a few. Just another beneft of having a dedicated IP.
Could you explain dedicated ip ?
What about sites in the same c class but not linking? What is a server farm?
Matt, I wonder if there is a way to turn off the document cacheing.
Good luck to your new Search Engine. It's real fast at spidering... I'll be using it for sure.
Matt, looks great and felt great to see listings so quick... hundreds of listings so quick. ;)
One concern -- cgi scripts are not being filtered out of the SERPs. For example, search on links.cfg and you'll see what I mean. I'd say that's a target for some serious abuse.
Update, Matt. I submitted the index page from several of our sites yesterday afternoon. I just checked our log files and the spider is now moving out past the root directory into our sub-directories. You said yesterday that this would happen and it did. Just wanted to let you know that the spider is continuing to follow links. I'll keep you updated as it moves along.
(edited by: MarkHutch at 6:55 pm (utc) on Mar. 26, 2002)
Matt, this is fun for all webmasters to see a plug and play spider action in motion.
Does gigabot support crawling with variable in the url, such as www.domain.org/cgi-bin/cs_compare?state=ca
Great job and wishing you all the luck.
Fast, good listings, and easy to use. Good work...keep it nice and simple. :)
Guess..the IP ban is effective on my site...I submitted twice and checked for the results...I did get the results...
But today when i checked, the results are gone...none of my pages are there in the index..
ideavirus, when did you submit? It seems like people keep missing this, even though it's been said a few times in this thread (which is now spread across three posts, so I guess it's not surprising that things are missed) but yesterday Matt lost his entire database and had to start spidering again from scratch. So anything submitted before that happened Sunday evening is probably no longer there.
|brotherhood of LAN|
His spider must be out on the rampage then, the site is v.slow
so i see about the database being re-wrote....he has went from one and a half million to 50 000
if you notice the number of docs shrinking it's because i reset the database.
i won't do this to you once the thing is officially released, but it may happen again before then.
1) Did the d/base reset again? A loa dof sites I put in yesterday seem to have dumped
2) Intermittently, I see the "Last 5" only returning 4 results. I'm clicking fairly fast, so I don't think it's a blank line coming through
3) Is it my imagination, or is there loads of German content in there? I've seen more German language lines in the SERPS from Gigablast than anywhere else I remember outside of a dedicated German language engine
All the best for your project.
One thing, the last 5 searches is turning into :
1. forum of its own
2. A place for people asking for mafia connections :)
3. A free advertising board
Will you keep this feature?
LOL steve_1881 sure is getting some attention, isn't he?
it is getting a bit out of hand now, they are advertising cocaine (where's that pen).
It's a shame people have to ruin what was/is a good idea for a SE.
Better filter needed, or a hand-mod time delay
|1) Did the d/base reset again? A loa dof sites I put in yesterday seem to have dumped |
It appears that he did. I believe he posted about it just previous to your post.
The force respider option is gone as well. I can only guess that the last five searches is a study tool for Matt at this point. I can't imagine leaving it in place.
Matt is using a black list he got free from squidguard. I am wondering how many others are using this and what the criteria is to be placed on it. I found sites of ours and others we know on these lists for no apparent reason. There were even some IP's that are on our server that have yet to be developed. They don't even have an index page and they're on this black list. Oh btw, if you have an adult site you're trying to get listed, this would be the most likely reason it isn't from what we are reading on these lists.
Matt: you might want to rethink using this data alone to decide what you do and don't want in your database.
I'm impressed...reallllly quick!
looks like another brownout! back to 26,000 pages.
Gentleman, start your engines... again..
Starters - please re-nominate.
The submit url is temporarily out of order guys. Time to get back to the real world of SEO for a bit. ;)
Well it's 181,536.
I just resubmitted !
I hope this one takes of well.
Best wishes Matt,
and ofcourse if any of the free logo designs coming to you are not to your liking you know who to ask ; )
Hello Matt. By now you are probably pulling your hair out trying to get this thing to work right, but I hope you find and get all the bugs worked out soon. You're providing a much needed service and I hope you become successful and make millions of $$$ for your effort.
I think Matt, that you should seek technical help in terms of sharing the tasks from some good programmer-pals, those probably whom you've not been in touch with for long, or some guys from this forum who are apt for the task.
Imagine increasing efficiency 3 times per day, wow, that will speed up 2 months to 20 days I guess (if the hardware is not a problem-the limitation I mean)
Hello again, Matt. I hope you're still reading comments here in the forum. I submitted our site again today because the database got reset again. I just checked our logs and noticed that you are now requesting a robots.txt file. That's good. However, I did notice that the "Gigablast 1.0" part was no longer part of the ID. Might want to turn that back on if it's not too much trouble. Our sites are Linux based and connected to an OC-3 line, so the searching speed is no big deal to us, but some folks are going to get upset when search engines pull pages at a fast pace. According to our log, your search engine pulled about 250 pages in about 2 minutes. Some webmasters might get upset at such a fast crawling rate. In a nut shell, you're doing a fantastic job with this new search tool. Any idea when you're going to finish your beta test? I'm considering adding a link to your search engine on some of our sites, when you've got everything worked out. Keep up the good work...
(edited by: MarkHutch at 12:55 am (utc) on Mar. 28, 2002)
| This 65 message thread spans 3 pages: 65 (  2 3 ) > > |