Forum Moderators: bakedjake
208.254.87.133 is the IP used by the GigaBot spider. I don't know if there are others. I never submitted my site to GigaBlast but it has been spidered on and off for a few months (ODP data?). Usually the home page gets hit every one to two days by Gigabot, most recently 04:22:14 on 03/16/02. Sometimes it makes a deep crawl. I'm filtering for this IP now to better learn it's behavior and patterns.
I must admit, I'm a little concerned about the robots.txt issue.
The first site has only the index page listed, the second site got every page spidered.
Very similar sites, and layout, different content.
The only reason I can think of is a meta tag <meta name="robots" content="ALL">? (On the site with only one page spidered)
Perhaps the spider mistakes this as a 'disallow'?
congratiulations on that site/SE,
I submitted my site, and it got spidered/added almost instantly, I think it will make a great addition if a few problems are solved (mostly speed).
Here's the details:
- Add a page detailing User Agent & behaviour of your robot, and put that into the robot UA String (like:gigabot/1.0; [gigablast.com...]
-Add some button for users to click if they feel the results are not relevant to their query, or to report "dead links".
I know this has some potential for abuse, but I think in the alpha/beta phase it can be helpful
other things (eg. for advanced search): only return sites that validate against the DTD they specify..
In advanced search there needs to be a text explaining some of the options..
Otherwise, very good site, good luck with it :)
Added two sites of mine...and instantaneous...go indexed...!..An interesting development here, i would say...!
Guess..now the "mirror" is re-named as " cached "...sounds more relevant...!
As mentioned before...the site/SE badly needs a More professional look and feel...yeah..I understand..you are tweaking the algo ATM...
Good luck from mee TOOOO...!
Cheers
:)
<added>
matt...How about adding the number of pages found for a particular query in the SERPs..sure must be on the TODO list..!
</added>
I'm glad to hear this is 'pre-beta' that means gigablast has even more potential than I originally thought :)
Kudos to you Matt!!! It seems that for now this is a one man job and as such is ultra-impressive.
I'm sending positive energy to this site and giving you as much word as my mouth can handle.
BEST OF LUCK!!!!
Greektomi
This is great!! Not only does it automatically get spidered and indexed - but your spider starts following the links from the submitted sites immediately!
This is incredibly impressive stuff - well done. Our techos here are in absolute awe - they are having a field day at the moment!
Keep up the great work!!!
if you're interested here's some changes i made today due mostly to suggestions i've received on this thread:
. filtered naughtiness from top 5 queries
. increased default # lines in result summaries from 1 to 3
. decreased # of chars in summary lines by about half
. fixed critical bug in my communications layer
. removed trailing slash from root urls in search results
. added User-Agent: gigabot/1.0 to spiders
. fixed one bug in my dns client, another is still there
. banned search boss' 14,000+ IPs
this message board has been awesome!
Thanks very very much for all the posts, everybody!
matt
matt
(edited by: bigrockman at 5:46 am (utc) on Mar. 18, 2002)
as for the casino spam, i'll have to compile a domain/ip list of the offenders.
matt
Any plans to start your own crawl, (sans submissions) maybe start with ODP and expand from there? Where would you start?
Or for that matter, if anyone on this forum were starting as mattdwells, where would you start from?
In the future, once my index is bigger and there are not so many urls competing for spider time, they should be added quickly as well. if you really need them in for now, you'll have to submit them one at a time via the addUrl page.
matt
so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself
Another thing: manual spam handling is not really scalable, is it? Good and aggressive auto filters is an essential part of any good public SE.
Question: What about indexing the robots.txt file, seems like your Robot isn't spidering the robots.txt file (according to my log-files).
Keep up the good work!
I like the job,
I just added my site and it is there on the fly.
now I tried for some of my KW. I get listed 4 or 5 times on the fisrt page. that is also great I am Happy as a SEO.
But as a user I would rather see only one link to a site (same url): the most relevant one and option to view the other pages from the site. do you have any projects in this view?
I don't know if any one shares this view?
Good work.
Do my eyes deceive me: Has the "full pages indexed" reduced since last night?