Submitted 2 sites.
The first site has only the index page listed, the second site got every page spidered.
Very similar sites, and layout, different content.
The only reason I can think of is a meta tag <meta name="robots" content="ALL">? (On the site with only one page spidered)
Perhaps the spider mistakes this as a 'disallow'?
congratiulations on that site/SE,
I submitted my site, and it got spidered/added almost instantly, I think it will make a great addition if a few problems are solved (mostly speed).
Here's the details:
- Add a page detailing User Agent & behaviour of your robot, and put that into the robot UA String (like:gigabot/1.0; [gigablast.com...]
-Add some button for users to click if they feel the results are not relevant to their query, or to report "dead links".
I know this has some potential for abuse, but I think in the alpha/beta phase it can be helpful
other things (eg. for advanced search): only return sites that validate against the DTD they specify..
In advanced search there needs to be a text explaining some of the options..
Otherwise, very good site, good luck with it :)
Added two sites of mine...and instantaneous...go indexed...!..An interesting development here, i would say...!
Guess..now the "mirror" is re-named as " cached "...sounds more relevant...!
As mentioned before...the site/SE badly needs a More professional look and feel...yeah..I understand..you are tweaking the algo ATM...
Good luck from mee TOOOO...!
matt...How about adding the number of pages found for a particular query in the SERPs..sure must be on the TODO list..!
Matt - On some queries, I'm seeing the same page returned 3 or 4 times....
22.214.171.124 - - [17/Mar/2002:15:51:02 +0100] "GET /index.html HTTP/1.0" 200 6321 "-" "-"
you are currenlty sending no user agent, please do in the future
Matt, I just submitted one of our URL's and your spider went right to work. Nice job...
This sure is a ton of fun!!!!
I'm glad to hear this is 'pre-beta' that means gigablast has even more potential than I originally thought :)
Kudos to you Matt!!! It seems that for now this is a one man job and as such is ultra-impressive.
I'm sending positive energy to this site and giving you as much word as my mouth can handle.
BEST OF LUCK!!!!
This is great!! Not only does it automatically get spidered and indexed - but your spider starts following the links from the submitted sites immediately!
This is incredibly impressive stuff - well done. Our techos here are in absolute awe - they are having a field day at the moment!
Keep up the great work!!!
Is Gigablast down now or is it just me?
Down from here. I did a few queries, some worked, some returned internal server error 500. Now it is down.
the qa i get here is fantastic!
if you're interested here's some changes i made today due mostly to suggestions i've received on this thread:
. filtered naughtiness from top 5 queries
. increased default # lines in result summaries from 1 to 3
. decreased # of chars in summary lines by about half
. fixed critical bug in my communications layer
. removed trailing slash from root urls in search results
. added User-Agent: gigabot/1.0 to spiders
. fixed one bug in my dns client, another is still there
. banned search boss' 14,000+ IPs
this message board has been awesome!
Thanks very very much for all the posts, everybody!
oh i forgot one more change.
a lot of people from here were searching for their urls like: "www.mysite.com", but since the page itself didn't have the word "mysite" on it, no search results would be found. so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself.
Matt, nice job again. I noticed that the number of sites in the index went to only 21,000. Did you have to start over? I submitted my index pages again. I hope that was ok.
Yuk. I don't know what happened, but the results I'm seeing don't match anything (much). Loaded with spam. Unless you like casino sites... ;)
I am new here today from WA forums and I am very impressed with this forum. Looks like I stumbed upon a great search engine in the making. Blew me away! Added url and BAMM, indexed immediately. GO MAN GO
(edited by: bigrockman at 5:46 am (utc) on Mar. 18, 2002)
yeah, i had to reset. :( one of the machines crashed and i was debugging its redundant twin remotely when i lost my telnet connection. i'm guessing comcast has some kind of anti-nat software that did it. i'll have to be more careful in the future, but since this is pre-beta i'm not worried too much about it.
as for the casino spam, i'll have to compile a domain/ip list of the offenders.
also, the search results won't be that great until about 3 days from now when the link-analysis system has had some time to do its magic. It needs a fairly large corpus of documents to work with.
I just checked my logs and your spider looks like it's working great. I just submitted my index pages and the spider is following the links just fine. I noticed it only got the files in my root directories. Does it spider sub-directories too or just the files in the root??
I am a newbie, but how do I add this search engine to my list of them? suggestions matt?
Obviously pre-alpha, as you mentioned, so I know I'm jumping the gun here a little.
Any plans to start your own crawl, (sans submissions) maybe start with ODP and expand from there? Where would you start?
Or for that matter, if anyone on this forum were starting as mattdwells, where would you start from?
i do do my own crawl in addition to handling the submissions. one of the sites i start with is yahoo.
when you submit a url to the spider, it should spider that page and all of the links on it fairly quickly. the links on the linking pages will be spidered, too, but their priority is not as high. it may be a few days before they get spidered.
In the future, once my index is bigger and there are not so many urls competing for spider time, they should be added quickly as well. if you really need them in for now, you'll have to submit them one at a time via the addUrl page.
Thank you for the info, Matt. It looks like your spider can handle about 300 pages every 5 to 10 seconds. (Just looking at the index page advance) That's a very fast rate, indeed.
|so i now index the url's host field with a slight weight over what it would get if it actually appeared in the document itself |
Psssst! Don't tell us! Otherwise it would be no fun.
Another thing: manual spam handling is not really scalable, is it? Good and aggressive auto filters is an essential part of any good public SE.
Hi Matt, well done (already). I always love it when someone dares to compete with the Majors.
Question: What about indexing the robots.txt file, seems like your Robot isn't spidering the robots.txt file (according to my log-files).
Keep up the good work!
I like the job,
I just added my site and it is there on the fly.
now I tried for some of my KW. I get listed 4 or 5 times on the fisrt page. that is also great I am Happy as a SEO.
But as a user I would rather see only one link to a site (same url): the most relevant one and option to view the other pages from the site. do you have any projects in this view?
I don't know if any one shares this view?
Great work, please keep it up!! I added 5 sites, went straight to the search and bingo! they were there.
Keewwl! I just added a site, with a real time watch open in another browser, and saw Gigabot/1.0 come through and scoop the top few pages (framed site, intrasite linking not yet up to scratch)
Most impressed thus far
Do my eyes deceive me: Has the "full pages indexed" reduced since last night?
| This 59 message thread spans 2 pages: < < 59 ( 1  ) |