Yuntis updated

Forum Moderators: bakedjake

Message Too Old, No Replies

Yuntis updated

mfagan

7:37 pm on Feb 6, 2003 (gmt 0)

Yuntis [yuntis-wrld.ecsl.cs.sunysb.edu] is now using data from January 2003. It includes a little over a million 'documents' (see stats [yuntis-wrld.ecsl.cs.sunysb.edu]).

There is a souped-up ODP clone [yuntis-wrld.ecsl.cs.sunysb.edu]. All pages are cached. Lots of neat things available on 'Information about URL' page (see example [yuntis-wrld.ecsl.cs.sunysb.edu]).

There's lots of stuff, which I need to explore in more depth later.

jeremy goodrich

7:46 pm on Feb 6, 2003 (gmt 0)

Nice find! All sorts of bells and whistles. I'll definitely have to explore it later, as well.

Those bars in the directory, though -> looks a bit like the Google directory. :)

Of course, the numbers / meanings are different, though...

mfagan

10:45 pm on Feb 14, 2003 (gmt 0)

just a week after I posted that first message, they've updated again.

"Selected World Sites - 34,638,326 URLs - 4,065,923 documents - (Data and Engine of February 2003)"

martinibuster

12:56 am on Feb 15, 2003 (gmt 0)

Strange result. A certain web site is in the directory, but doing a search by the url comes up blank.

chiyo

5:43 am on Feb 15, 2003 (gmt 0)

yep the SERPs are not killer relevance, but i guess as the system develops it will improve given the mass of elements they are tracking. Lets see how they are incoporated into the algo. Its a nice research tool, even now. Should make the corss-linkers shiver! Its too detailed to become mainstream, but as a research tool the data is very rich. I guess they may be aiming to come up with very simple alternative interfaces if they are looking to compete in the mainstream consumer market.

I will be spending qute a few hours there I can see already!

WindSun

9:56 am on Feb 15, 2003 (gmt 0)

Interesting, but they definately need to work on their algorithms, especially their "credibility" scores. (Google news showed up at 2.9%).

jeremy goodrich

8:50 pm on Feb 15, 2003 (gmt 0)

Agreed, Chiyo - very nice research tool. Those options on seeing who links to what / how, etc. are very good.

Will be very fun when their database grows, I think - if they increase the size to encompass more of the web.

Romeo

4:39 pm on Feb 22, 2003 (gmt 0)

... well, the Yuntis has a policy which I don't understand:
When I looked up my domain, the very first hit on top was a link to a directory I explicitly Disallow:ed for all search engines in my robots.txt.
From my server's logfiles I see that their bot has read the robots.txt and respected it: no crawl into the Disallow:ed directory.
So they show up stuff where they just found links pointing to without following those links and without spidering their content. Actually, those unverified links may be outdated and point to nowhere, but they get included in their database anyway.

I wrote them about this mentioning that most hostmasters put their Disallow:-rules for reason and probably won't like to find especially Disallow:ed links in a search engine's result list.
I got the answer that the "Robots exclusion standard covers only what should not be visited. It might be a policy of a search engine to also disregard links to robots.txt-blocked pages, but Yuntis currently does not have such policy."

I still cannot see any reason to publish Disallow:ed links and I don't quite understand why they pollute their database with unverified links of unknown quality they never visit.

Regards,
R.

cjtripnewton

9:42 pm on Feb 22, 2003 (gmt 0)

This indicates that the value of incoming links is very high in the Yuntis formula. So high that having them actually find a page at the end of the link dilutes the score for that page for your particular search. I've seen something very similar in Google's results recently.

jeremy goodrich

10:18 pm on Feb 22, 2003 (gmt 0)

Yuntis has some very neat advanced functions - that combined with similar linkage analysis seems to be a great way to do search differently.

[yuntis-wrld.ecsl.cs.sunysb.edu...]

Take a look at some of the textual matches tools...very interesting. Their reference to 'subexpression' is interesting as well in that it states, "not limited to words" so they are trying to mine as much data as possible in the text they parse & store.

I wonder what, if anything, else they could dig out of web pages with such parsing available to, say, an API related application such as the Google API?

There are lots of possibilities, I think...it might be interesting, just as an example, to see if such a tool could pull pricing information better from web pages selling products.

Very intersting stuff.

jdMorgan

10:18 pm on Feb 22, 2003 (gmt 0)

Romeo,

Google does the same thing - If they find a link, they publish it - whether or not robots.txt allows them to fetch the page.

The only work-around for Google is to allow the 'bot in robots.txt, and then use a <meta robots noindex> tag on the page itself.

Oh... also, Ask Jeeves works the same way, as far as I know.

HTH,
Jim

Yuntis updated

mfagan

jeremy goodrich

mfagan

martinibuster

chiyo

WindSun

jeremy goodrich

Romeo

cjtripnewton

jeremy goodrich

jdMorgan

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week