Forum Moderators: bakedjake
There is a souped-up ODP clone [yuntis-wrld.ecsl.cs.sunysb.edu]. All pages are cached. Lots of neat things available on 'Information about URL' page (see example [yuntis-wrld.ecsl.cs.sunysb.edu]).
There's lots of stuff, which I need to explore in more depth later.
I will be spending qute a few hours there I can see already!
I wrote them about this mentioning that most hostmasters put their Disallow:-rules for reason and probably won't like to find especially Disallow:ed links in a search engine's result list.
I got the answer that the "Robots exclusion standard covers only what should not be visited. It might be a policy of a search engine to also disregard links to robots.txt-blocked pages, but Yuntis currently does not have such policy."
I still cannot see any reason to publish Disallow:ed links and I don't quite understand why they pollute their database with unverified links of unknown quality they never visit.
Regards,
R.
[yuntis-wrld.ecsl.cs.sunysb.edu...]
Take a look at some of the textual matches tools...very interesting. Their reference to 'subexpression' is interesting as well in that it states, "not limited to words" so they are trying to mine as much data as possible in the text they parse & store.
I wonder what, if anything, else they could dig out of web pages with such parsing available to, say, an API related application such as the Google API?
There are lots of possibilities, I think...it might be interesting, just as an example, to see if such a tool could pull pricing information better from web pages selling products.
Very intersting stuff.
Google does the same thing - If they find a link, they publish it - whether or not robots.txt allows them to fetch the page.
The only work-around for Google is to allow the 'bot in robots.txt, and then use a <meta robots noindex> tag on the page itself.
Oh... also, Ask Jeeves works the same way, as far as I know.
HTH,
Jim