Forum Moderators: open
[inktomi.com...]
For some frequently asked questions on Slurp and Inktomi's web crawling and
search technology, please see later in this email.
Some web site administrators do not want robots to index their site, or
certain areas of their site. This is particularly true for sections that
contain dynamically generated pages, CGI scripts, and so on. There is a de
facto standard called the "Robots Exclusion Standard" (RES) which allows web
administrators to tell robots which areas of the site they are allowed to
visit, and which are off limits.
RES involves putting a file called "robots.txt" in the document root of the
web site, this file is parsed by a robot to determine what site restrictions
exist. Every robot that visits your site should request the robots.txt file
from your web server. If there is no robots.txt file, the RES specifies that
the robot can visit all parts of the site if it wishes.
Slurp obeys robots.txt restrictions. Every time Slurp, visits your site it
will request the robots.txt file. If your site doesn't have a robots.txt,
Slurp will obey its own rules on which URL's to visit (for example, Slurp
doesn't visit the standard places to store CGI scripts even if robots.txt
allows it to). If you don't have a problem with Slurp's visits, or other
robots visits, then you don't need a robots.txt file. If you just want to stop
getting "robots.txt not found" errors in your server logs, you can simply
create an empty robots.txt file.
For more background on Slurp, please see:
[inktomi.com...]
For more information on the RES, please see:
[info.webcrawler.com...]
A number of frequently asked questions regarding Slurp and Inktomi's search
system are listed below.
If your question is not answered by them, or if you still need further
assistance, please send an email with a full problem description to
slurp-help@inktomi.com.
Thank you for your interest in Slurp!
Q - How do I add myself to your database ?
A - We provide the backend technology to search portals, however Inktomi does
not provide a mechanism for general Internet users to submit new web sites
directly into the search index. In order for your site to be listed with a
specific search portal you'll need to contact the portal. Your Add URL requests
must be submitted directly to each of the portal sites in which you would like
your URL to appear. This process varies from site to site, please visit the
portal for the specific details on how to submit your web site to their search
index.
Q - I have moved my web pages to a new site, please update my URLs to match
the new address.
A - Our web database is automatically updated without manual intervention and
reflects the content of the Web. To add your new pages to our database you
should have incoming links to them updated, this should ensure that we
discover your new addresses on our next database update cycle. For faster
addition please add them via one of our partner portals. To have your old URLs
removed from our database you should either remove them from the old web
server or block access to them via a RES robots.txt file, when our next
database update cycle discovers that they are no longer retrievable then they
will be removed from the database.
Q - We've changed our web servers IP address, please note the updated IP
address
A - Providing your DNS update has propagated around the Internet, we will
access the new site correctly on our next crawl cycle.
Q - I added my site at one of your partner portals but only one page is
listed, or nothing is listed still.
A - There is a time delay lag between user Add URL requests, and the URL being
added into the database. This is because our architecture has to crawl and
index the URLs, as well as validate them for quality.
Furthermore, please be aware that adding a single URL will not cause every
page on your site to be added to our database, it is necessary to specifically
add each individual URL that you would like to be indexed.
Q - I submitted my web page to a search site, and it used to be listed in
search results, but it isn't listed any more. Why did you remove it ?
A - If you submitted your URL for adding to our database via a search portal,
then it has probably aged out of our database. We have to crawl and index an
enormous number of URLs and so older ones will eventually age out if they do
not pass certain relevance criteria. We suggest checking for your URL's
existence within our database every 21 days, and re-adding it if necessary.
Q - I submitted a list of URLs but you only crawled robots.txt and /. Why ?
A - Our crawling system normally retrieves these URLs first, the other URLs
will be crawled soon.
Q - I have changed the description for my site, please update it in your
database.
A - If you have changed the title or description for a URL then it will get
picked up the next time we crawl your site. The Inktomi database is refreshed
on an ongoing basis, and as we note changes to page descriptions we update
them within our database automatically. The database is refreshed over an
approximate 30 day cycle.
Q - I searched for something and found this site in the result list. I
consider it to be offensive, please remove it from your database.
A - Inktomi has a policy of not automatically deleting documents from our
database. The database is simply a reflection of the Internet and is crawled
and indexed with little human intervention. We do not remove these documents
automatically specifically for the protection of our partners and website
administrators. As this could expose Inktomi to liability from website
administrators.
Q - Whenever I search for my name, I get a list of sites with content which I
find offensive. This is very embarrassing for me, please fix it.
A - This normally happens because the page contents contain your first name
and your last name within the same page, but not necessarily next to each
other. If you are searching for your name by entering your First name and Last
name, for example by entering John Doe in the search box, then the default on
most search portals is to return pages which contain the word John OR the word
Doe and those which contain both words will be listed higher in the results.
If you instead try searching for your name by enclosing it within quotes, for
example by entering "John Doe" then this will return only pages with the word
John and the word Doe next to each other, ie only those pages which list your
name.
Q - Who are you and why are you accessing my site repeatedly, why are you
"hacking" my site ?
A - Slurp is Inktomi Corporation's web-indexing robot. What you are seeing in
your web server logs are not hacking attempts nor a web user repeatedly
accessing parts of your site, but normal regular visits by the Slurp robot to
your web site. If you wish to restrict the parts of your site that Slurp
visits, then we advise use of a Robots Exclusion Standard "robots.txt" file.
For more information on RES files, please see:
[info.webcrawler.com...]
If the machine that we are accessing is no longer running a web server, but
you are seeing access attempts on your firewall logs, then it is possible that
we have followed a link to it from another web page. These attempts should
stop shortly afterwards, when we discover that we cannot access the server.
Slurp accesses web sites for which it has followed target links from within
other web sites, or for which Internet users have submitted URL addition
requests. It does not use any sort of probing, network scanning, or
port-scanning intrusion techniques to locate web servers.
Q - Someone at your site keeps trying to FTP into my home machine, please ask
them to stop it.
Q - Someone at your company is trying to portscan my machine, my software
reports connection attempts on port 21.
A - These are not hacking attempts nor port scanning attempts, but standard
access attempts to attempt to login to an FTP server on your machine. Port 21
is the Internet standard port for FTP servers. The access attempts are coming
from part of Inktomi Corporation's web crawling and indexing system. The
particular part of our system which is attempting to FTP login to your machine
is our media indexing system. Users who are hosting media FTP sites submit
their URLs or IP addresses to us to have us index their FTP sites.
We are probably attempting to FTP to your machine either because someone has
incorrectly submitted your IP address to us, or because you have a dynamic IP
address which changes each time you connect to your service provider and
someone has submitted their own dynamic IP address which you have now picked
up. We specifically ask people not to submit dynamic IP addresses however
unfortunately sometimes they do.
In these cases, if you can supply us with information on your machine - Your
hostname or IP address, and if you are using a dynamic IP address (and if so
then what the dynamic IP range is or whom we can contact at your ISP to find
out what it is) then we can have our media crawler indexing system stop
accessing your machine.
Q - I was just portscanned, someone from *.inktomi.com accessed my machine
looking for web servers on different ports. Please stop this.
A - These are not port scanning attempts, but regular attempts to check for the
existence of web servers on your machine which our crawling system is
attempting to crawl.
If, during our crawl and index of the web, we found URLs which referenced your
machine - either on port 80 or any other port that was referenced in a URL -
then our crawling system will periodically attempt to connect to your machine
on those ports in order to validate that a web server still does run on them.
If you wish this to be stopped, then please supply us with your machine's
hostname and IP address/addresses, and the ports in question. If you are
hosting multiple virtual machines and/or multiple virtual IP addresses on a
single machine then please supply us with a list of all of them so that we
can block access to them all.
Q - I wish to filter out all Inktomi crawler accesses from my web server logs,
please supply me with a list of IP addresses which your crawlers use.
A - Inktomi has a variety of crawling subsystems which span a number of
different IP subnets and IP address ranges. The common link between all of our
crawling systems is that they contain the string "Slurp" in the HTTP
User-Agent. We would therefore recommend that you filter based on User-Agent
rather than IP address as this ensures that you will filter all current and
forthcoming crawlers.
Q - Why are you ignoring my robots.txt file and accessing my root / document
even though I have listed it in robots.txt ?
A - Slurp retrieves the root document for usage internally. However, if you
have disallowed it in robots.txt then it will not be indexed nor will it be
added to our search database nor links from it followed.
Q - Why are you ignoring my robots.txt file for directory <x> ?
A - If you are referring to the root directory /, then please see the above
answer. If not then we advise checking your robots.txt file to ensure that you
have the directory specified correctly, and bear in mind RES directory rules.
For example, many servers automap www.yourserver.com/~user to
www.yourserver.come/home/user - however these are not classed as the same
under RES rules, and so you would need a robots.txt file entry for both ~user
and /home/user.
In addition, for performance reasons, and to reduce the load on your web
server, Slurp caches robots.txt files internally. This means that after
updating your robots.txt file and disallowing a directory, Slurp might still
make access attempts for up to 7 days.
Q - Do you index ASP pages and .shtml pages ?
A - Slurp does, however it only follows static links, and we recommend the
avoidance of dynamically generated href links.
Q - Do you index dynamic links ?
A - Slurp does not, as following dynamic links can cause problems. In
particular session data within a query string is a problem. We recommend adding
your static URLs with the portals that you use.
Q - Do you index frames ?
A - Slurp does not index frames; indexing frames would lead users to a frame
directly, which is not what web site designers desire.
Q - Does Slurp support the <META noindex> HTML tag ?
A - Slurp does support the noindex tag.
Q - Does Slurp follow redirections ?
A - No, Slurp does not follow redirections.
Q - Why do you keep asking for documents which don't exist or have been moved ?
A - Slurp is attempting to re-index them for our database. Now that it knows
they no longer exist they will be dropped from our database, and if they have
a new location then they will be discovered on our next crawl cycle.
Q - How can I get my page to rank higher in your search results ?
A - You should ensure that your title, and keyword/description HTML meta tags
correctly describe your page. Think carefully about key terms that your users
will search on, and use them to construct your page. Documents are ranked
higher if a search term is in the title and users are more likely to click a
link if the title matches what they are looking for.
Generally the more meta tag keywords the better, however you should ensure
that only keywords which are relevant to your site are used. Slurp uses
artificial intelligence techniques to look for sites which have been created
using many irrelevant tags in an attempt by their authors to force them into
search result lists which have nothing to do with their true content (known as
"spamming").
If the above does not answer your question and you still require further
assistance, please send an email with a full problem description to
slurp-help@inktomi.com.
-----------------------------------
Inktomi - Essential to the Internet
I've always heard that submitting to one of their partner sites results in being included at all of the others. Are they now saying otherwise with this point?
> In order for your site to be listed with a specific search portal you'll need to contact the portal. Your Add URL requests must be submitted directly to each of the portal sites in which you would like your URL to appear. This process varies from site to site, please visit the portal for the specific details on how to submit your web site to their search index.
Doesn't it look like they're saying to submit to ALL of them? It also wouldn't be a bad idea to have this information available at each of the partner sites, if this is the case. And how about MSN - there is no place to add the URL. Is it possible that there's a change coming?
At any rate, with Ink's business model, relying on their partners to communicate, which isn't always reliable, they could easily "lose control," which appears to have happened, with the confusion that abounds.
Here's a quote from a
thread in the FAST forum [webmasterworld.com] that reflects what a typical consumer or webmaster could be thinking (sorry, I'm quoting *me*):
"If a search engine like FAST is opening a door of open communication with webmasters, they are doing their part to make it a cooperative venture rather than an adversarial one, and in the long run everyone will gain, including and especially the searchers, most of whom don't have time or patience to sift through pages of junk to find what they're looking for.
IMHO, FAST is at the forefront of trying to establish lines of communication, highly commendable as well as good business sense. It's quality marketing, showing respect and makes them worthy of our respect.
I believe one of the reasons Inktomi is losing ground is because of total lack of contact or communication with webmasters and consumers. Big mistake. My highly "uneducated guess" is that we could very well start to see FAST start to make inroads in areas where Ink is losing ground."
This email appears to be an attempt to remedy their situation and clear up some of the confusion and problems that exist.
Edited by: Marcia
Interesting - this seems to confirm link pop is a factor and they are inviting spam!
>I've always heard that submitting to one of their partner sites results in being included at all of the others. Are they now saying otherwise with this point?
I think they are referring to localised filtering.
I seem to remember, months before paid inclusion even started, that sites would disappear and need to be re-submitted as a matter of course.
Or is this point being made to encourage paid submissions with guaranteed respidering?
For example, data from Direct Hit.
>I seem to remember, months before paid inclusion even started, that sites would disappear and need to be re-submitted as a matter of course.
Sites in the, so called, "permanent DB" were less prone to drop out.
>Or is this point being made to encourage paid submissions with guaranteed respidering?
:)
Link pop;
Has anyone noticed that links to a site are not dropped by ink. I continue to feed ink new links but the old ones are not being dropped.
"certain relevance criteria"
Link pop and directory listings
I would guess that someone submitted a mail to them using your e-mail address as the one to reply to.
"Your Add URL requests must be submitted directly to _each_ of the portal sites in which you would like your URL to appear."
I don't think that there is any ambivalence here. They're not talking about localised filtering but submitting the same URL to each site that you want it to appear on.
ergo, there are separate databases for each partner and a page will not permeate through the Ink partners unless it is in the 'main' database which is propagated by links.
-Either that or this sentence has been REALLY badly phrased.