Welcome to WebmasterWorld Guest from 54.145.136.73

Forum Moderators: open

Every week FAST's crawler fetches robots.txt & leaves without crawling

   
4:53 pm on Jul 10, 2002 (gmt 0)

10+ Year Member



I read the FAQs at FAST and lazerzubb's small FAST FAQ [webmasterworld.com] as well, but couldn't find an answer. My question is rather simple: Why is FAST not crawling my (private) site (see profile)?

FAST-WebCrawler/3.6 or 3.5 fetched my robots.txt on the following dates:

  • [26/Jun/2002:21:27:58 +0200],
  • [19/Jun/2002:22:49:46 +0200],
  • [12/Jun/2002:21:53:34 +0200],
  • [11/Jun/2002:19:34:45 +0200],
  • [29/May/2002:14:32:54 +0200],
  • [19/May/2002:08:50:56 +0200],
but never crawled a single page! :( Why?

What can/should I do? I am in DMOZ and have links to my pages. My pages rank well in Google, Altavista, Teoma, ... Could you have a look at my "robots.txt" -- it's valid according to the Robots.txt Validator [searchengineworld.com].

9:35 pm on Jul 10, 2002 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



luma,

Your robots.txt looks OK to me... The only thing I see is that your
three Disallows for "Microsoft URL Control" are not likely to work -
those user-agents likely won't check robots.txt at all. You should
probably block these in .htaccess (for Apache server) instead.

Does Fast index any of the sites that link to your site? If so, they
should pick up your site quickly.

In your log files, what server code does your server return when Fast
requests robots.txt?

Have you recently moved your site?

Have you tried Fast's "Submit a Site" process?

Another "picky" thing about robots.txt is that is a Unix-format file;
Make sure you don't have carriage-return/linefeed pairs as the end-of-
line characters. Most robots.txt validators will catch this problem
though, so I assume this is not your problem. If so, and you're on a
PC, you can edit it in MS Word and use the "Save as" options, specifying
ASCII text, LF only.

So, good question - Anyone else?

Jim

1:41 am on Jul 11, 2002 (gmt 0)

10+ Year Member



Hi jdMorgan, thanks for your answer and thanks for everyone who had been checking.

Your robots.txt looks OK to me... The only thing I see is that your three Disallows for "Microsoft URL Control" are not likely to work - those user-agents likely won't check robots.txt at all. You should probably block these in .htaccess (for Apache server) instead.
I do use .htaccess for some 301s but haven't figured out blocking UAs.

Does Fast index any of the sites that link to your site? If so, they should pick up your site quickly.
Yes other pages linking to me are in.

In your log files, what server code does your server return when Fast requests robots.txt?
200 OK

Have you recently moved your site?
No. But I only started adding real content and getting links to it a couple of months ago.

Have you tried Fast's "Submit a Site" process?
I am sure I submitted a page or two a couple of months ago (free submit) and did so again a couple of days ago. I might be wrong cause I don't keep notes...

Another "picky" thing about robots.txt is that is a Unix-format file;
I am using Linux myself and checked again but everything seems right.

So, good question - Anyone else?
Thanks for your help. You see, I think I double-checked everything and really can't find anything. :(

Well, I just read thread Kudos to AllTheWeb - Customer Service [webmasterworld.com], so maybe there's hope after all. I used the AllTheWeb.com: Send Feedback to FAST [alltheweb.com] form a couple of days ago. But I also wanted to be sure that there's not a general problem (that could affect other search engines as well).

Guess I will just have to wait and hope that Google will never start ignoring me. It's just that I don't want to put all eggs in one basket ...

2:15 am on Jul 12, 2002 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



luma,

I looked at the source of a couple of your pages, and I still don't see anything
wrong. However, I should add that I'm not familiar with XHTML. I also searched
for your site on Fast using your URL, and your name and the primary subject of
your site. No luck.

I have a couple of observations and maybe they will help...

The DOCTYPE statement should probably not be broken into two lines. (I know the
W3C HTML validator is very picky about spacing and capitalization of letters in
DOCTYPE). Did you try the validators at www.w3c.org?

The DOCTYPE says it is English, but the other tags says it is Deutch (It's
both, yes, but these may need to agree)

You have several repeats in your meta keywords which gain nothing, and reduce
the value of the ones that follow. (Not related to your problem, but true).

Try adding one blank line at the end of your robots.txt. I'm not at all sure
it's required, but the specification calls the line break a "record separator",
so maybe some robots think the record didn't end?

These are all pure guesses, but I agree it is not a good idea to have all
of your eggs in one basket, and Fast looks like a nice second or third basket.

Fast is usually fast. I made some changes to my site, and within a week, Fast
had found them and updated its index. So, if you do fix something, you
should see results soon.

It looks like I will have to bookmark your site and come back to read about
URL filtering, etc. Very nice browser feature!

Jim

10:16 am on Jul 12, 2002 (gmt 0)

10+ Year Member



When we start talking about the insides of DOCTYPE, it becomes an SGML discussion. A little trickier than HTML, but not difficult:

The DOCTYPE statement should probably not be broken into two lines. (I know the
W3C HTML validator is very picky about spacing and capitalization of letters in
DOCTYPE).

That shouldn't be a problem. A line break between the Formal Public Identifier (FPI) and the system identifier is completely legitimate, and in fact, sorta traditional. A few very old browsers have trouble with it, but no legitimate validator or robot should complain. Besides, it doesn't sound like Fast is even requesting the HTML files, so the SGML probably isn't the problem.

The DOCTYPE says it is English, but the other tags says it is Deutch (It's
both, yes, but these may need to agree)

They're not supposed to agree (in this case). The language code in the FPI identifies the language used to create the markup language, not the language of the document content. HTML's DTDs are all written in English, so the HTML FPIs always use EN.

Looking at the source code for luma's home page, I'd personally be more concerned about starting the page with an empty comment tag. Starting a page with a comment feels like bad karma to me.

9:11 pm on Jul 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>Why is FAST not crawling my (private) site

don't panic, your site is not the only one not being fully spidered. a couple of weeks back, brett mentioned that FASTs spider wasn't crawling as normal. it could be that FAST are changing the crawl schedule, or they might be stopping the free crawling, or maybe something else. we'll have to wait and see.

2:41 am on Jul 15, 2002 (gmt 0)

10+ Year Member



Did you try the validators at www.w3c.org

All (but one) of my pages should validate.

Try adding one blank line at the end of your robots.txt

I just checked WMW's robots.txt and it doesn't have one either. But I'll give it a try.

I'd personally be more concerned about starting the page with an empty comment tag. Starting a page with a comment feels like bad karma to me.

I am not sure what you are talking about. I don't have an empty comment tag, do I?

we'll have to wait and see.

I guess that pretty much sums it up. ;)

Thanks for all of your help.

9:01 am on Jul 15, 2002 (gmt 0)

10+ Year Member



I am not sure what you are talking about. I don't have an empty comment tag, do I?

As it turns out, you don't. I was using my roommate's computer last week (mine was dismantled while I worked on a hardware problem), and I didn't realize the advertising filter he's using with IE alters the source code of HTML pages.

9:21 pm on Aug 14, 2002 (gmt 0)

10+ Year Member



It's been some time, and yesterday I found the following in my logfile:
access.log.33.2:66.77.73.254 - - [13/Aug/2002:15:04:43 +0200]
"GET /widgets/blue.html HTTP/1.0" 200 21053 www.domain.com "-"
"FAST-WebCrawler/3.6/FirstPage (crawler @fast.no;
http*//fast.no/support.php?c=faqs/crawler)" "-"

It fetched two pages but no robots.txt. The last time "FAST-WebCrawler/3.6 (atw-crawler at fast dot no; http*//fast.no/support/crawler.asp)" fetched robots.txt was on July, 23rd. It didn't fetch any pages.

So, is the Fast-Firstpage crawler the regular guy or did they finally read my e-mail and send some special bot? What do you think, (when) will those pages make it in the index?

11:19 am on Aug 15, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Sorry to hear about your problems with Fast luma. Things have be "fuzzy" lately with reports of irregularities in spidering/updates.

It's very strange that the crawler didn't ask for the robots.txt. It should do that every time it accesses the site. Are you sure?

Actually I haven't noticed the "first page" crawler before. It could be anything from a regular crawler to a special bot they use to check a site out manually (I doubt that - manual checks would be too time consuming).

First page could mean a bot for new sites that they haven't got in the database yet?

Come on guys and girls, help luma out and check if you have the first page crawler in your logs:

"FAST-WebCrawler/3.6/FirstPage (crawler @fast.no; http*//fast.no/support.php? c=faqs/crawler)" "-"

-I can't get the url to the crawler info to resolve either.

11:21 am on Aug 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



WebCrawler 3.6 is one of those that hits some pages on some of my sites sometimes. it's been a long time since FAST regularly and fully spidered any of my sites or the sites i look after. luma, i'd say you just gotta keep on waiting ..... relax and move onto something else for now and see what happens in a couple of months time ...
11:22 am on Aug 15, 2002 (gmt 0)

WebmasterWorld Senior Member heini is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Firstpage is not new, see here:
[webmasterworld.com...]
[webmasterworld.com...]
11:44 am on Aug 15, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



>first page not new

Ooops! Gotta use that Site Search before pushing the submit button.

Thanks heini ;)

7:56 pm on Aug 15, 2002 (gmt 0)

10+ Year Member



Well, FirstPage sounds better than Nirvana. ;)

You should be able to access FAST's crawler pages:

FAST-WebCrawler/3.6 atw-crawler at fast dot no;
FAST Web Crawler >> FAQs
[fast.no...]

FAST-WebCrawler/3.6/FirstPage crawler @fast.no
FAST Customer Support
[fast.no...]

That second address gets redirected to [fast.no...] were you find a link to FAST's Web Crawler FAQ (see above).

8:14 pm on Aug 15, 2002 (gmt 0)

WebmasterWorld Senior Member heini is a WebmasterWorld Top Contributor of All Time 10+ Year Member



FAST-WebCrawler/3.6/FirstPage _IS_ the regular guy - just checked a site where this was the only Fast crawler this month, walking all the way through the site.

>What do you think, (when) will those pages make it in the index?
Luma - no predictions here. My impression is Fast is working on something which has slowed crawling cycles and picking up new sites somewhat down.

1:01 pm on Aug 28, 2002 (gmt 0)

10+ Year Member



I am in! :) They updated part of their index I guess. I watched a query that got me three results. No it's up to five and one is one of the three pages the crawler was fetching. Hope it will fetch some more pages and not just the three that are in dmoz.

Thank you for helping and cheering me up. :)

2:14 pm on Aug 28, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Congrats Luma :)

Hopefully it will spawn some traffic too?

Let us know how it goes and what partners you get referrals from?

2:52 pm on Aug 28, 2002 (gmt 0)

10+ Year Member



I'm glad luma's issue is solved but it bothers me that this user agent has been in my logs nearly every day lately. It always gets my index.asp page first, then it comes back later -- sometimes the next day -- to get my robots.txt file, and then it disappears for one or two days. The robots.txt file does not exclude FAST except for the directories that I exclude all user agents from.
3:11 pm on Sep 3, 2002 (gmt 0)

10+ Year Member



And out again. :( Only the two 1B sized links (see thread Weird results (1 B) size on "more hits from" click [webmasterworld.com]) are in. I was watching one specific query that changed from 5 results to 9 and now 8.

That's it. I will completely ignore FAST (what a funny name for something this slow) until they crawl and list all of my site.

<random google success story>
A friend of mine finally got his domain online. I linked to two of his pages (frameset and deep) on Aug, 28th. On Sept., 1st those two pages were in!
</random google success story>

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month