homepage Welcome to WebmasterWorld Guest from 54.237.98.229
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Yahoo / Deprecated - Altavista, Alltheweb.com
Forum Library, Charter, Moderator: open

Deprecated - Altavista, Alltheweb.com Forum

    
Every week FAST's crawler fetches robots.txt & leaves without crawling
luma

10+ Year Member



 
Msg#: 541 posted 4:53 pm on Jul 10, 2002 (gmt 0)

I read the FAQs at FAST and lazerzubb's small FAST FAQ [webmasterworld.com] as well, but couldn't find an answer. My question is rather simple: Why is FAST not crawling my (private) site (see profile)?

FAST-WebCrawler/3.6 or 3.5 fetched my robots.txt on the following dates:

  • [26/Jun/2002:21:27:58 +0200],
  • [19/Jun/2002:22:49:46 +0200],
  • [12/Jun/2002:21:53:34 +0200],
  • [11/Jun/2002:19:34:45 +0200],
  • [29/May/2002:14:32:54 +0200],
  • [19/May/2002:08:50:56 +0200],
but never crawled a single page! :( Why?

What can/should I do? I am in DMOZ and have links to my pages. My pages rank well in Google, Altavista, Teoma, ... Could you have a look at my "robots.txt" -- it's valid according to the Robots.txt Validator [searchengineworld.com].

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 541 posted 9:35 pm on Jul 10, 2002 (gmt 0)

luma,

Your robots.txt looks OK to me... The only thing I see is that your
three Disallows for "Microsoft URL Control" are not likely to work -
those user-agents likely won't check robots.txt at all. You should
probably block these in .htaccess (for Apache server) instead.

Does Fast index any of the sites that link to your site? If so, they
should pick up your site quickly.

In your log files, what server code does your server return when Fast
requests robots.txt?

Have you recently moved your site?

Have you tried Fast's "Submit a Site" process?

Another "picky" thing about robots.txt is that is a Unix-format file;
Make sure you don't have carriage-return/linefeed pairs as the end-of-
line characters. Most robots.txt validators will catch this problem
though, so I assume this is not your problem. If so, and you're on a
PC, you can edit it in MS Word and use the "Save as" options, specifying
ASCII text, LF only.

So, good question - Anyone else?

Jim

luma

10+ Year Member



 
Msg#: 541 posted 1:41 am on Jul 11, 2002 (gmt 0)

Hi jdMorgan, thanks for your answer and thanks for everyone who had been checking.

Your robots.txt looks OK to me... The only thing I see is that your three Disallows for "Microsoft URL Control" are not likely to work - those user-agents likely won't check robots.txt at all. You should probably block these in .htaccess (for Apache server) instead.
I do use .htaccess for some 301s but haven't figured out blocking UAs.

Does Fast index any of the sites that link to your site? If so, they should pick up your site quickly.
Yes other pages linking to me are in.

In your log files, what server code does your server return when Fast requests robots.txt?
200 OK

Have you recently moved your site?
No. But I only started adding real content and getting links to it a couple of months ago.

Have you tried Fast's "Submit a Site" process?
I am sure I submitted a page or two a couple of months ago (free submit) and did so again a couple of days ago. I might be wrong cause I don't keep notes...

Another "picky" thing about robots.txt is that is a Unix-format file;
I am using Linux myself and checked again but everything seems right.

So, good question - Anyone else?
Thanks for your help. You see, I think I double-checked everything and really can't find anything. :(

Well, I just read thread Kudos to AllTheWeb - Customer Service [webmasterworld.com], so maybe there's hope after all. I used the AllTheWeb.com: Send Feedback to FAST [alltheweb.com] form a couple of days ago. But I also wanted to be sure that there's not a general problem (that could affect other search engines as well).

Guess I will just have to wait and hope that Google will never start ignoring me. It's just that I don't want to put all eggs in one basket ...

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 541 posted 2:15 am on Jul 12, 2002 (gmt 0)

luma,

I looked at the source of a couple of your pages, and I still don't see anything
wrong. However, I should add that I'm not familiar with XHTML. I also searched
for your site on Fast using your URL, and your name and the primary subject of
your site. No luck.

I have a couple of observations and maybe they will help...

The DOCTYPE statement should probably not be broken into two lines. (I know the
W3C HTML validator is very picky about spacing and capitalization of letters in
DOCTYPE). Did you try the validators at www.w3c.org?

The DOCTYPE says it is English, but the other tags says it is Deutch (It's
both, yes, but these may need to agree)

You have several repeats in your meta keywords which gain nothing, and reduce
the value of the ones that follow. (Not related to your problem, but true).

Try adding one blank line at the end of your robots.txt. I'm not at all sure
it's required, but the specification calls the line break a "record separator",
so maybe some robots think the record didn't end?

These are all pure guesses, but I agree it is not a good idea to have all
of your eggs in one basket, and Fast looks like a nice second or third basket.

Fast is usually fast. I made some changes to my site, and within a week, Fast
had found them and updated its index. So, if you do fix something, you
should see results soon.

It looks like I will have to bookmark your site and come back to read about
URL filtering, etc. Very nice browser feature!

Jim

mbauser2

10+ Year Member



 
Msg#: 541 posted 10:16 am on Jul 12, 2002 (gmt 0)

When we start talking about the insides of DOCTYPE, it becomes an SGML discussion. A little trickier than HTML, but not difficult:

The DOCTYPE statement should probably not be broken into two lines. (I know the
W3C HTML validator is very picky about spacing and capitalization of letters in
DOCTYPE).

That shouldn't be a problem. A line break between the Formal Public Identifier (FPI) and the system identifier is completely legitimate, and in fact, sorta traditional. A few very old browsers have trouble with it, but no legitimate validator or robot should complain. Besides, it doesn't sound like Fast is even requesting the HTML files, so the SGML probably isn't the problem.

The DOCTYPE says it is English, but the other tags says it is Deutch (It's
both, yes, but these may need to agree)

They're not supposed to agree (in this case). The language code in the FPI identifies the language used to create the markup language, not the language of the document content. HTML's DTDs are all written in English, so the HTML FPIs always use EN.

Looking at the source code for luma's home page, I'd personally be more concerned about starting the page with an empty comment tag. Starting a page with a comment feels like bad karma to me.

Crazy_Fool

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 541 posted 9:11 pm on Jul 12, 2002 (gmt 0)

>>Why is FAST not crawling my (private) site

don't panic, your site is not the only one not being fully spidered. a couple of weeks back, brett mentioned that FASTs spider wasn't crawling as normal. it could be that FAST are changing the crawl schedule, or they might be stopping the free crawling, or maybe something else. we'll have to wait and see.

luma

10+ Year Member



 
Msg#: 541 posted 2:41 am on Jul 15, 2002 (gmt 0)

Did you try the validators at www.w3c.org

All (but one) of my pages should validate.

Try adding one blank line at the end of your robots.txt

I just checked WMW's robots.txt and it doesn't have one either. But I'll give it a try.

I'd personally be more concerned about starting the page with an empty comment tag. Starting a page with a comment feels like bad karma to me.

I am not sure what you are talking about. I don't have an empty comment tag, do I?

we'll have to wait and see.

I guess that pretty much sums it up. ;)

Thanks for all of your help.

mbauser2

10+ Year Member



 
Msg#: 541 posted 9:01 am on Jul 15, 2002 (gmt 0)

I am not sure what you are talking about. I don't have an empty comment tag, do I?

As it turns out, you don't. I was using my roommate's computer last week (mine was dismantled while I worked on a hardware problem), and I didn't realize the advertising filter he's using with IE alters the source code of HTML pages.

luma

10+ Year Member



 
Msg#: 541 posted 9:21 pm on Aug 14, 2002 (gmt 0)

It's been some time, and yesterday I found the following in my logfile:
access.log.33.2:66.77.73.254 - - [13/Aug/2002:15:04:43 +0200]
"GET /widgets/blue.html HTTP/1.0" 200 21053 www.domain.com "-"
"FAST-WebCrawler/3.6/FirstPage (crawler @fast.no;
http*//fast.no/support.php?c=faqs/crawler)" "-"

It fetched two pages but no robots.txt. The last time "FAST-WebCrawler/3.6 (atw-crawler at fast dot no; http*//fast.no/support/crawler.asp)" fetched robots.txt was on July, 23rd. It didn't fetch any pages.

So, is the Fast-Firstpage crawler the regular guy or did they finally read my e-mail and send some special bot? What do you think, (when) will those pages make it in the index?

Rumbas

WebmasterWorld Administrator 10+ Year Member



 
Msg#: 541 posted 11:19 am on Aug 15, 2002 (gmt 0)

Sorry to hear about your problems with Fast luma. Things have be "fuzzy" lately with reports of irregularities in spidering/updates.

It's very strange that the crawler didn't ask for the robots.txt. It should do that every time it accesses the site. Are you sure?

Actually I haven't noticed the "first page" crawler before. It could be anything from a regular crawler to a special bot they use to check a site out manually (I doubt that - manual checks would be too time consuming).

First page could mean a bot for new sites that they haven't got in the database yet?

Come on guys and girls, help luma out and check if you have the first page crawler in your logs:

"FAST-WebCrawler/3.6/FirstPage (crawler @fast.no; http*//fast.no/support.php? c=faqs/crawler)" "-"

-I can't get the url to the crawler info to resolve either.

Crazy_Fool

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 541 posted 11:21 am on Aug 15, 2002 (gmt 0)

WebCrawler 3.6 is one of those that hits some pages on some of my sites sometimes. it's been a long time since FAST regularly and fully spidered any of my sites or the sites i look after. luma, i'd say you just gotta keep on waiting ..... relax and move onto something else for now and see what happens in a couple of months time ...

heini

WebmasterWorld Senior Member heini us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 541 posted 11:22 am on Aug 15, 2002 (gmt 0)

Firstpage is not new, see here:
[webmasterworld.com...]
[webmasterworld.com...]

Rumbas

WebmasterWorld Administrator 10+ Year Member



 
Msg#: 541 posted 11:44 am on Aug 15, 2002 (gmt 0)

>first page not new

Ooops! Gotta use that Site Search before pushing the submit button.

Thanks heini ;)

luma

10+ Year Member



 
Msg#: 541 posted 7:56 pm on Aug 15, 2002 (gmt 0)

Well, FirstPage sounds better than Nirvana. ;)

You should be able to access FAST's crawler pages:

FAST-WebCrawler/3.6 atw-crawler at fast dot no;
FAST Web Crawler >> FAQs
[fast.no...]

FAST-WebCrawler/3.6/FirstPage crawler @fast.no
FAST Customer Support
[fast.no...]

That second address gets redirected to [fast.no...] were you find a link to FAST's Web Crawler FAQ (see above).

heini

WebmasterWorld Senior Member heini us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 541 posted 8:14 pm on Aug 15, 2002 (gmt 0)

FAST-WebCrawler/3.6/FirstPage _IS_ the regular guy - just checked a site where this was the only Fast crawler this month, walking all the way through the site.

>What do you think, (when) will those pages make it in the index?
Luma - no predictions here. My impression is Fast is working on something which has slowed crawling cycles and picking up new sites somewhat down.

luma

10+ Year Member



 
Msg#: 541 posted 1:01 pm on Aug 28, 2002 (gmt 0)

I am in! :) They updated part of their index I guess. I watched a query that got me three results. No it's up to five and one is one of the three pages the crawler was fetching. Hope it will fetch some more pages and not just the three that are in dmoz.

Thank you for helping and cheering me up. :)

Rumbas

WebmasterWorld Administrator 10+ Year Member



 
Msg#: 541 posted 2:14 pm on Aug 28, 2002 (gmt 0)

Congrats Luma :)

Hopefully it will spawn some traffic too?

Let us know how it goes and what partners you get referrals from?

Pushycat

10+ Year Member



 
Msg#: 541 posted 2:52 pm on Aug 28, 2002 (gmt 0)

I'm glad luma's issue is solved but it bothers me that this user agent has been in my logs nearly every day lately. It always gets my index.asp page first, then it comes back later -- sometimes the next day -- to get my robots.txt file, and then it disappears for one or two days. The robots.txt file does not exclude FAST except for the directories that I exclude all user agents from.

luma

10+ Year Member



 
Msg#: 541 posted 3:11 pm on Sep 3, 2002 (gmt 0)

And out again. :( Only the two 1B sized links (see thread Weird results (1 B) size on "more hits from" click [webmasterworld.com]) are in. I was watching one specific query that changed from 5 results to 9 and now 8.

That's it. I will completely ignore FAST (what a funny name for something this slow) until they crawl and list all of my site.

<random google success story>
A friend of mine finally got his domain online. I linked to two of his pages (frameset and deep) on Aug, 28th. On Sept., 1st those two pages were in!
</random google success story>

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Yahoo / Deprecated - Altavista, Alltheweb.com
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved