Indexing scripted pages; non-.html endings - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Indexing scripted pages; non-.html endings

How do dynamic pages get crawled?

dblobaum

7:32 pm on Nov 26, 2002 (gmt 0)

10+ Year Member

Much of the website for the University of Chicago Press at
www.press.example.edu/ is indexed by Google and our homepage appears (from the Toolbar and the directory ranking) to have reasonably high PageRank. Many other pages on our site are also included in the index.

There is however a whole class of our pages that do not appear in the index and I wonder why they do not and how they can get included. These are the pages on which we describe our individual publications such as this one: www.press.example.edu/cgi-bin/hfs.cgi/00/15333.ctl
Our robots.txt file does not exclude these pages.

Perhaps the formulation of the URL leads to its exclusion? Which part? There are 6.5 million Google-indexed pages that include "cgi-bin" in their URL. However there are only 8 that include "hfs.cgi". Or perhaps the ".ctl" extension is the problem. Is there any way that we can tell the Google crawler to go ahead and index these pages?

I realize that dynamic pages can be problematic. But many dynamic pages do get indexed and I should think that our high PR would allow our dynamic pages into the index. Is there a specific bit of the URL that causes the robot not to crawl these pages?

Thanks for your help.

Dean Blobaum
Chicago

[edited by: ciml at 8:08 pm (utc) on Nov. 26, 2002]
[edit reason] No specifics please. [/edit]

ciml

8:44 pm on Nov 26, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Welcome to WebmasterWorld [webmasterworld.com], Dean.

If you have plenty of PageRank in the page that links to the missing page, then you should normally expect it to be spidered despite the cgi-bin or .cgi URL components.

.ctl may be a different matter. We discussed this topic recently [webmasterworld.com] but I don't think a definitive answer was reached.

I don't want to jump to an unfounded conclusion here, but it appears that Google doesn't crawl URLs that look like they might have unknown file extensions. The WWW approach would be to use <A href="/whatever.ctl" type="text/html">, but I don't think I've ever seen it used by any user agent application, or any HTML document.

There should be no reason not to use .gif URLs for HTML and .html URLs for GIFs, as long as the Web server advertises the content-type correctly. Seeing as both IE5 and Googlebot seem to guess content-type from URLs, this becomes a moot point.

NFFC

9:47 pm on Nov 26, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi dean and welcome to WebmasterWorld,

>formulation of the URL leads to its exclusion? Which part?

Too many "." imho.

dblobaum

3:51 pm on Nov 27, 2002 (gmt 0)

10+ Year Member

NFFC said:

>>formulation of the URL leads to its exclusion? Which part?

>Too many "." imho.

The Googlebot expects a file extension after the first "." perhaps? I don't know; I don't think data bears that out. If you drop this in the Google search box: "allinurl:s.cgi" you get 16,900 results. Not all have multiple "." in the URL but many do. (You can probably substitute any letter of the alphabet for "s" and get some results.)

The file extension of ".ctl" may be a more likely offender. Is there any way of searching for specific extensions in Google?

Thanks for your help.

Dean Blobaum
Chicago

ciml

6:08 pm on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes! The filetype: operater doesn't just work with the types in the drop-down list on the advanced search page.

word filetype:ctl

0 results.

This is fun, though:

word filetype:htm
word filetype:html
word filetype:shtml

dblobaum

7:17 pm on Nov 27, 2002 (gmt 0)

10+ Year Member

>Yes! The filetype: operater doesn't just work with the types >in the drop-down list on the advanced search page.

>word filetype:ctl

>0 results.

>This is fun, though:

Thanks, ciml!

More fun? Try your initials, maybe. I got 314 for my filetype. Maybe we'll end all our pages with that extension.

db
chicago

NFFC

7:25 pm on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

db, are any of the problem URL's linked to from a static/regular URL?

ciml

7:30 pm on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Use "+com" as the search and you'll find many .ctl URLs.

But, all of them have?something after the .ctl

Google search for +com filetype:ctl [google.com]

dblobaum

7:33 pm on Nov 27, 2002 (gmt 0)

10+ Year Member

NFFC asked:

>db, are any of the problem URL's linked to from a >static/regular URL?

Yes. All or almost all of them are linked from static pages within our own site. A significant number of them are linked from other sites and pages (that is, Google-indexed sites and pages).

Thanks.

Dean

dblobaum

6:18 pm on Dec 2, 2002 (gmt 0)

10+ Year Member

On Nov. 26 ciml said:

>If you have plenty of PageRank in the page that links to the missing page, then you >should normally expect it to be spidered despite the cgi-bin or .cgi URL components.

>.ctl may be a different matter. We discussed this topic recently but I don't think a >definitive answer was reached.

Today I looked for pages in Google with hfs.cgi/00/ in the URL [allinurl:hfs.cgi/00/]. There are now 257 pages in the index, 256 of which have a .ctl extension. So it appears that Google has started to crawl and index these pages within the last few days.

However, the pages have no PR--according to the Toolbar at any rate (which perhaps I ought to take with more than a few grains of salt). Why is that, when our site itself has a PR of 9?

DB

ciml

2:12 pm on Dec 3, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Good news, Dean. You needn't worry about the Toolbar; your listings seem to be from the Everflux [webmasterworld.com] so the PR won't show until the next Google Update [webmasterworld.com]. Some types of URLs don't show, but without "?" characters I think yours will.

The +com filetype:ctl [google.com] search doesn't show a sudden proliferation of .ctl endings; I wonder why?

- You are the only person who uses URLs ending in .ctl

- Other people will get their URLs ending in .ctl indexed at the full update.

- URLs ending in .ctl weren't the problem, and it just took you a long time to get yours indexed.

I don't think it's the latter.