homepage Welcome to WebmasterWorld Guest from 23.20.61.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Indexing scripted pages; non-.html endings
How do dynamic pages get crawled?
dblobaum

10+ Year Member



 
Msg#: 7189 posted 7:32 pm on Nov 26, 2002 (gmt 0)

Much of the website for the University of Chicago Press at
www.press.example.edu/ is indexed by Google and our homepage appears (from the Toolbar and the directory ranking) to have reasonably high PageRank. Many other pages on our site are also included in the index.

There is however a whole class of our pages that do not appear in the index and I wonder why they do not and how they can get included. These are the pages on which we describe our individual publications such as this one: www.press.example.edu/cgi-bin/hfs.cgi/00/15333.ctl
Our robots.txt file does not exclude these pages.

Perhaps the formulation of the URL leads to its exclusion? Which part? There are 6.5 million Google-indexed pages that include "cgi-bin" in their URL. However there are only 8 that include "hfs.cgi". Or perhaps the ".ctl" extension is the problem. Is there any way that we can tell the Google crawler to go ahead and index these pages?

I realize that dynamic pages can be problematic. But many dynamic pages do get indexed and I should think that our high PR would allow our dynamic pages into the index. Is there a specific bit of the URL that causes the robot not to crawl these pages?

Thanks for your help.

Dean Blobaum
Chicago

[edited by: ciml at 8:08 pm (utc) on Nov. 26, 2002]
[edit reason] No specifics please. [/edit]

 

ciml

WebmasterWorld Senior Member ciml us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 7189 posted 8:44 pm on Nov 26, 2002 (gmt 0)

Welcome to WebmasterWorld [webmasterworld.com], Dean.

If you have plenty of PageRank in the page that links to the missing page, then you should normally expect it to be spidered despite the cgi-bin or .cgi URL components.

.ctl may be a different matter. We discussed this topic recently [webmasterworld.com] but I don't think a definitive answer was reached.

I don't want to jump to an unfounded conclusion here, but it appears that Google doesn't crawl URLs that look like they might have unknown file extensions. The WWW approach would be to use <A href="/whatever.ctl" type="text/html">, but I don't think I've ever seen it used by any user agent application, or any HTML document.

There should be no reason not to use .gif URLs for HTML and .html URLs for GIFs, as long as the Web server advertises the content-type correctly. Seeing as both IE5 and Googlebot seem to guess content-type from URLs, this becomes a moot point.

NFFC

WebmasterWorld Senior Member nffc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 7189 posted 9:47 pm on Nov 26, 2002 (gmt 0)

Hi dean and welcome to WebmasterWorld,

>formulation of the URL leads to its exclusion? Which part?

Too many "." imho.

dblobaum

10+ Year Member



 
Msg#: 7189 posted 3:51 pm on Nov 27, 2002 (gmt 0)

NFFC said:

>>formulation of the URL leads to its exclusion? Which part?

>Too many "." imho.

The Googlebot expects a file extension after the first "." perhaps? I don't know; I don't think data bears that out. If you drop this in the Google search box: "allinurl:s.cgi" you get 16,900 results. Not all have multiple "." in the URL but many do. (You can probably substitute any letter of the alphabet for "s" and get some results.)

The file extension of ".ctl" may be a more likely offender. Is there any way of searching for specific extensions in Google?

Thanks for your help.

Dean Blobaum
Chicago

ciml

WebmasterWorld Senior Member ciml us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 7189 posted 6:08 pm on Nov 27, 2002 (gmt 0)

Yes! The filetype: operater doesn't just work with the types in the drop-down list on the advanced search page.

word filetype:ctl

0 results.

This is fun, though:

word filetype:htm
word filetype:html
word filetype:shtml

dblobaum

10+ Year Member



 
Msg#: 7189 posted 7:17 pm on Nov 27, 2002 (gmt 0)

>Yes! The filetype: operater doesn't just work with the types >in the drop-down list on the advanced search page.

>word filetype:ctl

>0 results.

>This is fun, though:

Thanks, ciml!

More fun? Try your initials, maybe. I got 314 for my filetype. Maybe we'll end all our pages with that extension.

db
chicago

NFFC

WebmasterWorld Senior Member nffc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 7189 posted 7:25 pm on Nov 27, 2002 (gmt 0)

db, are any of the problem URL's linked to from a static/regular URL?

ciml

WebmasterWorld Senior Member ciml us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 7189 posted 7:30 pm on Nov 27, 2002 (gmt 0)

Use "+com" as the search and you'll find many .ctl URLs.

But, all of them have?something after the .ctl

Google search for +com filetype:ctl [google.com]

dblobaum

10+ Year Member



 
Msg#: 7189 posted 7:33 pm on Nov 27, 2002 (gmt 0)

NFFC asked:

>db, are any of the problem URL's linked to from a >static/regular URL?

Yes. All or almost all of them are linked from static pages within our own site. A significant number of them are linked from other sites and pages (that is, Google-indexed sites and pages).

Thanks.

Dean

dblobaum

10+ Year Member



 
Msg#: 7189 posted 6:18 pm on Dec 2, 2002 (gmt 0)

On Nov. 26 ciml said:

>If you have plenty of PageRank in the page that links to the missing page, then you >should normally expect it to be spidered despite the cgi-bin or .cgi URL components.

>.ctl may be a different matter. We discussed this topic recently but I don't think a >definitive answer was reached.

Today I looked for pages in Google with hfs.cgi/00/ in the URL [allinurl:hfs.cgi/00/]. There are now 257 pages in the index, 256 of which have a .ctl extension. So it appears that Google has started to crawl and index these pages within the last few days.

However, the pages have no PR--according to the Toolbar at any rate (which perhaps I ought to take with more than a few grains of salt). Why is that, when our site itself has a PR of 9?

DB

ciml

WebmasterWorld Senior Member ciml us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 7189 posted 2:12 pm on Dec 3, 2002 (gmt 0)

Good news, Dean. You needn't worry about the Toolbar; your listings seem to be from the Everflux [webmasterworld.com] so the PR won't show until the next Google Update [webmasterworld.com]. Some types of URLs don't show, but without "?" characters I think yours will.

The +com filetype:ctl [google.com] search doesn't show a sudden proliferation of .ctl endings; I wonder why?

- You are the only person who uses URLs ending in .ctl

- Other people will get their URLs ending in .ctl indexed at the full update.

- URLs ending in .ctl weren't the problem, and it just took you a long time to get yours indexed.

I don't think it's the latter.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved