Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: open
There is however a whole class of our pages that do not appear in the index and I wonder why they do not and how they can get included. These are the pages on which we describe our individual publications such as this one: www.press.example.edu/cgi-bin/hfs.cgi/00/15333.ctl
Our robots.txt file does not exclude these pages.
Perhaps the formulation of the URL leads to its exclusion? Which part? There are 6.5 million Google-indexed pages that include "cgi-bin" in their URL. However there are only 8 that include "hfs.cgi". Or perhaps the ".ctl" extension is the problem. Is there any way that we can tell the Google crawler to go ahead and index these pages?
I realize that dynamic pages can be problematic. But many dynamic pages do get indexed and I should think that our high PR would allow our dynamic pages into the index. Is there a specific bit of the URL that causes the robot not to crawl these pages?
Thanks for your help.
[edited by: ciml at 8:08 pm (utc) on Nov. 26, 2002]
[edit reason] No specifics please. [/edit]
If you have plenty of PageRank in the page that links to the missing page, then you should normally expect it to be spidered despite the cgi-bin or .cgi URL components.
.ctl may be a different matter. We discussed this topic recently [webmasterworld.com] but I don't think a definitive answer was reached.
I don't want to jump to an unfounded conclusion here, but it appears that Google doesn't crawl URLs that look like they might have unknown file extensions. The WWW approach would be to use <A href="/whatever.ctl" type="text/html">, but I don't think I've ever seen it used by any user agent application, or any HTML document.
There should be no reason not to use .gif URLs for HTML and .html URLs for GIFs, as long as the Web server advertises the content-type correctly. Seeing as both IE5 and Googlebot seem to guess content-type from URLs, this becomes a moot point.
>>formulation of the URL leads to its exclusion? Which part?
>Too many "." imho.
The Googlebot expects a file extension after the first "." perhaps? I don't know; I don't think data bears that out. If you drop this in the Google search box: "allinurl:s.cgi" you get 16,900 results. Not all have multiple "." in the URL but many do. (You can probably substitute any letter of the alphabet for "s" and get some results.)
The file extension of ".ctl" may be a more likely offender. Is there any way of searching for specific extensions in Google?
Thanks for your help.
>This is fun, though:
More fun? Try your initials, maybe. I got 314 for my filetype. Maybe we'll end all our pages with that extension.
>db, are any of the problem URL's linked to from a >static/regular URL?
Yes. All or almost all of them are linked from static pages within our own site. A significant number of them are linked from other sites and pages (that is, Google-indexed sites and pages).
>If you have plenty of PageRank in the page that links to the missing page, then you >should normally expect it to be spidered despite the cgi-bin or .cgi URL components.
>.ctl may be a different matter. We discussed this topic recently but I don't think a >definitive answer was reached.
Today I looked for pages in Google with hfs.cgi/00/ in the URL [allinurl:hfs.cgi/00/]. There are now 257 pages in the index, 256 of which have a .ctl extension. So it appears that Google has started to crawl and index these pages within the last few days.
However, the pages have no PR--according to the Toolbar at any rate (which perhaps I ought to take with more than a few grains of salt). Why is that, when our site itself has a PR of 9?
The +com filetype:ctl [google.com] search doesn't show a sudden proliferation of .ctl endings; I wonder why?
- You are the only person who uses URLs ending in .ctl
- Other people will get their URLs ending in .ctl indexed at the full update.
- URLs ending in .ctl weren't the problem, and it just took you a long time to get yours indexed.
I don't think it's the latter.