Forum Moderators: open
*However*...
Almost 70 of the new 'pages' are actually images. This is in the main Google search results, not Google Image Search, where my site still has zero results.
I'm not sure why this is hapenning, but I guess Google is relying on filename suffixes almost entirely in determining what type of file a URL is. This is unfortunate. All of the images in question are either of the form /somesite.com/Archive/187 or /somesite.com/Archive/187?display=size
The files *do* have an appropriate content-type (such as image/jpeg), but they are showing up in the result pages as 'File Format: Unrecognized'.
Does anyone have a clue as to what's going on? I find it a bit hard to believe that Google just ignores the files' content-type completely.
I guess Google is relying on filename suffixes almost entirely in determining what type of file a URL is.
I'm afraid so. As you're seeing 'Format: Unrecognized' in the results (presumably with a rather pointless 'View as HTML'), the URLs are being fetched.
If you use Robots Exclusion Protocol via /robots.txt you'll save bandwidth as Google won't fetch the URLs, but the URLs will still most likely be listed.
One approach would be to put the images inside HTML documents, where you can add the META robots tag. META robots "noindex" causes Google not to list the URL if Googlebot fetches the content.
Alternatively, I'd be tempted just to keep those links out of the HTML source for User-Agent "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" (or more sensibly anything starting "Googlebot/"). This is search engine cloaking, but it's hard to imagine the Google search quality folk throwing a site out for withholding links to images.
The original image is located at somesite.com/archive/188, and additional sizes/formats are located at somesite.com/archive/188?display=size, but the default HTML page for the image is at somesite.com/archive/188/view
This means that if I have Google block access to the original file, I will also be blocking access to the HTML page (at least with the robots.txt option).
I suppose I can do some work to put the default HTML rendering at /188 and the original file at /188/original, or some such. I'll have to think on it some more.
In any case, I don't really want to block Google from downloading the images at all (well, perhaps I don't need Google to download *all* the sizes), I just want them to show up in the Image search results instead of the main results.
I am still battling disbelief over Google ignoring the content-type header. It makes no sense to me to rely exclusively on the filename in this way.
Yet, this does cause a problem with Google. Personally, I'd be inclined to protect Googlebot from those links, and just have the image once in an <IMG> element with sensible alt text for the image robot. Alternatively, hiding the links from User-Agents beginning 'Googlebot/' would not hide them from the image robot, as it is "Googlebot-Image" ("-" not "/").
> battling disbelief
Perhaps I should also mention that Google doesn't index "/index.html" even if the content returned is different from "/" :-)
Hmm.
Well, what is included on the page in an <img> tag is the /188?display=small version. I *need* to include links on the page to the other versions, including the original image.
Perhaps I could put these links into a pop-up, but I suspect this will impact usability.
> Alternatively, hiding the links from User-Agents beginning 'Googlebot/' would not hide them from the image robot, as it is "Googlebot-Image" ("-" not "/").
I am extremely reluctant to start hiding content for any reason.
>> battling disbelief
> Perhaps I should also mention that Google doesn't index "/index.html" even if the content returned is different from "/" :-)
That's hardly the same thing is it? I'm not complaining that Google is crawling or indexing my site, just that they are ignoring a basic and extremely informative building-block of HTTP.
I mean, this sounds like a webmaster could hide their content from Google just by giving the page a name of /page.jpg, which is insane.
Would Javascript-cloaking be much different? Although in that case you'd be sending the same string of characters to Googlebot, you'd purposely be making links available to 'normal' users and not the robot. Not that I'd challenge your motivation for considering it in this case, after all Google's own AdSense service uses technology also used by some cloakers engaging in high risk SEO.
I also would avoid that approach from a usability perspective.
A similar approach is to use a form (with a local URL for the action), but you'd need to use submit buttons or an image submit. Also, you'd need to give each 'link' its own form or use software to serve or redirect to the correct image.
I do not suggest that any of the options I mention are ideal.
My "/index.html" vs "/" comment just relates to Google inferring more from HTTP than is safe. Whether as a result of filename conventions or the DirectoryIndex default of a popular server, I find rather similar arguments for and against a popular search engine making these types of assumptions when building their robots.
Yes, I understood that.
*However*, my point is that in my case, Google is inferring far *less* from HTTP than is safe. I'm not aware of any browser that simply ignores the content-type header wholesale.
This still seems odd at least, if not downright pathological.