Images Erroneously in Regular Search Results

Forum Moderators: open

Message Too Old, No Replies

Images Erroneously in Regular Search Results

ZopeMaven

5:08 am on May 31, 2004 (gmt 0)

Well, my new image archive site is about 4 weeks old now, and the indexed pages (in Google) just went from 15 to 103 overnight.

*However*...

Almost 70 of the new 'pages' are actually images. This is in the main Google search results, not Google Image Search, where my site still has zero results.

I'm not sure why this is hapenning, but I guess Google is relying on filename suffixes almost entirely in determining what type of file a URL is. This is unfortunate. All of the images in question are either of the form /somesite.com/Archive/187 or /somesite.com/Archive/187?display=size

The files *do* have an appropriate content-type (such as image/jpeg), but they are showing up in the result pages as 'File Format: Unrecognized'.

Does anyone have a clue as to what's going on? I find it a bit hard to believe that Google just ignores the files' content-type completely.

ciml

2:43 pm on May 31, 2004 (gmt 0)

I guess Google is relying on filename suffixes almost entirely in determining what type of file a URL is.

I'm afraid so. As you're seeing 'Format: Unrecognized' in the results (presumably with a rather pointless 'View as HTML'), the URLs are being fetched.

If you use Robots Exclusion Protocol via /robots.txt you'll save bandwidth as Google won't fetch the URLs, but the URLs will still most likely be listed.

One approach would be to put the images inside HTML documents, where you can add the META robots tag. META robots "noindex" causes Google not to list the URL if Googlebot fetches the content.

Alternatively, I'd be tempted just to keep those links out of the HTML source for User-Agent "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" (or more sensibly anything starting "Googlebot/"). This is search engine cloaking, but it's hard to imagine the Google search quality folk throwing a site out for withholding links to images.

ZopeMaven

3:59 pm on May 31, 2004 (gmt 0)

Because of the site structure, I can't (currently) disallow access to the images.

The original image is located at somesite.com/archive/188, and additional sizes/formats are located at somesite.com/archive/188?display=size, but the default HTML page for the image is at somesite.com/archive/188/view

This means that if I have Google block access to the original file, I will also be blocking access to the HTML page (at least with the robots.txt option).

I suppose I can do some work to put the default HTML rendering at /188 and the original file at /188/original, or some such. I'll have to think on it some more.

In any case, I don't really want to block Google from downloading the images at all (well, perhaps I don't need Google to download *all* the sizes), I just want them to show up in the Image search results instead of the main results.

I am still battling disbelief over Google ignoring the content-type header. It makes no sense to me to rely exclusively on the filename in this way.

ciml

4:35 pm on May 31, 2004 (gmt 0)

In a typical Web page, images are used in IMG elements and/or have a .jpg/gif/png filename. While it is less usual to link an image called /foo via an A element, there's no reason why you shouldn't.

Yet, this does cause a problem with Google. Personally, I'd be inclined to protect Googlebot from those links, and just have the image once in an <IMG> element with sensible alt text for the image robot. Alternatively, hiding the links from User-Agents beginning 'Googlebot/' would not hide them from the image robot, as it is "Googlebot-Image" ("-" not "/").

> battling disbelief

Perhaps I should also mention that Google doesn't index "/index.html" even if the content returned is different from "/" :-)

ZopeMaven

5:10 pm on May 31, 2004 (gmt 0)

> Yet, this does cause a problem with Google. Personally, I'd be inclined to protect Googlebot from those links, and just have the image once in an <IMG> element with sensible alt text for the image robot.

Hmm.

Well, what is included on the page in an <img> tag is the /188?display=small version. I *need* to include links on the page to the other versions, including the original image.

Perhaps I could put these links into a pop-up, but I suspect this will impact usability.

> Alternatively, hiding the links from User-Agents beginning 'Googlebot/' would not hide them from the image robot, as it is "Googlebot-Image" ("-" not "/").

I am extremely reluctant to start hiding content for any reason.

>> battling disbelief

> Perhaps I should also mention that Google doesn't index "/index.html" even if the content returned is different from "/" :-)

That's hardly the same thing is it? I'm not complaining that Google is crawling or indexing my site, just that they are ignoring a basic and extremely informative building-block of HTTP.

I mean, this sounds like a webmaster could hide their content from Google just by giving the page a name of /page.jpg, which is insane.

ciml

6:04 pm on May 31, 2004 (gmt 0)

> I am extremely reluctant to start hiding content for any reason.

Would Javascript-cloaking be much different? Although in that case you'd be sending the same string of characters to Googlebot, you'd purposely be making links available to 'normal' users and not the robot. Not that I'd challenge your motivation for considering it in this case, after all Google's own AdSense service uses technology also used by some cloakers engaging in high risk SEO.

I also would avoid that approach from a usability perspective.

A similar approach is to use a form (with a local URL for the action), but you'd need to use submit buttons or an image submit. Also, you'd need to give each 'link' its own form or use software to serve or redirect to the correct image.

I do not suggest that any of the options I mention are ideal.

My "/index.html" vs "/" comment just relates to Google inferring more from HTTP than is safe. Whether as a result of filename conventions or the DirectoryIndex default of a popular server, I find rather similar arguments for and against a popular search engine making these types of assumptions when building their robots.

ZopeMaven

10:30 pm on Jun 3, 2004 (gmt 0)

> My "/index.html" vs "/" comment just relates to Google inferring more from HTTP than is safe.

Yes, I understood that.

*However*, my point is that in my case, Google is inferring far *less* from HTTP than is safe. I'm not aware of any browser that simply ignores the content-type header wholesale.

This still seems odd at least, if not downright pathological.