|Google Images indexing - how do these guys get so many?|
Just reading an article on a photo business news site and they're discussing Google Images with several of the stock photo agencies.
One company has over 56,000 images indexed. Another has over 2.8 million.
I used to have over 4000 original <high-profile niche> photos indexed and that dropped to about 430 after the big Google Images massacre in Summer '06.
Since then we've added about another 2000 photos. Would love to know how to get the numbers back up.
[edited by: Robert_Charlton at 10:41 pm (utc) on July 23, 2008]
[edit reason] removed specifics [/edit]
So would a lot of members here, as you can see through a Site Search [webmasterworld.com]. Google Images is a pretty mysterious service.
What kinds of things have you studied about the successful sites? Are their backlinks stronger, or more varied, or pointing to more deeper pages? Is there something in common with the site structure?
It would help a lot of people if we could discuss this in more detail without actually naming the specific websites.
Well, one thing about the stock photo sites is that their photo descriptions are varied and quite specific:
Man buying widget
Woman fixing widget in Texas field
Chinese students studying widgets
All the sorts of things that people making brochures about widgets are looking for.
In some categories there are only so many ways to skin a cat. An example I've used here before:
Tedster at the Webmasterworld ceremony. If I'm taking several photos (say hundreds of this prestigious event) there's some duplication in titles.
Tedster at the Webmasterworld ceremony
ianevans at the Webmasterworld ceremony
Robert Charlton at the Webmasterworld ceremony
Only one or two words difference. Now, if I was trying to sell stock I might try titling the photo:
Tedster in powder blue tuxedo at webmasterworld ceremony.
Again though, it's a difference in timing. A stock photo company is all about finding a picture of a woman in red eating an ice cream in Houston. So they have editors whose only job is to tag the hair, the hair length, the dress, the dress colour, designer, style, the watch...
As you can see, a news site like mine doesn't have the time to spend on that much tagging.
If I take 500 photos at an event, I also don't have time to rename each photo to some of the specifics we see suggested, like red-widget.jpg, tedster-in-widget.jpg.
I did start doing that on some thumbnail pages as I created the thumbnails automatically through PHP and so created names based on who was in the photo. But then you hear that Google favours the larger images, so...
Tagging. Specific Tagging. Tagging. More tagging. Tagging. More specific tagging. Oh, and tagging.
By tagging do you mean embedded Exif data or more basic on page tagging ?
I have been pondering GIS quite a bit lately. Google seems pretty good at having unique image reslults. How's this done ? Image dimensions and file size, exif data or maybe some sort of crazy algo that actually views the image to detect a signature?
|Tagging. Specific Tagging. Tagging. More tagging. Tagging. More specific tagging. Oh, and tagging. |
So you'd be suggesting, um, tagging?
As I said in my post, I agree that works for the stock photo sellers who have nothing but time but it doesn't necessarily work for the news sites.
From an appearance point of view, it also works for the sellers because it's okay for them to have oodles of tags on a page. But for a publisher, it doesn't look as good.
Do you have your images in an xml sitemap?
The image pages are in the sitemap, but Google says it ignores any image files in the sitemap.
Google doesn't care whether the website is about news or stock photos. The indexing scheme works the same in either case.
I suspect that a good 1-sentence summary of the article associated with a picture would also make a good meta tag for the picture, whether it's (to invent some WWII newsreel-type headlines) "Adolf Hitler visits rocket factory in Denmark" or "HMS Victorious sinks after torpedo attack in the Bight of Biscay" or "Aachen Cathedral on fire after tank battle" -- object, action, location, occasion.
I shouldn't wonder if that kind of summary (as, say, an H3-level heading CSS'ed down to a bold-font lead paragraph) wouldn't benefit news pages also.
The point is, it really isn't much work, for someone who's used to churning out columns of news text. It's just a slightly different way of thinking about a kind of work you're probably doing already.
Aye there's the rub...
If I was just posting one photo of the big event in a news article I could write a caption like "Hutcheson waves to admirers as he enters the exhibitor hall at the Google Search News Extravaganza."
But we're posting 500 photos of a big event quickly so our data entry is name ¦ event and the database then kicks out "Hutcheson at the Google Search News Extravaganza" not much differentiation and a chance for duplicates if I take more than one photo of you.
Perhaps I should draw up a list of synonyms and the database can take turns spewing out variations of "at the" so one person "attends" and another one "poses at". Still automatic but I guess a primitive form of making things different.
|...so our data entry is name ¦ event |
ianevans - It sounds like your database needs another field. If that's the only differentiation you make, you certainly can't expect Google to make any further distinctions. I suggest that you add a field for a full-fledged description... that is if you really care about getting these indexed.
|Perhaps I should draw up a list of synonyms and the database can take turns spewing out variations of "at the" so one person "attends" and another one "poses at". Still automatic but I guess a primitive form of making things different. |
IMO, that's not going to be sufficient. I recommend that you reread hutcheson's post. He sums it up very well. You're going to have to make up some descriptions that do have enough differentiation if you want them indexed separately.
A PS to the above... it also seems to me that for separate images to rank from the same site, the distinctions you make in your descriptions have to be interesting to searchers.
It may be that your images are not distinct enough for you to differentiate them in terms that are likely to be searched. This is essentially not much different from, say, a clothing site trying to get a separate page ranking for each size it offers of the same product. That's not going to happen, and trying to make it happen by getting a great many paginated pages indexed may well be counterproductive.
Maybe you should simply focus your efforts toward getting one representative page indexed for each subject... and realize that what matters here isn't image count, but rather it's subject count. Perhaps the other sites get more images indexed because they in fact have more different subjects.
This is starting to circle back to something in a previous thread, so I'll make a comment and head off to a new thought.
If Robert Charlton is the most sought after person to attend the widget convention, we might have eight photos of him, six of hutchence and one of Jane Doe. As I pondered in a previous thread, I should noindex 7 Robert's eight and add trivia to the one:
"Robert Charlton, who moderates Google Search News, has now made over 4700 posts." Since, for my needs, it doesn't really matter if ALL the photos are indexed, just one of EACH person attending the widget event. If I go by that tack, I can also rename that ONE photo file to robert-charlton-2008-widget-conference.jpg which is a heckuva lot easier than renaming 500 after a long night of widgeting. And even if dupes are noindex'd that's still a few thousand more photos than right now.
Which brings me to the new question. Is the descriptive text just a matter of proximity or, if there are div's, does it need to be in the same div?
I know some people have suggested:
but some CSS designs call for wrapping the image in its own div to add drop shadows, so you'd have:
Does that matter to Google? And...almost finished my 4:41am rambling...is there any issue with navigation text. Usually a slideshow will have Previous and Next. Can't really change those too much, but does sticking the caption in the link help e.g.:
<a href="/photo/3" title="Robert Charlton at the widget show">Next</a>
Thanks for your patience in letting me think out loud!
Re Google's VisualRank (http://www.www2008.org/papers/pdf/p307-jingA.pdf [www2008.org]) :
Judging by Google Image Search results, the VisualRank stuff seems to be in use, at least somewhat.
They start by looking at image similarities to, "find a single most representative, or “canonical” image from image search results."
Maybe, when possible, try to find a best match, out of your group of similar images, for a theoretical canonical image and focus on that image's page. Or crop one of the photos to make it a better match.
I have an image site. A lot of the images would be a good match to a canonical image on the subject. Many times, the subject is the only object in the image, and it's on a white background. Often it's depicted at its most recognizable angle.
I typically have a lead-in thumbnail that is a crop of the larger image. Cropping creates a more interesting image when it's a small image and it doesn't matter to me if the thumbnails are in the Image Search results. It makes sense that Google would infer that the image that's linked to by the lead-in thumbnail is the image which should be higher in the results (and the linked to image is always a larger file size, as well). Some of my linked-to images are more of a scene. Possibly these should be broken up into parts and separated, with the overall scene linked to by all of its parts. That would allow for a better chance for the pieces to match the canonical image associated with whatever word or phrase the user types in. (As recent as a few months ago, both the thumbnail and larger image would be listed in the GIS results. Now the larger image is but the thumbnail is buried.)
A quote from the white paper:
|...if a user is viewing an image, other related (similar) images may also be of interest. In particular, if image u has a visual-hyperlink to image v, then there is some probability that the user will jump from u to v. Intuitively, images related to the query will have many other images pointing to them, and will therefore be visited often (as long as they are not an isolated and in a small clique). The images which are visited often are deemed important... |
It sounds like when it comes to GIS, more emphasis is placed on user clicks than there is on the main index. As users become more savvy, the larger file sizes are going to get more clicks, since that's one of the few clues you have as to the quality of the image you're about to click on.
|...One method to reduce the computational cost is to precluster web images based using metadata such as text, anchor text, similarity or connectivity of the web pages on which they were found, etc. For example, images associated with “Paris”, “Eiffel Tower”, “Arc de Triomphe” are more likely to share similar visual features than random images... |
What I think they're saying is that because it's cheaper to process text than images, they 1) take the text you provide on your page, 2) take the links coming into the page, and 3) take words that are commonly associated with the words you used, even if those associated words aren't on your page. From these 3 sources they then have a group of words that are related to your image. I don't remember an example like this, when searching in the main index on the word "Paris", finding a page about the Eiffel Tower that makes no reference at all to the word "Paris". I'm guessing the reason they would do this in the Image Search is that a lot of sites with images aren't as likely to have the right words on the page. Aside from the time it takes to enter the text, it's hard to presume what text to enter and it can look strange.
In the searches I've done, I don't see things as unrelated-to-each-other-looking as "Paris" coming up in an "Eiffel Tower" search. Rather, Google seems to be taking their canonical images and comparing them to each other and then making an association between two similar canonical images. Maybe they look for both a match between 2 canonicals and a matched word grouping to get a close enough text association. The similar images are just barely sprinkled into the normal results. Again, I think as users get more sophisticated and look at the title before clicking, these are less likely to be clicked unless they really stand out for some reason.
The canonical image matcher (or whatever it is) doesn't seem to like images put into non-rectangular formats. In other words, if you have a picture of a red box but put it inside a white circle for graphic design reasons, your red box is going to be buried in the results. The silhouette seems to be important.
The color also seems to be important at times. If the silhouette is a match, but the canonical image is a strikingly different color from your image, your image won't do as well as if it was the same color. This makes me wonder if there is some sort of tally, where if most of the "red boxes" that are matched are, in fact, red, the ones that aren't get fewer points. Otherwise, when finding a canonical car image, what color would it be?
A commonly searched word or phrase is less "canonical image" focused and has more of a variety of types of images in the results, some obscure item in the news vaguely related to the phrase, a photo of someone's tatoo, a photo of someone's pet with that name, etc. As you add specifics to your search or if you do a less common one-word search, you can pretty much guess what the overall canonical image looks like by how similar the top results are to each other.
[edited by: tedster at 6:59 am (utc) on July 31, 2008]
[edit reason] make link clickable [/edit]
Nice find on that paper, piney. I notice that it focuses on "Product images" - I wonder if the algo is different for general images. I can see how it might make sense to create separate taxonomies with algo differences.
I also want to point to an earlier thread and the following information:
|Google recently entered a patent application that offers a lot of clues on how images can be automatically processed for search. Here's the patent application [appft1.uspto.gov]. Notice these possibilities, especially when there is little or no data/metadata directly associated with an image: |
- Images can be auto-tagged according to shapes, colors, and textures. This may involve breaking down images into smaller tiles and tagging those tiles.
- Images can be compared to other indexed images from around the web that have similar extracted features. Then keywords that are semantically related to those other images may be imported and used to tag the image that is being classified.
Google's challenge here is how to associate accurate keywords with images. They invented a two-person game to try to enhance this data, and they do some pretty adventurous keyword expansion across different domains. I think that's also part of the webmaster's challenge - making the mark-up extremely clear as to which text relates directly to the image.