|Using php to compare two images|
| 9:28 pm on May 14, 2010 (gmt 0)|
So I'm building this app and part of it is dedicated to ingesting a feed of text and images,.. the way the feed is sent to me the images are always named the same.. as in id-1.jpg, id-2.jpg etc.. so I started wondering what would happen if the managers decided to change an image in their application.. it would come to me named the same.. so I would either need to process ALL the images into multiple sizes everytime I process the feed or maybe somehow check each photo against what I had last time.. so here's what I am thinking about doing..
When I process the feed I read the image with file_get_contents, and since that is kinda large I take a md5 of that and store it in the db along with other details about the image.. then next time I process the feed, or the images from a specific record are requested I can check to see if I still have the most up to date image..
I ran a test with two identical jpgs, each all white but one with a single black pixel, and my tests showed the source and the md5 of that source were both unique.. so that worked..
The curious thing though is when I managed to get the same photos from two different feed dates (photos that my system should treat as identical) and test them the md5 was the same but the source was not..
Why would the md5 be the same while the source wasn't?
Does this sound like a logically sound way to go about it?
| 10:57 am on May 15, 2010 (gmt 0)|
Hi there the_hat.
Your idea is logically sound, i just wonder about using the md5 to create a key specific to that image. Is there a function in the gd lib that does similar? Try reading the file contents in hex to compare.
| 10:11 pm on May 15, 2010 (gmt 0)|
Matthew1980, thanks for the reply: It looks like there is a gd module called cmp_image but that would mean I would have to read two images and compare them instead of reading one image and comparing it to data stored in the db, the images right now total around 40,000, so I was thinking that running the external gd lib once and going to the db would reap me a fairly sizable time savings vs. running the gd twice per image.. Let alone the fact that I was planning on totally dropping the supplied full size photo in favor of smaller photos that I can display easier within my design. Not sure the cmp_image module is smart enough to equate a resized version of a photo to it's full size version..
Looking for a way to take a finger print of an image, that I can store in my db before doing away with that version, then when I recheck photos I can read the image and check it against the fingerprint I have of the previous image that came through with that name and decide...
| 3:51 pm on May 19, 2010 (gmt 0)|
Are these truly JPEG's -- with associated JPEG EXIF data?
If so, you might get away with simply comparing the first 512
(or even 256) bytes if each image file, Or, the MD5SUM of that
That's the method I've used in the past to check for new, updated
images from remote web cams.
| 8:09 pm on May 19, 2010 (gmt 0)|
Hi there Jonesy,
Forgive me asking, but I am interested as to the first 256/512 bytes angle, does this act as a sort of FAT system for the rest of the image data, if so that's a new one on me!
The way I understand it is that image data is stored in hex data, hence why when you compress an image file, the compression engine opens the image file in hex and the selects chunks of data that repeat ie: multiple FF FF FF FF chunks would be treated as FF x4 (same for any other combination of digits) so long as they repeat more than twice in a row, this is why I suggested the hex option - if its even possible.
I have experimented a similar thing in C, but never got to see the light of day. I would be interested to see how this one pans out!
| 1:42 pm on May 20, 2010 (gmt 0)|
@Jonesy: I am sure that they are when they are given to their application but I'm not sure if they still retain that information when they are pushed to me. I'll have to check on that. Thanks.
| 2:22 am on May 25, 2010 (gmt 0)|
"The first 256/512 bytes angle" merely counts on either the meta data or the actual image data in the JPEG being different for different images -- even for 'identical' images from a fixed position camera. There'll be a different timestamp, or different image content(s), or even different lighting resulting in different JPEG compression.
I dunno if it would work in all situations. However, I tested it for a few instances (all fixed position web cams) where I wanted to employ it, and IWFM. IIRC, I started out at 1024 bytes and pruned it back to just 256 bytes with no problems missing a new image.
Back to the OP, if your images can be fetched by HTML services, perhaps the "HEAD" command might do The Trick.