Forum Moderators: coopster

Message Too Old, No Replies

PHP Scrape Image Src

         

username

5:49 am on Feb 1, 2009 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi all, I have an issue where I need to scrape a few sites for various image sources <img src="..." />. Although I am quite close to solving the problem, am a little stuck, and in need of some help. So far, I have:

<?php
$link = 'http://www.example.com/';
$page = file_get_contents($link);
preg_match_all('~([a-z0-9\.\_\-]+(\.gif¦\.jpe?g))~i', $page, $matches);

for($i=0; $i < count($matches[0]); $i++) {
$source = $link . $matches[0][$i];
$filenames = $matches[1][$i];
$ext = $matches[2][$i];

echo "$source<br/><br/>";
echo "$filenames<br/><br/>";
echo "$ext<br/><br/>";
}

Unofrtunately this does not grab their entire URL, nor does it account for if the site's image source is <img src="../pic.jpg"> or other non absolute paths.

Does anyone have a bullet proof script of a great regular expression to solve this problem.

Thanks heaps.

PHP_Chimp

11:01 am on Feb 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Unofrtunately this does not grab their entire URL...

No it grabs any text that has a .gif or .jpeg/jpg at the end.
Why not narrow your search down by having your regex only look in image tags

'~<img ....... >~i';
As what you have at the moment will not only pick up image tags, but will also pick up text like this_is_a_pic.jpg. So is not actually going to give you correct results.

As you know that an image tag will always start with '<img ' and end with '>', there will be no other >'s within that tag.
You are then looking for the src="...." attribute.

nor does it account for if the site's image source is <img src="../pic.jpg"> or other non absolute paths.

No it wont show an absolute path, as that will start with a /, it will however show a relative path. So either you need to change your $link, so that it doesnt end with a / or discount the / if it is at the start of your src attribute, by using /? before your regex

So how about something like:


'~<img .*src="([a-z0-9\.\_\-]+(\.gif¦\.jpe?g))".*>~i';
// or you could split up your searches
'~<img([^>]+)>~i'; // find all image tags
'~src=['"]([\w\.\-]+(\.gif¦\.jpe?g))['"]~i'; // get the src attribute

\w is A-Za-z0-9 and _, so saves you a bit of space when writing the regex

You may find that the split regex runs faster, as it avoids the .* I have put in the first one to take into account spaces and other attributes. You are also only looking for gif and jpg images. While these are common you are missing out and all of the other image types. This may or may not be something you are worried about.