How to read the content of a page

Forum Moderators: coopster

Message Too Old, No Replies

How to read the content of a page

impact

12:24 pm on Jan 26, 2010 (gmt 0)

Hello,

I am trying to create a bookmark in my site. I need a script which can read specific contents of a web page. This is what i have so far,

$url = 'http://www.domain.com/home.php';
$contents = file_get_contents($url);

I have no idea, how to search and read "<title></title>" and "<meta name='description' content = 'This is my home page'>"

I have very limited knowledge in php, I will appreciate, if some one please tell me how do i read between the title tags and page description of any page which I am trying to bookmark.

Thank you,

jatar_k

6:43 pm on Jan 26, 2010 (gmt 0)

you have pulled the whole page into a string so you could now use a regular expression to find everything between <title> and </title>

I found this one
preg_match [php.net]("~<title>W¦w</title>~",$contents,$match)

try that

rocknbil

9:46 pm on Jan 26, 2010 (gmt 0)

These are typed on the fly but should get you going in the right direction.

$contents = file_get_contents($url);

Strip out the newlines, this will make it a little more reliable . . .

$contents = preg_replace('/[\n\r]+/',' ',$contents);

$title = preg_replace('/.*?<title>([^<]+)<\/title>.*/i',"$1",$contents);

$meta_desc = preg_replace('/.*?<\s*meta.*?description.*?content.*?\'*"*([^\'">]+)\'*"*\s*\/\s*>.*/i',"$1",$contents);

An explanation of WTH I'm thinking: the title preg should be easy to figure out if you grok this.

'/ = PHP regex delimiter and stock regex delimiter, note this means I have to escape any 's in the pattern itself.

.*? = Zero pr more of **any** character with a quantifier to keep from slurping up the entire string.

<\s*meta.*?description.*?content = Putting these all together as they work all together. You never will know what order things will come in, it's not unusual to see content= before description=, so this will FAIL in that instance. You don't want to just match on meta and content as this is a lot of meta tags. You may need to do an "or" here due to this.

So the start of the pattern is a <, and there may or may not be spaces, so zero or more spaces followed by any character followed by description and content, with zero or more of any character between. More restrictive,and perhaps more accurate, might be

<\s*meta\s+description\s+content

.*? = after "content" we can find =, spaces, or both. Hopefully. A better one here might be [\s=]+

\'*"* = you have your quotes single quoted, IMO bad style but acceptable, validates, and is often seen in PHP due to it's lack of qq support (like in perl,) so OK. But it means you have to check for both, so this means zero or more of either - and is good, it's entirely possible you will find an unquoted meta desc. tag.

([^\'">]+) = zero or more of any character not a ", ' or >. Given the above, if the meta desc is unquoted, it will fail if just ' and ", hence, the >. It will also fail if there's quote marks within the description, which is reason two I don't like ' as attribute delimiters - double quotes in **properly** coded output should be ". Not sure how I'd work around this one, but there's always a way. Maybe just the tired old specific characters cure-all:

([a-z0-9\s\-\'\.,!&%\$]+) (etc., no need for A-Z as it's case insensitive, see below)

The parentheses () store this value in $1, this is what we are hopefully capturing. I've escaped ' because that is what I'm using to define the regex in PHP - which is different from the regex delimiter, /.

\'*"*\s*\/\s* = after the captured part, we may have any number of characters or NONE of them, hence all "zero or more." Examples,

blah" />
blah'>
blah>
blah >

Note also escaped '

>.* = The end of the pattern, so it can be followed by zero or more of anything.

/i' = regex end delimiter, case insensitive, end PHP delimiter of the regex itself.

As said, will probably need some tweaking, but gives you some ideas on how to tackle it.