homepage Welcome to WebmasterWorld Guest from 54.166.173.147
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
HTML parser in php?
this is frustrating
SubZeroGTS




msg:1316354
 10:12 pm on Jan 29, 2003 (gmt 0)

i've been at it all day, and i just can't make a php script to parse HTML that will work for more than 1 or 2 sites.

what i need it to do is extract all links (URLs), words (as one large string snapshot of the page, and then an array of each individual word), and META tags...

is there a good script/code available online somewhere which works?

 

jatar_k




msg:1316355
 10:27 pm on Jan 29, 2003 (gmt 0)

i did a search on google for php html parser and came up with a bunch of different possibilities.

SubZeroGTS




msg:1316356
 1:40 am on Jan 30, 2003 (gmt 0)

wow i found one fantastic one, by some russian dude

gotta question though, is there any way to 'resolve' links?

a lot of sites only link to other files, how do you tack on the domain/directory info?

(i.e, [microsoft.com...] links to "../wow.htm"....how do i turn that into [microsoft.com...]

jatar_k




msg:1316357
 2:02 am on Jan 30, 2003 (gmt 0)

maybe store the domain name and then do the logical math to build the proper path.

kmarcus




msg:1316358
 2:28 am on Jan 30, 2003 (gmt 0)

here is a "pseudo html" parser i wrote some time ago. it is relatively inefficient, but nice for 15 minutes of coding.

<?php

// Head, Tail are pointers into the entire document at Text

function ParseHTML ($Text, &$Head, &$Tail, &$Attr)
{
$Chunk = substr ($Text, $Head);
$ChunkLen = strlen ($Chunk);

if ($Chunk [0] == '<') {
if (($Chunk [1] == '!') &&
($Chunk [2] == '-') &&
($Chunk [3] == '-')) {
$x = strpos ($Chunk, "-->");
if ($x > 0) {
$x += 2;
$Tail = $Head + $x;
$Attr = ATTR_COMMENT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_COMMENT;
}
} else if (strncasecmp ($Chunk, "<script", 7) == 0) {
$x = stripos ($Chunk, "</script>");
if ($x > 0) {
$x += 8;
$Tail = $Head + $x;
$Attr = ATTR_SCRIPT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_SCRIPT;
}
} else {
// catch all other tags!
$x = strpos ($Chunk, ">");
if ($x > 0) {
$Tail = $Head + $x;
$Attr = ATTR_TAG;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TAG;
}
}
} else {
$x = strpos ($Chunk, "<");
if ($x > 0) {
$Tail = $Head + $x - 1;
$Attr = ATTR_TEXT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TEXT;
}
}
}

?>

andreasfriedrich




msg:1316359
 7:28 pm on Jan 31, 2003 (gmt 0)

A function that resolves relative URIs [webmasterworld.com] can be found in the Bag-O-Tricks for PHP II [webmasterworld.com].

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved