Welcome to WebmasterWorld Guest from 174.129.127.214

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

HTML parser in php?

this is frustrating

   
10:12 pm on Jan 29, 2003 (gmt 0)

10+ Year Member



i've been at it all day, and i just can't make a php script to parse HTML that will work for more than 1 or 2 sites.

what i need it to do is extract all links (URLs), words (as one large string snapshot of the page, and then an array of each individual word), and META tags...

is there a good script/code available online somewhere which works?

10:27 pm on Jan 29, 2003 (gmt 0)

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member



i did a search on google for php html parser and came up with a bunch of different possibilities.
1:40 am on Jan 30, 2003 (gmt 0)

10+ Year Member



wow i found one fantastic one, by some russian dude

gotta question though, is there any way to 'resolve' links?

a lot of sites only link to other files, how do you tack on the domain/directory info?

(i.e, [microsoft.com...] links to "../wow.htm"....how do i turn that into [microsoft.com...]

2:02 am on Jan 30, 2003 (gmt 0)

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member



maybe store the domain name and then do the logical math to build the proper path.
2:28 am on Jan 30, 2003 (gmt 0)

10+ Year Member



here is a "pseudo html" parser i wrote some time ago. it is relatively inefficient, but nice for 15 minutes of coding.

<?php

// Head, Tail are pointers into the entire document at Text

function ParseHTML ($Text, &$Head, &$Tail, &$Attr)
{
$Chunk = substr ($Text, $Head);
$ChunkLen = strlen ($Chunk);

if ($Chunk [0] == '<') {
if (($Chunk [1] == '!') &&
($Chunk [2] == '-') &&
($Chunk [3] == '-')) {
$x = strpos ($Chunk, "-->");
if ($x > 0) {
$x += 2;
$Tail = $Head + $x;
$Attr = ATTR_COMMENT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_COMMENT;
}
} else if (strncasecmp ($Chunk, "<script", 7) == 0) {
$x = stripos ($Chunk, "</script>");
if ($x > 0) {
$x += 8;
$Tail = $Head + $x;
$Attr = ATTR_SCRIPT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_SCRIPT;
}
} else {
// catch all other tags!
$x = strpos ($Chunk, ">");
if ($x > 0) {
$Tail = $Head + $x;
$Attr = ATTR_TAG;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TAG;
}
}
} else {
$x = strpos ($Chunk, "<");
if ($x > 0) {
$Tail = $Head + $x - 1;
$Attr = ATTR_TEXT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TEXT;
}
}
}

?>

7:28 pm on Jan 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A function that resolves relative URIs [webmasterworld.com] can be found in the Bag-O-Tricks for PHP II [webmasterworld.com].