HTML parser in php?

Forum Moderators: coopster

Message Too Old, No Replies

HTML parser in php?

this is frustrating

SubZeroGTS

10:12 pm on Jan 29, 2003 (gmt 0)

i've been at it all day, and i just can't make a php script to parse HTML that will work for more than 1 or 2 sites.

what i need it to do is extract all links (URLs), words (as one large string snapshot of the page, and then an array of each individual word), and META tags...

is there a good script/code available online somewhere which works?

jatar_k

10:27 pm on Jan 29, 2003 (gmt 0)

i did a search on google for php html parser and came up with a bunch of different possibilities.

SubZeroGTS

1:40 am on Jan 30, 2003 (gmt 0)

wow i found one fantastic one, by some russian dude

gotta question though, is there any way to 'resolve' links?

a lot of sites only link to other files, how do you tack on the domain/directory info?

(i.e, [microsoft.com...] links to "../wow.htm"....how do i turn that into [microsoft.com...]

jatar_k

2:02 am on Jan 30, 2003 (gmt 0)

maybe store the domain name and then do the logical math to build the proper path.

kmarcus

2:28 am on Jan 30, 2003 (gmt 0)

here is a "pseudo html" parser i wrote some time ago. it is relatively inefficient, but nice for 15 minutes of coding.

<?php

// Head, Tail are pointers into the entire document at Text

function ParseHTML ($Text, &$Head, &$Tail, &$Attr)
{
$Chunk = substr ($Text, $Head);
$ChunkLen = strlen ($Chunk);

if ($Chunk [0] == '<') {
if (($Chunk [1] == '!') &&
($Chunk [2] == '-') &&
($Chunk [3] == '-')) {
$x = strpos ($Chunk, "-->");
if ($x > 0) {
$x += 2;
$Tail = $Head + $x;
$Attr = ATTR_COMMENT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_COMMENT;
}
} else if (strncasecmp ($Chunk, "<script", 7) == 0) {
$x = stripos ($Chunk, "</script>");
if ($x > 0) {
$x += 8;
$Tail = $Head + $x;
$Attr = ATTR_SCRIPT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_SCRIPT;
}
} else {
// catch all other tags!
$x = strpos ($Chunk, ">");
if ($x > 0) {
$Tail = $Head + $x;
$Attr = ATTR_TAG;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TAG;
}
}
} else {
$x = strpos ($Chunk, "<");
if ($x > 0) {
$Tail = $Head + $x - 1;
$Attr = ATTR_TEXT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TEXT;
}
}
}

andreasfriedrich

7:28 pm on Jan 31, 2003 (gmt 0)

A function that resolves relative URIs [webmasterworld.com] can be found in the Bag-O-Tricks for PHP II [webmasterworld.com].