Welcome to WebmasterWorld Guest from 54.167.46.29

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

HTML parser in php?

this is frustrating

     
10:12 pm on Jan 29, 2003 (gmt 0)

New User

10+ Year Member

joined:Jan 24, 2003
posts:40
votes: 0


i've been at it all day, and i just can't make a php script to parse HTML that will work for more than 1 or 2 sites.

what i need it to do is extract all links (URLs), words (as one large string snapshot of the page, and then an array of each individual word), and META tags...

is there a good script/code available online somewhere which works?

10:27 pm on Jan 29, 2003 (gmt 0)

Administrator

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 24, 2001
posts:15755
votes: 0


i did a search on google for php html parser and came up with a bunch of different possibilities.
1:40 am on Jan 30, 2003 (gmt 0)

New User

10+ Year Member

joined:Jan 24, 2003
posts:40
votes: 0


wow i found one fantastic one, by some russian dude

gotta question though, is there any way to 'resolve' links?

a lot of sites only link to other files, how do you tack on the domain/directory info?

(i.e, [microsoft.com...] links to "../wow.htm"....how do i turn that into [microsoft.com...]

2:02 am on Jan 30, 2003 (gmt 0)

Administrator

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 24, 2001
posts:15755
votes: 0


maybe store the domain name and then do the logical math to build the proper path.
2:28 am on Jan 30, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:July 27, 2002
posts:75
votes: 0


here is a "pseudo html" parser i wrote some time ago. it is relatively inefficient, but nice for 15 minutes of coding.

<?php

// Head, Tail are pointers into the entire document at Text

function ParseHTML ($Text, &$Head, &$Tail, &$Attr)
{
$Chunk = substr ($Text, $Head);
$ChunkLen = strlen ($Chunk);

if ($Chunk [0] == '<') {
if (($Chunk [1] == '!') &&
($Chunk [2] == '-') &&
($Chunk [3] == '-')) {
$x = strpos ($Chunk, "-->");
if ($x > 0) {
$x += 2;
$Tail = $Head + $x;
$Attr = ATTR_COMMENT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_COMMENT;
}
} else if (strncasecmp ($Chunk, "<script", 7) == 0) {
$x = stripos ($Chunk, "</script>");
if ($x > 0) {
$x += 8;
$Tail = $Head + $x;
$Attr = ATTR_SCRIPT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_SCRIPT;
}
} else {
// catch all other tags!
$x = strpos ($Chunk, ">");
if ($x > 0) {
$Tail = $Head + $x;
$Attr = ATTR_TAG;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TAG;
}
}
} else {
$x = strpos ($Chunk, "<");
if ($x > 0) {
$Tail = $Head + $x - 1;
$Attr = ATTR_TEXT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TEXT;
}
}
}

?>

7:28 pm on Jan 31, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 22, 2002
posts:1782
votes: 0


A function that resolves relative URIs [webmasterworld.com] can be found in the Bag-O-Tricks for PHP II [webmasterworld.com].