homepage Welcome to WebmasterWorld Guest from 23.23.12.202
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
HTML parser in php?
this is frustrating
SubZeroGTS

10+ Year Member



 
Msg#: 864 posted 10:12 pm on Jan 29, 2003 (gmt 0)

i've been at it all day, and i just can't make a php script to parse HTML that will work for more than 1 or 2 sites.

what i need it to do is extract all links (URLs), words (as one large string snapshot of the page, and then an array of each individual word), and META tags...

is there a good script/code available online somewhere which works?

 

jatar_k

WebmasterWorld Administrator jatar_k us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 864 posted 10:27 pm on Jan 29, 2003 (gmt 0)

i did a search on google for php html parser and came up with a bunch of different possibilities.

SubZeroGTS

10+ Year Member



 
Msg#: 864 posted 1:40 am on Jan 30, 2003 (gmt 0)

wow i found one fantastic one, by some russian dude

gotta question though, is there any way to 'resolve' links?

a lot of sites only link to other files, how do you tack on the domain/directory info?

(i.e, [microsoft.com...] links to "../wow.htm"....how do i turn that into [microsoft.com...]

jatar_k

WebmasterWorld Administrator jatar_k us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 864 posted 2:02 am on Jan 30, 2003 (gmt 0)

maybe store the domain name and then do the logical math to build the proper path.

kmarcus

10+ Year Member



 
Msg#: 864 posted 2:28 am on Jan 30, 2003 (gmt 0)

here is a "pseudo html" parser i wrote some time ago. it is relatively inefficient, but nice for 15 minutes of coding.

<?php

// Head, Tail are pointers into the entire document at Text

function ParseHTML ($Text, &$Head, &$Tail, &$Attr)
{
$Chunk = substr ($Text, $Head);
$ChunkLen = strlen ($Chunk);

if ($Chunk [0] == '<') {
if (($Chunk [1] == '!') &&
($Chunk [2] == '-') &&
($Chunk [3] == '-')) {
$x = strpos ($Chunk, "-->");
if ($x > 0) {
$x += 2;
$Tail = $Head + $x;
$Attr = ATTR_COMMENT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_COMMENT;
}
} else if (strncasecmp ($Chunk, "<script", 7) == 0) {
$x = stripos ($Chunk, "</script>");
if ($x > 0) {
$x += 8;
$Tail = $Head + $x;
$Attr = ATTR_SCRIPT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_SCRIPT;
}
} else {
// catch all other tags!
$x = strpos ($Chunk, ">");
if ($x > 0) {
$Tail = $Head + $x;
$Attr = ATTR_TAG;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TAG;
}
}
} else {
$x = strpos ($Chunk, "<");
if ($x > 0) {
$Tail = $Head + $x - 1;
$Attr = ATTR_TEXT;
} else {
$Tail = $Head + $ChunkLen;
$Attr = ATTR_TEXT;
}
}
}

?>

andreasfriedrich

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 864 posted 7:28 pm on Jan 31, 2003 (gmt 0)

A function that resolves relative URIs [webmasterworld.com] can be found in the Bag-O-Tricks for PHP II [webmasterworld.com].

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved