HTML extraction

Forum Moderators: coopster

Message Too Old, No Replies

HTML extraction

turbohost

7:28 am on Sep 9, 2003 (gmt 0)

Hi,

I was wondering if it is possible to extract html code/content from a website via a php script. If yes, with which commands can I do this?

Philip

acidic

8:42 am on Sep 9, 2003 (gmt 0)

<?php

$fp = fopen("http://mysite.com", "r");
$data = fread($fp, 1024*1024);
fclose($fp);

echo $data;
?>

This will put the contents of "http://mysite.com" into the string variable $data

mogwai

2:47 pm on Sep 9, 2003 (gmt 0)

Also, consider using the Snoopy class

[snoopy.sourceforge.net...]

"Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example."

turbohost

10:07 pm on Oct 1, 2003 (gmt 0)

Hi Acidic,

I just see a short part of the web page as a result of this php-script. How do I get the whole page? Is it also possible to get the html code from this page? What's the coding I've got to use? I've bought a big book about php but this part is not very well explained.

Turbohost

coopster

11:12 pm on Oct 1, 2003 (gmt 0)

I just see a short part of the web page as a result of this php-script. How do I get the whole page?

Wow, 1024*1024 should do it. But you can read about the fread function [us3.php.net] to figure out how big you need to read.

Is it also possible to get the html code from this page?

This does give you the raw html code. It is stored in the variable called $data. It's just that when you echo the data back out to your browser, the browser is rendering the data like it thinks it is supposed to. To see what I mean, throw a header out first and try it again:

<?php
$fp = fopen("http://mysite.com", "r");
$data = fread($fp, 1024*1024);
fclose($fp);
header("Content-type: text/plain");
echo $data;
?>

gethan

11:50 am on Oct 2, 2003 (gmt 0)

I think the file() function is a little easier to use.

<?php
// Get a file into an array. In this example we'll go through HTTP to get 
// the HTML source of a URL.
$lines = file ('http://www.example.com/');// Loop through our array, show html source as html source; and line numbers too.
foreach ($lines as $line_num => $line) {
 echo "Line #<b>{$line_num}</b> : " . htmlspecialchars($line) . "<br>\n";
}// Another example, let's get a web page into a string. See also file_get_contents().
$html = implode ('', file ('http://www.example.com/'));
?>

from: [hu.php.net...]

turbohost

6:58 pm on Oct 2, 2003 (gmt 0)

Maybe this is to much asked, but how do I extract a string followed by a number out of the html text?

Thanks,
Turbohost

coopster

8:09 pm on Oct 2, 2003 (gmt 0)

preg_match [us3.php.net] or preg_match_all [us3.php.net], depending on whether you want to find one occurrence or all occurrences. Here's a regex to get you started:


$string = ' string722 abcd abc123def456 789 '; 
preg_match_all("/([[:alpha:]]\w*\d+\b)/", $string, $matches); 
for ($i=0; $i<count($matches[1]); $i++) { 
print $matches[1][$i].'<br />'; 
} 
// prints: 
// string722 
// abc123def456

turbohost

4:25 am on Oct 3, 2003 (gmt 0)

Thx coopster! :->

HTML extraction

turbohost

acidic

mogwai

turbohost

coopster

gethan

turbohost

coopster

turbohost

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week