Forum Moderators: coopster

Message Too Old, No Replies

HTML extraction

         

turbohost

7:28 am on Sep 9, 2003 (gmt 0)

10+ Year Member



Hi,

I was wondering if it is possible to extract html code/content from a website via a php script. If yes, with which commands can I do this?

Philip

acidic

8:42 am on Sep 9, 2003 (gmt 0)

10+ Year Member



<?php

$fp = fopen("http://mysite.com", "r");
$data = fread($fp, 1024*1024);
fclose($fp);

echo $data;
?>

This will put the contents of "http://mysite.com" into the string variable $data

mogwai

2:47 pm on Sep 9, 2003 (gmt 0)

10+ Year Member



Also, consider using the Snoopy class

[snoopy.sourceforge.net...]

"Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example."

turbohost

10:07 pm on Oct 1, 2003 (gmt 0)

10+ Year Member



Hi Acidic,

I just see a short part of the web page as a result of this php-script. How do I get the whole page? Is it also possible to get the html code from this page? What's the coding I've got to use? I've bought a big book about php but this part is not very well explained.

Turbohost

coopster

11:12 pm on Oct 1, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I just see a short part of the web page as a result of this php-script. How do I get the whole page?

Wow, 1024*1024 should do it. But you can read about the fread function [us3.php.net] to figure out how big you need to read.

Is it also possible to get the html code from this page?

This does give you the raw html code. It is stored in the variable called $data. It's just that when you echo the data back out to your browser, the browser is rendering the data like it thinks it is supposed to. To see what I mean, throw a header out first and try it again:

<?php
$fp = fopen("http://mysite.com", "r");
$data = fread($fp, 1024*1024);
fclose($fp);
header("Content-type: text/plain");
echo $data;
?>

gethan

11:50 am on Oct 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think the file() function is a little easier to use.

<?php
// Get a file into an array. In this example we'll go through HTTP to get
// the HTML source of a URL.
$lines = file ('http://www.example.com/');

// Loop through our array, show html source as html source; and line numbers too.
foreach ($lines as $line_num => $line) {
echo "Line #<b>{$line_num}</b> : " . htmlspecialchars($line) . "<br>\n";
}

// Another example, let's get a web page into a string. See also file_get_contents().
$html = implode ('', file ('http://www.example.com/'));
?>

from: [hu.php.net...]

turbohost

6:58 pm on Oct 2, 2003 (gmt 0)

10+ Year Member



Maybe this is to much asked, but how do I extract a string followed by a number out of the html text?

Thanks,
Turbohost

coopster

8:09 pm on Oct 2, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



preg_match [us3.php.net] or preg_match_all [us3.php.net], depending on whether you want to find one occurrence or all occurrences. Here's a regex to get you started:

$string = ' string722 abcd abc123def456 789 ';
preg_match_all("/([[:alpha:]]\w*\d+\b)/", $string, $matches);
for ($i=0; $i<count($matches[1]); $i++) {
print $matches[1][$i].'<br />';
}
// prints:
// string722
// abc123def456

turbohost

4:25 am on Oct 3, 2003 (gmt 0)

10+ Year Member



Thx coopster! :->