Forum Moderators: coopster

Message Too Old, No Replies

Data Extraction with PHP

         

eslobrown

10:51 pm on Dec 18, 2009 (gmt 0)

10+ Year Member



Hello All,

I would like to spider a directory and copy the files to another directory:

This is the mashup code I was able to come up with from a couple other I found:

<?php
$a= 1;
$b= 2;
while ($a <= $b)
if(!copy("http://www.example.com/$a/", "$a.html"))
{
echo("failed to copy file");
}

?>

What this says is any files with the names between 1 and 10 should be copied to another file with the extension .html.

It looks like this script is currently copying the first file over and over and not moving on to the next one in the sequence.

Any help would be greatly appreciated.

Eslo Brown

TheMadScientist

11:23 pm on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You're not incrementing your counter in the loop...
And it's only set to copy file 1 to 2, unless $b=10.


<?php
$a= 1;
$b= 10;
while ($a <= $b) {
if(!copy("http://www.example.com/$a/", "$a.html")) {
echo("failed to copy file");
}
$a++;
}
?>

eslobrown

3:45 pm on Dec 19, 2009 (gmt 0)

10+ Year Member



Thanks TheMadScientist!

Follow-up: Is there a way within that same script to insert content into each of those files right after the <body> tag?

In other words, I want to insert the same header right after the open <body> tag of each of the copied files.

Is it possible with another script?

Thank you so much for your help.

CyBerAliEn

5:22 pm on Dec 22, 2009 (gmt 0)

10+ Year Member



If the file is being written/copied over to your own system: Yes

A simple/quick way... use PHP's file handling function to open the file and read it. Then use other functions to scan the content for "<body>" (you could do this by getting the file content as a string and using strpos() to find the first body tag). Then, you would re-create the content string with your "header" spliced in. A quick/dirt approach would be something like:

$newContent = substr($oldContent,0,$startBodyIndex).$myHeader.substr($oldContent,$endBodyIndex);

where oldContent is a string of the file contents, myHeader is a string of your custom header, and the index vars represent the character index (of original/old content) where body tag starts and ends

Then you would write/close the file and be done. You would add this extra code inside your original code just before the while loop terminates (to process your header before moving to the next file).

Do a search for "PHP file system functions" --- there are a lot of options for handling functions. Opening them, reading them, closing the file, writing the file, etc.

eslobrown

3:52 pm on Dec 23, 2009 (gmt 0)

10+ Year Member



Update: So I got the script to work thanks to all the help from Webmasterworld members. See below:

<?php
$a= 1;
$b= 10;
while ($a <= $b) {
if(!copy("http://www.example.com/$a/", "$a.html")) {
echo("failed to copy file");
}

$file = file_get_contents("$a.html");
$file = str_replace("WHAT YOU WANT TO REPLACE","WHAT TO REPLACE IT WITH", $file);
$myFile = "$a.html";
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $file);
fclose($fh);

$a++;
}
?>

Here is the only problem. Some of the files I am trying to copy are actually blank files. I need a way to tell the script to skip the files that have a specific attribute, in my case a specific title tag (i.e. "Blank Title Tag").

Thanks in advance.

StoutFiles

4:00 pm on Dec 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<?php
$a= 1;
$b= 10;
while ($a <= $b) {
if(!copy("http://www.example.com/$a/", "$a.html")) {
echo("failed to copy file");
}

$file = file_get_contents("$a.html");
if(strpos($file,"Blank Title Tag") == false)
{
$file = str_replace("WHAT YOU WANT TO REPLACE","WHAT TO REPLACE IT WITH", $file);
$myFile = "$a.html";
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $file);
fclose($fh);
}

$a++;
}
?>

eslobrown

5:56 pm on Dec 23, 2009 (gmt 0)

10+ Year Member



Hey Stoutfiles,

Thanks for the help but the script is still copying the blank files with the title tag "Blank Title Tag".

The only difference between mine and yours is the title tag on the blank docs is the company name. The ones that are not blank have something in front of the title tag, so what in the script I reference this: if(strpos($file,"<title>Company Name</title>") == false)

Here is my code:

<?php
$a= 1;
$b= 10;
while ($a <= $b) {
if(!copy("http://www.example.com/$a/", "$a.html")) {
echo("failed to copy file");
}

$file = file_get_contents("$a.html");
if(strpos($file,"<title>Company Name</title>") == false)
{
$file = str_replace("WHAT YOU WANT TO REPLACE","WHAT TO REPLACE IT WITH", $file);
$myFile = "$a.html";
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $file);
fclose($fh);
}

$a++;
}
?>

Thanks.

StoutFiles

6:47 pm on Dec 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could be that false is incorrect, and you should use 0 or "" instead. But run the code below and check the echoes output to see what you're getting for the files that work and don't work.

<?php
$a= 1;
$b= 10;
while ($a <= $b) {
if(!copy("http://www.example.com/$a/", "$a.html")) {
echo("failed to copy file");
}

$file = file_get_contents("$a.html");
if(strpos($file,"<title>Company Name</title>") == false)
{
$pos = strpos($file,"<title>Company Name</title>")
echo "Position: ".$pos;
$file = str_replace("WHAT YOU WANT TO REPLACE","WHAT TO REPLACE IT WITH", $file);
$myFile = "$a.html";
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $file);
fclose($fh);
}

$a++;
}
?>

eslobrown

7:11 pm on Dec 23, 2009 (gmt 0)

10+ Year Member



The file is now giving me an error msg:

Parse error: parse error in script.php on line 13

eslobrown

9:14 pm on Dec 23, 2009 (gmt 0)

10+ Year Member



Hey Stoutfiles,

I think I know why your original solution isn't working.

What the script says is to copy the file, then if the particular string exists, to not copy it.

What it should say is to copy the file and if the string exists, to delete the file.

Preferably, however, it should look at the original file and not copy it at all.

Nuno

CyBerAliEn

10:03 pm on Dec 23, 2009 (gmt 0)

10+ Year Member



eslo, you seem to have caught the "issue".

Your code is two blocks:
(1) get the file and copy it
(2) grab the file contents & splice in custom header

But since you want to AVOID copying files with the "blank title"... you need to adjust the code so that it:
(1) grabs the contents
(2) checks for criteria (title)
(3) if criteria met: splice in header/etc and save file to server; otherwise ignore file and move on

So you want to modify your code so that it is something more like:

<?php
$a= 1;
$b= 10;
$title = "<title>Company Name</title>";
while ($a <= $b)
{
//Grab File Contents
$contents = file_get_contents("http://www.example.com/{$a}/");
if ($contents===false)
{
echo "Error: File contents could not be retrieved!";
}
else
{
//Check for Title
if (strpos($contents,$title)===false)
{
//Manipulate Contents
/*do str_replace etc here to change file contents before it is written*/
//Copy File
$results = file_put_contents("{$a}.html",$contents);
if ($results===false) { echo "Error: File could not be saved/written."; }
}
}
}
?>

This code is NOT tested but should work. Just modify the values as needed and add the code you need to "manipulate" the content (ie: your str_replace).

eslobrown

12:07 am on Dec 24, 2009 (gmt 0)

10+ Year Member



Hey CyBerAliEn,

Thanks for the help. I tried your code and am getting the following error:

Fatal error: Call to undefined function: file_put_contents() in ...

Thanks.

Eslo

rocknbil

3:36 am on Dec 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This (most likely) means you're on a php version less than 5. Alternatively,


// Figure out your write mode, a for append, w for overwrite
$filemode="w";
$file=NULL;
// php 5 only, recoded for 4+ compatibility
//file_put_contents("{$a}.html",$contents);
if (is_writable("{$a}.html")) {
if (!$file = fopen("{$a}.html",$filemode)) {die("Cannot open {$a}.html in $filemode mode"); }
if (fwrite($file, $contents) === FALSE) {die("Cannot write to {$a}.html"); }
fclose($file);
}
else { die("file is not writable"); }

eslobrown

6:45 am on Dec 24, 2009 (gmt 0)

10+ Year Member



The following code tells me the file is not writeable:

<?php
$a= 1;
$b= 10;
$title = "<title>Company Name</title>";
while ($a <= $b)
{
//Grab File Contents
$contents = file_get_contents("http://www.example.com/8200254/product/{$a}/");
if ($contents===false)
{
echo "Error: File contents could not be retrieved!";
}
else
{
//Check for Title
if (strpos($contents,$title)===false)
{
//Manipulate Contents
/*do str_replace etc here to change file contents before it is written*/
$file = str_replace("WHAT TO REPLACE","WHAT TO REPLACE IT WITH", $file);
//Copy File
// Figure out your write mode, a for append, w for overwrite
$filemode="w";
$file=NULL;
// php 5 only, recoded for 4+ compatibility
//file_put_contents("{$a}.html",$contents);
if (is_writable("{$a}.html")) {
if (!$file = fopen("{$a}.html",$filemode)) {die("Cannot open {$a}.html in $filemode mode"); }
if (fwrite($file, $contents) === FALSE) {die("Cannot write to {$a}.html"); }
fclose($file);
}
else { die("file is not writable"); }
}
}
$a++;
}
?>

If I take out the following lines:

if (is_writable("{$a}.html")) {
AND

else { die("file is not writable"); }
}

It copies the correct files (without the ones with the bad title tag) but it does not perform the find and replace:

$file = str_replace("WHAT TO REPLACE","WHAT TO REPLACE IT WITH", $file);

I can't figure out what I'm doing wrong.

Thanks.

eslobrown

6:49 am on Dec 24, 2009 (gmt 0)

10+ Year Member



The following code tells me the file is not writeable:

<?php
$a= 1;
$b= 10;
$title = "<title>Company Name</title>";
while ($a <= $b)
{
//Grab File Contents
$contents = file_get_contents("http://www.example.com/8200254/product/{$a}/");
if ($contents===false)
{
echo "Error: File contents could not be retrieved!";
}
else
{
//Check for Title
if (strpos($contents,$title)===false)
{
//Manipulate Contents
/*do str_replace etc here to change file contents before it is written*/
$file = str_replace("WHAT TO REPLACE","WHAT TO REPLACE IT WITH", $file);
//Copy File
// Figure out your write mode, a for append, w for overwrite
$filemode="w";
$file=NULL;
// php 5 only, recoded for 4+ compatibility
//file_put_contents("{$a}.html",$contents);
if (is_writable("{$a}.html")) {
if (!$file = fopen("{$a}.html",$filemode)) {die("Cannot open {$a}.html in $filemode mode"); }
if (fwrite($file, $contents) === FALSE) {die("Cannot write to {$a}.html"); }
fclose($file);
}
else { die("file is not writable"); }
}
}
$a++;
}
?>

If I take out the following lines:

if (is_writable("{$a}.html")) {
AND

else { die("file is not writable"); }
}

It copies the correct files (without the ones with the bad title tag) but it does not perform the find and replace:

$file = str_replace("WHAT TO REPLACE","WHAT TO REPLACE IT WITH", $file);

I can't figure out what I'm doing wrong.

Thanks.

ALKateb

10:58 am on Dec 24, 2009 (gmt 0)

10+ Year Member



the $file variable is no longer the correct resource to do str_replace you see you wrote ($contents = file_get_contents("http://www.example.com/8200254/product/{$a}/"); )

so the str_replace function should be working now on $contents not on $file

$contents = str_replace("WHAT TO REPLACE","WHAT TO REPLACE IT WITH", $contents);

eslobrown

3:00 pm on Dec 24, 2009 (gmt 0)

10+ Year Member



Thanks ALKateb. That fixed it. One final question. Is there a way to tell the script to continue from where it quit.

There are about 60K files that need to be "spidered" with only about 3000 valid pages. Unfortunately the script keeps quitting. It would be great if the script found the last file generated and continued copying from there.

In any case, thanks to everyone who helped. I definitely could not have put this together without your help.

Happy Holidays!

eslobrown

4:30 pm on Dec 31, 2009 (gmt 0)

10+ Year Member



Hey Everyone,

Can anyone help with this last request? I need the script to restart once it fails.

Thanks.

Eslo Brown

eslobrown

7:48 pm on Jan 1, 2010 (gmt 0)

10+ Year Member



Bump. Anyone have any ideas? Thanks.