Forum Moderators: coopster
... get all of the html and write them to files locally on my machine.
There are applications that will do this sort of thing for you - unless you have some particular requirement. They basically crawl a website and save all HTML and images locally for offline browsing.
One that I've used in the past: HTTrack
The client doesn't know the password to get the files...
Is it not possible to contact the host and reset the password? If all is legit that is! ;)
<?php
$urls = array("", "dreamkit.html", "contents.html",
"buy.html", "about.html","contact.html");
get_web_page("website", $urls, "http://www.example.com");
//this function takes in a path to write to
//and array of urls to get html from
//and a domain name. example usage would be
//get_web_page("folder", array("test.html", "test\test.html", "test\test2\test3.html"), "www.someplace.com");
//it will create subdirectories so within folder you would have the files
//test.html and the folder test. within the folder folder\test
//you would have test.html and the folder test2
//this way you can keep the same directory tree
//IMPORTANT!
//if you want to also read the base domain
//i.e. www.example.com you need to add a blank
//"" to your array like the last entry in this array
//array("test.html", "test\test.html", "test\test2\test3.html", "");
//it will also write www.example.com as wwwIndex_com.html
function get_web_page($writeTo, $arrayOfUrls, $domain){
//go through all of the urls
foreach($arrayOfUrls as $i){
//initialize this back to its original state
$tempTo = $writeTo;
$url = $domain;
//check to see if there is a / in the
//url. if there is, then we need
//to make a directory for that folder
if(strpos($i, "\\")){
//if there are slashes create an array
//to get as many folders as we need
$arrName = explode("\\", $i);
$fileName = end($arrName);
//a counter
$count = 0;
//now add the folders to the write to directory
while($count < count($arrName) - 1){
$tempTo .= "\\".$arrName[$count];
$url .= "/".$arrName[$count];
//see if the directory exists
//if not then we create it
if(!(is_dir($tempTo))){
if(!(mkdir($tempTo))){
echo("There was an error in creating the directory $tempTo");
}//if !mkdir
}//if !is_dir
$count++;
}//while
$tempTo .= "\\$fileName";
}//if strpos
else{
$tempTo .= "\\$i";
$url .="/".$i;
}//else
//if i is blank then we are just
//going to read the domain
if($i == ""){
$url = $domain;
$tempTo = "{$writeTo}\\wwwIndex_com.html";
}//if
//make sure the page exists
if(page_exists($url)){
//now weed need to get the html
//and write it to the file
try{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$html=curl_exec($ch);
}//try
catch (Exception $e){
echo("There was an error in trying to open the connection to $url<br>");
var_dump($e->getMessage());
die();
}//catch
//write the html to a testfile
$temp = fopen($tempTo, "a+");
if(!(fwrite($temp, $html))){
echo("There was an error in writing the file $tempTo");
die();
}//if !fwrite
fclose($temp);
curl_close($ch);
}//if
else{
echo("The domain $url does not exist<br>");
}
}//foreach
echo("The directories and files were successfully written");
}//get_web_page
function page_exists($url){ [1][edited by: andrewsmd at 7:03 pm (utc) on July 29, 2009]
$parts=parse_url($url);
if(!$parts) return false; /* the URL was seriously wrong */
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
/* set the user agent - might help, doesn't hurt */
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
/* try to follow redirects */
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
/* timeout after the specified number of seconds. assuming that this script runs
on a server, 20 seconds should be plenty of time to verify a valid URL. */
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
/* don't download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
/* handle HTTPS links */
if($parts['scheme']=='https'){
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}
$response = curl_exec($ch);
curl_close($ch);
/* get the status code from HTTP headers */
if(preg_match('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){
$code=intval($matches);
} else {
return false;
};
/* see if code indicates success */
return (($code>=200) && ($code<400));
}
?>
[edited by: dreamcatcher at 3:30 pm (utc) on July 30, 2009]
[edit reason] removed url. [/edit]
There isn't a version of httrack for vista any other ideas?
Hhhmm strange, I see an advertised 'Vista version' available for download from another website, but not on the official HTTrack website. However, in the HTTrack forums there is hope:
Subject: Re: HTTrack for Vista
Author: Tony
Date: 16/06/2009 23:07I have downloaded the XP version and it is working just fine in Vista
But also, on the HTTrack download page there is mention of a Mozilla (I assume Firefox) extension called SpiderZilla which does something similar.
<edit>Thanks for sharing your code.</edit>