Forum Moderators: coopster

Message Too Old, No Replies

Read directories of a website.

         

andrewsmd

1:50 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know if this is even possible but I need to get all of the valid URLs of a domain. Is there a way to do this with php? Here's what I'm trying to do. I am getting ready to make edits to a site that is basic html for a client. The client doesn't know the password to get the files and so I'm going to use curl and php to get all of the html and write them to files locally on my machine. How I was originally going to do this was just to create an array of urls that are on the site and read all of them. However, I was afraid I may miss a site, is there anyway you can say www.someplace.com and php can go in and find all of the urls within it? I'm not for sure just wondering. I do have my own dedicated server so I can install any extensions. If not, no big deal I'll just have to go through manually and get all of the sites. Thanks,

penders

2:24 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... get all of the html and write them to files locally on my machine.

There are applications that will do this sort of thing for you - unless you have some particular requirement. They basically crawl a website and save all HTML and images locally for offline browsing.

One that I've used in the past: HTTrack

The client doesn't know the password to get the files...

Is it not possible to contact the host and reset the password? If all is legit that is! ;)

andrewsmd

5:41 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's legit, but the client can't remember what the email they used was and the company won't give us the password without the email or the last four digits of the credit card used to pay for the site. Of course, this idiot doesn't have that card anymore or a receipt. The subscription is about up so I just need to get all of the files and we are going to set it up with another place. Thanks, I'll check out httrack.

andrewsmd

5:42 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There isn't a version of httrack for vista any other ideas?

LifeinAsia

6:10 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Can you have the company send the password to the registered e-mail? Assuming "this idiot" still uses that e-mail address, he should get the password that way.

andrewsmd

6:51 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here is the code I wrote to do it. Just in case anyone wants it. Just make sure you either change the path it writes to or create a folder called website wherever this php file is located. This WILL overwrite any files that are already in the directory so you've been warned

<?php

$urls = array("", "dreamkit.html", "contents.html",
"buy.html", "about.html","contact.html");
get_web_page("website", $urls, "http://www.example.com");

//this function takes in a path to write to
//and array of urls to get html from
//and a domain name. example usage would be
//get_web_page("folder", array("test.html", "test\test.html", "test\test2\test3.html"), "www.someplace.com");
//it will create subdirectories so within folder you would have the files
//test.html and the folder test. within the folder folder\test
//you would have test.html and the folder test2
//this way you can keep the same directory tree
//IMPORTANT!
//if you want to also read the base domain
//i.e. www.example.com you need to add a blank
//"" to your array like the last entry in this array
//array("test.html", "test\test.html", "test\test2\test3.html", "");
//it will also write www.example.com as wwwIndex_com.html
function get_web_page($writeTo, $arrayOfUrls, $domain){

//go through all of the urls
foreach($arrayOfUrls as $i){

//initialize this back to its original state
$tempTo = $writeTo;
$url = $domain;
//check to see if there is a / in the
//url. if there is, then we need
//to make a directory for that folder
if(strpos($i, "\\")){

//if there are slashes create an array
//to get as many folders as we need
$arrName = explode("\\", $i);
$fileName = end($arrName);

//a counter
$count = 0;

//now add the folders to the write to directory
while($count < count($arrName) - 1){

$tempTo .= "\\".$arrName[$count];
$url .= "/".$arrName[$count];
//see if the directory exists
//if not then we create it
if(!(is_dir($tempTo))){

if(!(mkdir($tempTo))){

echo("There was an error in creating the directory $tempTo");

}//if !mkdir

}//if !is_dir

$count++;
}//while

$tempTo .= "\\$fileName";
}//if strpos
else{
$tempTo .= "\\$i";
$url .="/".$i;
}//else

//if i is blank then we are just
//going to read the domain
if($i == ""){
$url = $domain;
$tempTo = "{$writeTo}\\wwwIndex_com.html";
}//if

//make sure the page exists
if(page_exists($url)){
//now weed need to get the html
//and write it to the file
try{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$html=curl_exec($ch);
}//try
catch (Exception $e){
echo("There was an error in trying to open the connection to $url<br>");
var_dump($e->getMessage());
die();
}//catch

//write the html to a testfile
$temp = fopen($tempTo, "a+");
if(!(fwrite($temp, $html))){

echo("There was an error in writing the file $tempTo");
die();

}//if !fwrite

fclose($temp);

curl_close($ch);
}//if
else{
echo("The domain $url does not exist<br>");
}

}//foreach
echo("The directories and files were successfully written");
}//get_web_page

function page_exists($url){
$parts=parse_url($url);
if(!$parts) return false; /* the URL was seriously wrong */

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);

/* set the user agent - might help, doesn't hurt */
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

/* try to follow redirects */
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

/* timeout after the specified number of seconds. assuming that this script runs
on a server, 20 seconds should be plenty of time to verify a valid URL. */
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);

/* don't download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);

/* handle HTTPS links */
if($parts['scheme']=='https'){
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}

$response = curl_exec($ch);
curl_close($ch);

/* get the status code from HTTP headers */
if(preg_match('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){
$code=intval($matches);
} else {
return false;
};

/* see if code indicates success */
return (($code>=200) && ($code<400));
}
?>

[1][edited by: andrewsmd at 7:03 pm (utc) on July 29, 2009]

[edited by: dreamcatcher at 3:30 pm (utc) on July 30, 2009]
[edit reason] removed url. [/edit]

penders

6:53 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There isn't a version of httrack for vista any other ideas?

Hhhmm strange, I see an advertised 'Vista version' available for download from another website, but not on the official HTTrack website. However, in the HTTrack forums there is hope:

Subject: Re: HTTrack for Vista
Author: Tony
Date: 16/06/2009 23:07

I have downloaded the XP version and it is working just fine in Vista

But also, on the HTTrack download page there is mention of a Mozilla (I assume Firefox) extension called SpiderZilla which does something similar.

<edit>Thanks for sharing your code.</edit>

andrewsmd

7:10 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yea, to be honest I really wanted to write it just to see if I can do it. Now I just have to work on reading all of that html to find any .js, .css, and image files and download them to the correct directories.

penders

7:33 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yea, to be honest I really wanted to write it just to see if I can do it.

A good reason :)

bkeep

5:51 am on Aug 1, 2009 (gmt 0)

10+ Year Member



This may be to late to be of any use but you could try
wget -r http://example.com this should grab all of the public facing files following links and grabbing images recursively. Should be standard on any Linux box. I am not sure if you need to grab remote site files but you may want to read the man page.