Forum Moderators: phranque

Message Too Old, No Replies

tool for indexing website

would like a list of all files in a website

         

pixeltierra

9:03 pm on Nov 7, 2007 (gmt 0)

10+ Year Member



I frequently have to re-design / restructure websites. It's not always easy to eyeball a website and know how "big" it is, or if I've gotten to every page.

I want to know if anyone knows of a tool that, once I give it a domain, makes a simple list by directory of every file linked to within the site. Something like this

/dir1/file.ext
/dir1/file.ext
/dir1/file.ext
/dir1/file.ext
/dir1/file.ext
/dir1/file.ext

/dir2/file.ext
/dir2/file.ext
/dir2/file.ext
/dir2/file.ext
/dir2/file.ext

/dir3/dir/file.ext
/dir3/dir/file.ext
/dir3/dir/file.ext
/dir3/dir/file.ext
/dir3/dir/file.ext
/dir3/dir/file.ext

...

I guess I could write one, but why bother...

willybfriendly

9:24 pm on Nov 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Xenu might do this for you. Been awhile since I have used it.

thecoalman

2:46 am on Nov 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just used webreaper to generate a couple thousand static files from a forum locally, took about 10 minutes. Having said that it appears to be an evil program. lol

jtara

4:10 pm on Nov 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does the indexing tool have to work via the HTTP interface? And does it have to list URLs, or would files be OK? (I know you said "files", but you apparently *meant* URLS.)

If you have access to a Linux shell on the server, "du" will do what you want, and give you file sizes (and the aggregate sizes of files in directories) to boot.

pixeltierra

6:39 pm on Nov 11, 2007 (gmt 0)

10+ Year Member



jrara:

I usually don't have shell access, but I can run a script that executes shell commands. I've looked into the du command and it seems to work on every file in a dir recursively, whereas I want "every file linked to within the site". I'll have to add that I mean every internal site file.

URL paths would be fine, and so would internal file system paths. Doesn't matter.

It has to be only those files that can be viewed publicly from the site, since those are the files I'd have to re-format. I'm trying to estimate the size of my workload, not the total size of all files on the site.

I'm sure you all know what I mean. Many sites have atrocious markup that cannot be controlled with CSS without major house-cleaning. If a site has 20 files that need to be cleaned, I can estimate my time easily. The same goes for 200 files. I just need to know how many there are. The problem is, is that it's not always easy to know how many files I will have to re-format. I want my cost estimates for clients to be as accurate as possible.

Hope that makes sense. Maybe I should just write one myself. In the past, I have relied on the # of pages indexed in google, which is unreliable for obvious reasons.