homepage Welcome to WebmasterWorld Guest from 54.196.168.78
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Image Crawler to remove unused images in a website
fm86




msg:4425031
 12:35 pm on Mar 5, 2012 (gmt 0)

Hello people!

I have a folder (images) in my website which contains, in a rather unstructured way, all the images used in the website. With the time this folder grew in size and many files contained in it are not used anymore. Now it's time to make some file cleaning and I need to choose the best strategy to remove all the unused files and preserve the used ones.

The best idea that came to my mind is writing a software that crawls all the website following all the links and writes in a file all the request that are made to the folder images. And then delete all the files that are not in that list.

Before starting reinventing the wheel I would like to know if this idea makes sense and if there are already libraries that perform part of this task. Any suggestion is welcome!

 

g1smd




msg:4425040
 1:32 pm on Mar 5, 2012 (gmt 0)

Xenu LinkSleuth will do this, but only if you let it also have FTP access to scan the server filesystem.

fm86




msg:4425065
 2:32 pm on Mar 5, 2012 (gmt 0)

Thanks for the interesting reply! This sounds like a really cool software. I have a question though, in case you are an expert user: I just read the software's specs and I wonder if it is able to see that an image is required using the background-image(url:myUrl) css command. Doesn't looks like it goes checking the GET requests of files.

In case it does would you be so kind to give me an advice about how to configure it for my purpose? Besides, my site runs locally only for now. So I guess no FTP is needed.

g1smd




msg:4425069
 2:56 pm on Mar 5, 2012 (gmt 0)

After Xenu scans the website via HTTP (the site therefore needs to be running on a HTTP server such as Apache) it then asks for the FTP credentials so it can look in all the folders to find any files that were not accessed during the HTTP scan - those are the unused files.

I have no idea if Xenu looks for files mentioned in style sheets. I have never considered that possibility. I would hope that it does. It is quite easy to test whether it does or not.

fm86




msg:4425438
 7:51 am on Mar 6, 2012 (gmt 0)

Hmm... Seems it won't work for JS and CSS:

"Please be careful with removing files when listed in an orphan report. Especially navbar mouseover images will be seen as “orphans”, because Xenu cannot find links to it. "
(from [integralworld.net...]

Xenu is cool but doesn't seem to be the perfect solution in this case. I would need something that goes checking the GET requests done to server. In this way all CSS, JS and AJAX request would be parsed.

Any further suggestion?

Jonesy




msg:4427922
 9:30 pm on Mar 11, 2012 (gmt 0)

linux:
I generally use the web site layout ala:
...../documents/
...../documents/images/

On my workstation (using the command line) I enter the .../images/ sub-directory
and:

...../images$ for IMG in * ; do echo $IMG ; grep -l $IMG ../* ; done

(That's a lower-case "ell" for the grep switch.)
Any image listed without any following lines of files noted by grep is a file not to be found in any document: html, php, css, etc., usw.

If you wish, you could get more elaborate with the -r|-R (recursive) switch for grep.

Of course it works only for static web pages....

HTH,
Jonesy

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved