Welcome to WebmasterWorld Guest from 54.166.33.25

Forum Moderators: coopster & jatar k

Image Crawler to remove unused images in a website

   
12:35 pm on Mar 5, 2012 (gmt 0)

5+ Year Member



Hello people!

I have a folder (images) in my website which contains, in a rather unstructured way, all the images used in the website. With the time this folder grew in size and many files contained in it are not used anymore. Now it's time to make some file cleaning and I need to choose the best strategy to remove all the unused files and preserve the used ones.

The best idea that came to my mind is writing a software that crawls all the website following all the links and writes in a file all the request that are made to the folder images. And then delete all the files that are not in that list.

Before starting reinventing the wheel I would like to know if this idea makes sense and if there are already libraries that perform part of this task. Any suggestion is welcome!
1:32 pm on Mar 5, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Xenu LinkSleuth will do this, but only if you let it also have FTP access to scan the server filesystem.
2:32 pm on Mar 5, 2012 (gmt 0)

5+ Year Member



Thanks for the interesting reply! This sounds like a really cool software. I have a question though, in case you are an expert user: I just read the software's specs and I wonder if it is able to see that an image is required using the background-image(url:myUrl) css command. Doesn't looks like it goes checking the GET requests of files.

In case it does would you be so kind to give me an advice about how to configure it for my purpose? Besides, my site runs locally only for now. So I guess no FTP is needed.
2:56 pm on Mar 5, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



After Xenu scans the website via HTTP (the site therefore needs to be running on a HTTP server such as Apache) it then asks for the FTP credentials so it can look in all the folders to find any files that were not accessed during the HTTP scan - those are the unused files.

I have no idea if Xenu looks for files mentioned in style sheets. I have never considered that possibility. I would hope that it does. It is quite easy to test whether it does or not.
7:51 am on Mar 6, 2012 (gmt 0)

5+ Year Member



Hmm... Seems it won't work for JS and CSS:

"Please be careful with removing files when listed in an orphan report. Especially navbar mouseover images will be seen as “orphans”, because Xenu cannot find links to it. "
(from [integralworld.net...]

Xenu is cool but doesn't seem to be the perfect solution in this case. I would need something that goes checking the GET requests done to server. In this way all CSS, JS and AJAX request would be parsed.

Any further suggestion?
9:30 pm on Mar 11, 2012 (gmt 0)

5+ Year Member



linux:
I generally use the web site layout ala:
...../documents/
...../documents/images/

On my workstation (using the command line) I enter the .../images/ sub-directory
and:

...../images$ for IMG in * ; do echo $IMG ; grep -l $IMG ../* ; done

(That's a lower-case "ell" for the grep switch.)
Any image listed without any following lines of files noted by grep is a file not to be found in any document: html, php, css, etc., usw.

If you wish, you could get more elaborate with the -r|-R (recursive) switch for grep.

Of course it works only for static web pages....

HTH,
Jonesy
 

Featured Threads

Hot Threads This Week

Hot Threads This Month