Find URLS on web server

How to find URLs that need updating...

4:43 pm on Nov 5, 2002 (gmt 0)

10+ Year Member

I inherited a fairly large university site and need a way to locate URLs that are linked from pages anywhere on the site. (When a URL is changed, I find myself just guessing on which pages it might be a link.)

We use the Google University search and I have tried using that - but this is does not seem to be effective for this purpose.

I have tried searching the web for a solution, but no luck.

Thank you to anyone who can help!

5:11 pm on Nov 5, 2002 (gmt 0)

10+ Year Member

If you have linux, try this:

[I assume your URL is '/somepath/myurl.html', and your web server root path is '/home/httpdocs']

rgrep -x 'html' -rl '/somepath/myurl.html' /home/httpdocs > yourresultfile.txt

rgrep -x 'htm' -rl '/somepath/myurl.html' /home/httpdocs >> yourresultfile.txt

lazy way.. :)


7:17 pm on Nov 5, 2002 (gmt 0)

10+ Year Member

I meant to mention - we have a UNIX server... (I even tried the linux version of the command, but it didn't like "rgrep").

Unfortunately I am not the server admin - I know enough unix to make my job easier, but that is it...

Thanks again.

8:05 pm on Nov 5, 2002 (gmt 0)

10+ Year Member

man grep
man rgrep

(myself, I don't know what rgrep is...)

If all else fails, you can copy the whole site to your local hard drive and use Windoze' search capabilities to find the old urls, then just update them back in the source.

9:02 pm on Nov 5, 2002 (gmt 0)

10+ Year Member

If you don't have rgrep installed, the above command is useless with grep, 'cause of the lack of 'x' feature [-> searching only for files with this extension].

Of course, you can use 'grep' instead of 'rgrep', omitting the 'x' arg, but in this case you'll have a big CPU-RAM-diskI/O expense, because grep will scan ALL the files in the given directory.

This is another, correct, way, with grep:

find /home/httpdocs -name '*.html' > tmp.txt
find /home/httpdocs -name '*.htm' >> tmp.txt
grep -lr '/somepath/myurl.html' `cat tmp.txt` > yourresultfile.txt

Note that you don't need root privileges for doing this.


11:19 pm on Nov 5, 2002 (gmt 0)

10+ Year Member

Try a
grep /the/link/you/want/to/find `find . -iname '*html'`
to search all *.html pages in the current directory and below.
Works for me ...
8:15 pm on Nov 8, 2002 (gmt 0)

10+ Year Member

Thank you for the responses - I did try some of those, and they didn't quite work... today a kind soul at the college wrote a shell script that does what I need it to do (thank goodness we have a computer science program...)
Thanks again!

