Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: bakedjake
I am using wget to check for bad links on my site. We have a lot of links on our site that our servers redirect to other sites (e.g. a link to an affiliate that we want to track in our logs). They are dynamic links that look static, that is they end with a special ending like *redir.html.
I am having a hard time preventing wget from following these links. Here's an example:
wget --recursive --delete-after --no-directories --no-host-directories --reject="*redir.html" [mysite.com...]
There are two problems: first, wget doesn't seem to be honoring the --reject= command (which should prevent it from following all links with this pattern, or so I believe. Second, regardless of whether host spanning is on, off or what I have in --exclude-domains it still follows this redirect.
I know this is not the greatest place to post this, but I appreciate any help I can get :-)
The argument is that --recursive doesn't make any sense if you are excluding some .html files. I don't entirely agree, but I see the logic.
I didn't know that wget would follow a redirect recursively into a new site (without specifying --span-hosts). I'd call that a bug.
I've never seen that problem... Looking at a bunch of wget scripts, I notice that I always use --no-parent. It doesn't make much sense, but perhaps --no-parent causes wget to skip the link because it's not below the starting point?
It's worth a try.