Forum Moderators: not2easy
There are a couple of good web-based resources that contain the information I need, but when they update, there is no way of telling WHICH dates have updated.
What I need (and I could swear that I've used before!) is a piece of software that monitors the URL - say, checks it every 24 hours - and tells me which parts of the data have changed.
If I could define the sections that I want monitored, that would be handy, as I don't want to be notified when a banner rotates or when new news is posted.
Does anyone have any ideas? (Oh, and free would be best as I'm broke :))
But you should be aware that there are issues with this....
There is a whole group of webmasters who communicate here who are trying to limit crawls by "nusiance" spiders and such.
A lot of people doing things similar to what you suggest can become a kind of denial of service attack on web sites. That is why, for example, Google goes to such great lengths to stop automatic ranking checkers.
I would suggest the following:
1. Clearly identify your spider in the user agent field. Please include your email address, and, if possible, the address of a web site that explains your purpose.
2. Read, parse and obey robots.txt.
3. If you read multiple pages on a site, do so slowly, with at least 30 or 40 seconds between each request, so you don't overwhelm their resources and spread out your load.
4. Don't, under any circumstances, try to be sneaky and fake your user agent to try to make your spideer look like a browser. We, and others, look for these kind of spiders and ban them from our sites. The harm you cause by attempting this is not just that we are paranoid or something. For example, we charge some of our clients on a per visitor basis. If you don't identify yourself, then it inflates the visitor rate unfairly for our clients and reduces our conversion rates.
If you do all this then webmasters will view your bot as "friendly", accomodate and welcome it.
The script will download one page from a site (specified in the script) and store a copy of it on my server (non-viewable by visitors to my site).
A week later, it downloads another copy and compares it to the original, highlighting any changes.
There is no spidering involves, no DoS risk. It'll just look like I've browsed there and read the page.