Forum Moderators: coopster
Is there already such a tool developed? Has anyone attempted anything and seen if its better to use an XML system rather then mysql?
Reason the database is involved at all is there will be some querying and comparisons, but i want static html for performance and easy of getting spidered.
thanks
would you have to write a script which checked the last update time of each page (querying the database, etc), and comparing it to the creation time of the static page, and if necessary re-run wget on each page which needs updating?
does that sound along the right lines?
thanks
<added> sorry i got my knickers in a twist - instead of re-running wget on each page, you'd have to generate the html page again from a script... hmm i'm confused.. ;-)
i just realised we don't need to do that, because we generate all of our pages from one script anyway. it will be a simple matter to change the content management system to generate a static page and to upload it automatically everytime something is changed. this is Brilliant! it means i can have all the code and dbqueries on a test server, and serve plain vanilla html on the live server.
(sorry about the hijacking byronM ;-)
i've just been reading up on wget and it has so many options, it sounds like it could do all you need.
as long as you have some (any CMS) way of producing pages, wget can then rename them; rename the links accordingly; mirror sites by just checking last modified dates...
i am sure dmorison can give you a clearer idea, as i've just started looking into it. but it sounds really good.
(e.g. i'm gonna get it running periodically through my dynamic pages, looking for updates and mirroring those to the live server)
good luck
Then, in a folder I create two directories:
dynamic.website.com
static.website.com
You then build your PHP/database application, using files with a .html extension in the directory dynamic.website.com.
To serve the website; you point Apache at a soft link as the web root; rather than an actual directory. This soft link can point at either dynamic.website.com; or static.website.com.
So, to serve the dynamic version of your site:
/var/www -> dynamic.website.com
Now, you can use the mirror functionality of wget to retrieve the entire website into the directory static.website.com. Once you have built the static version; you can then swap your soft link over:
/var/www -> static.website.com
Study the man page for wget to learn how to use the mirror functionality. The option you want to look out for is the one that keeps it onsite - you don't want to go off retrieving sites that you link to!
I know this isn't a step-by-step how-to, but it should point you in the right direction.
how do you go about updating the content when it changes
The last time I built such a site; I actually went another step and could access both the static or dynamic versions using different URL's (this is not hard to setup using Apache's dynamic virtual host support).
The dynamic site had various people doing stuff, and their actions may cause an update of one or more tables.
A cron job running every night then looked directly at the last modified time of the mysql database table files, which on a simple mysql setup are in files called:
/var/lib/mysql/[database_name]/[table_name].MYI
The cron job (a Perl hack) then knew which portions of the site had to be rebuilt if a given table had been modified. The script then called wget to retrieve that portion of the site into the static web root.
thanks for explaining - i was thinking along the right lines, but the trick with the mysql last update location is a nice one!
as said above i wanted to get wget running on a cron job to check all the last modified times - but with our larger site wget would be running the whole time - i much prefer your suggestion to tell it which bits to update.
cheers
You could have the dynamic site above the web root, or have it visible and rely on a robots.txt file to keep bots out, or you could put a meta robots noindex tag on every page of only the dynamic site, or use .htaccess to keep them out.
Make sure that the dynamic website can't be spidered by Google otherwise those pages might start turning up in the SERPs (and the static site delisted).
Good point - in the scenario described above the dynamic site was protected by HTTP authentication; so it could never have been crawled.
Note however that in the basic setup (using a soft link to either a dynamic or static site so that only one version is ever visible to the web) this isn't a problem.