Forum Moderators: coopster
The news page only shows headlines for the past seven days. I want to make the past news headlines available in a numbered page fashion (similar to how WebmasterWorld accomplishes this) but a few issues/thoughts come to mind.
First is the fact that some of the news articles may be taken down after a period of time. So I need a way to check the link for 404 errors (others I should check for). Second is that I want to add a search feature strictly for the news.
To check if the articles still exist should I do checks periodically and remove the links or should I do a check at the time the db is queried for older headlines and remove it then?
For the Search I'm worried that it will only have limited copy (I'm only using the first two paragraphs at most) to search on and thus the searcher may not get accurate results. So I'm wondering if I should build a keyword parser and use it to search the full article at the time I'm adding it to the db and have the keywords it finds added to the record as well. If I do this, then my search engine can search the keywords as well. Granted this is not the most reliable as I do not know for sure what the user will search on though I can probably guess 80% of the terms they'd use.
Another option here is to use the Google API and roll my own custom search.
One last thing to keep in mind. At some point in the near future I will generate an RSS/RDF newsheadline feed for syndication with this db. What problems do you foresee, if any, given what I've told you?
It's all PHP.
Your thoughts?
Guessing here but do all the news articles stem from the one news script? i.e. an article that doesnt exist wont be a 404 just a script without an article to return.
I guess it's a matter of choice for you. Where do you want people to go when they visit a dud news page? You can always redirect to the news home page or something similar.
To check if the articles still exist should I do checks periodically and remove the links or should I do a check at the time the db is queried for older headlines and remove it then?
I'd do everything when you actually delete the old article. When the article doesnt exist anymore what other parts of your news site will be affected?
Links that are embedded in other articles might be hard to weed out, or time consuming to find, otherwise you'd be looking to update the DB and update it where the old article is concerned? When the data is normalized in a DB there shouldnt be a problem updating it.
It might take a little more space but you could always add another table with a articleid->articleid relationship showing links from one to another, and where the link is to sort the 404. It could get very complex but you could make sure you have 0 dead links from the beginning ;)
The DB structure always helps though, if you can pinpoint where a bad link could spring up in the db structure it'd make it easier to deal with.
>>search
stripping out stopwords you can store articles in a couple of kilobytes, if the news articles are categorized im sure that would help lots more.
you could put all the unique words into a dictionary that points to the different articles. if youre just counting word frequency you could slip the frequencies in there too, perhaps in order.
FDSE (perl search script) only uses word frequency to weight a doc, I guess you could get fancy and add other factors in :)
Almost diving into your project here, interesting problems to sort....the 404 URL's one is definetely possible, and I'd be inclined to use a cut and paste search script if time is short!
Guessing here but do all the news articles stem from the one news script? i.e. an article that doesnt exist wont be a 404 just a script without an article to return.
It's a single script that coughs up the headlines. It could check for the 404 at the time of delivery to the client but I'm leaning towards building a script that I can execute manually to check all of the news stories. If I'm doing it manually I'm guessing the only impact I have to be careful of is server load and bandwidth.
>> Where do you want people to go when they visit a dud news page?
I don't want them to even see the link if the link to that page is dead.
Re: search/ FDSE - yeah, I'm familiar with that SE and have it installed on the old website. I've toyed with it though I'm not familiar enough with PERL to modify it. May have to write my own in PHP.
<?php
$ch = curl_init ("$url");
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
$page = curl_exec ($ch);
curl_close ($ch);
preg_match("/^HTTP\/\d\.\d (.{3})/A",$page,$matches);
$statuscode = $matches[1];
if($statuscode == 200)
{
// Page is fine
}
else
{
// Do something else
}
}
?>
//added
If you add another option CURLOPT_NOBODY you can just do a head request to save a little time/energy when doing the 404 checking.
When you are grabbing news, it could be ideal to know where their templates start and stop and you can just grab X-Y bytes of a page instead of the whole thing ;)
Can't see the problem changing all that much, apart from the fact you need CURL to check if the link is still there.