Forum Moderators: coopster

Message Too Old, No Replies

News Headlines db

opinions needed on implementation/structure

         

lorax

8:57 pm on May 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've built a news headlines db. 1-3 times a day I search the web for on-topic headlines and add them to the db. I have the URI, title, author, organization, date published, date added, and a field for the first one or two paragraphs. On average I'm adding about 8-16 headlines a day.

The news page only shows headlines for the past seven days. I want to make the past news headlines available in a numbered page fashion (similar to how WebmasterWorld accomplishes this) but a few issues/thoughts come to mind.

First is the fact that some of the news articles may be taken down after a period of time. So I need a way to check the link for 404 errors (others I should check for). Second is that I want to add a search feature strictly for the news.

To check if the articles still exist should I do checks periodically and remove the links or should I do a check at the time the db is queried for older headlines and remove it then?

For the Search I'm worried that it will only have limited copy (I'm only using the first two paragraphs at most) to search on and thus the searcher may not get accurate results. So I'm wondering if I should build a keyword parser and use it to search the full article at the time I'm adding it to the db and have the keywords it finds added to the record as well. If I do this, then my search engine can search the keywords as well. Granted this is not the most reliable as I do not know for sure what the user will search on though I can probably guess 80% of the terms they'd use.

Another option here is to use the Google API and roll my own custom search.

One last thing to keep in mind. At some point in the near future I will generate an RSS/RDF newsheadline feed for syndication with this db. What problems do you foresee, if any, given what I've told you?

It's all PHP.

Your thoughts?

brotherhood of LAN

8:16 pm on May 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>>I've built a news headlines db
>>some of the news articles may be taken down after a period of time. So I need a way to check the link for 404 errors.

Guessing here but do all the news articles stem from the one news script? i.e. an article that doesnt exist wont be a 404 just a script without an article to return.

I guess it's a matter of choice for you. Where do you want people to go when they visit a dud news page? You can always redirect to the news home page or something similar.

To check if the articles still exist should I do checks periodically and remove the links or should I do a check at the time the db is queried for older headlines and remove it then?

I'd do everything when you actually delete the old article. When the article doesnt exist anymore what other parts of your news site will be affected?

Links that are embedded in other articles might be hard to weed out, or time consuming to find, otherwise you'd be looking to update the DB and update it where the old article is concerned? When the data is normalized in a DB there shouldnt be a problem updating it.

It might take a little more space but you could always add another table with a articleid->articleid relationship showing links from one to another, and where the link is to sort the 404. It could get very complex but you could make sure you have 0 dead links from the beginning ;)

The DB structure always helps though, if you can pinpoint where a bad link could spring up in the db structure it'd make it easier to deal with.

>>search
stripping out stopwords you can store articles in a couple of kilobytes, if the news articles are categorized im sure that would help lots more.

you could put all the unique words into a dictionary that points to the different articles. if youre just counting word frequency you could slip the frequencies in there too, perhaps in order.

FDSE (perl search script) only uses word frequency to weight a doc, I guess you could get fancy and add other factors in :)

Almost diving into your project here, interesting problems to sort....the 404 URL's one is definetely possible, and I'd be inclined to use a cut and paste search script if time is short!

lorax

8:29 pm on May 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Aloha BOL,
I should have been clearer as I think you may have misunderstood me - sorry. The articles are not on my website. They exist on other websites.

Guessing here but do all the news articles stem from the one news script? i.e. an article that doesnt exist wont be a 404 just a script without an article to return.

It's a single script that coughs up the headlines. It could check for the 404 at the time of delivery to the client but I'm leaning towards building a script that I can execute manually to check all of the news stories. If I'm doing it manually I'm guessing the only impact I have to be careful of is server load and bandwidth.

>> Where do you want people to go when they visit a dud news page?

I don't want them to even see the link if the link to that page is dead.

Re: search/ FDSE - yeah, I'm familiar with that SE and have it installed on the old website. I've toyed with it though I'm not familiar enough with PERL to modify it. May have to write my own in PHP.

brotherhood of LAN

8:43 pm on May 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ah, hear you loud and clear now, this might be a time saver if the code is OK with j_k ;)


<?php
$ch = curl_init ("$url");
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
$page = curl_exec ($ch);
curl_close ($ch);
preg_match("/^HTTP\/\d\.\d (.{3})/A",$page,$matches);
$statuscode = $matches[1];
if($statuscode == 200)
{
// Page is fine
}
else
{
// Do something else
}
}
?>

//added
If you add another option CURLOPT_NOBODY you can just do a head request to save a little time/energy when doing the 404 checking.

lorax

1:23 am on May 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gosh - almost didn't find the thread. Thanks. I take it that code is using the cURL library - yes? I've been wondering about that route.

brotherhood of LAN

12:01 pm on May 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yep, CURL is quite good for dealing with domains outside your site. You can get the headers of a document and decided what to do with it (i.e. if it has a bad status code) before you read the body of the page.

When you are grabbing news, it could be ideal to know where their templates start and stop and you can just grab X-Y bytes of a page instead of the whole thing ;)

Can't see the problem changing all that much, apart from the fact you need CURL to check if the link is still there.