data mining script to mine data from yahoo finance

I want to program a data mining script in PHP, which fills a multidimensional array with the name of a stock, its stock symbol and its dividend yield. Then I want this program to sort the different stocks descendingly according to their dividend yield.

I know this wouldn't be that difficult. I've mined data with a program before, unfortunately I cannot find the script. So I have to start from scratch. Plus, it has been some years since I've made it, which makes it even harder.

I want to mine data from yahoo finance. For consistency sakes I will use the NASDAQ as example: [finance.yahoo.com...] ...

steps:
step 1: Fill a multidimensional array with the stock market symbol, name and url (of the company's page) of each company. There are currently 50 pages , so a while loop must be used open each page and to mine each page independently.

step 2: use a for loop to open each company's page, extract the dividend yield and save the data in the array (or should I use a new array to save the company name & company symbol (again) plus the dividend yield).

step 3: use a for loop to print all the data of the different stocks. At the top the companies with the highest divident yield and at the bottom the companies with the smallest dividend yield.

I don't know quite where to start. I did read some articles about data mining with PHP and try to build this program through trial and error. I have a week vacation, so by then I should've finished it (I don't intend to spend all my waking hours on building this program though) I'll update my findings regularly. All tips are welcome!

<?php $url = "http://finance.yahoo.com/q/cp?s=^IXIC&c="; # create all URLs $urls = array(); for($page=0;$page<2;$page++) $urls[$page] = $url . $page; # open all URLs with a curl_multi request (goes a lot faster then individual curl requests) $pages = getPages($urls); # walk through the array of returned HTML pages $arrMinedData = array(); foreach($pages as $html) { # do your preg_match_all stuff here.... # be warned, the next preg match stuff is very ugly... should really be switched with your version..... preg_match_all('|<td class="yfnc_tabledata1"(.*)>(.*)</td>|Uim', $html, $matches); # with the above preg_match, we need the data from $matches[2] # with array_chunk we can bind together the <TD>'s that belong to the same <TR> (five cells per row) $matches = array_chunk($matches[2], 5); foreach($matches as $i => $arr) { # walk through all 'rows' foreach($arr as $num => $data) { # strip out unwanted HTML $matches[$i][$num] = trim(strip_tags($data)); } # and add the info to the mining array.... $arrMinedData[] = $matches[$i]; } } # here you go.... all data in 1 nice array.... echo "<pre>"; print_r($arrMinedData); echo "</pre>"; function getPages($arrUrls) { $mh = curl_multi_init(); $threads = null; foreach ($arrUrls as $page => $url) { $c[$page]=curl_init($url); curl_setopt ($c[$page], CURLOPT_TIMEOUT,600); curl_setopt ($c[$page], CURLOPT_RETURNTRANSFER,1); curl_multi_add_handle ($mh,$c[$page]); } $t1 = time(); do { $n=curl_multi_exec($mh,$threads); if (time() > $t1 + 2) { echo "keep-alive" ."<br/>"; $t1 = time(); } } while ($threads > 0); $arrData = array(); foreach ($arrUrls as $page => $url) { curl_multi_remove_handle($mh,$c[$page]); $html = curl_multi_getcontent($c[$page]); $arrData[$page] = $html; curl_close($c[$page]); } curl_multi_close($mh); return $arrData; } ?>

data mining script to mine data from yahoo finance

Data mining script, which extracts and compares Dividend Yields of stocks

stockmaster

stockmaster

lostdreamer

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week