Forum Moderators: coopster

Message Too Old, No Replies

data mining script to mine data from yahoo finance

Data mining script, which extracts and compares Dividend Yields of stocks

         

stockmaster

8:37 pm on Jul 22, 2011 (gmt 0)

10+ Year Member



I want to program a data mining script in PHP, which fills a multidimensional array with the name of a stock, its stock symbol and its dividend yield. Then I want this program to sort the different stocks descendingly according to their dividend yield.

I know this wouldn't be that difficult. I've mined data with a program before, unfortunately I cannot find the script. So I have to start from scratch. Plus, it has been some years since I've made it, which makes it even harder.

I want to mine data from yahoo finance. For consistency sakes I will use the NASDAQ as example: [finance.yahoo.com...] ...

steps:
step 1: Fill a multidimensional array with the stock market symbol, name and url (of the company's page) of each company. There are currently 50 pages , so a while loop must be used open each page and to mine each page independently.

step 2: use a for loop to open each company's page, extract the dividend yield and save the data in the array (or should I use a new array to save the company name & company symbol (again) plus the dividend yield).

step 3: use a for loop to print all the data of the different stocks. At the top the companies with the highest divident yield and at the bottom the companies with the smallest dividend yield.

I don't know quite where to start. I did read some articles about data mining with PHP and try to build this program through trial and error. I have a week vacation, so by then I should've finished it (I don't intend to spend all my waking hours on building this program though) I'll update my findings regularly. All tips are welcome!

stockmaster

7:42 pm on Jul 24, 2011 (gmt 0)

10+ Year Member



I've run into problems.. The stocks are listed in a table on 50+ different pages. I can get all the stocks of the first page in an array through preg_match_all. However, I do not know how to reuse the array for the other 49+ pages and not overwrite the data of the first page, when I use the same array for page two, three, four, etc..

please help me. I'm a php novice and I really need this program. What are good tutorials for building such a program and how do I need to structure the script with functions, if-else statements, for/while loops and built in php functions (do I need to use preg_match_all for this task or do I need another function..)?

lostdreamer

10:40 am on Jul 27, 2011 (gmt 0)

10+ Year Member



Normally I don't really do this, but I was bored a bit so.... I made you the complete script.... for informational purpose ofcourse :)

I suggest changing the preg_match_all line with something more strict, this was the easiest way to test since you allready have a working preg match...

I also used "curl_multi_exec" ( [php.net...] )
This will do all requests simultaneous, so you only have to wait for the longest page load, not the sum of all pageloads :)

Let me know if you need help implementing..


Regards,
LostDreamer

ps. on line 6 it says:
for($page=0;$page<2;$page++)

This was for testing, change this to
for($page=0;$page<53;$page++)
to go over all pages.

pss. I see that the 'code' tag here still removes indenting.
Here is a pastebin link to the script with indenting.
[pastebin.com...]


<?php
$url = "http://finance.yahoo.com/q/cp?s=^IXIC&c=";

# create all URLs
$urls = array();
for($page=0;$page<2;$page++)
$urls[$page] = $url . $page;

# open all URLs with a curl_multi request (goes a lot faster then individual curl requests)
$pages = getPages($urls);

# walk through the array of returned HTML pages
$arrMinedData = array();
foreach($pages as $html) {
# do your preg_match_all stuff here....
# be warned, the next preg match stuff is very ugly... should really be switched with your version.....
preg_match_all('|<td class="yfnc_tabledata1"(.*)>(.*)</td>|Uim', $html, $matches);
# with the above preg_match, we need the data from $matches[2]

# with array_chunk we can bind together the <TD>'s that belong to the same <TR> (five cells per row)
$matches = array_chunk($matches[2], 5);
foreach($matches as $i => $arr) {
# walk through all 'rows'
foreach($arr as $num => $data) {
# strip out unwanted HTML
$matches[$i][$num] = trim(strip_tags($data));
}
# and add the info to the mining array....
$arrMinedData[] = $matches[$i];
}
}
# here you go.... all data in 1 nice array....
echo "<pre>";
print_r($arrMinedData);
echo "</pre>";

function getPages($arrUrls) {
$mh = curl_multi_init();
$threads = null;
foreach ($arrUrls as $page => $url) {
$c[$page]=curl_init($url);
curl_setopt ($c[$page], CURLOPT_TIMEOUT,600);
curl_setopt ($c[$page], CURLOPT_RETURNTRANSFER,1);
curl_multi_add_handle ($mh,$c[$page]);
}
$t1 = time();
do {
$n=curl_multi_exec($mh,$threads);
if (time() > $t1 + 2) {
echo "keep-alive" ."<br/>";
$t1 = time();
}
}
while ($threads > 0);
$arrData = array();
foreach ($arrUrls as $page => $url) {
curl_multi_remove_handle($mh,$c[$page]);
$html = curl_multi_getcontent($c[$page]);
$arrData[$page] = $html;
curl_close($c[$page]);
}
curl_multi_close($mh);
return $arrData;
}
?>