Forum Moderators: coopster

Message Too Old, No Replies

html Simpledom - finding the correct $var[number]

         

I Will Make It

9:26 pm on Feb 28, 2014 (gmt 0)

10+ Year Member



When searching throught a sourcecode using simpledom library, I understand that you can either use

$var = $html->find('#divid')
or
$var = $html->find('div[class=divclass]')
or just
$var = $html->find('div')

If the site you are scraping doesn't have any elements which contains a class or ID, the only option is to use e.g. 'div'.

And since the $html-> object is an array, you can find the content you want by $var[number].

How can I easily know which number to place in $var[12] exept by counting manually through all div's on the site?

Please correct me if I have misunderstood something. I'm very new to programming.

lucy24

10:11 pm on Feb 28, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the site you are scraping

This could, perhaps, have been worded more felicitously. I prefer "unauthorized mirror".

I Will Make It

12:43 am on Mar 1, 2014 (gmt 0)

10+ Year Member



Thank you, that was helpful.

brotherhood of LAN

12:47 am on Mar 1, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Scraping is brittle by nature, if it so happens that the data you want is in the 12th <div>, then access that item. If you're not comfortable with that, try looking at using an Xpath or fingerprint text that you 'expect' within the <div> to qualify your choice.

There is always a frame of reference to access a particular element, but be sure to provide some error reporting in case the HTML structure changes.

Some sites adopt unique classes and ID's per page load, making it much harder to pinpoint the element you want.

I Will Make It

12:59 am on Mar 1, 2014 (gmt 0)

10+ Year Member



Thank you!

I guess I didn't explain my problem very good. I'm not English-speaking.

What I try to do:
I have a webshop, and I have a supplier who doesn't provide a order-csv. So I have two options:
1. Add several hundred products to my webshop manually
2. Build a scraper to get information about the products to put in a csv for importing to my webshop.

My problem:
Let's say there is 200 divs on the productpage I'm scraping.

And let's say it's div number 189 I need to get info from.

Is there a way to get that number easily, other than open source code in firefox, press "ctrl+f" and then write in "div" in the search field, and then press "next" 189 times while counting the divs I pass by to find the one I'm after.

I'm thinking about writing something that outputs the number of the div to the left, and then the content of the div to the right.

0 -> div content
1 -> div content
2 -> div content
...
...
...
189 -> div content (what I'm looking for)

This way I could have a better view at what div-number I should use.

brotherhood of LAN

1:11 am on Mar 1, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, here's an example using PHP's DOM Document [ca3.php.net] (I haven't used simpledom, but there's probably very similar syntax/functions)


<?php

$htmlstring = '<div>test</div>
<div>testing</div>
<div>sometextinsideiknowisthere</div>
<div>another false</div>';
$dom = new DOMDocument;
@$dom->loadHTML($htmlstring);
$divs = $dom->getElementsByTagName('div');
$content = '';
foreach($divs as $increment => $div) {
if(preg_match("'sometextinsideiknowisthere'ims",$div->nodeValue)) {
echo "hit at increment $increment\n"; // 2 in this case, counting starts from 0
$content = $dom->saveHTML($div);
break;
}
}
echo $content; // has "<div>sometextinsideiknowisthere</div>"

?>

I Will Make It

1:22 am on Mar 1, 2014 (gmt 0)

10+ Year Member



Thank you.

Although - since it's some hundreds products and I'm looking for the price, description, title and so on for each of them - the div-contents may vary a lot.

Therefor I think I still must count the divs, and use the array-number for the divs I'm scraping.

brotherhood of LAN

1:27 am on Mar 1, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The main thing you need to consider here is that you can use a loop to get all the <div> tags, regardless of how many you want to match, they're all there with their child elements.

Decide your 'hook' for selecting <divs> you're interested in.

Grab whichever info you need. Save it somewhere. Repeat this step however many times.

RE: specifically the increment number, you can see how to access that in my code example.

I Will Make It

9:38 pm on Mar 1, 2014 (gmt 0)

10+ Year Member



Yes! Finally - I got it.
However, the array is too large to print/echo.
Like you said, the child-elements looks to be many.
How can I loop through just the content of the DIVS, and nothing else?

I get something like an INSANE long string - this is just 0.2% of it:

Array ( [0] => simple_html_dom_node Object ( [nodetype] => 1 [tag] => div [attr] => Array ( [id] => body_wrapper ) [children] => Array ( [0] => simple_html_dom_node Object ( [nodetype] => 1 [tag] => div [attr] => Array ( [class] => mainWrapper ) [children] => Array ( [0] => simple_html_dom_node Object ( [nodetype] => 2 [tag] => comment [attr] => Array ( ) [children] (...)

I would like to see something like this:

Array 0 -> div content
Array 1 -> div content
Array 2 -> div content
and so on.. :)

I really appreciate the help!

incrediBILL

1:31 am on Mar 3, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a supplier who doesn't provide a order-csv


Once you've done this, give the file to your supplier and a link to a copy of Open Office so they can maintain it themselves and save yourself a lot of trouble in the future if they're willing to play ball.