Welcome to WebmasterWorld Guest from 18.207.238.169

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

html Simpledom - finding the correct $var[number]

     
9:26 pm on Feb 28, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Jan 31, 2006
posts: 305
votes: 0


When searching throught a sourcecode using simpledom library, I understand that you can either use

$var = $html->find('#divid')
or
$var = $html->find('div[class=divclass]')
or just
$var = $html->find('div')

If the site you are scraping doesn't have any elements which contains a class or ID, the only option is to use e.g. 'div'.

And since the $html-> object is an array, you can find the content you want by $var[number].

How can I easily know which number to place in $var[12] exept by counting manually through all div's on the site?

Please correct me if I have misunderstood something. I'm very new to programming.
10:11 pm on Feb 28, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15955
votes: 898


If the site you are scraping

This could, perhaps, have been worded more felicitously. I prefer "unauthorized mirror".
12:43 am on Mar 1, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Jan 31, 2006
posts: 305
votes: 0


Thank you, that was helpful.
12:47 am on Mar 1, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:5050
votes: 61


Scraping is brittle by nature, if it so happens that the data you want is in the 12th <div>, then access that item. If you're not comfortable with that, try looking at using an Xpath or fingerprint text that you 'expect' within the <div> to qualify your choice.

There is always a frame of reference to access a particular element, but be sure to provide some error reporting in case the HTML structure changes.

Some sites adopt unique classes and ID's per page load, making it much harder to pinpoint the element you want.
12:59 am on Mar 1, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Jan 31, 2006
posts: 305
votes: 0


Thank you!

I guess I didn't explain my problem very good. I'm not English-speaking.

What I try to do:
I have a webshop, and I have a supplier who doesn't provide a order-csv. So I have two options:
1. Add several hundred products to my webshop manually
2. Build a scraper to get information about the products to put in a csv for importing to my webshop.

My problem:
Let's say there is 200 divs on the productpage I'm scraping.

And let's say it's div number 189 I need to get info from.

Is there a way to get that number easily, other than open source code in firefox, press "ctrl+f" and then write in "div" in the search field, and then press "next" 189 times while counting the divs I pass by to find the one I'm after.

I'm thinking about writing something that outputs the number of the div to the left, and then the content of the div to the right.

0 -> div content
1 -> div content
2 -> div content
...
...
...
189 -> div content (what I'm looking for)

This way I could have a better view at what div-number I should use.
1:11 am on Mar 1, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:5050
votes: 61


Yes, here's an example using PHP's DOM Document [ca3.php.net] (I haven't used simpledom, but there's probably very similar syntax/functions)


<?php

$htmlstring = '<div>test</div>
<div>testing</div>
<div>sometextinsideiknowisthere</div>
<div>another false</div>';
$dom = new DOMDocument;
@$dom->loadHTML($htmlstring);
$divs = $dom->getElementsByTagName('div');
$content = '';
foreach($divs as $increment => $div) {
if(preg_match("'sometextinsideiknowisthere'ims",$div->nodeValue)) {
echo "hit at increment $increment\n"; // 2 in this case, counting starts from 0
$content = $dom->saveHTML($div);
break;
}
}
echo $content; // has "<div>sometextinsideiknowisthere</div>"

?>
1:22 am on Mar 1, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Jan 31, 2006
posts: 305
votes: 0


Thank you.

Although - since it's some hundreds products and I'm looking for the price, description, title and so on for each of them - the div-contents may vary a lot.

Therefor I think I still must count the divs, and use the array-number for the divs I'm scraping.
1:27 am on Mar 1, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:5050
votes: 61


The main thing you need to consider here is that you can use a loop to get all the <div> tags, regardless of how many you want to match, they're all there with their child elements.

Decide your 'hook' for selecting <divs> you're interested in.

Grab whichever info you need. Save it somewhere. Repeat this step however many times.

RE: specifically the increment number, you can see how to access that in my code example.
9:38 pm on Mar 1, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Jan 31, 2006
posts: 305
votes: 0


Yes! Finally - I got it.
However, the array is too large to print/echo.
Like you said, the child-elements looks to be many.
How can I loop through just the content of the DIVS, and nothing else?

I get something like an INSANE long string - this is just 0.2% of it:

Array ( [0] => simple_html_dom_node Object ( [nodetype] => 1 [tag] => div [attr] => Array ( [id] => body_wrapper ) [children] => Array ( [0] => simple_html_dom_node Object ( [nodetype] => 1 [tag] => div [attr] => Array ( [class] => mainWrapper ) [children] => Array ( [0] => simple_html_dom_node Object ( [nodetype] => 2 [tag] => comment [attr] => Array ( ) [children] (...)

I would like to see something like this:

Array 0 -> div content
Array 1 -> div content
Array 2 -> div content
and so on.. :)

I really appreciate the help!
1:31 am on Mar 3, 2014 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


I have a supplier who doesn't provide a order-csv


Once you've done this, give the file to your supplier and a link to a copy of Open Office so they can maintain it themselves and save yourself a lot of trouble in the future if they're willing to play ball.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members