Need a crawler/data extraction script

Forum Moderators: phranque

Message Too Old, No Replies

Need a crawler/data extraction script

harveycarpenter

11:44 am on Jan 2, 2007 (gmt 0)

Hello,

I've been searching for a while for a script that will go to a specified URL and take certain text from that page:

- Go to URL
- Based on user inputted HTML tags (eg. new item for every table), split the text up
- Add each piece of text to a database

So basically, I need a script that can detect a tag such as "<TABLE>" or even "Products-", take the text from that point and then stop when it gets to "</TABLE>".

This should be fairly simple to achieve with a GET command and a few loops and add to MYSQL's, but I was wondering if anybody's done anything like this before so I know where to start.

Cheers

PS: Sorry if this is in the wrong forum. I could've been more specific and put it in the *nix, Apache or PHP forum but I didn't want to rule any options out.

Corey Bryant

5:14 pm on Jan 2, 2007 (gmt 0)

It sounds like you might need to look for tear. We used ASPTear for something like this

-Corey

Easy_Coder

5:23 pm on Jan 2, 2007 (gmt 0)

The mshtml object is perfect for this. You can bind raw HTML to a document object model and then walk it.

harveycarpenter

5:32 pm on Jan 2, 2007 (gmt 0)

Thanks a lot for the replies. I'm unfamiliar with ASP and won't have a windows server, so a PHP option would probably be more appropriate - in fact, that's where I should've posted this - sorry!

Thanks

rocknbil

7:55 pm on Jan 2, 2007 (gmt 0)

The perl lwp module will also do this but needs some tweaking to work right. Just a heads up, if this is not your site this is known as "page scraping" and is frowned upon by most site owners. Unnecessary bandwidth usage from sources that are not actual visitors. You're better off seeking out an RSS feed source for your scrape - it's easier to implement too.

harveycarpenter

10:45 pm on Jan 2, 2007 (gmt 0)

Thanks for the replies.

I've been reading about the perl module - looks quite complicated but I'll keep reading :)

About the screenscraping - I'm aware of the possible implications but it'll only be required to scrape a page or two out of hundreds, and it's essentially an affiliate site so will be providing business. Thanks for mentioning that.

Anyone know a PHP solution?

BTW I've found the same problem listed here - but I don't yet have a subscription:

I'll post if I get subscribed or an update.

[edited by: trillianjedi at 10:41 am (utc) on Jan. 3, 2007]
[edit reason] TOS [/edit]

stardoc

10:55 pm on Jan 2, 2007 (gmt 0)

harveycarpenter, welcome to webmasterworld!

Two things about the link you posted. First you are not allowed to post such links in this forum. Second, you can see the solution on the page you referred to if you scroll down a bit more. :-)

harveycarpenter

12:00 am on Jan 3, 2007 (gmt 0)

Thanks stardoc. My time to edit that posts seems to have ended, so I can't remove the links. Hopefully they'll remain as they do help solve the problem and I'm obviously not affiliated with <snip>...

By the way, only the first one has the solution shown - I'll try that out.

[edited by: physics at 5:24 am (utc) on Jan. 4, 2007]
[edit reason] Snipped domain [/edit]