Spider programming storing and querying

Good Morning from Madrid,

I am a newbie on spider programing, 'though I've been for one year parsing external web pages for info crawling with
regexp, php, custom db's...

I want to extend the facilities given by our company by
a web spider. I've been taking a look at snoopy, and kind of classes, it looks interesting, but I need the keyword indexing as well, strategies, options, phrases, word density, etc...

(I hate this kind of messages of lazy people saying
"Anyone knows hoy can I do what I should be searching instead of requesting?", I won't do that again)

SO: Anyone can put me on the first steps on this stuff?

----

Additional questions:

1.- Can I spoof or accept a cookie given by Javascript to resend in cookie-protected page spidering

2.- Any Ideas on how to spider pdf files, apart of getting the file using pdf2html and parsing?

P.d.: I'll compile the related info for a future post

Spider programming storing and querying

The whole stuff on spidering and 2 additional question

Juanse

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week