Forum Moderators: coopster & phranque

Message Too Old, No Replies

WWW::Search::Yahoo

Does it work for you?

         

Hanu

10:42 pm on May 2, 2004 (gmt 0)

10+ Year Member



I want to monitor my main search phrases in Yahoo and Google. For that matter I wrote a little Perl script based on the latest version (as of May 1st) of WWW::Search and its submodules WWW::Search::Yahoo and WWW::Search::Google. The Google module works as expected. But the WWW::Search::Yahoo module returns exactly one match for any query. The match's URL points back to search.yahoo.com and has the original query string in it. The WWW::Search::Yahoo::DE returns the first 20 results only and some of the matches have URLs pointing to back to Yahoo with the real url of the matching page encoded as a query parameter. Does someone here use WWW:Search and knows what I'm doing wrong?

Or which solution do you use to monitor your rankings?

SeanW

1:14 pm on May 3, 2004 (gmt 0)

10+ Year Member



I'm in the same boat. I wrote some code to monitor various keyphrases over different engines and store it in an RRD. Recently, Yahoo stopped working, and there have been no updates to the code. I've sent an email to the maintainer to ask if there is an update planned, otherwise I'll probably crack out the debugger and see what's up.

Sean

VectorJ

5:14 pm on May 3, 2004 (gmt 0)

10+ Year Member



Until the Yahoo search module is updated, you can use WWW:Mech to scan Yahoo results pages and scrape them using HTML::Parser. Certainly a lot more trouble than using the WWW::Search::Yahoo module, but if it's critical that you get this information, there are alternatives.

Hanu

10:57 am on May 4, 2004 (gmt 0)

10+ Year Member



Sean, the funny thing is that there was an update to WWW::Search and WWW::Search::Yahoo on May 1st. The changelog says something about 'overhaul for new layout'. I wonder if Yahoo is cloaking the result pages based on User-Agent. On the other hand, Yahoo::DE works - why wouldn't yahoo.de do cloaking too? Or maybe they both cloak, but in different ways: yahoo.com returns only one match and yahoo.de twenty.

You can actually feed WWW::Search a custom LWP::UserAgent subclass. I'm thinking about writing writing one myself, faking a 'Mozilla ...' or so.

VectorJ, I will look into that. A quick hack of that sort will probably take me a day or so.

Thanks, guys.

PLITGames

11:29 am on May 4, 2004 (gmt 0)

10+ Year Member



That is actually a quite good idea, wonder why I have never done this.

Is there anywhere I can download a perl program source or should I just write one myself?

Peter

SeanW

11:49 am on May 4, 2004 (gmt 0)

10+ Year Member



Hanu:

My mirror must be behind, I didn't know about the May 1 update until just now.

There is mention in Yahoo.pm about cloaking going on and anti-robot measures. You could pass it an LWP subclass, or just hack the source ;) (there is commented out code in Yahoo.pm to change the UA)

Sean

SeanW

1:02 pm on May 4, 2004 (gmt 0)

10+ Year Member



I just reinstalled WWW::Search and WWW::Search::Yahoo from CPAN and it works again!

Sean

Hanu

1:51 pm on May 4, 2004 (gmt 0)

10+ Year Member



Still doesn't work for me. Tried various things: uncommented the user-agent lines as you suggested, tried gui_query() instead of natif_query(), called need_to_delay( 1 ). This is quite irritating. Now I feel like they are challenging me ;-). I might just write my own little script as VectorJ suggested, and publish it here. Then what?!

SeanW

2:45 pm on May 4, 2004 (gmt 0)

10+ Year Member



Hanu:

Not sure why it's not working for you... I did a "force install WWW::Search::Yahoo" from CPAN to get it fully updated, even though it claimed I was already current.

Internally, it uses LWP and HTML::TreeBuilder rather than HTML::TokeParser. I'd suggest learning TreeBuilder in addition to TokeParser, it makes some tasks much easier. I wrote some stuff to scrape CJ that I'm pretty sure would have been 10 times longer if it were done through scanning tokens.

Sean

Hanu

3:40 pm on May 4, 2004 (gmt 0)

10+ Year Member



Now it works like a charm. There must have been versioning discrepancies on CPAN or something. I checked the module version numbers about ten times and they were all identical to those of the sources on CPAN.

Half joking: What's really cool in our field is that for every command there seems to be a switch (or prefix in this case that turns the command into doing what you mean). I say 'install' and it doesn't install. I say 'force install' and it does install. I say 'rm' and it doesn't remove the file, I say 'rm -f' and it does. I suggest doing the reverse: for every command there should be switch or prefix that turns the command into one that doesn't do what it should, e.g.

maybe install

or

rm --whatever

Wouldn't that save some time?

Hannes