Forum Moderators: open

Message Too Old, No Replies

Pull pagerank automagically?

How can a script determine a page's pagerank

         

ggrot

5:05 pm on Jul 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Has anyone determined a way to get the pagerank for a site(as in the value for the google toolbar) via a script with no person staring at the bar involved? This would be useful for evaluating rankings, IMHO. If you know of a simple way to do it(algorithmically), please let us know.

littleman

5:30 pm on Jul 9, 2001 (gmt 0)



If anyone figures this out, please let me know!

theperlyking

5:35 pm on Jul 9, 2001 (gmt 0)

10+ Year Member



You could do it by pretending to be the google bar, it transfers XML data so you can just read that, by perl script etc...

What would you be using it to do?

agerhart

5:37 pm on Jul 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



you can sign me up for a copy of the script!

littleman

5:44 pm on Jul 9, 2001 (gmt 0)



Perlyking, have you been able to tap into the protocol ?

ggrot

5:50 pm on Jul 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thats exactly what I was thinking perlyking, but can you be more specific? What connection needs to be made, what values/format needs to be sent to google?

theperlyking

5:53 pm on Jul 9, 2001 (gmt 0)

10+ Year Member



I did a while ago, in fact i'm trying to find (and failing) my post about it.

It seems to have changed in format since I last looked a few months ago and there is more interesting stuff in there.

ggrot

6:00 pm on Jul 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As for my uses, I'd just like to be able to export a whole bunch of page variables to some type of basic database and analyze statistics, just for kicks and grins. PageRank just happens to be one of those variables.

ggrot

6:44 pm on Jul 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Almost got it figured out. I did a packet sniff(had to remember how that worked) and recovered the googlebar requesting the following xml document for yahoo.com:

[google.com...]

I can access this xml document directly within ie5. I assume other browsers support xml too. It contains lots of cool info in plan text, including the page rank. The only problem is the ch variable seems to be some type of redundant encrpytion of the url. In other words...you have to know the correct ch to get the xml document. It might also be encrypted to your specific ip, so Im not sure that you will be able to access my page. Anybody know how to generate the ch?

littleman

7:04 pm on Jul 9, 2001 (gmt 0)



Yeah, it does not allow you to alter the 'ch' or the IRL in question, but it does allow me to view it with a non MSIE browser. So I guess it would be possible to make a utility to spotcheck, but not new requests?

Doofus

7:33 pm on Jul 9, 2001 (gmt 0)



If you do a search for this:

[google.com...]

in which "www.mydomain.org" is the PageRank you want, you will get back this from Google (all angle brackets were changed to braces):

{?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?}
{!DOCTYPE GSP (View Source for full doctype...)}
- {GSP VER="3.1"}
{TM}0.159834{/TM}
{Q}info:http://www.mydomain.org/nsearch.html{/Q}
- {RES SN="1" EN="1"}
{M}0{/M}
- {R N="1" L="1"}
{U}http://www.mydomain.org/nsearch.html{/U}
{T}MyDomain Name Search{/T}
{RK}6{/RK}
{S}MyDomain name search. If you can't spell somebody's name, use{br} your best guess for their last name only: Last name only: {b}...{/b}{/S}
- {HAS}
{L TAG="link:" /}
{C SZ="3k" TAG="cache:" /}
{RT TAG="related:" /}
{/HAS}
{/R}
{/RES}
{/GSP}

The PageRank is between the {RK} and {/RK} -- in this case, it's a 6.

You can see that the title and the first sentence on the page also come back. If the page is in the ODP directory (not the case in this example) this info also comes back from Google, with the category that it is in.

However, there's a catch that makes it more complex. You need the "ch=0123456789" in the query string. It appears to be a ten-digit checksum based on the domain name you are requesting. If the number does not match with that domain, from which it is apparently generated within the toolbar code, you get a "not authorized" message in Explorer instead of the above information.

Writing a script would require knowing how this 10-digit checksum is generated. You'd have to collect a bunch of domains and checksums, and try to see if there's a pattern. It might be a simple checksum, or it might even be some sort of one-way hash.

I don't think it's worth the effort for a single-digit PageRank.

theperlyking

8:02 pm on Jul 9, 2001 (gmt 0)

10+ Year Member



The checksum varies by page requested and not just domain, i.e domain.com/a.html has a different checksum to domain.com/b.html - of course for PR this is probably not a problem.

As doofus says the calculations necessary make this a tricky task, while its definately cool to see the data being recieved theres probably limited scope for automating it :(

I suspect as soon as the checksum algo was decoded it could be changed anyway since the googlebar is self updating, it would be a moving target.

ggrot

8:17 pm on Jul 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Unless someone knows how to reverse engineer the software itself. I looked at a few of the checksums(came to the same conclusion). It appears as though they are pretty diverse. For example:
[yahoo.com...] gives a checksum of 14282204401 and
[yahoo.com...] gives a checksum of 13409342805.

Anyways, Im sure there is some method of stepping through the software with a debugger or whatnot to determine the checksum method, but just hacking it out by looking at patterns seems unlikely. Oh well.

ettore

9:12 pm on Jul 9, 2001 (gmt 0)

10+ Year Member



well, if anyone is going to determine the checksum method, please let me know, too :)

JamJar

4:38 pm on Apr 29, 2002 (gmt 0)

10+ Year Member



I know it isn't possible to produce the checksum automatically, but how can you find it for each individual URL.

This way, if you're running say, 10 sites, it's worth putting in the work to get the checksum from your toolbar to automatically get the PR later.

Anyone know?

ROLAND_F

9:35 pm on Apr 29, 2002 (gmt 0)

10+ Year Member



My God, you are still trying to understand the checksum function ?

Nobody do a little bit of assembly here ?

I learned everything I need to reverse engineer the toolbar and rip the checksum function in an afternoon. It's so easy. In fact there is no real
protection in this toolbar you know.

If you are afraid of assembly, you can put a proxy between your MSIE and the web and script MSIE to issue queries.

Your script request an URL via MSIE.
The toolbar detect the "openurl" event and launch a request to google's backend throu your proxy.

You can do everything you want thank to your proxy. Either store the url with the valid checksum for later retrieval or get the response data...

Hmm I remember that now there is some more thing to find out such as the encoding for the timestamps, look at it for 10 seconds and you will find that it's a kind of uuencoding.

And now what ?

You build a tool to cheat at google, google will do it's best to catch you and the cycle continue ad nauseum ...

And one day, to avoid bancrupcy google will let people pay to have their result at the top of SERP. The guy with more dollars win once again.

What a nice world.

Take a break, take a deep breath, relax and go design a nice and usable website with interesting content, play by the rules and you will have traffic !

JamJar

8:27 am on Apr 30, 2002 (gmt 0)

10+ Year Member



Playing by the rules involves putting your content online in a usable fashion, then going off to build links from relevant websites to relevant areas within your site.

While building the links (which obviously takes a hell of a long time), it is useful to check your PR to see how it is building, and to what level you are at compared to your competitors.

It is boring enough building links, let alone checking everypage on your website for the PR. Just wanted a little time saver in what is (when we're honest), a rather monotonous marketing position!!

(Apologies of you thought I was trying to cheat it somehow)

chris_f

8:37 am on Apr 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm just getting a forbidden message when I change the domain information. How do I go about changing the links you have posted to my site?

Chris_R

9:58 am on Apr 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



People that are gamers could make something as well by macro. You have a text file with pages you want to check - and the script runs through it and "reads" the green pixels.

I know this isn't what people had in mind, but anything else will probably get you a letter from Google's attorneys (if you don't stop when they ask you nicely).

Gee maybe if business slows down - I could open "Chris_R's PR Checking Service". For only $19.99 a month I will check 100 pages for you - give you their PR and reverse link count on a nice excel spreadsheet.

chris_f

5:49 pm on Jul 2, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Has anyone figured out how to calculate the checksum (ch) yet?

Chris.

Markus

7:28 pm on Jul 2, 2002 (gmt 0)

10+ Year Member



I don´t want to run automated queries but I don´t like using IE with the toolbar either. So, I´ve used a proxy and wrote the checksums down...

... until I saw that the IE caches the XML files.

"In fact there is no real protection in this toolbar you know."

:)

Doofus

9:43 pm on Jul 3, 2002 (gmt 0)



The checksum algorithm was changed by Google sometime in May 2002. It was consistent from December 2001 (or earlier) to May 2002, but then it changed.

It's not too surprising that the algo was changed. What's more surprising is that Google cleverly does not return an error message for PageRank queries coming in that use the obsolete checksum. Instead of an error message, you get bogus PageRank values. These values are typically plus or minus two complete digits on the 0-10 scale. Sites that were a 7 might be a 9. One site that was an 8 became a 10.

This is the famous Google sense of humor at work.

Since the toolbar is self-updating, the checksum algo can be made a moving target. Anyone who goes to all the trouble to decompile and analyze the algo, still has to keep checking with the latest toolbar in Explorer, to make sure the PR values coming back are not bogus due to a change on Google's end. Whatever clever program anyone writes after cracking the checksum algo will not be self-updating from Google, I presume.

None of us likes using Explorer with the Google toolbar. But Google makes the rules, and Google finds ways to make us play by their rules.