Forum Moderators: coopster

Message Too Old, No Replies

Wikipedia URLs

         

dashrockstone

3:08 am on Feb 24, 2011 (gmt 0)

10+ Year Member



Hi, I need some assistance with a little issue that is has been driving me crazy for a couple of days now.

I am creating a "six degrees of separation" game that involves links from one page to another in wikipedia. The script I'm writing is to verify that the links actually exist in the pages that the contestants post as their answers. The problem is, if there are capital letters in the wikipedia url, it doesn't work unless the contestant posts the exact matching case.

It's a simple copy the url from the address bar, paste it in an input box, click a button to verify scenario, but not all browsers behave the same and some don't include the capital letters in url in the address bar, and the capital letters seem to be required in order for the script to find the page.

I've actually run the script against other url's that aren't wikipedia, mixing the case of the letters and the case seems to make no difference at all.

Here's what I have:


//check whether the url exists or not and validate it

function check_url($url)
{
$check = @fopen($url,"r"); // open the url with fopen
if($check)
$status = true;

else
$status = false;

return $status;
}

//the following url works perfectly, notice the capital letters.
$url = "http://en.wikipedia.org/wiki/George_Washington";

//however the following url comes back false, even though
//resulting link is perfectly "clickable" and leads
//directly to the page that my script says doesn't exist.

//$url="http://en.wikipedia.org/wiki/george_washington";

if(check_url($url))
{
echo "<a href=$url>$url</a> is a <b>valid</b> URL";
}
else
{
echo "<a href=$url>$url</a> is an <b>invalid</b> URL";
}
?>



Obviously there's much more to the game than this, once it's able to open the file, there's other code to check to make sure the anchor exists, etc. This code, however is sufficient to show what's happening.

I'm sure it's something simple that I'm over looking, but after two days, I have given up.

Any help would be appreciated.

Thanks.


P.S. - I can't use any add-on classes or anything that would require installing any packages (CAKE, PEAR), client says no to "extra stuff".

rocknbil

5:22 pm on Feb 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How good are you at regexps, or str_replace? The obvious solution would be to inspect the input and capitalize every first letter. This will take some examination of the Wiki's methods, they may not capitalize articles and such (a, the, is, etc.)

dashrockstone

5:49 pm on Feb 24, 2011 (gmt 0)

10+ Year Member



I've actually considered that as an option, however, when I started looking into how wikipedia titles their articles, there seems to be no rhyme or reason as to how they do it. In some cases it's every first letter after an underscore, in others it's "willy-nilly".

An interesting one I ran into last night was the page for Joe Penna, aka Mystery Guitar Man. In the url in the address bar his last name is not capitalized, however, the script fails to find the page unless it is sent as "Joe_Penna", which I found even more odd.

That said, I don't think regexp is going to work unless I figure out a way to try about a hundred different possibilities within a few seconds.

coopster

7:10 pm on Feb 24, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Welcome to WebmasterWorld, dashrockstone.

My first suspicion was that wikipedia is using a 301 redirect based on capitalization mapping. I was correct when I used Live HTTP headers to verify:
http://en.wikipedia.org/wiki/george_washington 
GET /wiki/george_washington HTTP/1.1
Host: en.wikipedia.org
.
.
.
HTTP/1.0 301 Moved Permanently
Date: Thu, 24 Feb 2011 16:55:40 GMT
Server: Apache
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Vary: Accept-Encoding,Cookie
Last-Modified: Thu, 24 Feb 2011 16:55:40 GMT
Location: http://en.wikipedia.org/wiki/George_washington
.
.
.

So if you can follow the redirect first using cURL you should be able to retrieve the canonical url.

dashrockstone

8:23 pm on Feb 24, 2011 (gmt 0)

10+ Year Member



Thanks, Coopster. I'll run it past the boss and see if he's willing to install another library, his initial reaction to these things is usually filled with disdain.

I've never used cURL myself, so it will be something new for me, but I'm always willing to give it a try.


Thanks again.

astupidname

12:47 pm on Feb 25, 2011 (gmt 0)

10+ Year Member



...(edit)never mind...

astupidname

1:54 pm on Feb 25, 2011 (gmt 0)

10+ Year Member



Instead of fopen() try using file_get_contents($url) :
<?php

//$url = "http://en.wikipedia.org/wiki/George_Washington"; //correct character case url
$url="http://en.wikipedia.org/wiki/george_washington";

$f = @file_get_contents($url);
if ($f !== false) {
echo $f;
}

?>

astupidname

1:54 pm on Feb 25, 2011 (gmt 0)

10+ Year Member



Instead of fopen() try using file_get_contents($url) :
<?php

//$url = "http://en.wikipedia.org/wiki/George_Washington"; //correct character case url
$url="http://en.wikipedia.org/wiki/george_washington";

$f = @file_get_contents($url);
if ($f !== false) {
echo $f;
}

?>

dashrockstone

7:55 pm on Feb 26, 2011 (gmt 0)

10+ Year Member



"Instead of fopen() try using file_get_contents($url) : "

Actually did try that, same result, the problem is, the url without the proper capitalization doesn't exist.

It works with the cURL library by finding the redirect name then reading the file with the proper capitalization finding the item being sought, now I just have to convince the client that he has to install the cURL library. :)

Thanks anyway though.

astupidname

3:48 am on Feb 27, 2011 (gmt 0)

10+ Year Member



Actually, since posting that I tried a few more times and it does too work with file_get_contents and does access the correct George_Washington page by trying george_washington first -but only some times. That is very strange that it only works on occasion I guess. Actually is strange it works at all, I assume that file_get_contents will follow 301's if received. But I had found earlier that wikipedia requires some headers such as user-agent, which I'm not convinced the filesystem functions send or not. It seems wikipedia may be set-up to detect script access via checking the headers sent in a request, and they do issue a 403 if they are not convinced you are for real or can't find a matching page, and a 301 if they think you are for real and have a page which matches in different character case. A little spoofing via using the context parameter for some of the filesystem functions may be better:

<?php
//>

//$url = 'http://en.wikipedia.org/wiki/George_Washington'; //correct character case url
$url = 'http://en.wikipedia.org/wiki/george_washington';

$headers = array(
'User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)',
'Accept: text/plain,text/html;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive: 115',
'Connection: keep-alive',
'Cache-Control: max-age=0'
);

$opts = array(
'http' => array(
'header' => implode("\r\n", $headers)
)
);

$context = stream_context_create($opts);

$f = @file_get_contents($url, false, $context);
//$f = @file_get_contents($url);

if ($f !== false) {
echo $f;
} else {
echo '<pre>';
print_r($http_response_header);
echo '</pre>';
}

?>

penders

10:56 am on Feb 27, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...I assume that file_get_contents will follow 301's if received.


Yes, I think so. Although I believe a feature of URL open wrappers, not just file_get_contents()? But then it should work with fopen() as well? Or should it? Is there a limit to how many times it redirects (as you can set with cURL)?

See last comment on the Wrappers page...
[uk2.php.net...]

HTTP Wrappers
[uk2.php.net...]

astupidname

1:18 pm on Feb 27, 2011 (gmt 0)

10+ Year Member



Yes, I think so. Although I believe a feature of URL open wrappers, not just file_get_contents()? But then it should work with fopen() as well? Or should it?

Yeah, it does follow 301 re-directs, I just confirmed this -see example below. I suspect it would work with fopen as well (if you utilize the context parameter as I have done with file_get_contents), but have not bothered to check. Don't need the handle to the file in this case, just the contents please :)
Incidentally, using the context parameter you could actually set it to not follow redirects by setting the 'http' context options 'follow_location' parameter to false.
See http context options [us2.php.net]

Is there a limit to how many times it redirects (as you can set with cURL)?

In the link I posted above it shows the 'http' context options have a 'max_redirects' parameter which defaults to 20 and is changeable.

Here is an example which will, using the context parameter, give you a sort of "play by play" of what's going on. Odd thing is it will tell you the filesize is 408939 but the final "Made some progress" message will be "downloaded 817878 so far" which would be actually double the file size. I don't understand that, but oh well...
And also, if you do not pass the $opts in to the call to stream_context_create and therefore it does not send those headers then wikipedia does give you a 403 instead. All those headers may not be needed (is a plug-in from something else I had), probably just the user-agent but have not tried without the others. The new example:

<?php

if (!defined('PHP_VERSION_ID')) {
$version = explode('.', PHP_VERSION);
define('PHP_VERSION_ID', ($version[0] * 10000 + $version[1] * 100 + $version[2]));
}

function stream_notification_callback($notification_code, $severity, $message, $message_code, $bytes_transferred, $bytes_max) {
switch($notification_code) {
case STREAM_NOTIFY_RESOLVE:
case STREAM_NOTIFY_AUTH_REQUIRED:
case STREAM_NOTIFY_COMPLETED:
case STREAM_NOTIFY_FAILURE:
case STREAM_NOTIFY_AUTH_RESULT:
var_dump($notification_code, $severity, $message, $message_code, $bytes_transferred, $bytes_max);
/* Ignore */
break;
case STREAM_NOTIFY_REDIRECTED:
echo "Being redirected to: ", $message;
break;
case STREAM_NOTIFY_CONNECT:
echo "Connected...";
break;
case STREAM_NOTIFY_FILE_SIZE_IS:
echo "Got the filesize: ", $bytes_max;
break;
case STREAM_NOTIFY_MIME_TYPE_IS:
echo "Found the mime-type: ", $message;
break;
case STREAM_NOTIFY_PROGRESS:
echo "Made some progress, downloaded ", $bytes_transferred, " so far";
break;
}
echo "\n";
}

//$url = "http://en.wikipedia.org/wiki/George_Washington"; //correct character case url
$url="http://en.wikipedia.org/wiki/george_washington";

$headers = array(
'User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)',
'Accept: text/plain,text/html;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive: 115',
'Connection: keep-alive',
'Cache-Control: max-age=0'
);

$opts = array(
'http' => array(
'header' => implode("\r\n", $headers)
)
);

$context = stream_context_create($opts);

if (PHP_VERSION_ID >= 50200) { //support for the "notification" option callback started with php 5.2
stream_context_set_params($context, array("notification" => "stream_notification_callback"));
} else {
echo "Not able to track stream progress, php version not >= 5.2\r\n";
}

$f = @file_get_contents($url, false, $context);

if ($f !== false) {
echo '<p>File size: '.strlen($f).'</p>';
echo '<pre>'.htmlentities($f).'</pre>';
} else {
echo '<pre>';
print_r($http_response_header);
echo '</pre>';
}

?>

dashrockstone

7:04 am on Mar 1, 2011 (gmt 0)

10+ Year Member



I think my issue when using file_get_contents was due to not setting the "user agent". That's why I was getting told it was forbidden.

Great code, astupidname. Thanks! It works great and does exactly what I was trying to do with very little modification. :)

coopster

3:48 pm on Mar 1, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Odd thing is it will tell you the filesize is 408939 but the final "Made some progress" message will be "downloaded 817878 so far" which would be actually double the file size. I don't understand that, but oh well...


Not for me ... File size: 408854. Can you reproduce that every time? Apache or IIS? Using your browser instead, what happens in your browser with LiveHttpHeaders? At this point I'm just curious what could possibly be the cause ;-)

BTW, both of the following urls return the same resource with a 200 OK status:
http://en.wikipedia.org/wiki/George_washington

http://en.wikipedia.org/wiki/George_Washington

It's just that the first one has an extra <link> element ...
<link rel="canonical" href="/wiki/George_Washington" />

Playing around a bit more, this one returns a 404 Not Found:
http://en.wikipedia.org/wiki/GEORGE_washington

dashrockstone

5:03 pm on Mar 1, 2011 (gmt 0)

10+ Year Member



Playing around a bit more, this one returns a 404 Not Found:
"http://en.wikipedia.org/wiki/GEORGE_washington"


If you copy and paste that into your address bar it comes up with "Wikipedia does not have an article with this exact name. Please search for GEORGE washington in Wikipedia to check for alternative titles or spellings."

So apparently Wikipedia didn't think of all caps as a possibility when someone was searching.

I took it even a bit further and used "http://en.wikipedia.org/wiki/GEORGE_WASHINGTON", which came up with "This page has been deleted. The deletion and move log for the page are provided below for reference." when pasted directly into the address bar.

Interesting, to say the least. It doesn't seem like it would be that difficult for a script on their end to catch these as easily as it does "http://en.wikipedia.org/wiki/GEORGE_washington" and redirect from there.

astupidname

12:06 pm on Mar 2, 2011 (gmt 0)

10+ Year Member



Not for me ... File size: 408854. Can you reproduce that every time?

Yes, every time for me... am only testing on local WAMP server, not on remote server so I don't know if that has anything to do with it (doubtful) or not.

@dashrockstone, (you're welcome!).

@coopster, Live HTTP headers shows me Content-Length: 79787 when I attempt to access george_washington and am 301'd to George_washington.
The script is actually now showing me (wikipedia must have made some changes to the page, size is now smaller): "Made some progress, downloaded 710620 so far" as last message, and "File size: 355310" from the: echo '<p>File size: '.strlen($f).'</p>';
So I still don't get that, none of the numbers seem to come out.
Eh, maybe one day I'll upload to live server and check it there, don't really care that much right now.

I don't get why wikipedia disrepects with failure to capitalize last name. Are they too lazy?(who am I to speak!) As if capitalized the first name and said "yeah, good enough, I'm too tired now to properly capitalize a presidents last name". That extra shift might have broken a pinky :)