Forum Moderators: coopster

Message Too Old, No Replies

Check if URL Exists

         

dcool86

12:46 am on Sep 12, 2011 (gmt 0)

10+ Year Member



I'm writing a script that needs to check if the url from tumblr.com is valid. What ive tried already dosnt seem to work.


$url = "test.tumblr.com";
$file_headers = @get_headers($file);
if($file_headers[0] == 'HTTP/1.1 404 Not Found') {
echo "URL Not Valid";
}
else {
echo "URL Vaild";
}

The scripts I have found on google always show the page valid.
If I change the url to

$url = "notavaildurl.tumblr.com";


I need it to say URL Not Valid. I know there has to be a way as other sites check for vaild tumblrs thanks.

g1smd

1:12 am on Sep 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By testing for "not 404", URLs returning 301, 403, 503, even 500, will return as valid.

Surely only those returning "200 OK" (and maybe 301) should be classed as valid.

lucy24

2:06 am on Sep 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For testing purposes, I'd add 304. "Yeah, yeah, it was here ten seconds ago and it hasn't changed."

What about 302?

dcool86

2:27 am on Sep 12, 2011 (gmt 0)

10+ Year Member



I was reading some where tumblr returns something diffrent if it isn't a vaild URL then if it was.

lucy24

4:18 am on Sep 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, you can easily test that using an extension such as Live Headers. I tried it and, right off the bat, got

GET / HTTP/1.1
Host: www.tumblr.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:6.0) Gecko/20100101 Firefox/6.0
... {et cetera}

HTTP/1.1 302 Found
Date: Mon, 12 Sep 2011 03:59:24 GMT
Server: Apache
P3P: CP="ALL ADM DEV PSAi COM OUR OTRo STP IND ONL"
Location: https://www.tumblr.com/


The 302 seems to be a reference to the http-to-https redirect. Ahem.

https://www.tumblr.com/why/_reasons

GET /why/_reasons HTTP/1.1
Host: www.tumblr.com


This is a little bit interesting, because I certainly didn't click on any such link, and the https page didn't re-redirect me.

Picking a link at random I clicked About, leading to a further 302 as I was redirected from https:// back to http:// (Conclusion: yeah, think you'd better include 302 in your options!)

And then I typed in /garbage on the assumption that they have no such directory, since they're not a big-city Municipal Services site. As one would hope, that led to a 404.

Along the way, there was a colossal number of calls to subdomain assets.tumblr.com; this is apparently where they keep their images, style sheets and so on.

Wonder what the recurring 642 query string in display-related requests is? I keep my browser windows pretty narrow, but not that narrow!

Sorry. What was the question again?

penders

7:06 am on Sep 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



$url = "test.tumblr.com"; 
$file_headers = @get_headers($file);


What is $file?

The URL you pass to get_headers() should be a full URL... "http://..."

print_r($file_headers)
to see what you are actually getting.

dcool86

10:28 pm on Sep 13, 2011 (gmt 0)

10+ Year Member



Penders i copyed and pasted that part off the web forgot to change it as I was using it should be $url.

I can't even get it to print anything out.

penders

11:45 pm on Sep 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



print_r(get_headers('http://webmasterworld.tumblr.com/'));


Array 
(
[0] => HTTP/1.1 200 OK
[1] => P3P: CP="ALL ADM DEV PSAi COM OUR OTRo STP IND ONL"
[2] => X-Tumblr-User: webmasterworld
[3] => Link: ; rel=icon
[4] => Vary: Accept-Encoding
[5] => X-Tumblr-Usec: D=99364
[6] => Content-Type: text/html; charset=UTF-8
[7] => Content-Length: 13378
[8] => Date: Tue, 13 Sep 2011 23:33:29 GMT
[9] => Connection: close
)


print_r(get_headers('http://notavaildurl.tumblr.com/'));


Array 
(
[0] => HTTP/1.1 404 Not Found
[1] => P3P: CP="ALL ADM DEV PSAi COM OUR OTRo STP IND ONL"
[2] => Cache-Control: max-age=300
[3] => Vary: Accept-Encoding
[4] => X-Tumblr-Usec: D=25702
[5] => Content-Type: text/html; charset=UTF-8
[6] => Content-Length: 1870
[7] => Date: Tue, 13 Sep 2011 23:38:22 GMT
[8] => Connection: close
)

dcool86

12:24 am on Sep 14, 2011 (gmt 0)

10+ Year Member



I get it now my question is how can I get the info on array 0 the code 200 only.


Thanks

penders

8:10 am on Sep 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you just want to check for 200 only then something similar to your initial code would suffice, if you want to check for multiple codes (as suggested) then perhaps a regex with preg_match() ?

dcool86

8:06 pm on Sep 14, 2011 (gmt 0)

10+ Year Member



I finally got it to echo out what I want it to.


Thanks.

lucy24

10:04 pm on Sep 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your original code was fine, you just got sidetracked because you were only looking for 404. The simplest division is

OK Response [23]\d\d
Not-OK Response [45]\d\d

One of them goes in the IF-- with appropriate pipes or Regular Expressions or whatever you use-- and the other becomes the ELSE. Which one goes first (the IF loop) may depend mainly on your coding style. Either the one that gets more "hits"-- probably the OK Responses, unless you're doing some hanky-panky with random URLs-- or the one that leads to the more complicated code (only you know this part).