Forum Moderators: open
I am currently thinking to write my own cloaking script. I have put down some pseudocode below. Am i on the right track?
I only wish to cloak for google.
if (user agent = googlebot) AND (IP = IP from our database)
{show optimised
html}
else
{show viewer
html}
Now the above code is for cloaking on a single domain - risky. But of course - it can be adjusted for cloaking on a redirecting domain:
if (user agent = googlebot) AND (IP = IP from our database)
{show optimised
html}
else
{redirect to primary url}
Now I was thinking of using the google IP's on
[iplists.com...]
Is this a good choice?
Can any of the experts comment on the
{redirect to primary url} part?
I kind of thought the right way was to fetch the page server side via LWP or Curl. This way your cloaking domain stays in the address bar. If you just redirect to the target page won't you set off a few red flags for your competition snooping around?
Now lets for argumants sake say I was cloaking for widget.com
If I fetch widget.com server side won't most of the image and href links be broken because they're relative links?
I would imagine you would need to change all links to absolute...correct?
Does anyone know at what sort or regularity search engines change their ip addresses? I guess its a trade off between using the free list and risking losing some domains or paying for a bang upto date list.
The iplists one says it was last updated around 3 months ago which makes me slightly wary of using it.
You can always build a spider trap that looks for new IP's for a specific user agent. Have the script email you this warning so you can determine if it is a new IP for a spider or someone spoofing the user agent.
//waiting to here some more about the freshness of iplists in 3,2,1... :)
If your going to fetch a remote page.....
if (cached version exists AND it's age is less than x days)
show cached page
else
GET fresh version
cache it
display it
end if
On a more general theme I wonder about these technologies and copyright. Of course in this instance here - it is fine because we are connecting to our own material. But what happens if one connects to someone else's material in this way (for instance the BBC website). In a sense, your server is only really acting as a proxy between the viewer and the BBC website. What are the legal issues here? Can anyone post any resources on this? I have come across cases of hotlinking etc. and know that this is legally hot water. But this issue seems more complex to me.
On a second point: RE "This way your cloaking domain stays in the address bar."
When the viewer looks at the cloaking domain source code - can they see the url of the primary domain at all? Is there any way for them to find the url of the primary domain?
On a second point: RE "This way your cloaking domain stays in the address bar."
When the viewer looks at the cloaking domain source code - can they see the url of the primary domain at all? Is there any way for them to find the url of the primary domain?
If there is a base href tag, then it will have the "primary domain" in it.
Regarding the freshness of iplists.com (which is my site), a closer look will reveal that it was last updated yesterday. IP addresses are added (or removed) as they are discovered. Some of the lists are quite old. They haven't had any new IP addresses added because the spiders continue to use old IP addresses. However, if it will make everybody feel better, I can re-upload the files and make them all appear new :)
Just for the record, the fantomaster IP list is also very good.
However, if it will make everybody feel better, I can re-upload the files and make them all appear new :)
About using cURL/LWP, could you not just use php function file_get_contents or file to read the html page and print it to screen?