|Cloaking white papers|
this thread could die with one post...
I was standing in the shower this morning and I decided that I need to "get" cloaking.
Because right now, I just don't get it.
I understand the basic concept, but I don't understand what happens to make it work.
Does anyone have a "white" paper on cloaking? I want something deeper than just the advertisement of how it works. I need technical type information, and I honestly don't know where to start.
Does a white paper on cloaking even exist?
Your browser connects to a website by sending a request.
That request is a few lines of text formatted according to the HTTP protocol.
USER_AGENT=Mozilla/4.71 (Windows 98;US) Opera 3.62 [en]
Is a request for the root index page. That is called a http "header".
When the webserver software like Apache receives the request, it looks up in its configuration file for how the request should be processed. It looks at the host name and determines that the home files for www.webmasterworld.com are actually stored in /home/webmasterworld/. It then looks at the GET portion and see's that it needs "index.html". So, it fetches /home/webmasterworld/index.html and serves it back to the user.
When Apache gets the index.html file from disk, it looks at it and determines what it should do with the file. Sometimes the file includes SSI (server side include) directives in it. If it finds those, then apache processes them. Other times, a request may be fore "index.cgi" and Apache "sees" that the .cgi files are to be executed as scripts. When Apache executes our script, we now have control over what is sent back to the user.
With that as a back drop, this is where we start with the real Cloaking process.
As part of the Apache process, it sets "environment" variables that match those "header" values above. So our script can see the "User_Agent" and the "REMOTE_ADDR" that is the users ip address. With those two bits of information, we can determine the usersname, and their IP address (we know who they are at this point).
With that info, we can custom build a page for the user.
If the user is a search engine, we want to give it our best most optimized stuff. If it is a user, we want to give it a pretty page that is tricked out for navigation and usability.
Why Cloak? There are so many things that go into a well optimized page these days, that it is appropriate to protect your investment. Often when you get a high ranking page under quality keywords the first thing that will happen is your page gets stolen. Often that page is stolen just to put up a duplicate some where and reduce your rankings. Yes, that is what happens with many engines. They see the duplicate and think "duplicate" and mysteriously your page disappears in the next update.
On still other engines you will see your competition suddenly bump up against your high ranking page - they stole all your best ideas. Often those ideas are as simple as keyword density or keyword location and frequency on the page - it doesn't have to be a duplicate.
And that's the how and why we cloak.
Also posted here [searchengineworld.com].
>it looks up in its configuration file for how the request should be processed.
That's interesting. From first read, you would cloak by tweaking the config file. If the config file is the instructions on how the request should be processed, and if it is looking at the header to determine what should be shown, one would think it would also, at this point, see if the ip matches that of a spider and then hand out the optimized page.
<thinking on paper>
But then sometimes I read fast and miss something. Maybe the something I am missing is that I don't know how a config file is put together. If I look at how a config file is put together, I should answer my own question...maybe..??
</thinking on paper>
Most scripts compliment the configuration file and do not require that you alter it. Except for things like changing the file extension for server parsing or to enable cgi, and most of these things are done through .htaccess unless you run your own server and have access to httpd.conf (on Apache anyways), in which case you have some additional options.
The most basic concept to get down regarding cloaking is that while a person requesting a page on your web server with a browser does so by typing in something like www.somedomain.com/path1/path2/page.html the webserver actually looks for the page by using a combination of document root as specified in the server's virtual host containers and the paths specified on the URL.
So in the example above, assuming the path to your web root is /www/htdocs/public_html the server would look for the page in /www/htdocs/public_html/path1/path2/page.html, a script or program could also read and write directly to that page using such a path.
The trick is to get the server to instead fetch the script instead of the page that is requested. Usually this is done via SSI, the script then takes over and does all of it's checking and decides which page to fetch, it then communicates to the browser by outputting appropriate response headers for text/html, reads or generates the appropriate content for that request and outputs the file.
The browser (or spider) never knows the difference, it happily displays whatever is fed to it, in the case of a spider, it happily gobbles up the content fed to it.
Please forgive me for making what is probably a stupid observation. It's only out of ignorance and not that I'm a wise-ass or anything...
I recently (couple months ago) discovered the concept of cloaking and was thrilled with the whole idea. I found a couple of suitable scripts with long laundry lists of spiderbot addresses. I created my SEO page and my user-pretty page and submitted the switch (?) page to a few search engines.
Within a couple weeks, several spiderbots visited my site, but not one of them matched anything in the lists that I used. So of course they all visited the user-pretty page.
After further investigation I find that this is a constant war waged by the search engines to come up with new addresses to foil the webmasters out there that are attempting to cloak. And from what I see in the "Search Engine Spider Identification" forum, this is a moment-by-moment moving target.
So I finally asked my self - What is the point? As I understand it, some search engines will punish you if they discover that you're trying to cloak. So to successfully use this technology, you've got to keep a close eye on that other forum and add new spiderbot addresses the moment that they become known.
I suppose if I was a billion-dollar company, I could hire some nerd to sit and watch things and keep that file updated, but I'm 50% of my website company and I've simply got too many other hats to wear to put that much attention to this.
Have I totally missed the concept here?
Right On Pat! You've detailed the downside of cloaking, it's a lot of work! There's no easy way to get top rankings, doorways are always at risk of being dumped as duplicate pages and cloaking has it's problems as you so eloquently described.
The key is "top rankings." Either technique is worth the efforts and risks for the returns a top ranking can provide.
I could hire some nerd to sit and watch things and keep that file updated...
LOL, I guess that would be me.
Has anyone gotten cloaked pages dropped from Google lately? Or pages with the "no cache" tags dropped?
>The trick is to get the server to instead fetch the script instead of the page that is requested.
That's the part I didn't understand! Thank you thank you thank you.
And Pat, you aren't a wise*ss, you are speaking from your soul. I appreciate that in a person, because people cut all the bs and say what they really mean when they speak from their soul. *insert blues rif here*
Cloaking is an incredible amount of work, and because of that, I don't know if I will ever do it.
I want to understand it to know what I am up against. I also think knowing how to cloak, or rather, make cloaked pages that rank well, then it is a way to learn the algo of a particular search engine. And that is something we all want to know.
>>I could hire some nerd to sit and watch things and keep that file updated.
LOL, I guess that would be me.
Yeah, especially with Google changing their IP with every new crawl schedule. Nowadays, you really need to have a cellphone and laptop on hand at ALL times(even in the bathroom). So when the server detects a new bot version you can receive an instant message, get online, and update the file immediately.
how do you manage all of the sites/pages/IP's, etc??? do you have a doorpage creator making the pages for you (as per your specs)?
There isn't much cutting edge going on. Really it is about logs. I have on the fly stats where I could look at requests as they come in, there isn't
anything special there -- it's just a modified ASX script. I use one script per server and have all the domains write to a single log. I also log all
traffic that comes in without a referrer. These get broken down by day and screened against my "known spider IP' list. I use to log all non
Mozillas coming in, but at this time I think it is a waste of space.
I dump data from multiple domains into one log because there is a lot to be learned by watching an IP/bot/UA crawling patter.
do you have a doorpage creator making the pages for you (as per your specs)?
Yeah, it is home-grown stuff.
so what goes on your log, just requests from IPs not already on your list with no referrer? Some bots give a referring URL, especially if they are doing a crawl.
can you give an example? I have never seen a spider with refer field not empty..
Yeah,basically. Well, I do log everything in about three different logs but the on I keep my eyes on is the no referrers that do not match up.
Volatilegx, there are plenty of bots that *do* use a referrer, but they aren't SE bots. They are mostly scavengers, looking to harvest something from your website. The send a referrer to blend in and look like a surfer - by the way an easy way to see this is to see if an external file or a frame set isn't called. If you know otherwise please post your info.
What kind of mods to ASX little?
Really just the log making and rotating, also key word stripping, not much.
>The trick is to get the server to instead fetch the script instead of the page that is requested.
Okay, I guess I'm taking a differnet approach. ALL of my normal pages end with a .htm extension. This makes it easy to use the straight cloak script as the "page" I submit to the SEs. I just rename "cloak.cgi" that basically makes the determination of spider or visitor and serves the appropriate page, to "whatever.html" Then I edit the .htaccess file to parse all .html files as .cgi files.
This seems easier to me than messing with SSI calls, etc. But, I'm just starting this stuff.
Is this method acceptable? Anyone see problems with doing it his way?
I do it that way on most of my domains. If you have a domain that is nothing but cgi, it actually makes a lot of sense. In that case there in no reason to through ssi into the mix - no reason to complicate things.
>>Cloaking is an incredible amount of work, and because of that, I don't know if I will ever do it.
But, what if you just started cloaking for just one search engine, maybe even one that doesn't work so hard at discovering cloaking? That's how I'm going about it. Which leads me to my next question....
Let's take googlebot or slurp for examples. Everytime one of these come to my site it's obvious. I see it. My server log software knows it. Why is it then so hard to get a cloaking script database to know it. Seems if you use maybe a combination of useragent and IP, it would be very easy.
Or are they visiting my site disguised as visitors and I just don't know it???
Is it really that hard?
>>Nowadays, you really need to have a cellphone and laptop on hand at ALL times(even in the bathroom). So when the server detects a new bot version you can receive an instant message, get online, and update the file immediately.
If the server has detected a new bot version, doesn't that mean it detected it crawling your site? Which would mean it's too late to update?
-so many questions!-
Sorry I can't find the log listings giving a referrer for a search engine visit. I'll keep looking... I'm positive it's there somewhere.
>>But, what if you just started cloaking for just one search engine, maybe even one that doesn't work so hard at discovering cloaking? That's how I'm going about it.
This is how I've treated cloaking. I set up a script based on IP detection and a set of directories that contain files for the various search engines. For the last few months, I have been serving the spiders the same exact page as the visitors get, but from their *own* special directories. Now that I'm relatively sure the spiders are going to keep coming back, I'm starting to make small adjustments to the pages. For example, I just made some changes that drastically changed the density of my google page the other day, but was very happy to see that the size of the page only changed a fraction. I'm leaving graphics, tables, everything, in that page. Why not. I see millions of number one ranked pages that have tables, frames, blah, blah, blah.... My philosophy is going to maintain a no spam agenda. If a human from Google want's to look at my human pages they're going to see a page so very similar to the one their spider gets served that even if cloaking is detected, spamming will not be.
I haven't been hit by very many new IP's lately and I believe I have a pretty complete list of numbers. If I do miss one, the next time it comes back I'll have some spider food for him. I'm not worrying too much because, like I said, the pages are similar and I'm not worried about getting banned. I'll just have to be patient with the whole system until I'm in a better position to take over the world! :-)