|Best way to recognize googlebot in order to cloak (for redirection)?|
Before you start saying "this is not allowed" and the likes I will tell you that this kind of cloacking that I need is done by massive websites (I'm talking about international, very well known multi-milion companies). I can't post the names here as it's not allowed.
Having a website in different languages for different countries, they do automatic redirection depending on the country you connect from (by using your ip address). However, they cloack their page in order to NOT redirect if it's a search engine or any bot. You can test this easily by changing your user-agent to a search engine's one and you will see that you won't be redirected.
I can't post here which websites I am talking about as it's not allowed to mention here, but if you spend some time researching you will find some yourself.
So back to the question. I have only heard of BrowserHawk as a tool for detecting bots. Do you think that this is the best choice? and what is used by the leaders? is there a way I can test if they are using it?
I would like to do the same as them, as we are in the same business and at least I need to compete at the same level.
If not browserhawk can u suggest me how to do it? any opensource alternatives? any custom code?
Any help would be very appreciated..
Most major search engines can easily be detected using full trip DNS.
Some useful info here:
Hope that helps!
Yes thanks, it helped to clarify a lot of things.
My first guess with BrowserHawk was completely wrong :).
Anyway, I also read in other sources and they are suggesting SpiderSpy by fantomas.
I would have to pay 258$ every year, but at least seems they give a complete and updated list of IPs to which I can tweak the page (for good intents of course).
Would you trust this list?
Do you think google or the other big search engines would be able to go past this list of IPs authomatically if they want to test for cloaking?
I am not much worried about manual checks, as if the techniques used have good intents and the reviewer common sense the site shouldn't be banned (but yes it's guess work to know how tollerant they are). I'm only talking about automatic ways implemented by google to detect cloaking.
I did some more research and the 3 market leaders in my field they all use "cloaking". The thing that is buggering me is that I spend the time to unerstand all I can, risks etc... and instead I have seen the market leaders do cloaking by using ONLY USER-AGENTs strings.
This is quite shocking to me. They present a complete different page to users and to bots, and don't even do reverse lookup of the ip address.
Do you see any reason why this might be their choice? Do you think that by making it so easy to find out (u don't even have to change ur Ip) they have less of a risk?
Why they are not paying attention to possible penalties? Maybe having pr above 6 makes them "untouchable" so they don't have to deal with any of these problems?
That wouldn't be faire..
Do you think I should follow their example?
found an alternative solution in the end.. we decided on using google approved policies only to be in the safe side
so what's google approved policies, after all?
[edited by: incrediBILL at 5:57 am (utc) on Sep. 15, 2008]
[edit reason] no specifics, see TOS [/edit]
Neither of the examples given here are ones where the black-hat definition of "cloaking" applies. Serving different language or currency content based on location is not "cloaking with intent to deceive" either visitors or search engines.
Using the user-agent or reverse-DNS lookup method to force a particular language or currency setting for a search engine robot is not maliciously deceptive.
i understand but here is some simple help if someone is faced with my position , instead of using code to detect googlebot by using the user-agent variable i just set a dummy cookie called testcookie to 1 then did a check if testcookie is 1 then redirect to force country check otherwise do nothing so it defaults the the USA the default country! This way im not looking for a string like googlebot or checking IPAddresses that may change over time! i am checking to see that this is a browser client that accepts cookies and if it does then only redirect to country select page otherwise do nothing but display the page so googlebot can index it. Now to my more pressing point how do i make the googlebot come quick :-) and when it comes it only gets my main page not the other pages is there a way to make it index pages quicker? sorry this post probably belongs elsewhere but a simple solution to my earlier problem of google skipping my pages will hopefully help someone in the future!