homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Marketing and Biz Dev / Cloaking
Forum Library, Charter, Moderator: open

Cloaking Forum

    
Uncloaking
how to beat cloaking
tomandlis1




msg:676049
 6:09 pm on Dec 19, 2002 (gmt 0)

Here is a little background: everyday I go to a site (call it the target site), copy a table and drop it into an Excel sheet. Once the info is in the sheet I do some analysis that helps me make some decisions. Specifically, I am getting data about my fantasy football league player roster. This information changes almost daily and copying and pasting is a real drag.

I wanted to automate this task so I wrote a program to do it. I guess you could call this program an unsophisticated spider (I am using ServerXMLHTTP by MSXML). However, I canít get at the target site content with my program because it appears the target site cloaks their content. How can I beat their cloaking, i.e., uncloak them?

Here is what my program does:

1. Post login information (UID & PWD) to the site login page (this works).
2. Parse the headers from the response for cookies (session id cookies, etc.)
3. Echo back the cookies in a Ďgetí and get the roster content (on the roster page) from the site

I can't do step 3 without cookies from step 2 and they donít send the cookies if I use my program (set-cookie header is empty). However, they do send cookies if I am just browsing. I figured this is a sure sign they must be cloaking.

I figure I have to send some new or different header information in step 1, but I am not sure what. I thought some of the following headers, but I don't know what values I should put in them:

Http_Referrer,
User_Agent

Do you think this will work? Are there any other headers that I should include and if so what values should I send?

 

tomandlis1




msg:676050
 8:57 pm on Dec 19, 2002 (gmt 0)

Well here are the server vars I am going to set. Hopefully this will uncloak the page. Any suggestions? I picked these values by just writing out the server vars when a normal browser comes to the page.

.setRequestHeader "Cookie", "x=y"
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
.setRequestHeader "Referer", "http://targetsite.com/mainpage.htm"
.setRequestHeader "ACCEPT-LANGUAGE", "en-us"
.setRequestHeader "CONTENT-TYPE", ""
.setRequestHeader "CONTENT-LENGTH", ""
.setRequestHeader "ACCEPT-ENCODING", "gzip, deflate"
.setRequestHeader "Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"

jeremy goodrich




msg:676051
 9:12 pm on Dec 19, 2002 (gmt 0)

What they might do is to parse out the headers sent by your application...

If that is the case, you'll need to pick a browser (MSIE 6.0, as in your example) and then you need to know *exactly* what header information that passes when it makes a server request.

And, make sure you collect the images too - you don't need to save them, but make the request all the same - some systems can be configured to react differently if a person is using a standard browser, but the image requests aren't there.

Also - you shouldn't set just one referer, but the referer that comes with each page - including for the image calls.

Follow that, and it should work out - I haven't seen a bot (yet) designed properly that couldn't get a page...:)

Let us know if that helps.

And I'll add to that: Welcome to WebmasterWorld!

tomandlis1




msg:676052
 9:38 pm on Dec 19, 2002 (gmt 0)

Thanks for the welcome. I have been reading this site all day--the people on this site are very nice to each other. That is unusual, but welcome. Glad to be here.

Anyway, in regard to the 'referer', in my example I just put in the fixed referer, i.e., "http://targetsite.com/mainpage.htm", to give you an idea of how I will set it. I will dynamically set it at the time of execution.

I am not sure what you mean by 'collect the images', would you please elaborate on that point.

Thanks.

jeremy goodrich




msg:676053
 9:57 pm on Dec 19, 2002 (gmt 0)

Here is a sample get request:

55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /directory/ HTTP/1.1" 200 4854 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/back.gif HTTP/1.1" 200 216 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/blank.gif HTTP/1.1" 200 148 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/text.gif HTTP/1.1" 200 229 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/image2.gif HTTP/1.1" 200 309 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/folder.gif HTTP/1.1" 200 225 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Notice how after the first file is grabbed, which is an html file, the browser parses it and automatically follows with GET request for each icon?

You can trigger a 'flag' in a server, if they have it setup to avoid bots, if you don't make the GET requests for each image on the page. A 'typical' bot such as a search engine spider will only parse the HTML looking for hrefs and such - if you don't grab the images, you already look like a bot to them.

If you follow the html GET request, parse the doc, and issue a GET request for each image, you should be okay...at least, you'll have one less signal that it's a bot - and not a browser.

:) Yep, everybody is very nice here - not to mention good at what they do.

Hope that helps.

tomandlis1




msg:676054
 1:00 pm on Dec 20, 2002 (gmt 0)

Thanks for the idea about the following up the initial 'get' with more gets for images. Fortunately I didn't have to go to that extra step.

My simple header spoof was good enough to uncloak them. I bet they were checking the 'referer' to make sure the 'get' was refered by an internal page.

Hopefully they will never increase their cloaking. If so I'll know where to come to get help. Thanks! :-)

tomandlis1




msg:676055
 1:08 pm on Dec 20, 2002 (gmt 0)

I have one additional question about the images. If the target site is waiting for the additional image 'gets' wouldn't they have to slow down their response to the initial get? That is, if the target-site is evaluating your request for its bot-ness by looking for follow-up image requests aren't they slowing their serving speed performance by withholding their response to the initial page request?

tomandlis1




msg:676056
 1:26 pm on Dec 20, 2002 (gmt 0)

OK, sorry to bombard you with questions, but I am really confused now that I thought about the images thing.

If I follow up the initial request with image requests it must mean that I have already received the response to the initial request. Otherwise, I could not follow up the initial request.

Would the target-site be looking to identify and catalog my IP as a bot for future reference if I didn't follow-up the initial get with more gets?

Liane




msg:676057
 1:40 pm on Dec 20, 2002 (gmt 0)

Would the target-site be looking to identify and catalog my IP as a bot for future reference if I didn't follow-up the initial get with more gets?

Yes, I think that is what jeremy_goodrich was getting at. The idea is to make the request look like a browser rather than a bot. If they think you are a bot, they may ban your IP from doing this again.

tomandlis1




msg:676058
 3:02 pm on Dec 20, 2002 (gmt 0)

OK, now that I think about it this is a big issue for me as I have a fixed IP and if they ban it then I won't be able to spider in the future.

I will send the additional gets. To do this I imagine I will have to parse the return string for <img> tags, subparse for the 'src' attribute and then do a 'get' on the resultant src string (and not do anything with the returned content).

I hope the returned content-type (gif, jpg, etc.) doesn't screw up the xmlserverhttp component. Oh well, I'll find out soon enough.

Thanks again.

[edited by: tomandlis1 at 3:07 pm (utc) on Dec. 20, 2002]

tomandlis1




msg:676059
 3:05 pm on Dec 20, 2002 (gmt 0)

Any other suggestions for ways to make my spider appear as a regular old browser would be very welcome. I suppose there are good links to threads in this forum regarding this. I'll start looking. Please post any good ones you know here. Thx.

tomandlis1




msg:676060
 1:08 pm on Dec 31, 2002 (gmt 0)

Now that I think about it, I think getting the images is a waste of time. People frequently turn off the Images in their browser so you would be wrong if you identified them as spiders because they aren't requesting images.

Dreamquick




msg:676061
 2:06 pm on Dec 31, 2002 (gmt 0)

tomandlis1,

There's lots of great advice in this thread but one thing I haven't seen mentioned is a replay attack - ie you record everything the browser sends from the moment it goes onto the site to the moment it gets the data you require.

From here you could take the lazy solution and just replay this data whenever you need fresh information (if they need session cookies then you might need to get a fresh one each time but that's not too tricky).

Since its replaying real data then in theory these requests are indistinguishable from a real user (obviously there are going to be timing issues with any automated method but that's not too tricky a thing to get around).

The next level would be to determine how many of those steps aren't actually required to get ahold of that data - eventually you reach a point where not sending a request results in being served the wrong content.

At this point you have the critical path you require (ie the minimal amount of information you need to send to get the maximum data), from here it is simple a case of properly automating this process.

My little bit of advice on automating requests being to check the status code each request returns as nothing looks more suspicious than a "browser" generating a whole slew of 404 errors!

- Tony

tomandlis1




msg:676062
 2:41 pm on Jan 6, 2003 (gmt 0)

Thanks for the tip on 'replay' uncloaking. In that regard, what do you use to record everything that happens in a browser session? Do you use a program or some homegrown code?

Additionally, I am experiencing difficulty with my uncloaking script. It fails intermittently, i.e., sometimes it gets the target data, other times the target site cloaks the target data. I am pasting the vbscript code below if that helps.

Regards

Tom

Here is the vbscript code:

<%
s = getTargetData("myLoginUid","myLoginPwd")
response.write s
function getTargetData(uid,pwd)
'**********************
'set targetsite url and post uid & pwd and get info back
'**********************
'first go to login page and get session cookies, etc.
urlx = "http://targetsite.com/login"
postvars = "id=" & uid & "&password=" & pwd
set xmlhttp = server.Createobject("MSXML2.ServerXMLHTTP.4.0")
with xmlhttp
.open "POST", urlx, false
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
.setRequestHeader "Referer", "http://targetsite.com/"
.setRequestHeader "ACCEPT-LANGUAGE", "en-us"
.setRequestHeader "Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"
.send postvars
While .readyState <> 4
.waitForResponse 1000
Wend
strHeaders = .getAllResponseHeaders()
sReturn = .responseText
end with
set xmlhttp=nothing
nlenHead = len(strHeaders)
iStart = 1
'********************************
'get cookie information from returned headers (strHeaders)
'********************************
j = 0
dim arrCookie()
do while iStart < nlenHead
nFoundAt = InStr(iStart,strHeaders, "Set-Cookie:")
if nFoundAt = 0 then
exit do
end if
redim preserve arrCookie(j)
nEndOfCookie = InStr(nFoundAt, strHeaders, ";")
If (nEndOfCookie > 0) Then
' get only the cookie data; forget about path, etc. you shouldn't be that careless
arrCookie(j) = Mid(strHeaders, nFoundAt + 12, nEndOfCookie - nFoundAt - 12)
End If
iStart = nEndOfCookie + 1
j = j + 1
loop
'********************************
'end get cookie information from returned headers (strHeaders)
'********************************
'********************************
'use cookies to get target data information from target page
'********************************
'set referring page as the page (the page that login sends you to)
sTargetDataPage = "http://targetsite.com/rosters-grid
sRef = "http://targetsite.com/home
set xmlhttp = server.Createobject("MSXML2.ServerXMLHTTP.4.0")
with xmlhttp
.open "GET", sTargetDataPage, false
'we need to setrequestheaders twice due to KB article Q234486.
.setRequestHeader "Cookie", "x=y"
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
.setRequestHeader "Referer", sRef
.setRequestHeader "accept-language", "en-us"
.setRequestHeader "content-type", ""
.setRequestHeader "content-length", ""
.setRequestHeader "accept-encoding", "gzip, deflate"
.setRequestHeader "accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"
i = ubound(arrCookie)
for j = 0 to i
'set the outgoing request header with the cookies received above
.setRequestHeader "Cookie", arrCookie(j)
next
.send ""
While .readyState <> 4
.waitForResponse 1000
Wend
getTargetData = .responseText
end with
set xmlhttp=nothing
'********************************
'end get target data information from target page
'********************************
end function
%>

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / Cloaking
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved