Forum Moderators: open
I wanted to automate this task so I wrote a program to do it. I guess you could call this program an unsophisticated spider (I am using ServerXMLHTTP by MSXML). However, I can’t get at the target site content with my program because it appears the target site cloaks their content. How can I beat their cloaking, i.e., uncloak them?
Here is what my program does:
1. Post login information (UID & PWD) to the site login page (this works).
2. Parse the headers from the response for cookies (session id cookies, etc.)
3. Echo back the cookies in a ‘get’ and get the roster content (on the roster page) from the site
I can't do step 3 without cookies from step 2 and they don’t send the cookies if I use my program (set-cookie header is empty). However, they do send cookies if I am just browsing. I figured this is a sure sign they must be cloaking.
I figure I have to send some new or different header information in step 1, but I am not sure what. I thought some of the following headers, but I don't know what values I should put in them:
Http_Referrer,
User_Agent
Do you think this will work? Are there any other headers that I should include and if so what values should I send?
.setRequestHeader "Cookie", "x=y"
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
.setRequestHeader "Referer", "http://targetsite.com/mainpage.htm"
.setRequestHeader "ACCEPT-LANGUAGE", "en-us"
.setRequestHeader "CONTENT-TYPE", ""
.setRequestHeader "CONTENT-LENGTH", ""
.setRequestHeader "ACCEPT-ENCODING", "gzip, deflate"
.setRequestHeader "Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"
If that is the case, you'll need to pick a browser (MSIE 6.0, as in your example) and then you need to know *exactly* what header information that passes when it makes a server request.
And, make sure you collect the images too - you don't need to save them, but make the request all the same - some systems can be configured to react differently if a person is using a standard browser, but the image requests aren't there.
Also - you shouldn't set just one referer, but the referer that comes with each page - including for the image calls.
Follow that, and it should work out - I haven't seen a bot (yet) designed properly that couldn't get a page...:)
Let us know if that helps.
And I'll add to that: Welcome to WebmasterWorld!
Anyway, in regard to the 'referer', in my example I just put in the fixed referer, i.e., "http://targetsite.com/mainpage.htm", to give you an idea of how I will set it. I will dynamically set it at the time of execution.
I am not sure what you mean by 'collect the images', would you please elaborate on that point.
Thanks.
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /directory/ HTTP/1.1" 200 4854 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/back.gif HTTP/1.1" 200 216 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/blank.gif HTTP/1.1" 200 148 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/text.gif HTTP/1.1" 200 229 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/image2.gif HTTP/1.1" 200 309 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
55.55.55.55 - - [18/Dec/2002:14:49:24 -0500] "GET /icons/folder.gif HTTP/1.1" 200 225 "http://www.example-widgets.com/directory/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
Notice how after the first file is grabbed, which is an html file, the browser parses it and automatically follows with GET request for each icon?
You can trigger a 'flag' in a server, if they have it setup to avoid bots, if you don't make the GET requests for each image on the page. A 'typical' bot such as a search engine spider will only parse the HTML looking for hrefs and such - if you don't grab the images, you already look like a bot to them.
If you follow the html GET request, parse the doc, and issue a GET request for each image, you should be okay...at least, you'll have one less signal that it's a bot - and not a browser.
:) Yep, everybody is very nice here - not to mention good at what they do.
Hope that helps.
My simple header spoof was good enough to uncloak them. I bet they were checking the 'referer' to make sure the 'get' was refered by an internal page.
Hopefully they will never increase their cloaking. If so I'll know where to come to get help. Thanks! :-)
If I follow up the initial request with image requests it must mean that I have already received the response to the initial request. Otherwise, I could not follow up the initial request.
Would the target-site be looking to identify and catalog my IP as a bot for future reference if I didn't follow-up the initial get with more gets?
Would the target-site be looking to identify and catalog my IP as a bot for future reference if I didn't follow-up the initial get with more gets?
Yes, I think that is what jeremy_goodrich was getting at. The idea is to make the request look like a browser rather than a bot. If they think you are a bot, they may ban your IP from doing this again.
I will send the additional gets. To do this I imagine I will have to parse the return string for <img> tags, subparse for the 'src' attribute and then do a 'get' on the resultant src string (and not do anything with the returned content).
I hope the returned content-type (gif, jpg, etc.) doesn't screw up the xmlserverhttp component. Oh well, I'll find out soon enough.
Thanks again.
[edited by: tomandlis1 at 3:07 pm (utc) on Dec. 20, 2002]
There's lots of great advice in this thread but one thing I haven't seen mentioned is a replay attack - ie you record everything the browser sends from the moment it goes onto the site to the moment it gets the data you require.
From here you could take the lazy solution and just replay this data whenever you need fresh information (if they need session cookies then you might need to get a fresh one each time but that's not too tricky).
Since its replaying real data then in theory these requests are indistinguishable from a real user (obviously there are going to be timing issues with any automated method but that's not too tricky a thing to get around).
The next level would be to determine how many of those steps aren't actually required to get ahold of that data - eventually you reach a point where not sending a request results in being served the wrong content.
At this point you have the critical path you require (ie the minimal amount of information you need to send to get the maximum data), from here it is simple a case of properly automating this process.
My little bit of advice on automating requests being to check the status code each request returns as nothing looks more suspicious than a "browser" generating a whole slew of 404 errors!
- Tony
Additionally, I am experiencing difficulty with my uncloaking script. It fails intermittently, i.e., sometimes it gets the target data, other times the target site cloaks the target data. I am pasting the vbscript code below if that helps.
Regards
Tom
Here is the vbscript code:
<%
s = getTargetData("myLoginUid","myLoginPwd")
response.write s
function getTargetData(uid,pwd)
'**********************
'set targetsite url and post uid & pwd and get info back
'**********************
'first go to login page and get session cookies, etc.
urlx = "http://targetsite.com/login"
postvars = "id=" & uid & "&password=" & pwd
set xmlhttp = server.Createobject("MSXML2.ServerXMLHTTP.4.0")
with xmlhttp
.open "POST", urlx, false
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
.setRequestHeader "Referer", "http://targetsite.com/"
.setRequestHeader "ACCEPT-LANGUAGE", "en-us"
.setRequestHeader "Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"
.send postvars
While .readyState <> 4
.waitForResponse 1000
Wend
strHeaders = .getAllResponseHeaders()
sReturn = .responseText
end with
set xmlhttp=nothing
nlenHead = len(strHeaders)
iStart = 1
'********************************
'get cookie information from returned headers (strHeaders)
'********************************
j = 0
dim arrCookie()
do while iStart < nlenHead
nFoundAt = InStr(iStart,strHeaders, "Set-Cookie:")
if nFoundAt = 0 then
exit do
end if
redim preserve arrCookie(j)
nEndOfCookie = InStr(nFoundAt, strHeaders, ";")
If (nEndOfCookie > 0) Then
' get only the cookie data; forget about path, etc. you shouldn't be that careless
arrCookie(j) = Mid(strHeaders, nFoundAt + 12, nEndOfCookie - nFoundAt - 12)
End If
iStart = nEndOfCookie + 1
j = j + 1
loop
'********************************
'end get cookie information from returned headers (strHeaders)
'********************************
'********************************
'use cookies to get target data information from target page
'********************************
'set referring page as the page (the page that login sends you to)
sTargetDataPage = "http://targetsite.com/rosters-grid
sRef = "http://targetsite.com/home
set xmlhttp = server.Createobject("MSXML2.ServerXMLHTTP.4.0")
with xmlhttp
.open "GET", sTargetDataPage, false
'we need to setrequestheaders twice due to KB article Q234486.
.setRequestHeader "Cookie", "x=y"
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
.setRequestHeader "Referer", sRef
.setRequestHeader "accept-language", "en-us"
.setRequestHeader "content-type", ""
.setRequestHeader "content-length", ""
.setRequestHeader "accept-encoding", "gzip, deflate"
.setRequestHeader "accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*"
i = ubound(arrCookie)
for j = 0 to i
'set the outgoing request header with the cookies received above
.setRequestHeader "Cookie", arrCookie(j)
next
.send ""
While .readyState <> 4
.waitForResponse 1000
Wend
getTargetData = .responseText
end with
set xmlhttp=nothing
'********************************
'end get target data information from target page
'********************************
end function
%>