| 3:44 pm on Oct 26, 2013 (gmt 0)|
Some details would be nice, what's the IP and full user agent of this spider?
Always a logical explanation for how crawlers find things other than getting out the tinfoil.
| 11:23 pm on Oct 26, 2013 (gmt 0)|
As has often been said: Once on the Web...
Do not test live with your test browser. Do not use testing systems on the web. Mix those and you will be found out.
| 6:44 am on Oct 27, 2013 (gmt 0)|
Bottom line, robots.txt won't stop anything from crawling before it's ready to be launched.
You must block access to stuff via password authorization in .htaccess
| 3:25 pm on Oct 27, 2013 (gmt 0)|
I agree with incrediBILL and tangor here. Though we know the NSA is a large problem it's counter-productive to live in fear. There are many ways for your data to be accessed. Bingbot lately is getting into crevices that others are not, not even googlebot.
And besides that, the NSA doesn't need to create crawlers. Why reinvent the wheel, they just tap into and use everyone else's systems.
| 3:41 pm on Oct 27, 2013 (gmt 0)|
Hit came from 220.127.116.11 : CompSpyBot/1.0
There are no links anywhere to the CGI. There are no cookies involved to create traces. There are no outlinks to create referrers. My firefox has both cookie and referrer controls in place. I have been programming since 1989 and doing CGI since 1999.
The only access to this URL is the disassembly of an unsold Android app or access to the cgi-bin files.
I might add that I am neither a conspiracy theorist nor particularly fearful of intrusion. I was hoping someone would say something that would lead me to see another way of accessing the cgi-bin. Ideas anyone?
| 4:11 pm on Oct 27, 2013 (gmt 0)|
By the way evsiz, welcome to Webmaster World.
|I might add that I am neither a conspiracy theorist nor particularly fearful of intrusion. |
|I have reason to believe that CompSpyBot is, perhaps, a government bot. |
Your opening line in the OP right away jumped to the thought that is preoccupying your mind. Hence we are just saying relax, in a friendly way. Don't be concerned about the term conspiracy theorist around here. It gets used liberally by some without them understanding the difference between a true paranoid theorist versus someone who is concerned about developing trends and is not afraid to bring the concerns to the surface. If you've read this forum before becoming a member I'm probably one of the members who gets that label applied to myself consistently. It doesn't prevent me from continuing to chip away at the ignorance in a balanced manner.
Apologies if strong wording was used to make you feel the need to get on the defensive. That wasn't the intention.
As for the Spy Bot, that's a term generally used by Meta Search Engines. I'll say again that lately I've seen bingbot finding URLs in one of my CGI folders that is also outside of the public web root. You've probably encountered a similar situation with a different bot.
| 4:39 pm on Oct 27, 2013 (gmt 0)|
Hey, no offense taken, 7^3.
Are you saying that Bing hit a CGI which has no links in or
out? Mine is, practically speaking, an inert file to anyone
without the Android app or access to a CGI dir listing.
| 5:31 pm on Oct 27, 2013 (gmt 0)|
Same one? compspy.com/spider.html
| 5:49 pm on Oct 27, 2013 (gmt 0)|
yes, that is the one
| 6:40 pm on Oct 27, 2013 (gmt 0)|
Gravity is a CIA plot to keep everyone on the ground! (quote: Bill Bailey)
The IP you quote is part of an FDCserver range, which is a frequent black-bot sender (not entirely their fault: they rent out the servers, someone rents it; FDC, like most server farm owners, do not bother to check out what kind of activity is occurring.
Presumably you have SOME kind of link to the wiki or you could not access it yourself, nor could any app - at least, not by named URL. A common source of information is DNS, which usually has some kind of URL to your web site (exceptions being sites accessed only by IP). If there is a DNS entry for the wiki anything can (and will) follow it. If the default web site page (index, default, welcome, whatever) displays a page with any links at all, those links will be followed by bots of all shades, white, black, grey, puce...
If there is a DNS entry then even bing and the nameless one will access the page, robots.txt exclusion or not. The only way they will obey robots.txt is by not publicly listing the URL / page.
| 7:21 pm on Oct 27, 2013 (gmt 0)|
As said, link only in unsold app. DNS won't get you there.
| 12:35 am on Oct 28, 2013 (gmt 0)|
Well there are 172 gTLD websites hosted on that IP's /24 (18.104.22.168) and the WHOIS on the IP indicates a hoster/data centre and FDCserver.net hosts approximately 46,977 gTLD websites.
Is your site on shared or dedicated hosting?
| 8:15 pm on Oct 28, 2013 (gmt 0)|
Where is the app itself? Being Android I assume Google? If so I wouldn't bet on anything remaining a secret.
| 2:04 pm on Oct 30, 2013 (gmt 0)|
Going right back to the original observation in the OP
|All this seems to add up to the bot having internal access to the contents of RackSpace servers. |
I found this <insert your preferred expression of astonishment here> article:
|As the Snowden revelations proceeded it became apparent how reliant the security services actually are on the commercial services we all use—the Internet service providers, phone companies, and social networks—for help, both official and unofficial. Both in the US and in the UK the cloak of legal secrecy that surrounds this activity is such that no company dares come out openly and discuss its relations with the secret services. It is illegal to do so. For their part, governments on both sides of the Atlantic are terrified that commercial companies will “run for the hills” if consumers learn quite how accommodating they have been with their data. [nybooks.com...] |
That sounds like a potential match to possibly lead you to the root of your dilemma.
The linked article is quite long, 2 full page lengths, but well worth reading as a whole. Truly astonishing. If you just want to verify the quoted text above it's about 3/4 of the way down page 2.
| 3:46 pm on Oct 30, 2013 (gmt 0)|
An additional heads up. If the site linked above is slow to load, or not responding at all, don't give up. I think there is overwhelming demand on their server right now. The article was brought to my attention yesterday via Twitter so it's possible it's getting lots of attention.
| 7:55 pm on Oct 30, 2013 (gmt 0)|
|As said, link only in unsold app. DNS won't get you there. |
Who was your beta tester? You did have a beta tester, right?
Maybe it is not a leak from your system, but someone else's.