How can bots know the internal structure of a password protected site? - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

How can bots know the internal structure of a password protected site?

lamati

8:35 am on Feb 6, 2015 (gmt 0)

10+ Year Member

Hello,

I run a private non-public wiki (mediawiki) for a closed group of university students. Whoever wants to access any page of the wiki first has to enter login and password to get access, this is valid even for the default landing page.

Since about 2 weeks I see repeated access by the "Cliqzbot/0.1" to specific pages of the wiki deep within its structure, which always are causing a 401 access error.

I worry less about these continous attempts but more about HOW could the bot know these specific page addresses? I was able to pin down these pages to specific articles which were only accessed by two users of the wiki. So somehow the browser of one of these two users must have shared their browser history with the Cliqz search engine which then triggered the bot to try to access it.

How does this work? I consider this a violation of privacy if the browser history is sent to some search engine without the consent of the user.

Any ideas how this all works and what advice I should give to those two users to make sure this problem is terminated

Thanks for any help

keyplyr

10:57 pm on Feb 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Your user may have a browser tool bar, plugin or add-on that is passing history along.

wilderness

11:10 pm on Feb 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Any ideas how this all works and what advice I should give to those two users to make sure this problem is terminated

You create a valid TOS, which advises users that they are NOT allowed to use and software that crawls your site (links or otherwise).
Then, and when a user uses the software, you remove their access (both membership and IP).
Without lack of enforcement of TOS, your just blowing smoke.

lucy24

2:17 am on Feb 7, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I consider this a violation of privacy if the browser history is sent to some search engine without the consent of the user.

How do you know the user didn't consent? Grandparents do not have a monopoly on foolish toolbars and addons.

:: detour to look up robot ::

Huh. I feel neglected. I've never set eyes on this one. Admittedly the element "0.1" suggests that it's brand-new, barely in development-- or that its programmer is an idiot. (The two are, of course, not mutually exclusive.)

:: further lookup ::

Cliqzbot
[cliqz.com...]
A description for this result is not available because of this site's robots.txt � learn more.

So many possible responses, so little time...

In order to ensure, that your domain/web-page will be found and presented nicely in our product, our little Cliqzbot may have visited your site recently.

But, hm, your pages are roboted-out, right? Law-abiding robots would then not even ask for the pages in the first place-- and robots that do ask can happily be banned up front.

It seems to be German; at least that's where I'm finding most information. Did they crawl from 81.169 or 85.214?

One German site [en.wetena.com] (no, I have no idea what that "en." is doing in the URL of a German-language page) says helpfully:

Cliqzbot ist ein Bot mit unbekannter Funktion, der von dem M�nchner Unternehmen "10betterpages GmbH" betrieben wird. Eine Beschreibungsseite f�r den Bot existiert nicht, daher ist beispielsweise auch nicht bekannt, ob die Software Anweisungen in der robots.txt ber�cksichtigen sollte.
...
Bis die Betreiber von Cliqzbot die Art ihrer Dienstleistungen sowie die Aufgaben des Bots nachvollziehbar und �ffentlich einsehbar dokumentiert haben, sind Risiken durch das Aussp�hen der eigenen Website durch diese Software nicht auszuschlie�en. So lange Cliqzbot ein potenzielles Risiko darstellt, halten wir das pauschale Blockieren f�r sinnvoll.

Translation: Block 'em.

:: final detour to dictionary because I've never seen the word pauschal in my life ::

lump sum, all-inclusive, overall; blanket (as an adjective, like "blanket ban")

Yeah. What he said.

keyplyr

2:46 am on Feb 7, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@lamati - It would be helpful if you post the entire request string (but using "example" for your site/page.)

lamati

8:30 am on Feb 7, 2015 (gmt 0)

10+ Year Member

I have dug a bit more into the issue and one of the users really had a Firefox browser plugin installed called

"CLIQZ Beta 0.5.61"

and it is by the same company that runs the Cliqzbot, so the plugin was sharing the browser history with the search engine and it caused the bot to scan the addresses it received.

I wrote a little review on the plugin at the Mozilla site and received a comment from the developer that "no personal data is being stored" but that they anonymously store the search history as well as the visited sites of the users. They claim that they do this as part of developing a malware detection algorithm.

I asked the user to remove the plugin and since then I have not seen any access from the Cliqzbot so there seems to be no repeated access to addresses stored previously.

keyplyr asked for the full string of the access attempts, here is one example from the nginx log file:

85.214.34.101 - - [03/Feb/2015:16:28:06 +0100] "GET /index.php?title=XXXXXXXXX_XXXXX_-_XXXXXXXXXXXXXXX HTTP/1.1" 401 188 "-" "Cliqzbot/0.1 (+http://cliqz.com/company/cliqzbot)"

The access happened from various different addresses, but they were all from 81.169 and 85.214 as suggested by lucy24.

Thanks for the quick help

trintragula

10:35 am on Feb 7, 2015 (gmt 0)

10+ Year Member

Top Contributors Of The Month

There are a number of toolbars/plugins that do this, presumably on the assumption that not storing your IP is sufficient.

* Comodo have a 'security' product that follows its users around with a robot called CRAZYWEBCRAWLER. I found this out recently and the user involved was not aware that it was happening.

* The UK ISP talktalk have a parental controls bot that follows subscribers around in a similar way. There's been some public controversy about this. The bot travels under a browser UA disguise.

* Pinterest's robot seems to follow people around that way, though I am less sure of how it operates. It may only act when you do something with pinning rather than on every request.

I think many users install these things either deliberately or via bundle-ware with little or no idea of the issues involved.

Cliqz has been around about 3 years I think, and I've seen all of the bots I've mentioned on my site at one time or another. Ironically cliqz seem very keen to assure users of their commitment to their users' privacy.

There are probably others.

Information leakage from a password-protected site is always possible - users can just save a page and re-post stuff elsewhere - but tools like this make it a lot more common.