homepage Welcome to WebmasterWorld Guest from 184.73.52.98
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
above the law? ignores ROBOTS.TXT
lucy24




msg:4573205
 12:58 am on May 12, 2013 (gmt 0)

Moderators: May really be a "general" question. Your call.

Background:

Their action:
GET robots.txt
GET front page
GET front page again, with a second user-agent
GET page linked from front page

My action:
Deny from aa.bb.128.0/17

They've been coming around for ages, once a month or so, dismissed as "no skin off my nose". Always ask for robots.txt, front page, and all pages linked directly from front page. Distinguishing trait: picking up a second copy of the front page with a different UA, this one mobile. (Belated thought: If they received different content on that front page, would they ask for duplicates of all pages?)

And your point is...?

#1 This is the first time they showed up on my test site. Unlike my real site, this one's robots.txt includes the element
User-Agent: *
Disallow: /

On my real site, the disallowed directories are deeper: at least two steps away from the front page. The robot never gets that far. Here the whole SITE is roboted-out, giving the robot the opportunity to ignore any "keep out" signs.

#2 The offending robot belongs to verisign. But wait! Aren't they supposed to be the good guys? Did I miss a chapter?

The question:

Is anyone above the law? Do some automated agents perform a function so important, they're allowed to ignore robots.txt? Maybe not these guys in particular, but someone. For example, a virus checker isn't much use if it dutifully stays out whenever it meets a "Disallow". (This does not prevent me from blocking AV devices if they annoy me or I suspect they're bogus. But still.)

Punch line:

Here's what the link to the inner directory looks like.
.honey {display: none;}
...
<p class = "honey">
<a href = "/directory/"> ...et cetera

 

keyplyr




msg:4573554
 3:09 pm on May 13, 2013 (gmt 0)


The offending robot belongs to verisign. But wait! Aren't they supposed to be the good guys?

What ever made you think that? Their unethical antics and breaches of privacy are legendary.

lucy24




msg:4573586
 4:15 pm on May 13, 2013 (gmt 0)

Yup, lots of interesting stuff on verisign. I searched these forums before posting. And anyone who follows a link involving the words "honey" and "display: none" may fall solidly into the Darwin Awards category.

But I did say "maybe not these guys in particular" :)

General question still stands.

dstiles




msg:4573674
 7:55 pm on May 13, 2013 (gmt 0)

Agree with keyplr - verisign are certainly not above suspicion. As a DNS provider they once tried to send all hits on unassigned IPs to their own advertising service. And in the recent past (maybe even now) they refused any mail from services outside the US unless we registered with them - it was easier to put a warning on web site contact forms! :)

In any case, was it actually from verisign or merely from a verisign-owned IP (server or DSL)?

I have a "honeypot" display:none link on most of my sites and find almost no-one, bot or human, falls in. The big SEs used to for a while but they seemed to learn.

How did they find your test site? One way is to scan DNS for A records. There is no way of blocking this as far as I know, so anyone can find your site if they really want to, especially if they already have an inkling based on the same domain name.

Robots.txt is only ever applicable to serious SEs as a suggestion and only then sometimes (witness G's web preview). The only real way of stopping unwanted access is through blocking IPs, UAs and other bad headers. If it's a test site then block everything except your own IP (easy enough for those of us with fixed or long-term IPs, annoying for those (eg in UK) whose IP changes daily.

And I don't need to tell you (but I will remind you anyway!): if the bot is not welcome then block it. :)

lucy24




msg:4573725
 9:12 pm on May 13, 2013 (gmt 0)

How did they find your test site?

Wouldn't it be harder to prevent someone from finding it? My impression was that once a domain name is registered, the robots will come.

On the test site, nothing links from the front page except the honeypot, which exists purely to identify robots who are both stupid and bad. Log wrangling filters out anyone who got a 403, so anything left over will jump up and hit me in the face. It's about equal amounts stupid-plus-bad robots, and humans whose Bing searches led them to the site name even though it's roboted-out so no snippet. Yes, I could take the "noindex" approach instead, but this way is more fun. The front page says, quote,
Bad news for any passing humans:
This is a test site. You won’t find any entertainment. Sorry. I had to call it something, and the domain name was just sitting there.


Oh, and the site shares an htaccess (mod_authz and mod_setenvif) with my "real" sites, so offending IPs are blocked at the gate.

dstiles




msg:4574150
 7:35 pm on May 14, 2013 (gmt 0)

The "commercial" domains claimed by USA (erroneously) as theirs (ie com/net/org) are public from registration. Terribly bad domain name system all round.

Other domains (eg UK) are more difficult to discover as they have to be registered in DNS as well. I've registered UK domains and never recceived a bot attempt on briefly DNS's test sites. But com is open season.

However, once in DNS it can be found if DNS is scanned.

I really wish there were some way to ONLY get robots by invitation.

One has to view G's approach in the area of malware promoters in a sceptical light:

"Google (enacted) some forceful policy changes recently that prohibit developers from sending users who download apps from Google Play off the marketplace for updates. The Google policy change states that any app downloaded from Google Play may not modify, replace or update its Android Application File (APK) binary code using an update method other than Google’s."

Optimistic to say the least! :)

Dijkgraaf




msg:4574610
 2:10 am on May 16, 2013 (gmt 0)

Robots.txt There is no law, it's just a guideline ;-)

incrediBILL




msg:4574820
 10:43 pm on May 16, 2013 (gmt 0)

Robots.txt There is no law, it's just a guideline ;-)


Maybe on YOUR servers.

On my servers any violation of robots.txt gets a 403 forbidden. Even Googlebot asking for pages it's told not to access get a 403 forbidden and the script to do it is pretty easy really using robots.txt processing rules readily available in open source. The same PHP functions with sets of rules crawlers use to process the robots.txt page and also be reversed and used by the site being crawled. When a bot crawls the page you use their user agent and just as the robots.txt function, using your robots.txt file, if it's allowed or not. Easy peasy.

Also, just asking for robots.txt and being denied puts you on the list so come back with any user agent you like, the IP has been blocked. Basically the rule for robots.txt is "asked and answered" and the answer is either "pass or fail" for that IP.

lucy24




msg:4574838
 11:55 pm on May 16, 2013 (gmt 0)

You can put anything you like in a script-- but the 403 isn't intrinsic to robots.txt. That's the difference.

just asking for robots.txt and being denied

Do you mean, being denied in their request for robots.txt? Heck, even Ukrainian robots can have my robots.txt if they ask for it. Not that they ever do* (Chinese search engines sometimes do-- for all the good it does them). But if they did ask, I'd give it to them.


* I once met a well-behaved Ukrainian robot. It was memorable :)

incrediBILL




msg:4574917
 9:13 am on May 17, 2013 (gmt 0)

but the 403 isn't intrinsic to robots.txt. That's the difference.


No 403s for the actual robots.txt request, it's 100% according to spec. I let them have the robots.txt and it responds to the robots.txt request with a valid robots.txt file that states they are DENIED or ALLOWED with the appropriate robots.txt content for either scenario.

However, if they are denied, making any other requests to the site are also denied.

Basically, don't ask for the robots.txt if you aren't a robot because it could get ugly although I put a turing test on the robots.txt 403 page so those denied just for being nost can still say "YES! I'M A HUMAN!" and get out of jail free.

I also leave the "contact us" page wide open like the robots.txt file. and it's linked from the r403 page, so if someone is being locked out by accident they can drop an email. Every now and then I get an email, not too often, and more often than not I see in my log file that something odd was going on and I ask them to explain themselves.

The email I get contains their IP address and a link to an admin page to pull up all their activity on demand.

Work smart, not hard ;)

[edited by: incrediBILL at 9:18 am (utc) on May 17, 2013]

lucy24




msg:4574918
 9:17 am on May 17, 2013 (gmt 0)

Hee. Someone once came by my site asking for "humans.txt". (I naturally suspect someone from hereabouts, but can't prove it :)) I felt obliged to add one just on general principles. Can't remember now what it says, but I'm pretty sure the file is still there.

incrediBILL




msg:4574919
 9:18 am on May 17, 2013 (gmt 0)

humans.txt? that's pretty funny.

One somewhat inebriated night I did a drive-by on a couple of people I know keep an eye on their log files and there was ZERO mention of the wacky user agents and whatnot in the forum.

Very disappointing :)

However, some of the bot list sites published it as expected.

keyplyr




msg:4574929
 10:41 am on May 17, 2013 (gmt 0)




One somewhat inebriated night I did a drive-by on a couple of people I know keep an eye on their log files and there was ZERO mention of the wacky user agents and whatnot in the forum.

So you're the one!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved