Forum Moderators: open

Message Too Old, No Replies

Nasty nasty Robot: NaverBot

This one just won't listen to reason.

         

Josefu

12:39 pm on Jul 26, 2004 (gmt 0)

10+ Year Member



I read elswhere in this forum something about this bot's 'perhaps unethical' way of spidering, but here I'd like to point a few findings. What I find strange is that this bot has been pulling most everything it can from my site for more than a year now, but when I go to the search engine from whence it is supposedly sent there is not a word there about my site.

Since he was taking up a fair amount of bandwidth to no evident end, I decided to ban it. I updated robots.txt and, lo and behold, in my next database check (I now use a database/cookie/session system to track visitors) there he was again still sucking away. I then went to my .htaccess and added a new rewrite - and he came back AGAIN using another ID string - and this time with a 'buddy' robot ('PlantyBot') from an adjascent IP. I expanded my rewrite list accordingly and he's gone... for now.

Anyone else have a similar experience with this one?

jdMorgan

1:48 pm on Jul 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Check the IP addresses -- You may find that this is either a "spoof" of Naver, or that someone else is using that robot for their own purposes. (I don't know much about Naver, but these are possibilities, and I have seen Naverbot coming from several "odd" IP addresses, (apparently) not belonging to Naver.)

Jim

volatilegx

2:54 pm on Jul 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see actual Naverbot IP addresses being used for browsing sometimes, too.

Josefu

7:47 am on Jul 27, 2004 (gmt 0)

10+ Year Member



this l'il bugger, after mods to robots.txt AND .htaccess., managed to suck another twenty pages whilst I slept.

NaverBot-1.0 (NHN Corp. / +82-2-3011-1954 / nhnbot@naver.com)

Here's how I tried to block him

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} nhnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} naver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NHN.Corp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverBot [NC,OR]
...
RewriteCond %{HTTP_USER_AGENT} Sleipnir [NC]
RewriteRule ^.* - [F,L]

How can he make it past that .htaccess? He did it all the same.

Josefu

2:44 pm on Jul 29, 2004 (gmt 0)

10+ Year Member



He's back yet AGAIN after I added even more rewrite conditions to block him.

Can someone please explain how a Spider can make it past the .htaccess rewrite? I'm totally flummoxed.

wilderness

3:05 pm on Jul 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why not stop dinking around with bot name denials and eliminate the entire IP ranges of that BOT's IP?
Uless of course your looking for Korean traffic?

RewriteCond %{REMOTE_ADDR} ^61\.(7[89]¦8[0-5])\. [OR]
RewriteCond %{REMOTE_ADDR} ^218\.(14[4-9]¦15[0-9])\. [OR]

Don't forget to make the corrections of how the fourm displays the pipe character

Josefu

3:19 pm on Jul 29, 2004 (gmt 0)

10+ Year Member



Thank you very much Wilderness.

Since I'm new to apache rewrites I was under the impression that name denials were pretty hard-core. I now see differently. Coincidence, I was getting the IP range from the APNIC server just when you posted : )

Thanks again,

Josefu.

wilderness

3:22 pm on Jul 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Josefu,
Here's an old thread which I recently reference in another recent thread which may help you:

A Simple Beginning
[webmasterworld.com...]

Josefu

3:28 pm on Jul 29, 2004 (gmt 0)

10+ Year Member



Thanks again, noted and tabbed : )

I still would be curious (very very) to know how a bot can make it past the name denial. Does is spoof in some way, or rewrite its 'useragent' info once it's in? Not only do I not see the point in a bot doing that...

wilderness

4:10 pm on Jul 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Josefu,
Naver has used many UA's.
They've even used google in their UA.

Use of UA's in denails is a first preferecne for many. The benefits can be both good and bad.
In some instances a particualr portion of a UA may only be used by a solitary user or small group. In other instances an unidentified bot will travel using a standard UA which prevents denial on those grounds without taking out many innocents.

I've read that some bots are using a system which circumvents denials and redirects. I have no clue if it's possible or true. Only that I read it. (I don't believe naver fits this exception though, as they are easily denied.)

Josefu

4:16 pm on Jul 29, 2004 (gmt 0)

10+ Year Member



Yes, understood, but naver still made it past my denials. Perhaps they found that 'way' you read about? If you could post a link (if it's permitted here) it would be most helpful. I hope this trickle (bots evading denial) won't become a flood.

jdMorgan

4:45 pm on Jul 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Josefu,

Your code should have worked, but did not. However, in your listing, you show "..." indicating that there is more code that is not shown. Look for you problem there. Assuming that any mod_rewrite code works for you, then what you showed above should have worked, too.

I don't believe in any "magic work-around" for bypassing .htaccess denials. In every case I've ever seen, it has been a coding error or regular-expressions error -- some very, very subtle, such as a missing ")" or "]".

BTW, since you are using unanchored regex patterns, your fourth RewriteCond is redundant, as anything that will match it will have already matched the second RewriteCond. Also, flag [F,L] is redundant too. [F] carries an implied [L], as does [G] and [P]. However, neither of these details would have stopped your code from working.

Jim

Josefu

5:38 pm on Jul 29, 2004 (gmt 0)

10+ Year Member



Yes, noted. Thank you.

I thought it wise to not anchor the search term to capture other instances of it in a longer (updated bot?) ID's. I hope this is not unwise.

Josefu

7:18 pm on Jul 31, 2004 (gmt 0)

10+ Year Member



Mister NaverBot has yet to return - perhaps last time I he visited while I was checking the logs. My .htaccess is blocking him and other automates of ill intent just fine, thanks to you jdMorgan. Cheers and thanks for all your help : )

Josefu.

Josefu

8:47 am on Aug 18, 2004 (gmt 0)

10+ Year Member



Okay, this is a recap of the above problems I've had. I'm not sure where to put this as it's divided between Spiders, PHP and Apache server configuration. I'm putting it here as a sum-up to the above thread, and to quell the possible 'rumor' I've started by announcing that a 'Bad bot' made it past my .htaccess. I leave it to moderator discression where to move this if the need be. But please leave a liink in this thread.

A summary: All of this came about because of a faulty server configuration, namely in the Apache httpd.conf file.

When I first noted that several supposedly banned bots were making it into my site, I went through the code again and found no errors. I then (erronously) thought that somehow name blocking wasn't as 'hard-core' as blocking IP ranges. In reality they are equally effective, the only difference is the conditions set so no more delusions there, either.

jdMorgan came to the rescue with info about the loading/execution order of server modules loaded through the Apache httpd.conf file - what is loaded last executes first. In my case, jdMorgan's theory was that the PHP module was loading after, thus executing before, the Rewrite functions, thus the server was sending my PHP-generated pages before it had a chance to execute the Rewrite Cond's I had set. This turned out to be true. After my hosting service had flipped things around, the bots were effectively stopped at root level.

But then the bots tried getting in through a lower directory, and were making it in.

Again the httpd.conf file, but this time because of the still-fairly-new Apache addition, the Rewrite Options module (best to look it up than I take the space to explain here). It is set by default to "on" (as far as I know) but can be overridden by another later command. This still is not clear to me, but I know for certain that when I added 'Rewrite Options inherit' line to all of my lower-directory .htaccess files the bots were blocked for good. My pleasure now, watching from the ramparts, as the badBots are deflected from their path to my gateway to crash headlong into the stone of my 403-error wall. Picturesque, non? Satisfying, for sure.

My hosting service, because of the above debacle, will be conducting a head-to-toe configuration and security recap on their servers at the end of this month. Many thanks, from me as well as my hosting service, to jdMorgan for his help in both exposing and clearing up this matter. I hope that this thread in some way helps others seeing the same sort of 'breach' in their sites.

Josefu

8:53 am on Aug 18, 2004 (gmt 0)

10+ Year Member



Oh, and for a fun finale: NaverBot, with all his covert techniques and bad manners, after being slammed into the wall at every attempt to enter my site, is now repeatedly requesting nothing but my robots.txt file. But being slammed all the same : )

One last thought: I wonder what they do with all the information they gather, as they certainly don't put it into their search engine.

wilderness

1:41 pm on Aug 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



they certainly don't put it into their search engine

How might one tell what they put in their SE (unless capable of reading Korean or Japenese?)
I found an English link on their corporate page however not on the search page.

Josefu

3:00 pm on Aug 19, 2004 (gmt 0)

10+ Year Member



Watakushi wa nihongo wakarimasu : )

I've done quite an extensive search using all my keywords and then some. Perhaps you came upon something I didn't? I will look again...