homepage Welcome to WebmasterWorld Guest from 23.20.28.193
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
msnbot-media/1.1 generating a ton of 404s
NONE existent JPEGs and GIFs
blend27




msg:4627792
 1:10 pm on Dec 4, 2013 (gmt 0)

For a couple of days now msnbot-media is requesting NONE existent JPEGs and GIFs from the site in question. All requests are coming from 199.30.20.* and 199.30.16.* ranges. It started around 11/29/2013 with the first request originating at 11:28:14 PM EST.

The UA that is used is a plain vanilla msnbot-media/1.1 (+http://search.msn.com/msnbot.htm). RDNS points back to msnbot-199-30-16|20-***.search.msn.com.

I own this domain since beginning of 2003. Custom built site by me. All of it. Site is hosted on Dedicated IP, same host same IP from the beginning. PHP is disabled.

The images being requested seem to be a part of WordPress themes, part of the layout and at times some random image names.

So far 468 requests. Examples:

These are requested from the root:

/wp-content/themes/TutsPlaza/_assets/img/article-nav-arrow.png
/ProductImages/derosedesigns/Thumb_DD-350-BRN%20500x220.jpg
/images/top-rgt.jpg
/themes/migration-2/images/buttons/cart_btn_view.gif
/ProductImages/dreamline/plumbingaccessories/Thumb_DLVHD-ACC-D3-AB%200100x300.jpg
/images/bild%20h%209.jpg

What is also interesting that it would take a valid URL on this site Chop it in half and then append none existing path to it:

/myDirectory1/myDirectory2/images/bild%20h%20778.jpg


Does not make any sense to me....



Anyone had observed something like this in the past?

 

lucy24




msg:4627893
 9:02 pm on Dec 4, 2013 (gmt 0)

Well, I knew it couldn't be just me [webmasterworld.com] :) I posted my own grumble in General because I didn't think it could be bing-specific, and there was obviously no question about the UA. But maybe bing is having a problem.

The part that's clear is that it isn't just making up URLs as you might do if you're testing for valid 404s. These are some other site's URLs getting spliced onto your own.

wilderness




msg:4627966
 2:47 am on Dec 5, 2013 (gmt 0)

This thing (IP and all) eats 403's from my sites regardless of what they request (with the exception of Robots.txt)

blend27




msg:4628057
 2:03 pm on Dec 5, 2013 (gmt 0)

Don, I hear you on 199.30 ranges. Preview Bot that comes from there gets 403 here. But...

I just checked another site and boom the same thing. bot from a different, well established MS range.

65.55.215.*

Again msnbot-media/1.1 from the root:
/robots.txt
/images/d2/10007/left_search.gif
/images/common/bottombanner13.gif
/theme/4/img/shopping/4/imagens/1.png
/images/d2/10007/left_search.gif
/directoty1/directory2/Ovchinnikov-AIC-rim62b.jpg
/image_site/g3262.png
/directoty1/directory2/directory3/current/img/header/landing-hero-image.jpg
/directoty1/directory2/directory3/current/img/find_dealer.jpg
/directoty1/directory2/directory3/current/img/UnitedStates.gif
/directoty1/directory2/Ovchinnikov-AIC-7win54b.jpg
/directoty1/directory2/Ovchinnikov-AIC-ovc17b.jpg
/sites/all/themes/interviewrussia2/images/sign_ad.png
/directoty1/directory2/directory3/current/img/header/landing-hero-image.jpg
/image_site/g3262.png
/directoty1/directory2/directory3/App_Themes/default/img/buttons/itemPage_view_wishlist.gif
/directoty1/directory2/directory3/images/chuck2.jpg

where /directoty1/directory2/directory3 could contain valid local directories, not in that exact pattern.

And this is just today in 1 hour span. There are dozens more.

I took over this site in 2008. I still have the original "design" of it, 3 pages in total. All images were stored in the root. All 2 of them: LOGO and blinking EMAIL Us gif. That was it.

I think this is more then what meets the eye at this moment.

keyplyr




msg:4628183
 9:12 pm on Dec 5, 2013 (gmt 0)

This msnbot/bingbot crap has been going on for at least 2 years, possibly more. I have ranted about it here several times.

dstiles




msg:4628186
 9:16 pm on Dec 5, 2013 (gmt 0)

I can see one explanation, valid from MS' viewpoint; although unlikely, I admit.

It is easy enough to put up one or more domains with links to other sites with large numbers of images and pages. An SE reads the site and tries to follow the links - unsuccessfully.

As I said, unlikely but it's still a possibility. Think of all the bad links reported by that arch-fiend G. :(

lucy24




msg:4628190
 9:35 pm on Dec 5, 2013 (gmt 0)

I took over this site in 2008.

I doubt it pertains to something that existed in the pre-2008 version of your site. My domain was registered in early 2007 and afaik the name simply didn't exist before then. It's a pattern that you see fairly often with inept robots: they've got one script that lists filepaths and another one listing domain names, and they get the two garbled. Sometimes you can even work out what site they think they're crawling! But coming from a major search engine it's unusual.

199.30. is also the range used by the plainclothes bingbot that requests images. (The non-image variant uses two other ranges even when the UA is identical.) In fact they may be connected. I have yet to figure out what Bing Preview really is. If it's hiding in search results it's hiding awfully thoroughly-- which would seem to defeat the purpose of a preview!

blend27




msg:4629459
 12:47 am on Dec 11, 2013 (gmt 0)

I am going the soft road here first via robots.txt

If that does not stop it, I am going to slap a 403(.htaccess style rewriting this to a log that I could read after April 1st) on all msnbot-media bot IPs, 199.30. or not, and all others from $M that try to fetch images that don't exist on both domains pro-grammatically. I don't have time to investigate.

@Lucy. ZQROO-OO-OO plainclothes bots from $M, all 403 here.

@dstiles(enough to put up one or more domains with links to other sites with large numbers of images and pages.)

yep, but I doubt that these 2 domains are connected in any way. I don't see Gbot trying to fetch those. Someone has to be able to cloak their sites to 2 IP ranges that belong to $M in order to orchestrate these.

wilderness




msg:4629479
 1:35 am on Dec 11, 2013 (gmt 0)

I am going the soft road here first via robots.txt


blend,
take it from an experiences "softie", and don't waste your time with robots.txt and the msn-media bot.

lucy24




msg:4629506
 2:51 am on Dec 11, 2013 (gmt 0)

ZQROO-OO-OO plainclothes bots from $M, all 403 here.

I blocked them for a couple of years. But a month or two back I got curious about what they were up to and lifted the block. What ever happened to bingdude, anyway? I wish the ### thing would just say what it's doing!

not2easy




msg:4629524
 4:52 am on Dec 11, 2013 (gmt 0)

I'm seeing the same thing on domains I've had for many years and built from scratch. The directories and images they are trying to find have never existed on that or any other domain related or linked AFAIK anyway. The queries look like a list they scraped from some robot looking for possible future 'browsing':
/joomla
/wp
/blog
/drupal
- all on one site, along with a ton of .gifs I've never seen or heard of in directories that never existed:
/includes/rj_globnavimages/arrow_red.gif
/includes/rj_globnavimages/areas_pha.gif
/ng/journal/v36/n11s/thumbs/ng1435-F4.gif
/ng/images/arrow_black_prev.gif

No more 404s, they will be seeing the 403s now.

wilderness




msg:4629536
 5:59 am on Dec 11, 2013 (gmt 0)

For those of you that are getting this unknown directory and/or image requests. I've a question?

Are you on shared hosting or your own server?

The last three shared-hosts that I've had allowed open directories above my domain root to server their own images.
I caught bots crawling these images early on and installed blank index files into the same directories (didn't think that would be possible above my root, however it was).
I tried getting one host to set this by default in their own configuration and they replied that it was not possible.

All I've been getting from msn-media bot for some while is requests for existing images that support pages, and despite the images being contained in sub-directories that are off-limits in robots.txt.

Thus they may eat 403's until hell freezes over.

Don

lucy24




msg:4629550
 6:31 am on Dec 11, 2013 (gmt 0)

The last three shared-hosts that I've had allowed open directories above my domain root to serve their own images.
I caught bots crawling these images early on and installed blank index files into the same directories (didn't think that would be possible above my root, however it was).

How does that work, physically? I mean, what URL are your unwelcome visitors requesting?

wilderness




msg:4629555
 6:46 am on Dec 11, 2013 (gmt 0)

luck,
You may recall some while back a dilema I had with 403's and other custom errors?

The shared host had a default configuration were an adverstiment 403 was delivered by the host server.
Simply adding 403 directives into htaccess did not change this.
I had to:
1) add error documents into htaccess
2) then go into the hosts CP and change their directive settings.

As far as an acutal answer to your question, I'll need to go back two years and look at backup DVD's, however generally speaking the URL's were part of my domain, and even though the images files did not exist on my domain.

wilderness




msg:4629561
 6:56 am on Dec 11, 2013 (gmt 0)

lucy,
Here's four examples from logs.
I've omitted the leading and trailing data from the lines.

/404images/btn_more_details.gif
/404images/buy.jpg
/404images/big_1.png
/404images/curr_USD.png

wilderness




msg:4629563
 7:06 am on Dec 11, 2013 (gmt 0)

Here's some more (none of these files were ever part of the pages and/or images I created and uploaded, rather they were served via my shared host and through my domain.

/icons/blank.gif
/icons/folder.gif
/icons/layout.gif

keyplyr




msg:4629568
 7:38 am on Dec 11, 2013 (gmt 0)


Don, I have index.html files in all my image directories and this M$ phenomena still occurs regularly. Even after I moved my main site to a new (shared) hosting company, it continues. And I have no redirects or dynamic pages.

I also get ill-formed requests for http://example.com/page-name1/page-name2.html

This happens with both msnbot and bingbot but no other SE bots. This has been going on for well over 2 years. At one point I had an ongoing email dialogue with an M$ tech. I sent him daily logs showing over 100 daily 404s from their bots, day after day. I was told they had probably gotten a corrupted index and that it should "wear off" very soon. It hasn't.

wilderness




msg:4629570
 7:44 am on Dec 11, 2013 (gmt 0)

keyplr,
I was just providing a possibility to explore.
I've no insight into your site (s), rather only you have that.

I'm just not seeing what you folks are seeing and there must be a reason for that.

Course, I've just about everybody denied including Moses, Allah, Jehova, The Pope and the The Tooth Fairy ;)

phranque




msg:4629607
 9:45 am on Dec 11, 2013 (gmt 0)

wilderness, you must have softlinks in your document root directory that point from /404images/ and /icons/ to their locations outside of the document root.

blend27




msg:4629658
 1:12 pm on Dec 11, 2013 (gmt 0)

Are you on shared hosting or your own server?

Don, both of site is question are hosted on shared hosting servers with less that 50 sites on them. Both sites are hosted on dedicated IPs(with expensive SSL Certificates).

On both servers PHP is simply fully disabled, so /wp-content/themes(WordWress) would never make it there.

The fact that /directoty1/directory2/directory3/current/img/UnitedStates.gif is requested tells me that the bot is having a problem of its own, because /directoty1/directory2/directory3/ is rewriten-vertual URI(via HTACCESS) that would never contain .img extention. That has been tested for all variation of no good stuff.

I have asked a server tech(I know the dude for 7 years now) to scan the entire server for "wp-content" string in folder names, nothing came up.

Also, this is the first rule that is in .htaccess file:

RewriteCond %{HTTP_HOST} nn.nnn.nn.n [NC]
RewriteRule .? - [F]

That says that nothing will ever be hosted on that IP.

and the second rule that says that if it is not a fully qualified domain(unless it is for NONE www version of the root without query string), don't serve the content, redirect to error doc with the link to domain root. The site is built on top of a custom written ColdFusion Framework where everything goes via root document that is hidden and unless there is an entry in config file, the request would not resolve. All those requests are logged in real time.

I watch these sites like a hawk and msnbot-media bot is bugging at this time :(

MJBill




msg:4634125
 7:17 pm on Dec 30, 2013 (gmt 0)

Sorry to be so hammish but I am new to WW and this is my first ever posting on a forum. Here's an example of the many lunatic URLs which msnbot-media/1.1 keeps trying to reach on my site:
/sites/default/files/imagecache/60x60_square/article/thumb/2013/11/22/Jamaat.jpg
None of these directories have ever existed on my site and neither has the jpeg. I did the double DNS look-up and the bot comes back genuine. Could someone give me an idea what is going on please? I've reached the point where I'm considering using robots.txt to block access to all images from all robots. Thank you for any help.

not2easy




msg:4634133
 7:45 pm on Dec 30, 2013 (gmt 0)

Welcome to WebmasterWorld Forum, MJBill, you can get some insight looking around.
The msnbot and msn-media bots periodically take acid. OK, maybe they don't but they do act like it. I have blocked their ranges because they do not always follow directives in robots.txt - although heaven knows they have ample opportunities, grabbing that file every few seconds. They don't seem to "get" what 404 means and "403" after a few thousand more files may slow them somewhat. Good luck.

lucy24




msg:4634479
 2:28 am on Jan 2, 2014 (gmt 0)

One more contribution to the "What has Bing been smoking?" discussion:

Early in September, msnbot-media abruptly started asking for jpgs in the form
/paintings/thumbs/filename.jpg
I have never had a /paintings/thumbs/ directory. What I do have-- in plenty-- is directories in the form
/paintings/category/thumbs/
for assorted values of /category/. The requested filenames would be valid for this path.

Why did it take me four months to notice? Because I don't pay much attention to image files; I only discovered it because I've just moved sites so I'm paying close attention to all redirects.

Pause here for thanks to the person who recommended TextWrangler to me, because I was able to take a closer look:

-- three major Bing ranges are involved, randomly switching among 65.55, 131.253 and 199.30-- but never 157.5n.

-- in those four months, requests for nonexistent files have been far more numerous than earlier requests for correctly named forms of the same file. Generally 4-6 per file, total. Is that how many 404s it takes them to decide an image doesn't exist?

-- Round 1, running from 11 September to 8 October, requested the entire contents of /paintings/foobar/thumbs/, without the /foobar/ component. They never asked for the newest file, dating from last May; all others were created in 2005 and have had their present address since 2011. Each request was immediately preceded by a www-redirect.

-- Round 2, beginning on 26 November and continuing to the present, involves the entire contents of /paintings/juju/thumbs/ and /paintings/mojo/thumbs/ in random order-- again without the central directory name-- skipping one file from /juju/ that I just added a few weeks ago.

On the other hand they mysteriously omitted one file from /mojo/ that has been around since 2011. To make up for it, they repeatedly asked for a file which-- according to timestamps on my originals-- did not exist until several days after it was first requested. Now, I happen to know that for some unknown period of time my hosts' logs had a glitch causing them to be off by 22 minutes. But three days I would have noticed.

-- The most recent batch came in after I moved sites. Each of these took an instant redirect, just like a human, sticking with the same IP.

-- Prior to 11 September I cannot find any msn/bingbot redirects or 404s that would shed any light on this behavior.

Conclusion: Huh.

not2easy




msg:4634495
 5:44 am on Jan 2, 2014 (gmt 0)

I would need to do a lot more checking to be certain, but it looks like the "nothing from 157.5n" holds true for the 404s I see also. After looking at a few hundred 404s in half a dozen log extracts I pulled a random one from last January and there was a few there which had never existed, not even complete filenames. The entire request was for "GET /no" and another for "GET /sitemap " maybe they got distracted by something shiny in the middle of the request. Those are rare oddities.

I am not seeing any 404s from files that had existed or do exist in another folder - or confused paths. I see completely unrelated image and folder names that I have never seen or used. None of the file structure exists on any of my sites and they don't request all images from one structure, they make up a new string of directories that don't exist and then invent an image filename at the end.

some actual examples:
/files/assets/custom.regular_image_thumb/crop/666496823.jpg
/newforums/images/misc/bookmarksite_digg.gif
/jpeg/co/co01c01zj.jpg
/scj/top/tt/example.com.jpg
(example.com replaces a not very nice .com domain name)

While these domains I'm checking logs for are on a shared server, they are hosted within my reseller range and there are no subdomains. Each domain has a unique IP.

I was not seeing this until the last half of 2013 but on some domains there was a ridiculous number of requests and repeated requests for non existent files that had returned many 404 errors already. On those domains I blocked them by UA. 403 does not tell them anything yet. The UA is also Disallowed in robots.txt too. I wish I could find anything helpful. Searching around I see that others complained about this problem 4 - 5 years ago but I don't see any explanation.

MJBill




msg:4634597
 2:06 pm on Jan 2, 2014 (gmt 0)

Thank you Lucy24 and not2easy. This might be a little off-thread, but I had the following idea for blocking rogue bots and brute-force attackers. Being a newcomer though and never having seen it posted by the experts, I figured there is probably something stupid about it which I have missed. My thoughts were to set a 1-second crawl delay on robots.txt (and webmasters etc), and blocking (permanently) anything that hit the site faster than that for more than a few seconds. If it's just the crawl-delay part of it that's stupid (good search results are vital to me), I could check instead for genuine bot CIDRs or even do Google's horrid double look-up, but both of those seem a bit messy. Or is the whole concept a non-starter? Your advice would be much appreciated.

lucy24




msg:4634676
 8:25 pm on Jan 2, 2014 (gmt 0)

The only thing in robots.txt that you can absolutely demand everyone follow is "Disallow". Everything else is an optional extra. I don't know about bing, but the googlebot says outright that it disregards "Crawl-Delay". You have to set it in wmt instead.

blend27




msg:4638328
 1:57 pm on Jan 18, 2014 (gmt 0)

So for now I ended up with :

User-agent: msnbot-media
Disallow: /

and it seems to stop.

I still see the traffic from Bing image search that has several hundred images from one site in question.

Now if they could only figure out how to interpret HTTP status codes, the nonsense from 131.253.24.* plainclothes bots, they don't seem to get the 403 concept.

lucy24




msg:4638376
 9:09 pm on Jan 18, 2014 (gmt 0)

131.253.24.* plainclothes bots, they don't seem to get the 403 concept.

They also crawl from 65.55.21x (without images, like 131.253) and 199.30.24-25 (with images). A few months ago I unblocked them out of sheer curiosity to see what they're up to. I remain none the wiser. The most plausible suggestion is that they're testing javascript (which in fact they do). But why they can't do this with the ordinary bingbot UA and ask for robots.txt first must remain a mystery.

tangor




msg:4638380
 9:41 pm on Jan 18, 2014 (gmt 0)

I can't tell if this is a "collection of common uri to test" or "screwed up beyond belief" searches. In too many cases (and G and Y do this as well, just not as agressive) it looks like the SE is testing the site for the underpinnings.

All of my sites are hand made. No wp, Joomla, etc. CMS style systems, yet all are being tested for those. What can you do? 403 is my response.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved