Forum Moderators: open

Message Too Old, No Replies

Which stupid bot is asking for .

/foldername/google-analytics.com/ga.js'

         

g1smd

7:51 am on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am working on a site that doesn't have access to raw server logs.

There's only data from Awstats and Google Analytics to go by.

Awstats is showing lots of 404 errors generated by something asking for /google-analytics.com/ga.js' or for /foldername/google-analytics.com/ga.js' on the site.

It is very obvious that this is a bot that is reading through the GA Javascript and incorrectly extracting a URL from it:

document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/java script'%3E%3C/script%3E"))

The closing single quote pretty much nails it. The site uses a base tag on every page.

Bots on the site in that 24 hour period include:
Gigabot/3.0_(http://www.gigablast.com/spider.html) as well as bots from Yahoo, Inktomi, and Live/MSN.

Samizdata

9:58 am on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Perhaps you are seeing the same problem identified here:

[webmasterworld.com...]

...

g1smd

2:17 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm. I was already dealing with the ;1813) user-agent, for several days before this particular 404 error appeared in the site stats.

However, I note the newer ^User user-agent now also being used, and which I was not doing any special for, up until this morning.

Short answer: I don't know.

Longer answer: However, if this is a common bot, many other people who have GA JS code on their site and have section URLs that are folders and which end with a trailing "/" (and who also have a base tag in the page header, defining the page URL ending in "/" as the base), should also be seeing the same thing in their log files.

So, who else is seeing this?

wilderness

2:26 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, who else is seeing this?

g1smd,
You ever see a dog chasing its own tail?
I have that same eery feeling as these issues develop ;)

I've been seeing the following for some days now:

"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"

however, I've long had a denial in effect for this "begins with" and I'm not about to make any changes, regardless of whom uses the term. (same appiles for crawler or spider in other UA's).

Don

g1smd

2:43 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, there are many issues surrounding the AVG 8 LinkScanner, but this might not be one of them.

Is anyone else seeing lots of 404 errors generated by something asking for /google-analytics.com/ga.js' or for /foldername/google-analytics.com/ga.js' on their site.

If so, what is it that is requesting those URLs?

Samizdata

2:51 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You ever see a dog chasing its own tail?

Nice analogy.

Purely a guess, but it looks like AVG is trying to fix one of the many LinkScanner problems and succeeding only in breaking something else - pretty much par for the course, I'd say.

who else is seeing this?

I can only say that I am seeing far fewer hits from LinkScanner lately - apparently because AVG users dislike it even more than we do and are uninstalling it as fast as they can find out how.

...

incrediBILL

3:40 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's only data from Awstats and Google Analytics to go by.

Bots don't tend to use javascript so if you're seeing this in GA it's probably something else.

pageoneresults

3:53 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is anyone else seeing lots of 404 errors generated by something asking for /google-analytics.com/ga.js' or for /foldername/google-analytics.com/ga.js' on their site.

I've been making mention of this in various topics here and there. It just goes in one and out the other. :)

If so, what is it that is requesting those URLs?

I've had my lead programmer on the horn multiple times investigating these failed GA calls. I think they only occur with the new GA code, I can't verify that as I use only the new code now.

There are two (02) JS calls for the GA scripts. The second one is failing at times and causing a 404. When that happens, the GA code gets appeneded to the URI of the page where the GA code failed. If I remember correctly, before the new GA code, it would cause page load delay. So, with the new GA code came some sort of timeout feature that causes the second script to fail if it takes too long to load. Does that make sense?

During our research, we came across a Chinese translation service that had their "translated" pages indexed. One of our sites was strategically framed and they were doing some real funky stuff with URI appendage behind the scenes. Low and behold, our GA script is sitting there too. I have to wonder what that does to analytics. I don't dig that deep, yet!

But, back to the topic at hand. Here are just a few of the UAs on these failed GA calls...

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; MSDigitalLocker; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; InfoPath.1)

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

Opera/9.50 (Windows NT 5.1; U; en)

There is no consistency in the UAs. Is that the AVG Scanner doing all of that?

OT, we are getting ready to do what a few others are doing around here. For our US based clients who ship only with the US, most everything outside is going to get blocked from accessing the site. We're freakin' sick of it. We're tired of finding our long hours of work regurgitated on an MFA website somewhere. That's it! If you are outside the US in one of our blocked countries, send us an email and we'll put you on the whitelist!

g1smd

5:19 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What turned people on to this being AVG LinkScanner?

I was assuming that it was a bot. I have only seen these errors start in the last 48 hours, and they are on a site that has had full content for several months and a "coming soon" page for many months before that.

The site has both Awstats and Google Analytics, and no access to the raw log files. The error is showing only in Awstats. The error is caused by the agent parsing the Google Analytics Javascript code (parsing it, not running it) on the page, and incorrectly extracting a duff URL out of it.

OK on the variety of user-agents. Nothing concrete there to go on.

I have catered for it. The requests get redirected to somewhere else, with a message in an appended query string that spells out what has happened.

As for copied pages skewing the data, I use domain filtering in Google Analytics to show direct accesses to the site in one profile, and accesses via all other sites in another profile. That alternative profile therefore shows people viewing the site in the Google, Yahoo, Ask, or Live cache, or via any other site that has copied the content.

Samizdata

5:40 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What turned people on to this being AVG LinkScanner?

I merely suggested you might be having the same problem as the post I linked to above:

I have been having a problem with what I call search bots not finding a file on my website

The post was about repeated 404s for a JavaScript file and it identified a LinkScanner UA.

As you have no access to the logs it's hard to be certain, but pageoneresults seems to have a better handle on your problem, and as you have catered for it all is presumably well again.

...

g1smd

8:48 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



*** The second one is failing at times and causing a 404. When that happens, the GA code gets appended to the URI of the page where the GA code failed. ***

Interesting, but I am wondering why this has (so far) only happened for index pages ("/" or "/folder/") on this site. There is one such error for each folder and sub-folder on the site, including the root.

pageoneresults

9:14 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is there a rewrite in place? All of ours that exhibit this behavior have rewrites in place. But, that doesn't say anything because every single site we have uses a rewrite functionality. :(

g1smd

9:46 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Errr. There are a LOT of potential redirects in the site, in order to fix any and all known canonical problems that could arise.

However, if you use the normal site navigation you will never hit a redirect when clicking anything internal to the site.

There are only a very few rewrites, and none of those are involved with any of the URLs mentioned here in this thread.

pageoneresults

9:55 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It sure would be nice if Google Engineers would step up to the plate on this one and provide some possible explanations? That would be way cool. I'm sure there are millions of others getting these and they don't even know it. And, when that 404 occurs for the second GA call, exactly what data is being lost?

Is this a fail safe feature to prevent page load delay? If it is, what is causing the timeouts. If I look at the small percentage of these, I'd chalk it up to a "given" in the third party calls. Its not "always" going to connect but it does most of the time. On one high traffic site, those 404s represent a small percentage.

Is it also possible that the these particular visitors have some sort of security setting (AVG) that is causing this? Its confusing for me because I don't see anything remotely related to AVG with our challenges. At first I thought we had a rewrite issue but we don't. Its that second script appending itself to the URI of the page where the script failed...

<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>

Is there something wrong with that GA script that would cause these types of 404 appendages? So, if part of it fails, why does the google-analytics.com/ga.js get appended to the page? I'm wondering if that is some sort of indicator that the script failed and Google is letting us know? I'm just guessing now. I'd like to solve it. That is one more statistic I'd like to have my hands on too!

g1smd

10:27 pm on Jun 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hang on though. The appendage is:

/google-analytics.com/ga.js'

Notice the single quote mark at the very end. The quote mark matches the one in the code.

Is that what you get too?

pageoneresults

12:56 am on Jun 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just got a final confirmation on this, they're (my team) tired of me pinging them about it...

again, there is nothing wrong with that code as i can see. if you are stil talking about the ga.js 404 error, it's either a bot that tries to parse it incorrectly and appends it to your current path, or just the browser executing the 2nd js before the 1st js is finished loading.

g1smd

6:01 am on Jun 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My guess is also that it is a bot that parses the JS code and incorrectly extracts a URL from that.

The big clue is the single trailing quote on the end of the requested URL; hence the need to tie those requests back to the UA and IP associated with them.

Timmay

1:22 pm on Jul 1, 2008 (gmt 0)

10+ Year Member



Not sure if anybody is still having this problem... but what I did to fix it was to modify the first js script and just use the static location of the script.

<script src="https://ssl.google-analytics.com/ga.js" type="text/javascript"></script>

Not the ideal solution, but I don't get 404s anymore.

g1smd

10:58 pm on Jul 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I only had those for a few days. I then redirected those requests elsewhere, and they therefore no longer appear in the site stats.

pageoneresults

11:06 pm on Jul 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Welcome to WebmasterWorld Timmay!

Not the ideal solution, but I don't get 404s anymore.

Welll I'lll beee! < Remember Gomer Pyle?

I wonder why Google went through all the trouble splitting that URI into sections like it does with the new GA script. Any JS gurus around who can explain the pros and cons of what Timmay has done? I'd like to know because I'd like to eliminate those 404s and if that one liner does it, I'm off to make some changes this afternoon.