Location of Search Engine Spiders

Forum Moderators: DixonJones

Message Too Old, No Replies

Location of Search Engine Spiders

All they all from the US of A?

ulleskelf

9:24 am on Jul 12, 2006 (gmt 0)

The figures we get from our logs relating to geographic locations seem to suggest Americans are surfing our site with images turned off!

Amongst visitors from the UK, Canada and Australia, between 42-49% of the requests on our server are for GIF files, 26-30% are for JPEGs and 16-18% are for HTML pages.

From the USA, the figures are 34% GIFs, 17% JPEGs and 34% HTML, suggesting that something in the US is loading our pages but not images.

Would I probably be correct is assuming that search engines, all/most of which use US-identified IP addresses, are visiting our site and only loading HTML pages and not images? Hence skewing our figures for US visitors?

If I wanted an as-true-as-you-can-get figure for where our actual human visitors are coming from, instead of excluding the spiders from our stats, would another option be looking at the IP addresses of those who load GIFs and JPEGs?

gregbo

11:45 pm on Jul 12, 2006 (gmt 0)

It might be a good estimate provided you don't have a lot of fraudulent traffic.

oxbaker

3:05 am on Jul 13, 2006 (gmt 0)

Is your site accessed by thin clients? phones anything like that? If its not, you can probably look at the image download numbers as being more relavant, but a combination of all discussed methods is the best approach.

hth,
mcm

TXGodzilla

5:24 am on Jul 13, 2006 (gmt 0)

It sounds like you are learning how to deal with cache-control issues for your site. Try adjusting the caching for your static pages & images, then review the traffic stats. You should also try checking reports for individual browser types to see if the requests for text vs images changes drastically with the browser type.

There are some who already know the answers but you really need to see the test results to comprehend what is happening.

Trying to sort out bots is nearly impossible. Most exploit seekers will disguise themselves as a current browser & OS type. I noticed that a particularly large bot farm also has a few bots that don't announce themselves, I always suspected those were the more advanced bots checking for cloaking and SEO exploits.

Start sampling some of the IP addresses. When you start finding "browsers" viewing your site from the Rackspace & Level3 colo facilities, you'll realize that a lot of the odd traffic is just bot activity.

You can create "honeypot" robots.txt entries & links in some of your pages. Bots will follow anything on the page, bad bots will even follow what they are asked politely to ignore, people are only going to click on visible links.