Forum Moderators: open

Message Too Old, No Replies

Learning more about Googlebots behaviour on my site.

Which googlebot came to my site? Where did it go? Did it index anything?

         

crowthercm

5:20 pm on Sep 26, 2003 (gmt 0)

10+ Year Member



Hello,

I'm still in my rookie year of SEO and this is the first time I've encountered a need to learn more about googlebot and his/her behaviour. In the past, when I created a site, I would just link to it from all the pages I own (~3500 non-dynamic pages ranging in pagerank from 3-7). I don't run a linkfarm, I just build sites all within a niche so the new sites I create are purposefully on topic. Google almost immediately picked up and indexed any new sites so I've never thought much about it, in fact I haven't yet considered using a robots.txt either.

I've started another thread here that is similar to this one which gave me some really good suggestions. However I think the point I'm looking for in this thread is significantly off topic of the last so I wanted to start a new one fresh.

I currently use 3 different types of software to measure my site's statistics; (1) Awstats, (2) Webalizer, and (3) Analog stats. Since my sites release 3 weeks ago, I've used these traffic measures to determine that googlebot (I do not know which ip) has visited my site. Beyond that I know very little. Namely, all awstats tells me is this:
Googlebot (Google) 70 476.01 KB 25 Sep 2003 - 16:01

I also know that various subdomain googlebots have also visited my site over the last few weeks:
crawler8.googlebot.com
crawler9.googlebot.com
crawler14.googlebot.com
etc.

What I do not know, but have been asked about is, what are these bots looking at? When they reach my page are they indexing pages or are they leaving? I understand that the deep google bot seems to have been merged with freshbot so I can't check that way anymore.

How can I learn more about googlebots behaviour while it is on my site? This will give me a better ability to troubleshoot through some of the suggestions I've read here on WebmasterWorld.

Thanks,
Chris

dirkz

7:51 pm on Sep 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In order to track Googlebot behavior you must have access to your access logs. In them every line contains a "hit", that is a request to your server from an IP, like

64.68.82.167 - - [26/Sep/2003:09:49:42 +1000] "GET /robots.txt HTTP/1.0" 404 288 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

This way you can track it with your own eyes.

Ask your provider whether you have access to the original log files.

crowthercm

8:22 pm on Sep 26, 2003 (gmt 0)

10+ Year Member



Very cool thanks (: Comes as a bizarre file type (***.com). I'm able to open it in word fine but is there a program that handles log files better? As it stand it's rather inconvenient to work my way through the various lines of code.

Scarey stuff too, some of the requests actually carry user login IDs.

claus

8:28 pm on Sep 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if you can open it in word, you will also be able to open it in Excel i think. That might be better ;)

Otherwise there are software programs exclusively for analyzing log files (when they get larger than what a spreadsheet can handle.)

crowthercm

8:46 pm on Sep 26, 2003 (gmt 0)

10+ Year Member



Some very neat stuff in this, I was able to get the gist of it just by using the word search feature. Really quite impressive software, google hit the site originally on the 14th, then over the next 5-6 days visited as far as I can tell the site. I see why people want to use robots.txt as well. Somehow the bot found some files that it shouldn't have, i.e. no links to these files anywhere on the site!

I can see why a separate program would be needed and fast, with hardly any traffic to the site I'd already like to have some. Can you recommend some software tailored specifically to handle log files? Either something that parses the file directly on the server or alternatively by opening a raw access file on my home machine.. I'll take a quick peak on google to see what's available as well.

Claus, yes I got it open btw in excel but it's a bit of a mess :P

dirkz

8:33 am on Sep 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To be honest, I use command line tools such cat and grep (available on every *nix box and with cygwin for win also), e.g.
cat access_log¦grep "Googlebot"

If you have lots of subdomains (virtual hosts) in your logs, it helps to grep them too:
cat access_log¦grep "subdomain"¦grep "Googlebot"

For the overview I use webalizer.

Though a tool would be nice (for free of course :) that can track individual IPs and their behavior by mouse click so you could trace their path through your site :) [based on log files]
This way you wouldn't have to do the grepping.

bwelford

10:51 am on Sep 27, 2003 (gmt 0)

10+ Year Member



I highly recommend using Excel for analyzing basic log files. If you import the log file as a text file you get a long column of entries. Chose the option to convert text to columns and specify that space is a data delimiter. You will the find that each parameter comes into a separate column. So you can sort by file read or by name of visitor or by time or by any combination.

It's easier to do than to describe. I think you'll like it.

Barry Welford

chiyo

11:32 am on Sep 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



yep as long as your log files are not too long using excell or a spreadsheet is a great way to get a feel for your visits by filtering, sorting, etc etc.

just remember to define what you mean my "pageview" and filter out the jpgs, gifs, non-content scripts, js, ico, etc etc and it becomes much more manageable.

Also analog is infinitely flexible if you take the time to read the manual. you can create separate reports for example which just count googlebot variant visits.

claus

2:04 pm on Sep 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Excel
bwelford posted the way to get the logfiles sorted in Excel. Excel will handle 65,000 hits (= lines in the log file) so if your log files are generated daily, this might do for some time. If you have, say, five hits per pageview on average, it's equal to 13,000 pageviews.

Personally, i would recommend you to look into the Excel feature "Pivot Tables" - not just for log files, but for all kinds of excel based analysis. It's a vey valuable tool, and it can collect and aggregate data from more than one sheet (giving you more than 65,000 lines to work with).

If/when your files get longer you will have to:

  1. split them into shorter parts
  2. use samples in stead of the whole file (eg. every 10th line)
  3. use other tools
Here's a suggestion to use Access when the file gets longer:

(a) Importing raw logs into Excel
[webmasterworld.com...]


Log software
You will find good advice on specific software in the threads of the forum "Tracking and logging": [webmasterworld.com...]

Once in a while a (usually long) thread is started with suggestions on software, and between those a few good ones also show up. Here are 10 relevant ones i could find within the last 20 or so pages (longer threads in bold):

(1) Poll: What web stats service do you use?
[webmasterworld.com...]

(2) Web statistics
[webmasterworld.com...]

(3) Reading Access Logs
[webmasterworld.com...]

(4) How to open a 500MB log file?
[webmasterworld.com...]

(5) Log Analysis vs. Outsourced (3rd Party)
[webmasterworld.com...]

(6) Log Analysis
[webmasterworld.com...]

(7) What is the best choice today regarding log analysis software?
[webmasterworld.com...]

(8) Free Log Analyzer/Stats Program
[webmasterworld.com...]

(9) Best text editor for large log files
[webmasterworld.com...]

(10) A satisfactory stats package
[webmasterworld.com...]

Link no #1 gets into a discussion around page 7-8 and on page 8, post #113 i've explained my own take on log files. Essentially, they are needed to study spider/bot behavior but they are not very good for studying human behavior.


I hope this will give you some answers. There's a lot of reading to do at least ;)

/claus

crowthercm

5:01 pm on Sep 27, 2003 (gmt 0)

10+ Year Member



Wow! Thank you all for the posts and especially Claus for taking the time to go into such detail. I will be out of town for early this week but this page has already been bookmarked, and printed!

Many thanks again, it's very helpful and appreciated.

Cheers,
Chris