How Pages Get Indexed Without Googlebot Visit ?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How Pages Get Indexed Without Googlebot Visit ?

studee

10:50 am on Oct 18, 2008 (gmt 0)

Hi!

This might sound funny or may be i am less informed on this, but i want to know this , my 2 month old website was indexed very quickly i have like 23000 pages showing in the index. but lately i am seeing very less googlebot activity 1-3 page per crawl day .

today i made an article and after 2 hr i could search it via google but my log is not showing google bot activity can anyone tell me how google bot index pages. also i am pretty concerned ( should i ? ) about low google bot activity earlier it used to index like 300+ pages each day then it started alternate days now its like 1-3 pages every other day ( i make 5 article per day ) . is it a matter of concern as i am really not investing in back links or anything but my content is heavily interlinked within the website.

Finally whats the average google bot activity like can you guys share it ?

Receptional Andy

6:47 pm on Oct 18, 2008 (gmt 0)

How are you recording googlebot visits? It sounds to me like your tracking might not be accurate. Google visits the content it puts in the index ;)

There isn't really average googlebot activity - it depends how good your incoming links are and how frequently your content updates.

studee

2:12 am on Oct 19, 2008 (gmt 0)

Hi andy !

i use a CMS there is an inbuilt tracking module which track all bot activity , earlier its log used to show lots of activity googlebot activity, lately the figures are as i mentioned. The google WMT also showing less data being downloaded from my site , same goes for the server LOG. still my page showing up in index witin 2 hr ( may be earlier) i just checked after 2hr yesterday , i am just curious about this.

also i make 5 article on an average per day, so content is updated every 5-8 hrs.

tedster

3:05 am on Oct 19, 2008 (gmt 0)

Here's one possibility. Since Google migrated to their Big Daddy infrastructure, every one of Google's crawlers shares a common crawl cache to help reduce the amount of crawling that hits any server. Not every Google crawler has the string "googlebot" in its user-agent.

For example, your pages can end up in the main Google index if they were crawled by the Adsense bot. The user agent for the Adsense bot is Mediapartners-Google. So software that only looks for "googlebot" will miss it.

Reno

12:31 pm on Oct 19, 2008 (gmt 0)

Not every Google crawler has the string "googlebot" in its user-agent.

Am assuming however that even if "googlebot" is not specified, the bot is at least identified as being related to google (as in your example)? In other words, the bot will never be anonymous?

............................

the_nerd

6:35 pm on Oct 19, 2008 (gmt 0)

In other words, the bot will never be anonymous

that would make it too easy for cloakers ....

Receptional Andy

6:41 pm on Oct 19, 2008 (gmt 0)

In other words, the bot will never be anonymous
that would make it too easy for cloakers ....

But cloaking checks are just that a test - not an general crawling/indexing process.

I still suspect a problem with the tracking mechanism - it's always best to verify these things via a server log file. There are also things like the feedfetcher bot that might not appear in a script-based tracker.

studee

3:59 am on Oct 21, 2008 (gmt 0)

okay guys the googlebot is back with a bang on my site and indexing about 300+ pages/day since 2 days, the same tracker is showing 300+ pages activity it means the tracker is working right, still the pages which got indexed without googlebot visit amaze me. but till pages are coming up in search result no complains.

but if pages can get indexed without bot visit what's with link SEO and backlinks , i got like 23000+ pages showing in index with very good positions, with just 16 backlinks to my site.

if the same trend continues like this for 1 more month i don't think much backlinks are required at all.

Megaclinium

7:27 am on Dec 5, 2008 (gmt 0)

Are you identifying bots from the raw logs by behaviour?

I filter out robots based on a lookup table from accumulated history. First the complete IP address which I parse out, usually at begin of raw record. Then the 1st 3 octets & 2 octets.. if all 4 is not specifically in the spiders IP lookup table.

This makes it easy to record large blocks as bots. I was going to use the exact ranges but that would require a begin and end IP for the bot which would complicate the lookup function.

Whatever is left in 'users' I look at and many are immediately obviously robots for hitting just text, not media embedded in pages, things like that (also hitting robot traps, hitting multiple pages from unrelated areas of my site, trying to get jpegs / other media without the referrer being one of my web pages...)

If they ARE bots then I just add them immed to the spiders IP lookup for future logs and can rerun logs from the raw log db table if I really want to (tho I rarely do). If something catches your eye, an IP address, you can then goto the raw db table which is also indexed by IP and extract previous records from when you hadn't id's as a robot, as far back as you keep to a display or text file or spreadsheet real easily, see what behaviour that IP had.

I actually download the raw logs via a script hourly and write to an access database table, and at same time write the NEW access records (the raw logs don't reset hourly) based on IP + date unique key (usually first chars in raw record). The recreated users only logs are named with date + time.

This way you know from hourly logs how many real users and bots (in sept files). I think you might be surprised how many users are actually bots, but anyway, you should be able to quickly id who accessed that page so what must be a bot.

I also have the script inserting a line for summary of each download to show total (new) users, bot hits, and any special such as banned hits or 404 hit counts to quickly id probs, such as huge 'user' hit counts, you give a quick scan of the data and it is obviously a scraper sometimes.

I ended up doing this because progs like awstats only show what they think is a bot from the UA or robots.txt access, which missing huge chunks of bots, and there's no way for you to add a table of IPs you know are bots to it.

I do use the raw table as a staging for periodically reading all the raw accesses for a time period, parsing out all the fields in the raw logs then writing them to a db table so you can do more useful searches and reporting, analytics of many types if you want.

coaster01

10:17 pm on Dec 5, 2008 (gmt 0)

Google has got several different bots that register as different user agents; perhaps the OP's problem is that his tracking is only counting one of them. I've got AdSense, and Google's AdSense bot is on every single time I'm on, (and I assume it is with others as well), and I would suppose the AdSense bot would also be indexing content while it was there, no?

jimbeetle

10:59 pm on Dec 5, 2008 (gmt 0)

I would suppose the AdSense bot would also be indexing content while it was there, no?

As Ted explained above, Google has been using a shared cache since Spring 2006. In the simplest form any page crawled by any Google bot is stored in the cache, then any other bot can first check the cache before sending out a page request.

coaster01

1:37 am on Dec 6, 2008 (gmt 0)

OK, makes sense. So if a particular stats program isn't tracking a particular user agent identified as Google, then a site's pages could very well be indexed by Google without anything showing up in the stats identified as Google. Might just be listed as "unidentified crawler"

I see now Ted addressed that (above) but I just didn't understand what he meant by "common cache". I get it now, thanks. ;-)