Forum Moderators: Robert Charlton & goodroi
This might sound funny or may be i am less informed on this, but i want to know this , my 2 month old website was indexed very quickly i have like 23000 pages showing in the index. but lately i am seeing very less googlebot activity 1-3 page per crawl day .
today i made an article and after 2 hr i could search it via google but my log is not showing google bot activity can anyone tell me how google bot index pages. also i am pretty concerned ( should i ? ) about low google bot activity earlier it used to index like 300+ pages each day then it started alternate days now its like 1-3 pages every other day ( i make 5 article per day ) . is it a matter of concern as i am really not investing in back links or anything but my content is heavily interlinked within the website.
Finally whats the average google bot activity like can you guys share it ?
There isn't really average googlebot activity - it depends how good your incoming links are and how frequently your content updates.
i use a CMS there is an inbuilt tracking module which track all bot activity , earlier its log used to show lots of activity googlebot activity, lately the figures are as i mentioned. The google WMT also showing less data being downloaded from my site , same goes for the server LOG. still my page showing up in index witin 2 hr ( may be earlier) i just checked after 2hr yesterday , i am just curious about this.
also i make 5 article on an average per day, so content is updated every 5-8 hrs.
For example, your pages can end up in the main Google index if they were crawled by the Adsense bot. The user agent for the Adsense bot is Mediapartners-Google. So software that only looks for "googlebot" will miss it.
In other words, the bot will never be anonymousthat would make it too easy for cloakers ....
But cloaking checks are just that a test - not an general crawling/indexing process.
I still suspect a problem with the tracking mechanism - it's always best to verify these things via a server log file. There are also things like the feedfetcher bot that might not appear in a script-based tracker.
but if pages can get indexed without bot visit what's with link SEO and backlinks , i got like 23000+ pages showing in index with very good positions, with just 16 backlinks to my site.
if the same trend continues like this for 1 more month i don't think much backlinks are required at all.
I filter out robots based on a lookup table from accumulated history. First the complete IP address which I parse out, usually at begin of raw record. Then the 1st 3 octets & 2 octets.. if all 4 is not specifically in the spiders IP lookup table.
This makes it easy to record large blocks as bots. I was going to use the exact ranges but that would require a begin and end IP for the bot which would complicate the lookup function.
Whatever is left in 'users' I look at and many are immediately obviously robots for hitting just text, not media embedded in pages, things like that (also hitting robot traps, hitting multiple pages from unrelated areas of my site, trying to get jpegs / other media without the referrer being one of my web pages...)
If they ARE bots then I just add them immed to the spiders IP lookup for future logs and can rerun logs from the raw log db table if I really want to (tho I rarely do). If something catches your eye, an IP address, you can then goto the raw db table which is also indexed by IP and extract previous records from when you hadn't id's as a robot, as far back as you keep to a display or text file or spreadsheet real easily, see what behaviour that IP had.
I actually download the raw logs via a script hourly and write to an access database table, and at same time write the NEW access records (the raw logs don't reset hourly) based on IP + date unique key (usually first chars in raw record). The recreated users only logs are named with date + time.
This way you know from hourly logs how many real users and bots (in sept files). I think you might be surprised how many users are actually bots, but anyway, you should be able to quickly id who accessed that page so what must be a bot.
I also have the script inserting a line for summary of each download to show total (new) users, bot hits, and any special such as banned hits or 404 hit counts to quickly id probs, such as huge 'user' hit counts, you give a quick scan of the data and it is obviously a scraper sometimes.
I ended up doing this because progs like awstats only show what they think is a bot from the UA or robots.txt access, which missing huge chunks of bots, and there's no way for you to add a table of IPs you know are bots to it.
I do use the raw table as a staging for periodically reading all the raw accesses for a time period, parsing out all the fields in the raw logs then writing them to a db table so you can do more useful searches and reporting, analytics of many types if you want.
I would suppose the AdSense bot would also be indexing content while it was there, no?
I see now Ted addressed that (above) but I just didn't understand what he meant by "common cache". I get it now, thanks. ;-)