homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Googlebot / image bot causing probs?
sometimes hitting 404 pg, sometimes skips
Megaclinium




msg:3694120
 7:00 am on Jul 9, 2008 (gmt 0)

some discussions here have 66.249.67.xx with UA of Googlebot-Image/1.0
as a prefetch accelerator?

Anyway I have leech turned on so any bot can access some file types such as .jpg but only if the refering page is on my domain.

This causes it sometimes to get 404 errors as it is not going thru the web page but requestint the .jpg directly. (returns a 302 error)

My question is that sometimes it gets the 404 error within a second in an additional record right after it in the log.

At other times it doesn't generate the 404 error (but still 302 and small byte count so googlebot didn't get its image)

Any idea why?
the ones it seems to not get 404 on are larger images in sept directory, and it gets the 404 for smaller images in thumbnail directory. I banned it for a while on the general irritation principle of making me get up and look at 404 errors. Is there a reason for this bot, and will banning it keep me out of google indexes?

 

wilderness




msg:3694331
 1:15 pm on Jul 9, 2008 (gmt 0)

It seems that your "leech turned on" is generating a redirect/replacement file (via 302).

This is really not an effective presentation for major SE's.
In fact, presenting alternative images to any visitor (i. e., leech) as opposed to a straight-forward access-forbidden (403) is a bad practice.

You need to modify the CP created htaccess to effective reso;ution for both standard visitors and major SE's.

Megaclinium




msg:3694980
 12:32 am on Jul 10, 2008 (gmt 0)

Thanks, Wilderness

I don't think I have access to the .htaccess file. I'm on a shared server. Does that make a difference? It seems from discussions that some hosting services do allow this. I got an all-you-can-eat deal and I took them up on it :) for a low yearly price. Tho now I am tempted to switch as I've seen some good ideas in these discussions, if that's what it takes.

And why would it sometimes create a 404 and other times not?

I don't want people linking directly to my images as I have web pages, and it helps me ID scraper bots too if they try and grab jpegs with not being referred from my pages. Not too many are good enuf to follow links and get files that way.

wilderness




msg:3695351
 12:47 pm on Jul 10, 2008 (gmt 0)

I don't think I have access to the .htaccess file. I'm on a shared server. Does that make a difference? It seems from discussions that some hosting services do allow this.

You really need to poke around on your host/server site and see what's available.
Most likely if the CP (Control Panel) offers the "leech" than it also offers manual editing of your htaccess.
In my tenth year of utilizing Apache-shared hosting, I've never saw a provider that did not allow htaccess editing.
It could be different if your using and IIS server.

In addition, and in another thread, you noted that raw access logs were not available to you?
I would poke around until you locate them.

phred




msg:3695397
 1:52 pm on Jul 10, 2008 (gmt 0)

In addition, and in another thread, you noted that raw access logs were not available to you?
I would poke around until you locate them.

I'm in the same boat - no raw logs. Have had a pretty good poke around with both ftp and the cp file manager without any luck. Can you give any additional hints? The hosting company I'm using is also a big name in domains...

Phred

wilderness




msg:3695410
 2:09 pm on Jul 10, 2008 (gmt 0)

Phred,
In March I switched the hosting on one of my domains to "El Cheapo" ;)

Although the initial transition and corrections required in my htaccess file were a nightmare, the host although with some occasional live problems is OK for me.

This host utilizes a service for support and although the support response is impressive, the laguage barriers create some real issues on each and every support request.
Each tech, requires a different terminology.

This host and others offer multiple domain hosting at the cheap price. I've never seen the "OTHERS" CP, however this one does not offer FTP access to raw logs (which is unheard of for me).

In addition the CP REQUIRES java on for full fucntions.

The Raw logs are located under CP:
1) Live Stats (pulldown)
2) Access & Error Logs

The host assured me prior to my 30-day money back guarantee expiriation that they offer "rotating logs", however unless they rotate on a monthly basis, I haven't seen daily rotating logs which I previously had grown accustomed to.
Were I to let my daily logs accumulate for a month, checking any access would be a nightmare (as would bandwidth use).
Thus, I manually delete and restart the logs every few days, losing a few hours accumulation in the process.

Megaclinium




msg:3695674
 7:17 pm on Jul 10, 2008 (gmt 0)

Thanks; I've tried to find the .htaccess file, nothing under that name in any of the root diretories (of shared server).

I DO have access to the raw logs.
I've had the ISP do really stupid stuff like lose the monthly archived raw access logs then tell me that I should use the canned control panel stats. These include robots which I want to exclude when looking for visitors. Analog is totally useless as it lists so many URL's (I have a humongo site) that it crashes the browser on most months. webalizer works but is limited. they just added awstats and it is a bit better in excluding robots but it is only based on UA with web page in it not behavior.

And they end up now resetting the raw logs DAILY!? I really should ask them about this. It used to be monthly.

However I have a script going in background that grabs the raw log hourly, separates out bots I know about (via a lookup table of either the full address, and if not found the 1st 3 groups of IP address).

Then the raw stats are written as is (skipping duplicate writes) to either visitors or bots database table.

From there is simply query to look up visitors, and if a new bot appears I just add it to the lookup table so it doesn't show up as a visitor again. (I put myself in bots table as a bot named 'me' or 'me2' if dift address at work so when we visit our site doesn't show up in visitor counts).

Megaclinium




msg:3695754
 9:11 pm on Jul 10, 2008 (gmt 0)

Something I didn't realize: being a bot, it is up to the bot to determine if it wants to actually request the 404 page, the same way it can skip requesting graphics and just grab html, and it appears sometimes it does and sometimes it doesn't want the 404 page apparently.

wilderness




msg:3695791
 10:10 pm on Jul 10, 2008 (gmt 0)

1) If the page doesn't exist, the bot has no choice.

2) with properly implemented custom error pages (server side), the bot has no choice either.

g1smd




msg:3695798
 10:19 pm on Jul 10, 2008 (gmt 0)

If you use PHP, you can get that to write a log of pages requested for you.

Google does use some of the 66.249.64.0 - 66.249.95.255 range for its own bots.

A bot will not have a referrer, because it does not jump from page to page following links. It spiders links from an internal list of what to spider, the link references collected some hours to days before on a previous run.

wilderness




msg:3695805
 10:28 pm on Jul 10, 2008 (gmt 0)

(of shared server)

Could you possibbly expand on this for clarification that we are on the same channel?

1) Are you utilizing paid-webhosting which functions on under a shared server hosting plan [en.wikipedia.org]?

2) or perhaps some alternative/mis-configured plan offered to you via a friend or perhaps free host?

I've never seen a shared hosting server which did not offer use of htaccess, although I recall one of my hosts was without a file in place when my account my was new. Creating one was no real issue, however, I'm more proned to believe that you personally are simply unable to locate the file.

Rotaing logs are generally dumped into accumulating ZIP files.

The Windows version of Analog offers a config file for customized stats (the shared host stats are not specifically customized for you or your sites, rather the servers sites as a group), and as a result, there is simply no need to tear you daily logs apart from their original format, throwing a monkey wrench (so to speak) into the mix.
However, "to each his own".

My effort in providing these explanations to you is under the hope that your experiences at Webmaster World and those others that may assist you will require less communication in the future.

g1smd




msg:3695851
 11:26 pm on Jul 10, 2008 (gmt 0)

Most FTP applications are set up to "hide filenames that begin with a dot" by default.

Make sure you cancel that option in your FTP program, before you look round your site.

Err, that was meant for a thread talking about finding the .htaccess file on the server.

incrediBILL




msg:3695867
 11:52 pm on Jul 10, 2008 (gmt 0)

A bot will not have a referrer

Legit bots don't, but fakers trying to slip by simplistic anti-scraper rules include a referrer. Typically it's stupid and just uses your index page on all hits, even to pages that aren't even on the index page, but some really follow the flow of the site.

wilderness




msg:3695882
 12:19 am on Jul 11, 2008 (gmt 0)

Most FTP applications are set up to "hide filenames that begin with a dot" by default.

g1smd,
My "El Cheapo" host explained to me that under their CP file format, FTP access to log was not possible, without offering open http access to anybody to same logs.

Had I never seen ftp logs made available by a host above my own root folder, I may have accepted the explantion. Instead, I just shrugged and proclaimed to myself "CHEAP" ;)

Don

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved