Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Spidering / Indexing Questions

         

doughayman

11:28 am on May 14, 2008 (gmt 0)

10+ Year Member



To all,

Sorry if these are dumb or repetitive questions, but I was unable to answer them to my satisfaction, looking at old posts.

Please don't ask why (but there are reasons), I am running an "ancient" webserver - O'Reilly & Associate Website V1.1h (copyright 1997, and no longer being supported). It subscribes to HTTP Protocol V1.0 (and not V1.1). It appears, from my weblogs, that certain files on my server get spidered at least once a week, even though they have not changed. And when this occurs, my rankings tank for various keywords for this page, for a day or 2, until the "new" page gets indexed properly, at which time my rankings seem to return. This is repeatable, and I'm seeing this over and over again.

My questions are as follows:

1) When Googlebot visits my site, and accesses a file on my server, and the return status is "200", does this imply that Googlebot thinks that this file has changed, since its last fetch?

OR

Does this imply that Google has fetched it, and it will determine whether this file has changed since its last fetch subsequently on Google's end ?

2) I have read about Last-Modified Headers and Last-Expired headers on WebmasterWorld, and given that my webserver is running HTTP Protocol V1.0, it does not support these features. Aside from moving to a different webserver, can anyone suggest a mechanism that I can employ, that would prohibit Google (or any other search engine) from retrieving files that have not changed since the last Googlebot fetch ? Unfortunately, there are no webserver configuration options that support this effort.

3) One other possible anomoly - I have had a Website V1.1h logging feature enabled, that generates extended log records, which is informationally useful. However, I noticed that the timestamp that this log format uses is GMT, whereas my server machine's file timestamps are in local time (EST). Could this 4-hour time differential (i.e., the delta between GMT and EST) possibly be causing Google to think that a file has changed, when in fact, it has not ? I do have other logging options, that timestamp webserver accesses via local time (i.e., EST, in my case), which would put things in synch with my local server time.

I'm apologize in advance if these sound like ancient or stupid questions, but I'm just trying to pick the board's brains !

Thanks in advance !

Doug

[edited by: engine at 11:59 am (utc) on May 14, 2008]
[edit reason] formatting [/edit]

tedster

7:52 pm on May 14, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1) The 200 status means only that your server found the requested resource and served it to googlebot. There is nothing impied about the freshness of the file.

2) I'd say just make sure that your server puts an accurate "last modified" stamp on your files.

3. It's certainly worth the experiment to set your timestamp to GMT.

As you well know, the problem here is a server-specific technical problem and not really a Google question - and we don't have a forum here for your particular legacy server. Even if the server is no longer supported, I assume you have access to some documentation - and that's probably your best resource.

From this HTTP 1.0 documentation at the W3C [w3.org], the 304 response should have been available to servers since 1996, even on HTTP 1.0. However, that doesn't mean your server has this possibility.

Receptional Andy

8:00 pm on May 14, 2008 (gmt 0)



certain files on my server get spidered at least once a week, even though they have not changed

Google has to spider your file to see if it has changed, since HTTP headers revealing modification dates are not widely implemented. It looks like Google expects your file to change more frequently than it actually does.

certain files on my server get spidered at least once a week, even though they have not changed. And when this occurs, my rankings tank for various keywords for this page

An interesting effect, but can you clarify how you are measuring this? It's clear that Google wants updated results as frequently as possible, but to downgrade a page solely because it hasn't been updated would have to be considered a bug.

Do the keywords you target suggest a topic that would be more relevant if the results were newer?

doughayman

11:55 pm on May 14, 2008 (gmt 0)

10+ Year Member



Thanks for the responses, Tedster and Andy.

Andy, my traffic to a certain page (let's call it my money page, on the effected site), drastically decreases for 1-2 days, and is consistent with this file being spidered by Google. This has been a laborious effort of measuring traffic to my site, in concert with mapping it to weblog activity.

No, the keywords I target don't suggest a topic that would be more relevant, if results were fresher. That is not an issue here.

Receptional Andy

12:04 am on May 15, 2008 (gmt 0)



a laborious effort of measuring traffic to my site, in concert with mapping it to weblog activity

I don't mean to add to your labour, but what kind of sample size is involved? Spidering and ranking are related, but most often there is no cause and effect relationship.

doughayman

12:32 am on May 15, 2008 (gmt 0)

10+ Year Member



Andy, I eye-balled this for a period of several months, and then did a formal analysis over the same period (2-3 months). I correlated my declines in traffic directly to the 1-2 days after my "money page" was being spidered. Without exception, my hypothesis (unforunately) came to fruition. It is still occurring as we speak.

Tedster, I have changed the log files to be representative of my local server time, as an experiment, although I am not holding out much hope for this.

Receptional Andy

12:40 am on May 15, 2008 (gmt 0)



How willing are you to experiment with this page? Or is it an instant result you're after?

Your server software is clearly a major restriction, but there are a few things you might try, depending on the balance between spidering and performance. Personally, I think there's a lot to learn from fringe cases like this.

doughayman

12:54 am on May 15, 2008 (gmt 0)

10+ Year Member



LOL, I am willing to always experiment, although radical changes may be outside my limits. I make my living predominantly from my affiliate marketing business, and I can't put myself at total risk ever. If you have any suggestions that you would like to share, Andy, I am all ears.

Thanks,

Doug

Receptional Andy

1:00 am on May 15, 2008 (gmt 0)



To throw it back to you a little bit, what theory would you like to test? For instance, if you think there is a causal relationship between spidering and ranking, you have direct influence over spidering behaviour. That's an easy one to look at statistically.

doughayman

1:07 am on May 15, 2008 (gmt 0)

10+ Year Member



Are you talking about an outright block (perhaps a robots.txt Disallow) ? I am afraid of the potential longer-term repurcussions of doing that.

doughayman

1:19 am on May 23, 2008 (gmt 0)

10+ Year Member



Going to back to my original problem (1st post of this thread), what are the ramifications of me doing a:

Disallow: /

OR

Disallow: /Page.htm

as a mechanism for preventing Google (or any other spider) from spidering my site (or the page that I'm preventing from being spidered) ?

Can I use this strategy to shut down the indexing of my site, and in turn, open it back up, when I actually make changes that I want spidered ?

What are the ramifications ? Will the Disallowing of spidering my site (or designated files) cause any of the following to occur:

a) Reduction in Google's interest in my site, and less frequent
automatic spidering visits ?

b) In the case where I am preventing the spidering of a file, will
Google consider de-indexing this file ?

c) Will Google consider my site (or excluded pages) to be of less
importance, which may result in decreased rankings for a given
keyword phrase, in relation to this site (or excluded pages) ?

As stated earlier, this problem ALWAYS occurs whenever these pages get spidered by Googlebot.

doughayman

9:00 pm on May 26, 2008 (gmt 0)

10+ Year Member



All I know is that I've gone from brown hair to all grey in the last 2 months...............there's got to be a better way !

tedster

9:43 pm on May 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Doug, have you scrutinized Google's Help pages about robots.txt [google.com]? I'd look there first for the "official" word.

Can I use this strategy to shut down the indexing of my site, and in turn, open it back up, when I actually make changes that I want spidered ?

Yes, but I'd say it's not a good idea to go back and forth a lot. Better to develop changes in a test environment and the make them live when you are satisfied with what you've got.

A robots.txt disallow rule can result in those disallowed urls being dropped from the Google index, or they may be shown as "url-only". If googlebot can't spider a page, then that page still may be ranked, but only according to backlink influences. The content of that page is no longer able to be scored for ranking purposes.

my rankings tank for various keywords for this page, for a day or 2, until the "new" page gets indexed properly, at which time my rankings seem to return.

A robots.txt disallow is likely to cause a much longer drop than one or two days - probably a drop for as long as spidering is disallowed.

I still can't easily see how a spidering of a page would always result in a short term rankings drop - but Google can be a very complex critter. It may not be a direct cause and affect, but just a related phenomena. Did you change the timestamp to GMT?

A suggestion - you might want to study the SERPs that are there when your traffic tanks. Look for content copied from your site, for one thing. Also get VERY familiar with urls that do have stable rankings, their backlinking, content, coding - all of it. You never know what might jump out. Another idea - have you checked your page for any major html problems? How about checking any feedback from Webmaster Tools?

doughayman

11:01 pm on May 26, 2008 (gmt 0)

10+ Year Member



Ted,

As always, thank you for your input, wisdom, and patience.

Yes, I am intimate with the ROBOTS.TXT doc. I concur with you, that it might not be the best idea to use Robots.txt to regulate my spidering, via a "faucet" approach. I'm afraid of long-term repurcussions of using Robots.txt to disable access to my site, as you suggest.

Yes, I changed my weblog timestamps - they are now posted in EST, which equates to my server's system clock, and my server filestamp times (formerly, I was logging in GMT). This change seems to have had no effect at all.

Very little, if any, content has been copied from my site, although I'm not sure if COPYSCAPE gives you the full result set of copied pages any more. I think they now give a teaser, with the hope that you will subscribe to the full service. Even if I find copied segments of my site elsewhere, the recourse for reporting these pirates, and achieving desired results, is nil (I speak from past experience).

Looking at other sites, that rank well, is probably a good idea, and maybe something will jar inside my head, as a result.

My site is pretty clean (and always has been) - changes have been minimal - and are always content changes in nature. I don't think major HTML problems is the issue (but who knows ?). Also, I'm very clean in the eyes of WTools. The only thing they "complain" about is that a couple of my internal pages have "short META TAG descriptions". No big deal there, and ironically, the pages it complains about in that regard, are the internal pages that have PR - the ones it doesn't complain about have "Grey-bar" PR, which seems to be sympotomatic of many of the members' sites these days, as I've read on other threads.

Regards,

Doug