Forum Moderators: phranque
I always have strange questions:
Example: www.sample.com/testpage.htm
Was clawed and cached by search engine already
And it will regularly come the check if update
Here is a situation
The testpage.htm is generating by 3 data sources regularly, one day 1 data source has temporary problem but I will still generate the page from another two sources to serve the normal visitors.
Here is the question, when search engine robot comes, if I do nothing, it will claw and update the page. So I want to know, what I can do to tell search engine this page is temporarily half-baked and do not update them database.
Any ideas, thanks
If I was doing this I would add a couple of <meta> tags to the <head> section of the incomplete page, just leave them out when it's complete and I don't mind client's cache being updated:
<meta name="robots" content="noarchive" />
<meta http-equiv="Pragma" CONTENT="no-cache" />
If I could make /testpage.htm dynamic by using error doc in IIS or addHandler in Apache I would use Server level signaling to tell browsers not to cache the material.
How dynamic is your process for assembling /testpage.htm ?
Btw, if the page is dynamic, what http status code will you recommend? 304 Not Modified if request include HTTP_IF_MODIFIED_SINCE, or 503 Service Unavailable with Retry-After? or some other ideas?
Thanks
To get it deleted from their database you can mark your page as 'noindex' in the 'robots' meta and still have to go to Google Webmasters Tools and request the page be removed from their index manually anyway - that has a tendancy of lasting 6 months if you request removal, wait till removed and go request re-inclusion immediately, 6 months last time anyone I heard. It's slightly more substantial to disallow it in robots.txt but still may have to go request removal if already indexed.
I think you should put <meta name="robots" content="noarchive" /> in the <head> section permanently, it is pointless letting SEs cache this constantly changing page anyway, using this directive on the incomplete page alone doesn't seem unreasonable to me though.
If you are running such a page and you've found that an older page is losing positions in the SERPs for keywords it has always done well for it might be worth your while to mark the constantly changing page 'noindex' as well (maybe even disallow it in robots.txt), depends on if you particlularly want/need it indexed by them.
What is the change frequency of this page Eric?
Thinking a little more about it, why can't you just use 'the last known usable update' from each of your three sources and never have an incomplete or broken page of it?