Once hijacked/penalized site now getting many deep crawls

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Once hijacked/penalized site now getting many deep crawls

How long til titles/descriptions appear?

crobb305

9:08 pm on Apr 30, 2005 (gmt 0)

Well, my once penalized/hijacked website has started getting hit hard by Gbot in the past week and a half. In 10 days, I have deep crawls (100% of pages spidered) on 4 seperate days, including last night. Prior to my reinclusion request 2 weeks ago, only robots and index were spidered for months.

My pages are still listed as url only, so does anyone know what the deep crawls indicate, and when title/desc may appear in serps?

Thanks!

Panacea

11:37 am on May 2, 2005 (gmt 0)

crobb305,

did you use the re-inclusion form that GoogleGuy said to use: [google.com...]

TheET

12:28 pm on May 2, 2005 (gmt 0)

crobb305 I am in the same position as you. No title and Descriptions etc after 302's and the non-www problem.

From your earlier posts I think I submitted a reinclusion request on the same day as you and started getting deep crawled 1 day after the reinclusion request (previously only 1 or 2 pages had been crawled for 30 previous) this was on the 22nd then 23rd and so on (last was on 30th).

However no title and descriptions have come back for that site and there are still many pages missing even though they have been crawled at least 3 times.

Another site of mine lost all title and descriptions around the same time as that site. The index came back after been re-written (2 weeks later) but then dissapeared again after the re-inclusion request which is a bit of a suprise as some titles and descriptions of sub pages came back 2 days after that reinclusion request and ranked (not as well as before but there all the same). Further happenings (after this) included my home page showing number 10 (and number 5 on 1 datacenter) for the sites name (which is keyword-1-2-widget) but showed URL only. The index is still missing for that site but have tried re-writing 1 or 2 paragraphes again.

I did have a similar experience a month earlier with the first site I mentioned, the rankings started to decline 3 weeks bafore a major update for no reason, so I changed the content drastically on the home page (it had been stale for ages) Then, after the next update my rankings shot back to above my previous position, I took it at the time that the update cleaned out the problem and the changes had been beneficial.

This time around I didn't make changes until 1 week later when the site had vanished to URL only. (too late)

From what I can sum up from my experience, and what others have said, is, there seems to be no order. So I don't think anyone can give a set time as to when and where Rome will rise again.

I was thinking that maybe a major update will be more helpful, but who knows.

lufc1955

8:11 pm on May 2, 2005 (gmt 0)

Crobb305

I would say that your web site will be reincluded in the index at the next update. That's what should happen after being deep crawled.

Can I ask how long you were banned from Google for and for doing what?

crobb305

9:28 pm on May 2, 2005 (gmt 0)

Yes, the "deep crawls" started one day after my reinclusion request. The reason I am so curious about the significance of this is because until 2 months ago, I had different stats tracking software. I could not see robot activity. So, I have a very limited dataset. It is quite possible that Gbot has always "deep crawled" my site prior to the next update and still kept my listings excluded/url only. From the data I have, I can say that until last week, only robots.txt and index pages were requested by Gbot over a two month period. Then, suddenly deep crawled on 4 seperate days after the reinclusion request. The gbot activity has decreased again. No hits in two days.

So, is this behavior normal? Did my reinclusion request get accepted? My site remains indexed with internal pages showing url only, and index page listed with title/desc last cached April 22.

Chris

window

8:37 am on May 3, 2005 (gmt 0)

I have also done reinclusion two weeks back, but still my site has not crawled yet.

Gbot has not crawled my index pages of my 5 interlinked sites from last 4 to 5 months..Though some of my internal pages get crawled after every 10-12 days from last 50 days.

What does this mean?

How can I crawl my index pages...as it is very very important for me.

crobb305

8:02 pm on May 4, 2005 (gmt 0)

Is Gbot just particularly active right now? Getting crawled again today. :)

Dayo_UK

8:05 pm on May 4, 2005 (gmt 0)

Crobb305

Is it Mozilla Googlebot - if so read my posts here:-

[webmasterworld.com...]

Dayo

AlexK

3:19 am on May 5, 2005 (gmt 0)

crobb305:

It is quite possible that Gbot has always "deep crawled" my site prior to the next update ...
Is Gbot just particularly active right now? Getting crawled again today. :)

I never put the bl**ding-obvious together until I read your posts.

Shortly before *every* major + minor update (at least since Dec 04), G has deep-crawled my site. Then, 24-48 hours later--oh dear--the update became active and the site's position in the G SERPS fell like a stone, as did the hits.

I can trace the recent rise in bot-hits back to about 9 Apr (only an estimate, as my site is on a temp server) but then *really* ramped up on Fri 29 and, Tue/Wed, is of the order of 5,000 pages each day. Blimey.

I can only hope that--this time--my no-tricks site will rise in the SERPS.

walkman

3:54 am on May 5, 2005 (gmt 0)

I missed this crawl completely. Blocked G for about 10 days from one site and G was looking for the old files. I couldn't 301 them so I have to wait. Not even my index is cached now. Slurp indexed them all...a day after letting the spiders in. I was shocked

AlexK

11:24 am on May 6, 2005 (gmt 0)

walkman:

Slurp indexed them all...a day after letting the spiders in. I was shocked

Annoyed at not having proper stats (site is on a temp-server, and has been there for a while whilst my hosts twiddle their thumbs) I copied the logs across to my main server and installed AWStats [awstats.org]. Now I'm shocked.

The first (test) report is only for 4+ days (1-6 May), but Search bots account for 30% of the hits and 23% of the bandwidth:

Type --- Pages - Hits --- Bandwidth 
people - 32907 - 160301 - 801.41 MB 
bots --- 13955 - 14477 -- 235.86 MB

As only 36.8% of the site hits originate from a search-engine, I'm beginning to have my first doubts about their value.

theBear

11:57 am on May 6, 2005 (gmt 0)

AlexK,

As only 36.8% of the site hits originate from a search-engine, I'm beginning to have my first doubts about their value.

What would happen if 36.8% of your traffic dryed up over night?

Further how many of 53.2% of your visitors started visiting because of a search engine?

We found that a large percentage of paying customers are short termers from search engines looking for a quick answer to a particular set of requirements.

The long term repeat visitors while they provide less revenue provide the required traffic to keep advertisers happy. Most of these visitors started as search engine referalls.

But if you wish a simple robots.txt file will cure this issue and we'll happily welcome your former search engine refered visitors.

Just let us know which subject area you are in so we can build some content;).

crobb305,

I noticed that Google has two of your pages showing title and description in its listing. I really can't remember what it was when I first looked. I thought it was only the home page then.

I've looked at couple of dozen sites in a number of niches, there was plenty of damage and there are still folks who haven't figured out what hit them.

dazzlindonna

12:10 pm on May 6, 2005 (gmt 0)

never mind...not awake yet.

walkman

2:40 pm on May 6, 2005 (gmt 0)

what does the number in bold mean:
HTTP/1.1" 200 8341
HTTP/1.1" 200 8093
HTTP/1.1" 200 8102
HTTP/1.1" 200 8037
Does anyone know?
It's a little too low for page size, even if it just got the text. Google got my index 4 times last night and I'm wondering

Dayo_UK

2:48 pm on May 6, 2005 (gmt 0)

Size of the file - (you changed your post ;))

It is just the size of the html document.

walkman

2:56 pm on May 6, 2005 (gmt 0)

"It is just the size of the html document"

if I save my view source it's close to 40 KB. if I save (on IE explorer) as text, it's a little over 10. I wonder G doesn't take

Dayo_UK

3:03 pm on May 6, 2005 (gmt 0)

Is it Cached?

What does Google show as the Cached size?

& is all of the page displayed in the Cache?

walkman

3:14 pm on May 6, 2005 (gmt 0)

no not catched. I had banned G for a while as I was changing the script. I looked at a GB simulator as the text seems within the range. A little surpised but then I use tables to draw the results. All the text seems to have been taken anyway so it doesn't matter.

why would they pull the site 4 times? Gathering the links? Not that I mind :), just curios..

AlexK

6:45 pm on May 6, 2005 (gmt 0)

theBear:

What would happen if 36.8% of your traffic dryed up over night?

My site has lost more than that twice in the last 6 months (Google specific), hence my involvement in this thread.

... there was plenty of damage and there are still folks who haven't figured out what hit them.

...and you have? Do let us know.

walkman:

Google got my index 4 times last night and I'm wondering

MSNBot took my site's robots.txt file 38 times in 4 days:

In addition to this, an Unknown robot (who did not actually take any pages) took the robots.txt file 338 times. What was that all about?

crobb305:

My pages are still listed as url only, so does anyone know what the deep crawls indicate, and when title/desc may appear in serps?

The original question in this thread remains unanswered. In msg 6 of another similar thread [webmasterworld.com] I reported that just 12 from the first 100 SERPS on a site:mysite.com G-search had url + desc; all the rest were url-only. That was May 3. 3 days and 3,000+ G-sampled pages later that has altered to 13 in the first 100 pages. It hardly seems worth it.

The fundamental question of this thread remains: "How long til titles/descriptions appear?"

AlexK

7:00 pm on May 6, 2005 (gmt 0)

walkman:

what does the number in bold mean:
HTTP/1.1" 200 8341
HTTP/1.1" 200 8093
HTTP/1.1" 200 8102
HTTP/1.1" 200 8037

As others have said, it is the size of the page in bytes.

if I save my view source it's close to 40 KB.

That is possible if the page is compressed by the server (gzip/deflate). This page [leknor.com] will let you know for any specific page on your site whether it is gzip-ped (shows the Content-Length, too), and will show the headers for deflated pages.

why would they pull the site 4 times?

Probably because it has varied in size every single time. I assume that you have dynamic content on the page?

walkman

7:25 pm on May 6, 2005 (gmt 0)

thanks AlexK,
I am gzipped via apache 2.0* and php.ini with ob_ something. The wierd part is that my other sites are too but content length is different (when G pulls it).
I might be into something...will this mean anything bad to Google?

"Cache-Control no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma no-cache"
The other site didn't have this and I got the file size from the page suggested (which is about right scaled to this).
Yes, I also have a dynamic script that randomly pulls about 4-5 featured widgets from a selected pool.

AlexK

7:50 pm on May 6, 2005 (gmt 0)

theBear:

What would happen if 36.8% of your traffic dryed up over night?
Further how many of 53.2% of your visitors started visiting because of a search engine?

These are good questions, even if your maths is bad :)~ and I have been mulling this over for a while.

My site started on a free-web-host many years back, and who cared who or how much bandwidth, since I did not have to pay the costs. Those days are long gone... Today, I have to pay for each byte downloaded.

With the search-engines it is a simple business proposition: they are allowed to roam at will all over my site because the cost to me of them doing this is offset by the return in visitors from their SERPS. When I then discover that 23% of my bandwidth is due to search-bots it stops me dead in my tracks. When I also discover that 83% of their SERPS for my site are url-only I begin to worry. When I discover that there is no means to find out why or to fix this situation I get really angry.

At this instant the SEs are a benefit for my site, but the equation is sliding rapidly towards the red side of the balance sheet.

AlexK

8:21 pm on May 6, 2005 (gmt 0)

walkman:

I am gzipped ... with ob_ something ... wierd part is that my other sites are too but content length is different

Normally auto-gzip (via php) is fixed to a value of 3 (can range from 0-9). My own site uses a modified php-class originally derived from the site that I gave in msg 20, which varies the depth of gzip according to server-load.

will this mean anything bad to Google?

My site is 87% url-only, but who the hell knows?

Some of the G-Bots are http 1.0, and therefore will not accept compression (HTTP/1.1 only). My site also provides a link to switch compression off for all pages (for anybody having display problems). I suspect that this is causing a duplicate-penalty.

"Cache-Control no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma no-cache"

Do not quote me on this (check here [salemioche.com] instead) but this will instruct browsers (and proxies?) to re-fetch the page on every visit, which backs up G bringing your index page 4 x times.

walkman

3:34 am on May 7, 2005 (gmt 0)

one more question:
this shows when I test for gzip (which I have)
Set-Cookie : PHPSESSID=e8ac4e5d0********4a1ce1f*******b; path=/
Expires Thu, 19 Nov 1981 08:52:00 GMT

All my pages are linked clean, but the PHPSESSID are for users to save a product for that session...not a cookie fan :).

Once again, none of my links are with /blah.php?ID=#*$!#*$!#*$!x, when I highlight or click I see clean links, and when slurp indexed the site, the links they got were clean (just /blah.php ).
Will this still be a problem for Google?

thanks guys,

theBear

3:57 am on May 7, 2005 (gmt 0)

AlexK,

My maths are fine my keyboarding is off at times.

That comes from old age and arthyrightist ;) or is that :(. Some days it is just painful.

Reid

4:07 am on May 7, 2005 (gmt 0)

IF_MODIFIED_SINCE
If googlbot sees the current page is cached, it will stop requesting it. That is why a deep-crawl is followed by 'just checking'
if it finds any files with a new date it reqests them.
after being deep crawled it should appear in the next update.

How can I crawl my index pages...as it is very very important for me.

try a tool called 'poodle predictor'

walkman

4:41 am on May 7, 2005 (gmt 0)

"If googlbot sees the current page is cached, it will stop requesting it."

makes sense. Well my index, for good or for bad, changes everytime it's loaded...

Reid

5:17 am on May 8, 2005 (gmt 0)

Well my index, for good or for bad, changes everytime it's loaded...

yeah but does the IF_MODIFIED_SINCE date change everytime? Thats where dynamic sites can get into crawling problems. The cache never gets updated because the 'skeleton' of the page never changes.

walkman

5:32 am on May 8, 2005 (gmt 0)

Hi Reid,
this is what i got when checked for cacheability:
Expires 1224 weeks 2 days ago (Thu, 19 Nov 1981 08:52:00 GMT)
Cache-Control post-check=0, must-revalidate, no-store, no-cache, pre-check=0
Last-Modified -
ETag -
Set-Cookie path=/; phpsessid=--8567f-----f8***fca6a***a379
Content-Length - (actual size: **7*5)
Server Apache

This object has been deliberately marked stale. It doesn't have a validator present. It will be revalidated on every hit, because it has a Cache-Control: no-cache header. It won't be cached at all, because it has a Cache-Control: no-store header. Because of the must-revalidate header, all caches will strictly adhere to any freshness information you set. This object requests that a Cookie be set; this makes it and other pages affected automatically stale; clients must check them upon every request. It doesn't have a Content-Length header present, so it can't be used in a HTTP/1.0 persistent connection.
------------
The last modified line is blank. What else can I do? Google stopped by and picked about new 100 pages today, including the home page (full size too).

Reid

8:02 pm on May 8, 2005 (gmt 0)

that sounds ok walkman - if it cant be cached then it shouldn't cause a problem.
I saw a few others that were allowing these stale skeleton pages get cached - never to be updated again.

This 60 message thread spans 2 pages: 60