Forum Moderators: phranque
Around 25 - 31 Jul 05 our server was attacked by a hacker who breached our security [ we had weak passwords at the time] . The hacker demonstrated some familiarity with our industry and the software that we use by the use of system knowledge of navigating it.
We identified the hacker's activity, who used dynamic IP address' to try and conceal their identity, over a prolonged period of 5-6 days re enter and make alterations comprising of the application of robots.txt files into the many web sites. The hacker stopped when we upgraded security.
We first suspected a problem around 3 Aug 05 when our sites started to be eliminated from the Google and Yahoo caches. Yahoo kindly identified the application of robots.txt files to us [ upon seeking help ], Google provided a standard reply. The cache date of 25 Jul 05 is a consistant date of last caching over most of the site, with supplemental indexing of those pages [ mostly but not all ].
The matter was refered to the local police and is currently under a prolonged investigation by Interpol.
Until this date Google was caching us every few days. From this date onward all of our websites failed to cache, even when we removed the robots.txt file.
My suspicion is that something triggered Google to stop spidering us or upset the bots. Most of the pages are supplemental now, although we did have potentially some duplicate content on ths sites - which is now fixed.
Yahoo mostly returned to normal, although all pages are still not performing correctly.
Can anyone think of anything that may have been damaged or altered, that could have resulted in the collapse of our caches and therefore our results, or are we the victims of a supplemental indexing, either co incidentally, or because something has triggered the current "suppression".
We identified the hacker's activity, who used dynamic IP address' to try and conceal their identity, over a prolonged period of 5-6 days re enter and make alterations comprising of the application of robots.txt files into the many web sites.
Did you review the content of those robots.txt files? Did they contain this?
User-agent: *
Disallow: / If so, it appears the hacker dropped a bunch of robots.txt files that prevent the sites from being indexed.
With Google, this exploit would be rather severe as I'm sure the Google Remove URI Tool would be used which requires an entry in the robots.txt file. You'd see results within 24-48 hours. Gray bar, pages dropping from index, etc.
Yikes!
However, did you notice the date when this happened per the above - it was mid 05 and we never recovered our caching , even though we reversed the robots.txt
The reason for coming to this forum was my concern that the attacker may have attacked our scripting in a manner that could interfere with our Google results.
What do you think?
As you know, old cache dates are rampant at Google right now and that alone isn't enough of a clue to track down the issue that plagues you. But server logs might give you some significant information.
It probably checked back at your site for a period of time and when the restrictions were not lifted it put your site into some type of 'Do Not Crawl' list to save itself processor power. I am guessing that Yaho or MSN do not have such a thing, and that is why they have returned. Again I am guessing about what is happening with G.
But if I were you I would take steps as if your site was sandboxed by Google and do all the little things that someone does to get their site indexed the first time by G.
The best way I have found to speed this along is by using the Google Sitemaps tool. This is where you can create an XML file that follows a standard set by Google (instructions are easy to follow for creating this file) You then submit this XML file to G and they use that to index your site. Once G has recieved that file it shouldn't be more than 2 weeks until Gbot comes to visit you. Who knows what will happen then but you should be ok.
Google Sitemaps
[google.com...]
Good Luck and if you get any more clues as to what happened or if this works I would love to hear about it.
[gritechnologies.com...]
I'm also curious to hear the answer to tedster's questions re crawling.
FWIW, I helped a site that had a bad robots.txt back in the fall and G was back in crawling the site very quickly, so I'm skeptical that G just disregards sites for an extended period because of robots.txt issue as long as there are good inbound links...though I suppose it might depend on what instructions were contained in the hacked files. Did the added robots.txt contain the number 410?
do you see googlebot spidering your pages right now?
This is our server activity as of today's date :
Yahoo Slurp 160249+2164 3.21 GB 15 Mar 2006 - 11:55
Googlebot 26456+26 415.44 MB 15 Mar 2006 - 10:26
Unknown robot (identified by 'crawl') 14449+41 39.41 MB 15 Mar 2006 - 11:55
MSNBot 13105+278 282.08 MB 15 Mar 2006 - 11:55
GigaBot 1268+189 28.46 MB 15 Mar 2006 - 11:53
We have a lot of pages still showing 25 Jul 05 Cache date and even more that have not cached
I only have a copy of the log files at the time of the attack. Would the server administrator have a copy of this?
Re sitemaps
We submitted 1 site to site maps over 4 weeks ago. There is nothing significant in the sitemaps reports ie it seems to have gone smoothly from the submission and diagnostics point of view, but nothing has changed in terms of appearing on Google.
All we have at the moment is a home page and 22,000 supplemental pages out of a total page number of around 85,000 - but i think this is associated with the BD issues since 301's were latterly involved.
I have to say based on the information you have put forth I am a little mystified. I mean it sucks to say but I hope there is some underlaying code that is causing this because if it is then when you find it and get rid of it problem should work itself out from there. If it is on Google's end then the problem may never be releved and who knows when it will resolve itself.
I ran "Poodle Predicator" and have previously submitted Google Site Maps.
What puzzles me is the corrolation between the attack date and the last cache. However, it could be a coincidence between this and the BD issues of the last month.
We have just added "last-modified" to the headers which were also frozen in time at 25 Jul 05 .... so we'll see what happens
However, did you notice the date when this happened per the above - it was mid 05 and we never recovered our caching , even though we reversed the robots.txt
What was in the robots.txt files?
All we have at the moment is a home page and 22,000 supplemental pages out of a total page number of around 85,000 - but i think this is associated with the BD issues since 301's were latterly involved.
Can we assume that a bulk of those pages are dynamically generated? If so, how many "real pages" comprise the site? Have you taken a look at those pages to make sure nothing was changed, added, etc.?
Just clarifying 85,000 approx dynamic pages have only produced 1 result [ home page ] and around 22k of supplementals...
I'm beginning to think that this is related to BD and that the froxen cache date of around 25 Jul 05 is coincidental, since Yahoo and MSN recovered fully and were stable. We applied 301's across the site from old pages to new ones in Feb 06.
What was in the robots.txt files?
I've had a few Stickys with Whitey concerning the content of those robots.txt files. Unfortunately they cannot tell me for sure what was in them but they think it was...
User-Agent: *
Disallow: / Based on the symptoms, my guess is that after the robots.txt hack occurred, requests were submitted to Google to remove the URIs.
Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of your site from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)
So the 180 day mark puts you in 2006 February (approximately). Then you say you implemented 301s in February. So from my persective, you've added insult to injury. Then we have all this stuff going on with cache and it appears that the dates of the cache issue coincide with your robots.txt hack. Not sure if it is coincidental or all part of the issue.
Around 25 - 31 Jul 05 our server was attacked by a hacker who breached our security [ we had weak passwords at the time]
We first suspected a problem around 3 Aug 05 when our sites started to be eliminated from the Google and Yahoo caches.
Yup, I'm more convinced now that your robots.txt file contained the Disallow: / directive. Google typically responds to that URI removal process via robots.txt within 24-48 hours. Ask Brett, he'll confirm this! ;)
Until this date Google was caching us every few days. From this date onward all of our websites failed to cache, even when we removed the robots.txt file.
Read the above excerpt from the Google Guidelines. Your site would not be indexed for a full 180 days after the URI removal request.
On a side note. If I were to ever be hacked, I'd rather them screw with something else (maybe). This is a surefire way to totally wreak havoc on someone's search marketing campaign. That robots.txt file can be friend or foe. In your case, it turned out to be foe of course.
If the hacker was smart enough to do the robots.txt hack, then there is a good chance that they did other things. Maybe robots meta tags (set to noindex, nofollow) throughout the site. Maybe erroneous code was injected at strategic areas of the site and it is buried deep. Based on your questions and the lack of technical assistance from your end, I can only assume that there are probably some things still floating about that may be causing issues. Only you and your team can figure that out.
Good luck!
[edited by: pageoneresults at 9:32 am (utc) on Mar. 17, 2006]
Yes. 100% without a doubt.
On a side note. If I were to ever be hacked, I'd rather them screw with something else (maybe). This is a surefire way to totally wreak havoc on someone's search marketing campaign. That robots.txt file can be friend or foe. In your case, it turned out to be foe of course.
Now that I give this more thought, this is a 100% foolproof way of destroying a campaign. Just imagine, someone has access to your robots.txt file. They drop the Disallow: / directive in there. Then they submit your URI to Google using the URI removal tool.
You of course are totally clueless that this is going on. The hacker, watching the Google SERPs closely, sees that the robots.txt removal has worked. They come back in, clean up their tracks and wham, you're history.
And then here you are (no one inparticular) wondering why your site is not showing up in Google anymore. Wow, that would not be a pleasant experience. You'd be chasing your tail for at least six months trying to figure out what the heck happened. :(
In order to remove a URL from the Google index, we need to first verify your e-mail address. Please enter it below, along with a password.
Was there an email account set up around the same time the hacking started?
The above was only necessary if they used the Google URI Removal Tool. If not, I would assume that a few crawls by Googlebot, Slurp, etc. and the new robots.txt files would do their thing without the use of the tool.
Note: It would not be necessary for the creator of the robots.txt hack to have used the Remove URI Tool. But, if they had access to the server, and found access to email, they may have set up a temporary account, performed the procedure, deleted the account and then went on their merry way.
So the 180 day mark puts you in 2006 February (approximately). Then you say you implemented 301s in February. So from my persective, you've added insult to injury. Then we have all this stuff going on with cache and it appears that the dates of the cache issue coincide with your robots.txt hack.
Now you're starting to firm up my suspicion
Not sure if it is coincidental or all part of the issue.
It could be a co incidence.
But what if the combination of robots.txt and 301's, as reported reported in the Supplementary Club [webmasterworld.com] was a reason to trigger us being dumped into the supplementary index in a manner that it tripped Google's intelligence. Is this a clue to the behaviour that differentiates those who are being restored and those who are not in the Big Daddy update?
Seperately, there are still some tidy up issues - we'll keep you informed as we take time to look through the site.
The others are all "lopped" with some pages appearing, but the majority not. Yep the 301 certainly looks as though it added insult to injury.
But what puzzles me is the coincidence of dates and how it might have exposed a processing issue in delivering sites to the supplemental, not just us.
Remember, only the one site went to the supplemental -the one with 301's.
Is there a guru on another forum who can provide insight into these clues?
Is there a guru on another forum who can provide insight into these clues?
Adding insult to injury again. ;)
I do believe we've provided more than enough clues as to what is going on. It is now up to you to locate the technical expertise to undo what may have been done. It appears that you've taken some steps, but not all. Your server administrator should be the one participating in this topic.
and the interesting thing is, that it is the one site to which we applied the 301's on in Feb that has been utterly smashed by Google.
Nothing real interesting there. If the 301s were from old pages to new, depending on what was happening with the site at the time, there will be delays in getting the new pages indexed. In your case, you were just coming out of a 180 day no index period and you slapped those 301s in there adding insult to injury.
I do believe we've provided more than enough clues as to what is going on
You certainly have - and thanks.
Howver, I wasn't referring specifically to us when i spoke of "clues" and "gurus". [ We'll do our own investigations on this ] - What I mean is, that possibly our experience demonstrates a pattern of events that might be connected to "some" of the issues experienced by other webmasters in the BD / Supplementary Club. Maybe it's an irregularity Google hadn't accounted for.
There has to be a logical reason why some sites are coming out of the supplementary index and some are not. - Just a hunch for a better brain than mine to dismiss or look further!
btw - i checked the static html pages against the dynamic ones and they all froze around the same date -so not much difference here.
They don't need an email address on your server. Google will take any email address for the tool registration. They take the fact that there is a robots.txt file on the server to be enough proof that you have had enough access to the server to have put the robots.txt file there.
Our site started to reindex which means the 180 day robots.txt suppression is off.
At this stage [ 1st day ] the site in question has lost it's "supplemental page" status occurring through the BD change and results have returned to the index, although, for example a previous No1 result is now position 90. We still have 70% of the site to be indexed, but it's looking good.
Where we previously had strong No1 positions, and we presumably have no other limiting factors, how long will it be before the site will recover it's [similar ] position?
And what indexing process' does it have to go through on Google?
have added another 60k of pages
the ones you mentioned have not yet been updated and still show the old pages and supplementals.
From what you're saying i think the suggestion is to wait a bit.
My question was really about when & what comes next in the indexing cycle
e.g. backlinks