Forum Moderators: phranque

Message Too Old, No Replies

Can a hacker make your site unspiderable?

server breach robots.txt file alterations

         

Whitey

6:38 am on Mar 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We operate a cgi perl application on a MySQL/Linux platform operating a central db serving dynamic pages to around 15 web sites at the time of the following event.

Around 25 - 31 Jul 05 our server was attacked by a hacker who breached our security [ we had weak passwords at the time] . The hacker demonstrated some familiarity with our industry and the software that we use by the use of system knowledge of navigating it.

We identified the hacker's activity, who used dynamic IP address' to try and conceal their identity, over a prolonged period of 5-6 days re enter and make alterations comprising of the application of robots.txt files into the many web sites. The hacker stopped when we upgraded security.

We first suspected a problem around 3 Aug 05 when our sites started to be eliminated from the Google and Yahoo caches. Yahoo kindly identified the application of robots.txt files to us [ upon seeking help ], Google provided a standard reply. The cache date of 25 Jul 05 is a consistant date of last caching over most of the site, with supplemental indexing of those pages [ mostly but not all ].

The matter was refered to the local police and is currently under a prolonged investigation by Interpol.

Until this date Google was caching us every few days. From this date onward all of our websites failed to cache, even when we removed the robots.txt file.

My suspicion is that something triggered Google to stop spidering us or upset the bots. Most of the pages are supplemental now, although we did have potentially some duplicate content on ths sites - which is now fixed.

Yahoo mostly returned to normal, although all pages are still not performing correctly.

Can anyone think of anything that may have been damaged or altered, that could have resulted in the collapse of our caches and therefore our results, or are we the victims of a supplemental indexing, either co incidentally, or because something has triggered the current "suppression".

perl_diver

6:54 pm on Mar 13, 2006 (gmt 0)

10+ Year Member



I think you should be asking in the search engine/google forum, this really has nothing to do with perl.

pageoneresults

2:49 am on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We identified the hacker's activity, who used dynamic IP address' to try and conceal their identity, over a prolonged period of 5-6 days re enter and make alterations comprising of the application of robots.txt files into the many web sites.

Did you review the content of those robots.txt files? Did they contain this?

User-agent: *
Disallow: /

If so, it appears the hacker dropped a bunch of robots.txt files that prevent the sites from being indexed.

With Google, this exploit would be rather severe as I'm sure the Google Remove URI Tool would be used which requires an entry in the robots.txt file. You'd see results within 24-48 hours. Gray bar, pages dropping from index, etc.

Yikes!

Whitey

4:23 am on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



pageoneresults - Whoever attacked us put robots.txt onto all of our pages.

However, did you notice the date when this happened per the above - it was mid 05 and we never recovered our caching , even though we reversed the robots.txt

The reason for coming to this forum was my concern that the attacker may have attacked our scripting in a manner that could interfere with our Google results.

What do you think?

tedster

7:07 am on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To separate the old cache date problem from your server hack, do you see googlebot spidering your pages right now? If someone has injected malicious script onto your server, your logs should show some spidering problems.

As you know, old cache dates are rampant at Google right now and that alone isn't enough of a clue to track down the issue that plagues you. But server logs might give you some significant information.

Demaestro

4:29 pm on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am guessing that what has hapened is that G has out smarted itself. To be efficient G probably has something in their algo that won't return to a site that has been cut off using the robots.txt after X amount of attempts.

It probably checked back at your site for a period of time and when the restrictions were not lifted it put your site into some type of 'Do Not Crawl' list to save itself processor power. I am guessing that Yaho or MSN do not have such a thing, and that is why they have returned. Again I am guessing about what is happening with G.

But if I were you I would take steps as if your site was sandboxed by Google and do all the little things that someone does to get their site indexed the first time by G.

The best way I have found to speed this along is by using the Google Sitemaps tool. This is where you can create an XML file that follows a standard set by Google (instructions are easy to follow for creating this file) You then submit this XML file to G and they use that to index your site. Once G has recieved that file it shouldn't be more than 2 weeks until Gbot comes to visit you. Who knows what will happen then but you should be ok.

Google Sitemaps
[google.com...]

Good Luck and if you get any more clues as to what happened or if this works I would love to hear about it.

Demaestro

4:33 pm on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry I just thought of this now as well. There is a tool called "Poodle Predictor" and you go to their site and put in a URL, then it tries crawls the site the way it thinks Google crawls your site. It returns results in the same manner Google would. You can try passing your site to this tool and if it can't crawl it, it will tell you why and this may offer more clues as to what is going on.

[gritechnologies.com...]

caveman

4:43 pm on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The sitemaps tool is a good suggestion I think.

I'm also curious to hear the answer to tedster's questions re crawling.

FWIW, I helped a site that had a bad robots.txt back in the fall and G was back in crawling the site very quickly, so I'm skeptical that G just disregards sites for an extended period because of robots.txt issue as long as there are good inbound links...though I suppose it might depend on what instructions were contained in the hacked files. Did the added robots.txt contain the number 410?

Whitey

8:53 pm on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Tedster :
do you see googlebot spidering your pages right now?

This is our server activity as of today's date :

Yahoo Slurp 160249+2164 3.21 GB 15 Mar 2006 - 11:55
Googlebot 26456+26 415.44 MB 15 Mar 2006 - 10:26

Unknown robot (identified by 'crawl') 14449+41 39.41 MB 15 Mar 2006 - 11:55

MSNBot 13105+278 282.08 MB 15 Mar 2006 - 11:55
GigaBot 1268+189 28.46 MB 15 Mar 2006 - 11:53

We have a lot of pages still showing 25 Jul 05 Cache date and even more that have not cached

Whitey

9:08 pm on Mar 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[quote]Did the added robots.txt contain the number 410?[quote]

I only have a copy of the log files at the time of the attack. Would the server administrator have a copy of this?

Re sitemaps

We submitted 1 site to site maps over 4 weeks ago. There is nothing significant in the sitemaps reports ie it seems to have gone smoothly from the submission and diagnostics point of view, but nothing has changed in terms of appearing on Google.

All we have at the moment is a home page and 22,000 supplemental pages out of a total page number of around 85,000 - but i think this is associated with the BD issues since 301's were latterly involved.

Demaestro

4:57 am on Mar 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did you try the poodle predictor?

I have to say based on the information you have put forth I am a little mystified. I mean it sucks to say but I hope there is some underlaying code that is causing this because if it is then when you find it and get rid of it problem should work itself out from there. If it is on Google's end then the problem may never be releved and who knows when it will resolve itself.

Whitey

5:34 am on Mar 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Demaestro

I ran "Poodle Predicator" and have previously submitted Google Site Maps.

What puzzles me is the corrolation between the attack date and the last cache. However, it could be a coincidence between this and the BD issues of the last month.

We have just added "last-modified" to the headers which were also frozen in time at 25 Jul 05 .... so we'll see what happens

pageoneresults

6:29 am on Mar 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



However, did you notice the date when this happened per the above - it was mid 05 and we never recovered our caching , even though we reversed the robots.txt

What was in the robots.txt files?

All we have at the moment is a home page and 22,000 supplemental pages out of a total page number of around 85,000 - but i think this is associated with the BD issues since 301's were latterly involved.

Can we assume that a bulk of those pages are dynamically generated? If so, how many "real pages" comprise the site? Have you taken a look at those pages to make sure nothing was changed, added, etc.?

Whitey

6:52 am on Mar 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's a good point to look at the permanent static pages and compare them to those generated dynamically off of the DB .

Just clarifying 85,000 approx dynamic pages have only produced 1 result [ home page ] and around 22k of supplementals...

I'm beginning to think that this is related to BD and that the froxen cache date of around 25 Jul 05 is coincidental, since Yahoo and MSN recovered fully and were stable. We applied 301's across the site from old pages to new ones in Feb 06.

kartiksh

9:13 am on Mar 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Most of the pages are supplemental now

Sounds familiar. Bigdaddy? [webmasterworld.com...]

pageoneresults

6:02 pm on Mar 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Okay, would you mind if I asked one more time?

What was in the robots.txt files?

pageoneresults

9:14 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What was in the robots.txt files?

I've had a few Stickys with Whitey concerning the content of those robots.txt files. Unfortunately they cannot tell me for sure what was in them but they think it was...

User-Agent: * 
Disallow: /

Based on the symptoms, my guess is that after the robots.txt hack occurred, requests were submitted to Google to remove the URIs.

Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of your site from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)

So the 180 day mark puts you in 2006 February (approximately). Then you say you implemented 301s in February. So from my persective, you've added insult to injury. Then we have all this stuff going on with cache and it appears that the dates of the cache issue coincide with your robots.txt hack. Not sure if it is coincidental or all part of the issue.

Around 25 - 31 Jul 05 our server was attacked by a hacker who breached our security [ we had weak passwords at the time]

We first suspected a problem around 3 Aug 05 when our sites started to be eliminated from the Google and Yahoo caches.

Yup, I'm more convinced now that your robots.txt file contained the Disallow: / directive. Google typically responds to that URI removal process via robots.txt within 24-48 hours. Ask Brett, he'll confirm this! ;)

Until this date Google was caching us every few days. From this date onward all of our websites failed to cache, even when we removed the robots.txt file.

Read the above excerpt from the Google Guidelines. Your site would not be indexed for a full 180 days after the URI removal request.

On a side note. If I were to ever be hacked, I'd rather them screw with something else (maybe). This is a surefire way to totally wreak havoc on someone's search marketing campaign. That robots.txt file can be friend or foe. In your case, it turned out to be foe of course.

If the hacker was smart enough to do the robots.txt hack, then there is a good chance that they did other things. Maybe robots meta tags (set to noindex, nofollow) throughout the site. Maybe erroneous code was injected at strategic areas of the site and it is buried deep. Based on your questions and the lack of technical assistance from your end, I can only assume that there are probably some things still floating about that may be causing issues. Only you and your team can figure that out.

Good luck!

[edited by: pageoneresults at 9:32 am (utc) on Mar. 17, 2006]

pageoneresults

9:28 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can a hacker make your site unspiderable?

Yes. 100% without a doubt.

On a side note. If I were to ever be hacked, I'd rather them screw with something else (maybe). This is a surefire way to totally wreak havoc on someone's search marketing campaign. That robots.txt file can be friend or foe. In your case, it turned out to be foe of course.

Now that I give this more thought, this is a 100% foolproof way of destroying a campaign. Just imagine, someone has access to your robots.txt file. They drop the Disallow: / directive in there. Then they submit your URI to Google using the URI removal tool.

You of course are totally clueless that this is going on. The hacker, watching the Google SERPs closely, sees that the robots.txt removal has worked. They come back in, clean up their tracks and wham, you're history.

And then here you are (no one inparticular) wondering why your site is not showing up in Google anymore. Wow, that would not be a pleasant experience. You'd be chasing your tail for at least six months trying to figure out what the heck happened. :(

pageoneresults

10:10 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One last question Whitey and this is to confirm that the robots.txt hack was used.

In order to remove a URL from the Google index, we need to first verify your e-mail address. Please enter it below, along with a password.

Was there an email account set up around the same time the hacking started?

The above was only necessary if they used the Google URI Removal Tool. If not, I would assume that a few crawls by Googlebot, Slurp, etc. and the new robots.txt files would do their thing without the use of the tool.

Note: It would not be necessary for the creator of the robots.txt hack to have used the Remove URI Tool. But, if they had access to the server, and found access to email, they may have set up a temporary account, performed the procedure, deleted the account and then went on their merry way.

Whitey

10:32 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So the 180 day mark puts you in 2006 February (approximately). Then you say you implemented 301s in February. So from my persective, you've added insult to injury. Then we have all this stuff going on with cache and it appears that the dates of the cache issue coincide with your robots.txt hack.

Now you're starting to firm up my suspicion

Not sure if it is coincidental or all part of the issue.

It could be a co incidence.

But what if the combination of robots.txt and 301's, as reported reported in the Supplementary Club [webmasterworld.com] was a reason to trigger us being dumped into the supplementary index in a manner that it tripped Google's intelligence. Is this a clue to the behaviour that differentiates those who are being restored and those who are not in the Big Daddy update?

Seperately, there are still some tidy up issues - we'll keep you informed as we take time to look through the site.

Whitey

10:38 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



.... and the interesting thing is, that it is the one site to which we applied the 301's on in Feb that has been utterly smashed by Google.

The others are all "lopped" with some pages appearing, but the majority not. Yep the 301 certainly looks as though it added insult to injury.

But what puzzles me is the coincidence of dates and how it might have exposed a processing issue in delivering sites to the supplemental, not just us.

Remember, only the one site went to the supplemental -the one with 301's.

Is there a guru on another forum who can provide insight into these clues?

pageoneresults

10:49 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is there a guru on another forum who can provide insight into these clues?

Adding insult to injury again. ;)

I do believe we've provided more than enough clues as to what is going on. It is now up to you to locate the technical expertise to undo what may have been done. It appears that you've taken some steps, but not all. Your server administrator should be the one participating in this topic.

pageoneresults

10:52 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



and the interesting thing is, that it is the one site to which we applied the 301's on in Feb that has been utterly smashed by Google.

Nothing real interesting there. If the 301s were from old pages to new, depending on what was happening with the site at the time, there will be delays in getting the new pages indexed. In your case, you were just coming out of a 180 day no index period and you slapped those 301s in there adding insult to injury.

Whitey

11:22 am on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I do believe we've provided more than enough clues as to what is going on

You certainly have - and thanks.

Howver, I wasn't referring specifically to us when i spoke of "clues" and "gurus". [ We'll do our own investigations on this ] - What I mean is, that possibly our experience demonstrates a pattern of events that might be connected to "some" of the issues experienced by other webmasters in the BD / Supplementary Club. Maybe it's an irregularity Google hadn't accounted for.

There has to be a logical reason why some sites are coming out of the supplementary index and some are not. - Just a hunch for a better brain than mine to dismiss or look further!

btw - i checked the static html pages against the dynamic ones and they all froze around the same date -so not much difference here.

g1smd

11:32 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> But, if they had access to the server, and found access to email, they may have set up a temporary account, performed the procedure, deleted the account and then went on their merry way. <<

They don't need an email address on your server. Google will take any email address for the tool registration. They take the fact that there is a robots.txt file on the server to be enough proof that you have had enough access to the server to have put the robots.txt file there.

Whitey

7:31 am on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



g1smd - That's an interesting point.

We are still on hold with the 301 redirect problems of BD for pages we diverted, but with all the other checking done would you put a reinclusion request into Google or wait?

g1smd

8:14 pm on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would wait until a week or so after BigDaddy rolls "for real", as I have a feeling that maybe they need another crawl/index cycle to get things "right".

I also have a suspicion that they will make it even more "wrong" than it is now, when they do.

Whitey

6:54 am on Mar 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good news for those who find themselves in a similar situation [ i hope that no good webmaster has to go through this ]

Our site started to reindex which means the 180 day robots.txt suppression is off.

At this stage [ 1st day ] the site in question has lost it's "supplemental page" status occurring through the BD change and results have returned to the index, although, for example a previous No1 result is now position 90. We still have 70% of the site to be indexed, but it's looking good.

Where we previously had strong No1 positions, and we presumably have no other limiting factors, how long will it be before the site will recover it's [similar ] position?

And what indexing process' does it have to go through on Google?

g1smd

10:41 pm on Mar 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I dunno, but that process might be being seen at [64.233.171.104...] and at [64.233.185.104...] as something very different is brewing over there...

Whitey

2:22 am on Mar 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



66.249.93.104
64.233.179.104
72.14.207.104

have added another 60k of pages

the ones you mentioned have not yet been updated and still show the old pages and supplementals.

From what you're saying i think the suggestion is to wait a bit.

My question was really about when & what comes next in the indexing cycle
e.g. backlinks

This 36 message thread spans 2 pages: 36