|Spammers somehow found my clients' customer logfile|
A cautionary tale -- don't let this happen to you
I put a contact form on a client's website, and had it also write senders' email addresses to a logfile. (The client hosts their own email, which is unreliable, and they wanted to be able to see if there were entries in the logfile for addresses from which they never got a message.)
I didn't think the logfile would be visible to the outside world, because it was inside a directory that already had an index.html file. In other words, the directory contents were:
My understanding is that if a directory has an index.html file, then any request for the directory itself makes the server return index.html, so the requester can't see the contents of the directory.
That's nice in theory, but somehow spammers got to the logfile. I first noticed this when the special address I used for the client, which I also used to test the contact form, started getting spam. I did a Google search for that address and nothing showed up, so I assumed that the client's computer was infected with spyware which was stealing addresses from their addressbook and sending them to the spammer.
But some months later, one of the client's customers also started getting spam, and the customer Googled that address, and the logfile came up in Google! It also comes up in Yahoo. I think spammers found the logfile before the search engines (since I started getting spam before I could find the logfile in Google), but I can't be sure.
I can't explain how this happened. There are only four ways I know of that a bot (friendly or no) can find a file:
(1) The file is linked to from a page. But I certainly didn't link to the logfile from anywhere, and I don't think the client did, either. I searched for backlinks to the logfile in the SE's and found nothing.
(2) The file is in a directory without a default file like index.html or index.htm. But there is indeed an index.html file, and I don't think it was ever deleted.
(3) The bot guesses at the filename. I think it's a stretch that spambots are going to query every directory they come across for "log.txt", but even if they did, how does that explain how *Google* and *Yahoo* found the logfile? Certainly Google and Yahoo aren't playing guessing games.
(4) Submitting the url directly to a SE. Obviously this didn't happen. Lessons learned:
1. Unless there is a compelling reason to store email addresses in the webspace, don't. I could have easily written the logfile above the webspace (e.g., to /home/log.txt, instead of to /home/domain.com/directory/log.txt), and in hindsight, I should have.
2. If it's really necessary to store email addresses within the webspace (e.g., a client app to access the data via the web), put it in a secure directory. The directory where the data is stored should always prompt visitors for a username/password in order to see the contents.
3. For added safety, it wouldn't hurt to obfuscate the addresses when writing them. For example, instead of "firstname.lastname@example.org", write "user - domain.com".
4. Don't assume that a file is unviewable just because it's in a directory with an index.html file. As my experience showed, that doesn't always work. I don't know why it didn't, but that's beside the point. How I handled the problem
1. Apologized to the client profusely.
2. Emptied out the contents of the logfile in its old location.
3. Changed the script to start storing the logfile *above* the webspace.
4. Suggested to the client that they notify all users whose email addresses were compromised, explaining the problem, and laying the blame squarely on their web services provider (me).
5. Offered a $1000 guarantee that the new logfile will absolutely not show up in any search engine. Suggested that the client point this out to their customers so they can have some confidence that this was really a one-time screwup and that the client is confident it won't happen again.
6. Canceled the client's most recent invoice.
7. Wrote this post to share my experience, to prevent it from happening to others.
This is especially disheartening because I've spent countless hours fighting and preventing spam for clients and making sure the server and scripts don't get compromised. I haven't used mailto: links in HTML for nearly a decade, always trying new ways to keep addresses out of spambots' hands while making the mailing experience easy for the website visitor -- I even wrote a fairly detailed article years ago on How to Keep Spambots from Stealing Addresses. And now, thanks to my not being careful enough, 357 people are getting more spam. Ugh.
Well, at least I learned my lesson, and I'm writing this in hopes that no one else makes the same mistake.
There's a fifth way that a file may be indexed...
If it is viewed in a browser with a with a search-engine toolbar (e.g. Firefox with Googlebar) then it will be indexed (unless denied by robots.txt or a meta tag). This is also true if any pagerank tool is installed, official or otherwise.
If you want a file to be viewable but don't want it to be indexed, ever, at all, under any circumstances, as a minimum place it in a password protected directory.
in addition to what kaled describes, there are also situations where server access log files become publicly viewable and this would expose any direct requests for resources that would otherwise be obscured or protected.
Also worth noting, some packages like Awstats have a default install that does not lock down the statistics report that gets generated.
Hence hacks or spammers can easily pull up awstat reports on websites that never lock it down.
Moral of the story, lock the stats directory down by password and/or IP address.
Another item to add to your list: add the following to your .htaccess in that directory.
RewriteRule ^log\.txt$ /not-found.html [R=404,L]
For a web-viewable version for the client, you write a password-protected script that actually opens the file via the system, not a direct web request.
Are outbound referrers a possibility? If the logfile contained any clickable urls then this could easily happen.
Your client clicks on a url, their browser sends the referrer to the webpage, the destination webpage has publicly viewable stats (many do), search engines spider that and follow the link to your client's page, and hey presto your secret log.txt file is available to all and sundry.
On Apache, we simply turn off index listings (Options All -Indexes) to any folder we don't want to be viewed directly.
Nosey-nates will type in your domain/your directory/ and they can see everything, even the index.htm file you put in there.
If you're on Apache, and you don't want files to be viewed in any particular folder (directory), simply add "Options All - Indexes" w/out the quotes to your .htaccess file.
On another possibly related note;
I came across a mail/php script years ago that I've been using without incident on some of our client sites. I wish I could remember where I got this.
I install this into the same directory as my mail form and write it to point to feedback.php. (action=feedback.php) and it's a done deal as far as getting the mail and no spammers.
(copy to notepad and save it as feedback.php)
// Configuration Settings
$SendFrom = "Feedback <email@example.com>";
$SendTo = "firstname.lastname@example.org";
$SubjectLine = "YourSubject";
$ThanksURL = "http://www.example.com/thanks.htm"; //confirmation page
$Divider = "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~";
// Build Message Body from Web Form Input
$MsgBody = @gethostbyaddr($_SERVER["REMOTE_ADDR"]) . "\n$Divider\n";
foreach ($_POST as $Field=>$Value)
$MsgBody .= "$Field: $Value\n";
$MsgBody .= "$Divider\n" . $_SERVER["HTTP_USER_AGENT"] . "\n";
$MsgBody = htmlspecialchars($MsgBody); //make content safe
// Send E-Mail and Direct Browser to Confirmation Page
$Spam = count($_POST) == 0 ¦¦ stristr($MsgBody, "cc: ") ¦¦
stristr($MsgBody, "href=") ¦¦ stristr($MsgBody, "[url");
mail($SendTo, $SubjectLine, $MsgBody, "From: $SendFrom");
[edited by: phranque at 11:47 pm (utc) on Nov. 23, 2008]
[edit reason] exemplified urls [/edit]
Unlucky mate. But at least you didn't lose the gig altogether. Nice of you to warn others though!
you were probably just giving an example in your original post, but if your file really was named 'log.txt' then it's no wonder they found it.
bots trawl the web for obviously named stuff like that. you only have to do a inurl:log.txt search in google to turn up thousands of them.
i once had my stats package page turn up in alexa -- in a section labelled "most popular pages on this site".
and this was just weeks after the site went live.
there is no way they crawled that page from my site. so i can only assume i must have had their toolbar installed, or something like that, which gave them the url i visited. because no one else would have visited it.
Those were good ideas of other ways a file can be found, thanks. For the record, though, the logfile didn't contain any URLs (so it didn't show up in others' referrer logs), neither the client nor I ever accessed the file via the web (so it wasn't the toolbars), and our traffic logs/reports are locked down with passwords.
|Nosey-nates will type in your domain/your directory/ and they can see everything, even the index.htm file you put in there. |
I'm not following. I thought if there's an "index.htm(l)" file in the directory, the server will return only that? How exactly can someone see all the files?
Yes, the file was indeed named "log.txt", but even assuming that spammers try that one on every possible directory, how does that explain how Google and Yahoo found it? Certainly the major search engines aren't trying to guess at logfile names, right?
Options All -Indexes would prevent parsing of the directory by *nosey nates.
Google and Yahoo aren't being bad, they're just writing down what's in the directory after coming to home/directory/, index.htm and all because you don't have Options All -Indexes set.
A parsing utility will come to a directory with a url that looks like this;
and will proceed to collect whatever's there. htm, text, or whatever other file that is in the directory.
Again, a parsing utility will come to a directory that's written like this;
but this time you've got your Options All -Indexes set ... the server will serve up a 403 error to the parsing agent as a result, and yes, this includes Google and Yahoo.
If your link to the directory was written like this;
then the parsing utility would follow that, and follow whatever links you have on the page too, even though you've got your Options All -Indexes set.
I don't think that anyone was really intentionally gunning for your logfile. I think that through today's parsings, it just showed up in an index somewhere, someone found it, and then proceeded to exploit it.
I'm sorry, mcneely, I'm just not following you. What is a "parsing utility"? Can you give an example? Are you saying that there is some software that can get a list of all files in a directory even if that directory has an "index.html" file? If so, how does it do so, since my understanding was that the presence of an "index.html" meant that a request for the directory contents caused the server to return only the "index.html" file alone.
|I'm not following. I thought if there's an "index.htm(l)" file in the directory, the server will return only that? How exactly can someone see all the files? |
The restriction is to load index.html if someone attempts to view the contents of the web folder that's all. But that's not enough. Each file in the folder can directly accessed. But yes he will need to guess the filename so here is one scenario
If the contact form you have in place can disclose some info it will be possible someone to get the filename. The way it could be done is to send a bot to attempt all kinds of invalid values with the form fields. So if one of the attempts brings up an error or warning eg:
warning cannot write to file /log.txt
...now he knows. He can then access the file directly submit the link to a popular page from which search engines index it and everybody else can see it.
The statistics approach that was mentioned earlier in the thread is another possibility. Exposed statistics are extremely useful for attackers.
Ok for the folder issue one way is to make a simple .htaccess to restrict access to a file or a set of files eg:
Deny from all
then there is no direct access for txt files in this example. It also depends how the application operates. How do you write to the log file? Do you use fopen? via a scoket? etc.