| This 57 message thread spans 2 pages: < < 57 ( 1  ) || |
|Google Forced to Remove Data|
That old GET vs POST problem all over again...
|The Social Security numbers and test scores of 619 students at public schools in Catawba County, N.C., were available online via Google's search engine until Friday, when the company complied with a local court order to delete all information about that county's board of education from its servers. |
According to sources, a student may have stored a username/password in a link in the form of a GET url that included a nonexpiring sessionid. The school apparently was unaware that a GET statement worked the same on their software as a POST.
The point is if you have information you want secret don't put it on the Internet. Nobody should be trusted with sensitive information if they don't understand how to protect it. It is like saying that you want to store things on your curb and your mad the trash men keep picking it up. That is what they do. They do it twice a week.
|Would this be the IT guys who got his degree 20 years ago...? |
I've been programming over 25+ years, so by that logic you're saying that I was trained too long ago, can't teach an old dog new tricks, and us old farts aren't keeping up on technology which makes ignorance an excuse?
Nope, not buying it.
College just prepares you for the road ahead and it's up to everyone to continue their own ongoing education on a daily basis and some go back and take additional classes if they can't learn it on their own.
|Exactly when and where should they learn it? |
This isn't that complicated and even said where I found it and it wasn't hidden.
The first time I ever went to Google I read everything they had posted in their webmaster help pages about crawling, indexing, and blocking their spider and continue to read updates as they post them to keep current.
This is a logical thing to do, it's not earth shattering or rocket science, anyone can do it and it's in as plain a place to find as anything could ever be. If someone is too lazy or complacent to look then it's THEIR problem as it's easy to find and ignorance is never an excuse.
How about coming to WebmasterWorld?
It's real hard to miss, you can barely search for anything technical on any search engine and miss WebmasterWorld, it's everywhere and all the information is here unless, once again you're too lazy to read it once you find WebmasterWorld.
It's like opening a box with a new electronic gizmo in it, you READ the instructions before you use it. If you don't read the instructions and you can't use the new gizmo or all the features, it's because of ignorance, which is what happened in this instance.
|no problem with saying that anyone who doesn't have a robots.txt should not be indexed. |
Now you've hit on something good.
If there is no robots.txt file the search engines should just move along elsewhere as the webmaster hasn't said opt-in or opt-out, and obviously is ignorant about web standards so should be dismissed out of hand until the webmaster learns that bare basics of web 101.
If the webmaster wants to know why his site isn't being indexed, now he'll be forced to research the problem and discover that a site needs robots.txt to be indexed. Odds are other information about NOARCHIVE, NOINDEX, etc. will also be in the same place unless once again, this is a lazy webmaster and stops reading the minute the basic need has been met.
HOWEVER, this still ignores the fact that there are MANY non-legit crawlers out there that:
a) don't look for robots.txt,
b) ignore it if they do read it
c) are in stealth mode, don't claim to be a bot and pretend to be a browser
So basically you would like those that play by the rules to play by new rules, which is OK if you want the change the rules because the good bots tend to play fairly most of the time.
Still, the rest of the vermin will do whatever anyway, invited or not.
The issue of crawlers is just too big to be cured with one placebo bandage and is a whole new topic on it's own.
[edited by: incrediBILL at 11:08 pm (utc) on June 26, 2006]
>> robots.txt is SEO.
Arguably, but it's also basic webmaster knowledge and basic server administration. And having a robots.txt in place wouldn't really have been the solution here... sure, it would probably have meant that Google wouldn't have indexed the data. But that data wasn't available only to search engines; if they found it anyone could have gotten to it.
In fact it's probably fortunate that it was picked up by Google, which led to it being found by relatives of one of the students and to the school being notified. If that hadn't happened, there's no telling how long it might have remained available and who might have come across the unsecured personal information.
The point is that if you're administering a publicly accessible computer, you have -- someone on the staff has to have -- responsibility for taking reasonable steps to secure it. If you don't do that for whatever reason you should expect some criticism.
|Robots.txt is SEO. Bottom of the barrel, easy peasy SEO, but SEO none-the-less. You put a robots.txt on a website in order to tell a search engine robot where not to go. It is there for no other purpose than to Optimize your websites for Search Engines. I.E. SEO |
No, robots.txt is bandwidth control. That is what the standard [robotstxt.org]addresses. It pre-dates almost all search engines. But it exists, and it is the standard. And as I said, if you go to an opt-in model, then who exactly is going to tell all those Webmasters, still living or not, that they need to go back and opt-in their sites? Changing the model now would simply destroy search and the Web as well, because most sites on the Web have no robots.txt, i.e. no current opt-in (or opt-out, for that matter). They would disappear overnight.
Like I said, "it's the law," like it or not. Opt-out or take your server down. Anyone who wants to learn about robots.txt, its purpose and its implementation need only do a search. Anyone who wants to know about password-protection and on-line security need only do a search.
|Google "hacked our website" |
|The schools claim that Google's search engine spider grabbed information they shouldn’t have and posted it on the Interweb. |
lol. As we can all see, it's not the school's fault. It has been clearly stated that Google hacked the website and posted it on the "Interweb."
|A spokesGoogle said that it was impossible for its spider to bypass a password. |
Who wrote this article? The school's IT staff?
|And as I said, if you go to an opt-in model, then who exactly is going to tell all those Webmasters, still living or not, that they need to go back and opt-in their sites? |
The same people telling internet professionals they should opt-out.
IncrediBILL, this is all I have been saying all along. I don't think that the people running sites are faultless, I just think that if we want to protect innocent bystanders from lack of knowledge and lack of sense, the SEs need to step up. It is their responsibility as much as it is the IT guy's. Kick out the sites that don't have robots.txt. trust me, those that care will have it up right quick and the SEs ill be safe from lawsuit.
If it is not too hard for people to put robots.txt on their site to keep out SEs, then it is not too hard too put them up to let them in. I think they won't because too many people just wouldn't care. Did that school care if they got found on a search engine for the school's name? Nope, probably not. But then the SEs would not be able to provide the comprehensive catalog they currently do. Of course the privacy of a few hundred students would not be threatened either.
Whatever robots.txt were, they are now for search engine control. What other bandwidth control is out there concerning robots? The bad ones will just ignore it and the rest have to do with companies interested in cataloging your site and it's information.
Google is not the law. It is a company. It is dangerous to confuse the two.
>> lol. As we can all see, it's not the school's
>> fault. It has been clearly stated that Google
>> hacked the website and posted it on
>> the "Interweb."
Actually, the school districts CTO, who has been the spokesperson on this, said they never used the word "hacked" and don't see it as that. In fact, while the article linked to earlier at theinquirer has the headline Google "hacked our website", it doesn't attribute that quote to anyone.
She also says that they took the court route because they "could not get beyond an operator at Google" when attempting to contact the tech support or legal departments there.
That info is from Danny Sullivan's report today on an email he recieved from the school's CTO.
|the SEs need to step up. It is their responsibility as much as it is the IT guy's. |
We'll just have to disagree then as it's the guy who runs the server and the site who is responsible to secure it. It's YOUR server, YOU install the firewall, YOU install the anti-virus, YOU install the anti-spam, so it's not a stretch to imagine that YOU also need to worry about robots.txt, .htaccess and all the rest, not the SE's.
|What other bandwidth control is out there concerning robots? |
I block about 300+ spiders daily.
The polite ones are politely turned away with robots.txt, minimum bandwidth waste, the rest are met with brute force.
There are over 100 unique instances of NUTCH running alone, which hit my site in about a month, which does honor robots.txt:
And that's just the tip of the iceberg.
Recommended reading of additional forums here:
[webmasterworld.com...] - Search Engine Spider Identification
[webmasterworld.com...] - robots.txt
We're getting off topic now, but there's a ton of "Web 2.0" type internet aggregators out there and it's just amazing how many new ones keep popping all the time. We aren't just dealing with search engines any more, it's an epidemic, but that's a whole 'nother thread.
My last word on this topic is since it's a SCHOOL, the right thing for them to do would send the IT guy back for a few classes to bring 'em up to speed on Internet security.
>She also says that they took the court route because they "could not get beyond an operator at Google" when attempting to contact the tech support or legal departments there.
Now that makes sense. Surely it's not Googles fault this happened. But it is Google's product and Google's product can't have social security numbers on it. So Google should fix it. Given another part of this solution is to educate the webmaster about protecting senstive data, spiders etc.
By the way, Google says that they removed the pages in question from the index before receiving the subpoena... apparently while the school staff never succeeded in contacting anyone there, media sources were able to do so after the story broke.
|Actually, the school districts CTO, who has been the spokesperson on this, said they never used the word "hacked" and don't see it as that. In fact, while the article linked to earlier at theinquirer has the headline Google "hacked our website", it doesn't attribute that quote to anyone. |
Point taken. The last thing they need is outside parties making them look even sillier. If that's happening, they should be more mad about that than Google indexing.
I do see the "opt-in" point being made in this thread. Since we live in the world as it is at this moment, you really shouldn't plug a computer into the internet without taking some reasonable steps to protect sensitive data. If a friendly search engine spider can get to it, it doesn't seem like reasonable steps have been taken.
I've never called Google for anything, but I'm not surprised they are slow to respond. No prob with using a court order.
if search engine is smart enough to return different set of results for the search term "jewelry" and "jewellery" based on the country your searching from - same search engine should be smart enough to recognize the patern of "999-999-9999".
end of story.
> search engine should be smart enough to
> recognize the patern of "999-999-9999".
> end of story.
Hardly. If your suggestion is that any page that contains a number fitting that pattern shouldn't be indexed by any search engine, I guess you mean 999-99-9999, which would be the pattern for a SSN. But many other numbers could fit that same pattern; simply trying to filter out any such sequence or any page containing that pattern would certainly not be a perfect solution.
And what if a list of SSN numbers doesn't include the hyphens?
And for that matter, if I choose to include my SSN on -- for example -- a resume that I make available online, that's my decision. Yeah, it may be a foolish decision, but a search engine shouldn't be making it for me.
ofcause it is 999-99-9999 JAYC, same as ISBN number gets converted to a link..., you are absolutely right aboyt the resume thingy. i am able to recognize that you are(or not) from... and tax-payer ID of yours is ... and so on. liability is covered under "make me a bigger story type of thing"
>> ofcause it is 999-99-9999 JAYC, same as ISBN number gets converted to a link...,
Uh... Every ISBN consists of ten digits preceded by the letters ISBN. Don't put 'ISBN' in front of it, and it doesn't get recognized and linked -- because there's no way to tell it from any other ten digit number.
I have an employee database on my PC, which the SSN of every employee going back 10 years. No hyphens stored; just 9-digit numbers. You'd expect that if I were to put that list on the web Google should be able to recognize that those are social security numbers?
Don't expect Google to run around trying to determine what's a SSN and what's not, that's just silly. I can make a typo of a phone number, drop one digit and POOF! my page won't index. What about product part numbers that look like SSNs, or a math problem "what's 123-55-3333?" and so on and so forth.
The search engines can't be responsible for what people expose, it's just silly.
However, when someone from that website puts in a removal request, it should happen instantly, no excuses, just yank those pages ASAP.
ispecialy if "social security" or "SSN" or "SS NUMBA" was in the proximity of the 9 digit number or 11 character string, just to cover all the bases. :)
oh wait if thay misspell SSN and put SNL enstead would that mean that page in question is in a comedy topic?
Since when was someone allowed to steal my stuff because I left my door unlocked?
So let's say a friend asks you to watch his pile of a million dollars and you agree to keep it safe. You leave it out on the front door step in the middle of the night and it's gone in the morning. Is your friend going to forgive you because it wasn't your fault that someone stole it?
This thread make me almost cry.
First of all, the tech person/teacher who put the full SSN should be sent to jail. So if am in a class, does that mean every student in that class should see my SSN? Morever if some one puts that info online, with little or no security, it should be identity fraud by the teacher. Yes, it was not the intent, neverthelees its just a lame excuse. If I leave a gun at the front door, that does not mean I am not liable.
Opt in SE? So for few morons who does not understand basic security, millions of website should be penalized?
|First of all, the tech person/teacher who put the full SSN should be sent to jail |
I sure hope you don't become a governor.
|First of all, the tech person/teacher who put the full SSN should be sent to jail |
I tend to agree with the tech person facing some form of action. If it was a teacher then the school should face some form of action instead because data entry and website management is certainly not appropriate use of a teacher's professional time.
On a related subject I am continually amazed at the number of apps which do not differentiate between GET and POST. I've never known what the difficulty is. I certainly differentiate as a matter of course.
From SearchEngineWatch - Danny Sullivan [blog.searchenginewatch.com]
|I was finding pages from what the district said was a password protected area to still be available through Yahoo. |
...some of these pages indeed didn't require a login to view.
I cannot understand why so much of this topic seems to revolve around robots.txt!
As has been pointed out there are far less 'polite' bots around than Google's. There are also, although you wouldn't know it to read WebmasterWorld sometimes, people with browsers out there.
These people with browsers can do exactly what Google did which is follow a link on a page they've 'read'. This 'following' will send a request for the page to the webserver which should make a decision, based on how it is set up, whether to give out a copy of that page. Google, and other bots are a side issue here. The information should never have been made publicly available to anybody who makes a request and is squarely down to the techs involved.
(Given some other recent tech stories perhaps the right approach to shifting the blame is to accuse the student who put up the link of subverting the webserver's security ;) )
|So you are saying that a school should invest in a $60K-100K IT guy, heck, a whole staff of IT people when they can barely afford to pay a teacher $30K or. |
Yes, absolutely they should. If anyone - school, bank, government, whtaever - decides to make data this sensitive available outside of a secure intranet, then no expense should be spared to make sure someone competent is keeping that data secured.
I believe the school's IT staff - system admins, webmasters, whatever - are 100% responsible for allowing this to happen in the first place.
If you don't want to hire the staff to properly maintain such a system, don't make this information available. Simple.
I surely hope other schools are paying attention to this and taking a good hard look at their own systems.
adamas got it right. The real problem here is that somebody put secure data on a publicly available webpage. Sure, Google may have made it easier to find, but the fact is it was still available. Robots.txt doesn't matter, I have no need to exclude Google and other nice bots from secure areas (via robots.txt) of my site because they are secure!
Also it's not really a GET/POST problem. Any page with an SSN should be password protected and only available via SSL (https). It wouldn't take a full time IT guy or staff, just a competent contractor to get that stuff right.
|Also it's not really a GET/POST problem. |
Um, yes, it is.
SSL does NOT mean secure, it only means encrypted transmission.
Even if the LOGIN was in SSL Google could've spidered it unless the loginh form BLOCKED all login requests from a GET otherwise Google would just be spidering encrypted pages.
SSL means it's hard (not impossible) to sniff those pages in transmission but it's a complete waste of SSL if the login is vulnerable.
| This 57 message thread spans 2 pages: < < 57 ( 1  ) |