Purge Information from domain tools and aboutus.org? - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Purge Information from domain tools and aboutus.org?

1
2
3
»

doni

3:33 am on Mar 8, 2009 (gmt 0)

10+ Year Member

Hello everybody

I'm a total noob to this forum, and to be honest I'm not super tech saavy but I do my best to get by... on to the issue:

Is there anyway to remove/purge your info from a site like domain tools or aboutus.org?

I've found another post on the site made by IncrediBill which showed me how to block the perpetrators IP Addresses... but I'm not sure if we can take it to the next level and get rid of all their stolen information.

I'm a musician, but I don't make enough money yet to be free from the employment world. The problem is these sites make it dam near impossible for me to have clean google results for my first+lastname. I am making a general assumption that in a tough job market, employers would choose somebody who does not have any other extra-curricular activities that they work on over somebody who does; especially when it comes to music, because let's face it, all musicians are peace pipe smoking liabilities (totally not true, but I believe this is the stigma...)

Another thought, am I the only one who finds these domain tools type websites horrible offensive and a huge breach of privacy laws? I actually feel sick that some rouge company would publish my backend information so they can create more advertising revenue.

Any help would be greatly appreciated. Keep up the good work everyone

[edited by: incrediBILL at 10:23 am (utc) on Mar. 10, 2009]
[edit reason] fixed filter issue [/edit]

incrediBILL

10:27 am on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

There is nothing private about a domain registration, it's kind of like buying a house, it's a matter of public record.

The only way to solve the problem is to get a private registration using a proxy service, but if you do that after your domain has been registered previously without a private registration, there's history available for anyone willing to pay to find out what it was previously.

As far as I know the only way to avoid the problem is to cancel the old domain and register a new domain and make it private from the initial registration.

Using a 301 redirect you should be able to point the old domain to the new domain, then discard the old domain after a period of time.

I don't think they track an old domain being 301d to a new domain, so that's probably your only recourse.

Rosalind

12:13 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If you forbid them by IP address this will restrict what they can index, but it does mean that they just use the old information from your index and About page.

Typically Aboutus.org will scrape pages labelled "About" or addressed about.php, about.htm, and so on. So if you want to erase their data more effectively you need to let the IP in, but use a script to cloak it and send it to a page that no-one else will see. A short message about the wickedness of copyright infringement might be appropriate.

Of course if you don't want potential employers to look you up, your best bet is to adopt a stage name that they won't be searching for, and go in for some reputation management SEO for your real name.

dstiles

9:06 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I had (still have!) a problem with aboutus.org. They listed pretty much all of my own and my clients' sites with excerpts claimed to be from site pages. The info was almost entirely innacurate and out of date and in several cases was given for the wrong site. Sites that did not even exist were (and still are) listed, some with a mix of old data and new thumbnail.

I signed up for an account and began to remove the information. Within an hour some moron began replacing it. Meanwhile my account had been blocked.

I found a way of complaining (their contact form is not a model of functionality or ease of use). I was told by email to submit my domains to them and they would remove all non-domain information from their database. I submitted something like 300 domains.

I checked a handful of the domains and they seemed to be empty of info. Re-checking now I see that some of my domains now have info again, taken from the site - AND they have stolen my logos!

To be fair, other domains have a note that we did not wish to be listed. Perhaps they ran out of time to complete the removals.

At the time of submitting the domains list I did warn that if their listing, which is commonly high on google, caused my company damage by mis-representation I would consider legal action.

From my own experience, ANYONE can sign up to add or alter information about a domain - potentially with criminally damaging effect without proving ownership. I suspect replacing my data was due to me completely removing the information.

Nor are they above listing domains that have no web site. A message comes up asking you wait whilst they compile information, then ask if it is a valid domain - you tell me, you suggested it! (Yes, the domain is valid but the robot block was successful.)

These so-called information sites are proliferating and contributing to the innacuracy of the web, as well as suppressing real sites by being listed high on search engines for stolen content. They are using OUR sites for THEIR gain. Google is partly to blame for this in allowing them to rise high: they should be listing the real sites not someone's un-checked opinion of them on a scam site. General wikis are bad enough but when they claim to be this authorative...

At the moment the web browser is displaying a really annoying and continuous popup on the aboutus pages saying "The google apps api key used on this site was registered for a different web site. You can generate a new key for this site at (google URL)." As if I cared!

Their site is hosted with Spry Hosting as Name Intelligence Inc on the range 66.249.16.0 - 66.249.17.255.

Their bot came in as:

IP: 66.249.16.nnn
UA: Mozilla/5.0 (compatible; AboutUsBot/0.9; +http://www.aboutus.org/AboutUsBot)

Spry has been blocked for some time on my server.

GaryK

12:11 am on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

They've been blocked on my sites since 2006 when I had similar problems with them.

Just checked my main money site and they've got outdated information from late 2006. They also list a slew of related domains that I don't own and never have owned including php.net. I wish I owned php.net!

I don't understand how a site like this can get away with listing such badly outdated and outright incorrect information without someone making a legal issue of it. Why haven't the folks at PHP taken issue with AboutUs stating that I own PHP's domain name?

dstiles

3:51 am on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I suspect it's because very few people who know what they're doing ever bother to go to the site. I only "discovered" it because I was looking for content thieves and this one came up.

Also, how many people can afford to begin a law suite, especially outside their own country? Apart from large companies, of course, and then see above. :(

As I said, I blame google for letting them (and other similar scavengers) get away with it.

koan

5:58 am on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In your robots.txt

User-agent: AboutUsBot
Disallow: /

incrediBILL

7:25 am on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Robots.txt has no impact on AboutUs.org, there is no bot page, they don't care about robots standards best I can tell.

<update>
I hadn't looked at AboutUs in ages, it appears I'm incorrect!
[aboutus.org...]

Not sure that'll stop the screen shots or Domain Tools though so I have their IP range blocked.
</update>

[edited by: incrediBILL at 9:47 pm (utc) on Mar. 14, 2009]

dstiles

4:40 pm on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In my robots trap: block IP range and block "aboutus" robot. :)

koan

11:20 pm on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

incrediBILL, they still will create a page about your site but they won't scrap your content:

How do I prevent the bot from gathering info about my site?
Using a robots.txt file, you can choose not to have your future AboutUs.org pages initialized with selected content from your website. This doesn't mean that we won't create a Wiki Page for your website. Our users should still have the opportunity to contribute their own content describing your site, as well as adding their own reviews.

GaryK

11:45 pm on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Great. So people who aren't affiliated with my site can supply information about it. I'm sure that'll work as well as Wikipedia does. Which is to say not well at all.

dstiles

3:55 am on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"initialized with selected content from your website"

In my experience they will not alter it once it's there, as witness a complete change of purpose for a couple of my sites in the past few years. Which can be very damaging if you buy a domain that previously belonged to a baddie, especially if you've never heard of the scammers.

incrediBILL

5:04 am on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I logged in for my domains a long time ago and removed all the scrapings and logos with:
"Removed for copyright and trademark violations"

koan

6:15 am on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If you send them a copyright notice they will remove whatever they did scrape. I did before I set them up on my robots.txt. Personally I refuse to be forced to create an account with a site that copies my stuff, just like I don't register to all the myriads of blogs copying my content to notify the owner. It's a matter of principles. A DMCA notice is more expedient.

GaryK

9:28 pm on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That's a good idea. I'll contact my IP attorney in the morning.

keyplyr

4:43 am on Mar 13, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

aboutus.org would not remove my property from their server despite my requests. In fact, they have my property (content, images, registered trademark, logo, et al) tagged to a nefarious domain that is 301ing to my site.

To their (limited) credit, they did post a statement reflecting my protest.

GaryK

4:06 pm on Mar 13, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If you used DMCA how can they get away with that? If they refuse to remove it isn't their service provider/host required to remove it considering it's rather obvious AboutUs has no rights to your content?

keyplyr

7:15 pm on Mar 13, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

One could say the same for any of the cached pages residing on SE servers. At least the big guys support the noarchive attribute in the robots meta tag, but many others don't. This has been discussed at depth here at WW.

GaryK

8:08 pm on Mar 13, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I've never seen those discussions. Has anyone discussed the possibility of some judge somewhere agreeing to a class-action suit against one of the big players on behalf of some ginormous amount of webmasters, perhaps starting a precedent that could be used in other cases?

Umbra

8:28 pm on Mar 13, 2009 (gmt 0)

10+ Year Member

If you used DMCA how can they get away with that? If they refuse to remove it isn't their service provider/host required to remove it considering it's rather obvious AboutUs has no rights to your content?

According to Section 4 of their Intellectual Property Policy ( [aboutus.org...] ), they do comply with DMCA letters.

In Section 2 of that policy, they argue that some content is "fair use". I don't know if section 2 supersedes section 4?

Any lawyers here?

MarkDilley

9:09 pm on Mar 14, 2009 (gmt 0)

10+ Year Member

Hello Folks,

As part of the community team at AboutUs, I wanted to respond to some of information above.

* Is there anyway to remove/purge your info from a site like aboutus.org?
** Yes, [aboutus.org...]

* I signed up for an account and began to remove the information. Within an hour some moron began replacing it. Meanwhile my account had been blocked.
** What is your account name? I'd like to see who did the blocking and find out why.

* I suspect replacing my data was due to me completely removing the information.
** You are correct, it is our policy to replace pages that are simply blanked. If someone edits the page to remove info, but leaves the headings, we do not replace it.

* I checked a handful of the domains and they seemed to be empty of info. Re-checking now I see that some of my domains now have info again, taken from the site - AND they have stolen my logos!
** Could you please let me know which pages? I'm not aware of us ever having repopulated information from a page that was formerly 'NoBotted'. I'd like to check it out.

* They also list a slew of related domains that I don't own and never have owned including php.net.
** Related Domains have never been intended to imply co-ownership. I'm not entirely sure how our bot decides what's related. Sometimes it appears to be sites you're linking to from your site or being linked to from another site. Other times, I'm not so sure. Our developers are in the process of re-writing the bot and revising the related domains algorithm.

* Robots.txt has no impact on AboutUs.org, there is no bot page, they don't care about robots standards best I can tell.
** We do care and have that addressed here: [aboutus.org...]

* At the moment the web browser is displaying a really annoying and continuous popup on the aboutus pages saying "The google apps api key used on this site was registered for a different web site. You can generate a new key for this site at (google URL)."
** Which URL are you visiting when you see this error? We would like to get this bug fixed.

* Also, how many people can afford to begin a law suite, especially outside their own country?
** I understand how you feel; many times it seems as if the threat of a law suit is the only way to get a big company to pay attention. We strive to be different than that, community is at the heart of AboutUs, and working to resolve community concerns is what I do. I am happy to talk with you by phone or email, or even here in this public forum.

Best, Mark

[edited by: incrediBILL at 2:43 am (utc) on Mar. 15, 2009]
[edit reason] no signature links tos #13, specifics removed [/edit]

incrediBILL

9:36 pm on Mar 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

class-action suit

Nope, we can't discuss legal actions here, short of the use of DMCA for copyright takedowns, per TOS#26:

26. Claims of action, flames, and calls to action against any company or person will be removed.

If you used DMCA how can they get away with that?

DMCA doesn't stop fair use so small snippets of text can still remain regardless of the DMCA, be careful when using the DMCA because if you aren't on solid grounds with copyright laws and get a site disabled they can counter-sue for damages.

[edited by: incrediBILL at 9:50 pm (utc) on Mar. 14, 2009]

MarkDilley

9:59 pm on Mar 14, 2009 (gmt 0)

10+ Year Member

I think it is easier for you to just contact me directly. Best, Mark

[edited by: incrediBILL at 4:05 am (utc) on Mar. 15, 2009]

incrediBILL

10:32 pm on Mar 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hi Mark and Welcome to WebmasterWorld, I wasn't referring to your post in specific, I was referring to some issues in other posts above.

I'm glad to see you've added robots.txt options to your site, but does it retroactively remove content like Archive.org does?

What I mean is if I blocked the AboutUsBot today, does it remove the information previously gathered from our sites on the next visit?

[edited by: incrediBILL at 10:33 pm (utc) on Mar. 14, 2009]

dstiles

11:06 pm on Mar 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks for responding, Mark.

* NoBot

I originally looked for removal information and could find no link to (eg) that page - or any other such page for that matter. When I tried to fill in the only form I could find it took several submissions with a password I was sure was correct before it was accepted. I was not pleased at the end of that session and even less so when all the information returned! And why should I have to password a complaints form?

From that nobot page:

"We choose not to completely remove pages from our system because AboutUs aims to be a guide to websites, and deleting a page would make us that much more incomplete."

Instead you prefer to have inaccurate, misleading and obsolete information stolen from my web sites. I did NOT give your company permission to hold extracts of my data nor to display my logos or thumbnails of my sites. And do not quote "fair use". I know what that means and you are abusing the principle if for no other reason than you permit others to modify the text to make it say what it was never intended to say but which viewers might think it did.

All of our newer sites include, in the AUP, the phrases:

"Content may not be held on another web server or used in any commercial form without the content owner's express written permission."

and

"You may not harvest or otherwise obtain or use information from this web site for commercial resale or advantage."

I think that about covers your site's abuse of mine. Have you inspected the AUP of ANY site you list?

* account name

<myaccount>- replaced by <employee> who then volunteered to remove my domains. I guess she ran out of patience part way through the list of 370+ domains.

* blanking sites

Some information to that effect would help. However, there is still the implication that ANYONE can change the entry, with drastic consequences for a web site, especially if the web site owner is blissfully unaware of you.

That, of course, assumes that anyone reads your site in the first place. Why anyone should place credence in a wiki that anyone can alter I have no idea. I certainly don't BELIEVE anything I read in a wiki without corroborating it elsewhere.

* Unremoved domains

The person named above has a full list or can copy the list to you upon request.

* related domains

So you are quite happy to associate domains that may be "bad neighbourhood" with my sites. You DEFINITELY need to work on that one. Likewise keeping up to date with obsolete domains.

Has it occurred to anyone there that if a domain returns a 4xx error then it may not exist or has blocked you for scraping? 403, for example, means you are not welcome. TAKE THE HINT! Remove the site.

* robots.txt

There is an entirely false assumption, mostly by botmasters, that a site can easily avoid being scanned by adding a line to this file. In practice the file is absolutely useless except to GUIDE major, well-known SEs (and by implcation permit them to hold approved data from the site).

Most bots, if they take any notice of it at all, are usually unknown to site owners. To pursue every bot that lands on a site, test it for compliance and add it to this file is far too time consuming for webmasters. It is far easier to block by IP range or, if the bot is honest, by UA - one word entered in a server-wide file against several lines in every one of several dozen files scattered across the server for each of - oh, let's say, 5,000 bots and counting?

* annoying popup

I have no idea now - it was the page that displayed my site details.

* There is no way I can afford a long conversation with America, either in time or cash. If you really care about us then...

1) don't steal our content without EXPLICIT permission ;

2) don't allow non-owners to modify the information;

3) fix all domains so that they can only be modified IF the web site's header has a specific meta tag tied to a login (as google etc do);

4) when an owner removes ALL information, keep it that way!

5) when you get a 4xx error, dump the site.

YOU may think that "community" is at the heart of aboutus. Why should anyone care? If we want information about a web site there are plenty of ways of finding out and I doubt yours is top of most peoples' list. Your "service" is of little help to web surfers and can be annoyingly misleading; and it is potentially dangerous to web site owners if anyone finds the incorrect information and BELIEVES it.

Whatever you may think about your "community", at the end of the day the site is a commercial undertaking. It is making money out of US, first from exploiting our registration details and then from theft of our COPYRIGHTED site content.

[edited by: incrediBILL at 2:45 am (utc) on Mar. 15, 2009]
[edit reason] removed specifics tos #13, keep it civil tos #4 [/edit]

incrediBILL

11:59 pm on Mar 14, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

To pursue every bot that lands on a site, test it for compliance and add it to this file is far too time consuming for webmasters.

The problem is it's a robots EXCLUSION protocol, not an INCLUSION protocol

That's why I whitelist my robots.txt file and convert it to an INCLUSION protocol:

#allow these bots
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: Mediapartners-Google*
Disallow:
#block all other bots that ask
User-agent: *
Disallow: /

Then you stop chasing thousands of bots that honor robots.txt and you can review a months worth of robots.txt requests at your leisure and see if anything is worth letting in.

dstiles

2:39 am on Mar 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Whitelisting and re-checking is fine for a few sites or if you can automate it, but automating robots.txt on IIS requires a few changes I'm not prepared to make. In any case, every site has unique requirements such as blocking individual files or folders, adding come-ons into traps and so forth. As I said, it's easier to block by IP and/or UA.

I have felt for several years that robots.txt is, like a major part of the internet, rather antiquated (can you say that about something only about 20 years old?). It's all very well google patching and darning but it'll take a major impetus to get robots sorted out, starting with killing botnets and working up.

keyplyr

3:23 am on Mar 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Another "community" that seems to think they can copy our web sites is iterasi.net

Aboutus.com uses it as their "archive" link. I found several "saved copies" of my web sites. Sent DMCA notices.

From the iterasi.net home page:

Every day you find web pages you may never see again. Which is fine, unless you actually need that information. Bookmarks don�t cut it. They lead you to where that information was � but not the information itself. With iterasi, you can save any web page and return to it anytime, from anywhere, forever.

(emphasis mine)

incrediBILL

4:08 am on Mar 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

We just had a recent thread about iterasi for those interested:
[webmasterworld.com...]

dstiles

4:30 am on Mar 15, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I wonder what would happen to either of these (and similar) sites in the event a web site owner was issued with a legal notice to remove content. If s/he were ignorant of these archivers who would be legally liable for continued infringement?

This 68 message thread spans 3 pages: 68

1
2
3
»