| 11:33 am on Jul 15, 2010 (gmt 0)|
Hunh? You don't know what's on your own site?
| 11:41 am on Jul 15, 2010 (gmt 0)|
Our content is based on the content of many other websites (domains). It's easy to filter it using a dictionary with #*$! related keywords but how to handle the pages in the languages other than english? Arabic, Russian, German etc etc..
That's why im looking for a database like urlblacklist.com but with a complete list of all adult domains.
| 11:46 am on Jul 15, 2010 (gmt 0)|
how do you manage to keep track of a "few million" websites. is that a literal number? you've got to be scraping the content, surely.
| 11:51 am on Jul 15, 2010 (gmt 0)|
Yea, you may not get a lot of help here with that. Some of us spend a lot of time fighting that sort of thing.
| 11:52 am on Jul 15, 2010 (gmt 0)|
The question is not how we manage the websites...
I'm talking about _1_ website that indexes around 120mln (of course, not all of them are indexed so far, just a few millions ;) ) of other websites and produces the content basing on their contente (e.g. keyword density stats).. I need to drop all adult websites from our database and i'm looking for a way to do this.
| 11:56 am on Jul 15, 2010 (gmt 0)|
oh, we thought you were a scraper.
getting a list of websites is probably a waste of time, because new ones would pop up every five minutes. you'd also have no way of knowing whether the list is complete, which would mean you'd still have exactly the same problem.
you'd be better off scanning your own pages for a dictionary of certain words, and then stopping adsense from appearing on those that have them. that way you can still run other ads, but not adsense.
why throw away thousands of pages?
| 12:36 pm on Jul 15, 2010 (gmt 0)|
>>oh, we thought you were a scraper.
same difference! someone using my bandwidth to produce keyword density stats - which they then try to make money from by serving them up with adsense ads
| 12:50 pm on Jul 15, 2010 (gmt 0)|
yeah, we are bad.. i know that ;)
| 2:16 pm on Jul 15, 2010 (gmt 0)|
I don't think you're going to find any universal list of adult domains but a partial list is a good start.
I think you're going to have to do your own content filtering and look for adult words on the page and simply disable AdSense on those pages and replace it with something else.
The foreign languages may require translating in order to filter, and sadly the translation tools often deliberately skip over certain words you'll need to filter.
I would start by yanking AdSense until I figured this problem out.
Then deploy it on English sites after filtering out the bad stuff.
FYI, I ran into this same problem with AdSense once, on an art site even, and simple words like erotic or nude will send AdSense in a tailspin.
| 2:33 pm on Jul 15, 2010 (gmt 0)|
This is what i'm doing right now.. probably i'll have to forget about placing adsense on any non-english page forever ;(
| 3:02 pm on Jul 15, 2010 (gmt 0)|
After thinking about it, I have a solution that I also use but it's very time consuming and maybe impractical for the number of pages you have.
You could try running a screen shot tool and looking at maybe 500 screen shots per page looking for adult sites.
It's the fastest way I know to scan that many sites without browsing them individually as your screen shot tool does that for you in the background.
If you make enough from your site, perhaps consider outsourcing the compilation of the offensive terms to filter to someone that speaks the native language.
Additionally, you may want to look for safe site filtering technology in foreign languages, the "net nanny" type of stuff.