Forum Moderators: martinibuster

Message Too Old, No Replies

adsense ad quality logger and analyzer

server-side PHP, help me improve it

         

amznVibe

4:39 pm on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I present to you a little project I have been working on to automate one of my biggest adsense problems as a publisher with high-traffic websites.

We work hard, getting our filters down just right after studying our logs and using the adsense preview tools. The more sites and volume you have, the bigger a problem it becomes as new advertisers can popup daily and skew all your efforts with poorly written, bad-taste or completely off-topic ads. Even worse, you only get 200 static filters so if you need to remove some domains to add some new ones, you risk forgetting what is still active. Don't forget you also have some liability if ads for illegal products are shown on your sites. It's a mess.

After blindly stareing angrily at the adsense preview tool for the 100th time one week I said there has to be a way to make our rules more automated. We know what we do and don't want. We just need to know what the inventory is for our pages at any given moment.

The follow code will use the adsense preview tool (in test mode, no penality for impressions given) to generate a log for specified pages on your sites. It will gather all the ad urls, visible and hidden, and ad text. It timestamps the ad to know when it was "first seen" and only adds to the log if there is a new, unique url. Last but not least, it caches the data returned by the preview tool so we don't get Google upset by constantly pulling down the same data while developing.

Right now it's using flat-file and leaves alot to be desired. But with some mysql code and good analysis algorithms for badwords (ie. "free"), ad-age, etc. we can really make Adsense work like it's supposed to. Help me make this dream come true!

[edited by: amznVibe at 4:48 pm (utc) on Jan. 22, 2005]

amznVibe

4:43 pm on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[1]<?
$pages=array(
"www.example.com/",
"www.example.com/news/",
"www.example.com/text/",
);
$serverusername="blah"; // username for paths below
$log='/home/'.$serverusername.'/www/inventory.log.txt'; // chmod 777
$cache_dir='/home/'.$serverusername.'/www/XMLcache/'; // chmod 777
$cache_time=1800; // cache and don't query Google more than once a half hour, 0 to turn off
$ads_limit=50; // if returned ads ever go over 20, limit to this number
[/1][1]
$adsenseurl="http://pagead2.googlesyndication.com/pagead/ads?adtest=on&output=xml&client=ca-google&url=";
[/1][1]
putenv('TZ=US/Pacific'); $uid = posix_getpwnam ($serverusername); posix_setuid($uid["uid"]);
[/1][1]
$data=''; if (@$handle=fopen($log, "rb")) {while (!feof($handle)) {$data.=strtolower(fgets($handle, 4096));} fclose ($handle);}
$handle = fopen($log, 'a+b');
foreach ($pages as $page) { echo "source page - ".$page." - ";
if ($xml = XMLfetch($adsenseurl.$page,$ads_limit,$cache_dir,$cache_time)) {$i=0; $unique=0;
foreach ($xml['ads'] as $ad) {$i++;
$realurl=urldecode(preg_replace("/^.*?[?¦&]adurl\=(.*?)&.*?$/si","$1",$ad[url]));
if (strpos($data,strtolower($realurl))===false) {
$output=date("Y-m-d")."\t".date("H:i:s")."\t".$ad[visible_url]."\t".$realurl."\t".$ad[LINE1]." ".$ad[LINE2]." ".$ad[LINE3]."\n";
if (!fwrite($handle, $output)) {echo "error writing";}
$data.=$output; $unique++;
}} echo $i." ads loaded - $unique new ads found...<br>\n";
if ($xml['ad_count'] <= 0) {echo "error: no ads found?<br>\n";}
}
else {echo "error: can't load ads?<br>\n";}
}
fclose($handle);
[/1][1]
function XMLfetch ($XML_url,$ads_limit,$cache_dir,$cache_time) { $timedif=0;
if ($cache_time>0) {$cache_file = $cache_dir.md5($XML_url); $timedif = @(time() - filemtime($cache_file));}
if ($timedif<$cache_time) {$result = unserialize(join('', file($cache_file)));}
else {$result = XMLparse($XML_url,$ads_limit);
if ($cache_time>0) {$serialized = serialize($result); if ($f = @fopen($cache_file, 'wb')) {fwrite ($f, $serialized, strlen($serialized));fclose($f);}}
} return $result;
}
[/1][1]
function XMLparse ($XML_url,$ads_limit) {
$adwrap = "AD"; $adtags = array('LINE1','LINE2','LINE3','url','visible_url');
$XML_content = ''; if (!$fp = @fopen($XML_url, 'rb')) {return False;}
while (!feof($fp)) {$XML_content .= fgets($fp, 4096);} fclose($fp);
preg_match_all("'<".$adwrap."(¦ .*?)>(.*?)</".$adwrap.">'si", $XML_content, $adlines);
$i = 0; $XML_ads_attr = $adlines[1]; $XML_ads = $adlines[2];
$result['ads'] = array();
foreach($XML_ads as $XML_ad) {
if ($i < $ads_limit ¦¦ $ads_limit == 0) {
foreach($adtags as $adtag) {
$temp = preg_match2("'$adtag=\"(.*?)\"'si", $XML_ads_attr[$i]);
if ($temp!= '') $result['ads'][$i][$adtag] = strip_tags(unhtmlentities(strip_tags($temp)));
}
foreach($adtags as $adtag) {
$temp = preg_match2("'<$adtag.*?>(.*?)</$adtag>'si", $XML_ad);
if ($temp!= '') $result['ads'][$i][$adtag] = strip_tags(unhtmlentities(strip_tags($temp)));
}
$i++;
}
}
$result['ad_count'] = $i;
return $result;
}
[/1][1]
function unhtmlentities ($string) {
$trans_tbl = get_html_translation_table (HTML_ENTITIES);
$trans_tbl = array_flip ($trans_tbl);
$trans_tbl["&apos;"] = "'";
$ret = strtr ($string, $trans_tbl);
return preg_replace('/&#(\d+);/me',"chr('\\1')",$ret);
}
[/1][1]
function preg_match2 ($pattern, $subject) {
preg_match($pattern, $subject, $out);
if(isset($out[1])) {return trim($out[1]);} else {return '';}
}
?>
[/1]

amznVibe

4:54 pm on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Don't forget the forum messes with the following character ¦
You need to search and replace that with the regular pipe symbol.

There are two problem with the above code right now that were working before but I must have broken something. My eyes may be too tired to spot it at the moment.

First the unhtmlentities stopped working (stuff like &amp; should be made to just & ) and secondly a dupe will occasionally get by strpos for some reason - I have no idea why. Perhaps someone can spot these issues for me.

jetteroheller

10:07 pm on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You wrote
The follow code will use the adsense preview tool (in test mode, no penality for impressions given)

I do not undersand what You mean with

"no penalty for impressions given"

What do You mean with this?

amznVibe

2:58 am on Jan 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Very simply, the code uses the same url that Google and other third party preview tools use for their adsense "sandbox" routines. The adwords advertiser is not held responsible or counted against for the retrival (impressions) of their ads when done in this manner.

homeblock

2:35 am on Jan 24, 2005 (gmt 0)

10+ Year Member



<snip> error: can't load ads?

[edited by: Jenstar at 3:21 am (utc) on Jan. 24, 2005]
[edit reason] No URLs as per TOS [/edit]

amznVibe

5:04 am on Jan 24, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That error means the adsense xml is not even being returned properly.
The code is (still) working on my server.

You'll have to give me more information and check carefully for forum character replacement in the code above. Are you using php on apache or microsoft?

amznVibe

7:03 am on Jan 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here is some code to add a new url to the adsense URL filter list.
Next step is to add a "bad" keyword list so when a stopword is found in ad text, just add to the filterlist automagically
(ie. "sex","chat","free")

WARNING: backup your URL filter list before testing this code

<?
$addfilter="example.com";
$username="adsense.email@address.com";
$password="adsense.password";
$cookie="/home/your.server.username/public_html/path.to/cookiefile";

$url="https://www.google.com/adsense/login.do";
$destination="/adsense/filter-edit?null=Add+%2F+Edit+sites";
$postdata="destination=".urlencode($destination)."&username=".urlencode($username)."&password=".urlencode($password)."&null=Login";

$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt ($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt ($ch, CURLOPT_FRESH_CONNECT, FALSE);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt ($ch, CURLOPT_TIMEOUT, 20);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt ($ch, CURLOPT_COOKIEFILE, $cookie);
$result = curl_exec ($ch);

$filterlist=preg_split('/\n/simU',preg_replace('/^.*\<textarea.*\>(.*?)(¦\n)\<\/textarea\>.*$/sim','$1',$result));
$filterlist[count($filterlist)]=$addfilter;

$url="https://www.google.com/adsense/filter-save.do";
$postdata="null=Save+changes&filterlist=";
foreach ($filterlist as $entry) {$postdata.=$entry."%0D%0A";}

curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt ($ch, CURLOPT_POSTFIELDS, $postdata);
$result=curl_exec ($ch);
echo $result; //for testing only - remove
curl_close($ch);
?>

note: replace the '¦' in the middle $filterlist=preg_split - forum changes the character

Sanenet

10:05 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Question - don't these types of scripts violate section 5, paragraph (vii) of the Google Adsense ToS?


[..](vii) "crawl", "spider", index or in any non-transitory manner store or cache information obtained from any Search Results and/or Ad(s) or any part, copy or derivative thereof, [..]

incrediBILL

10:29 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"no penalty for impressions given"?

You're messing with the AdWord peoples impressions and CTR, that's probably enough to make Google swat you like a fly if you get caught running this TOS violation script. Not to mention Google may interpret this act as someone attempting to collect a list of their advertisers to compete against them.

Not something I'd want to do, my checks are too substantial to lose!

I would be more concerned with making sure my page content was narrowly focused and had all the right things on the page for AdSense to correctly identify the the content rather than attacking the advertisers. A quick email to Google describing the problem would allow them to at least let you know if your pages were at issue and why they may be attracting the wrong types of ads.

amznVibe

10:31 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Please don't post general opinions if you are not up to speed on adsense's test mode, what it is and how it works.

If you read that TOS part carefully they are taking about you crawling the ads when displayed on your webpage. That is completely different. I am fetching the ads in test mode directly from Google. This is exactly what their preview tool and other third party tools do. I am not doing anything new for that part.

Again, my php tools do for the most part what other tools are "okayed" by Google for already. I am not doing anything new against their policy. There have been very popular and widely known adsense sandbox tools in heavy use for a year now.
My tool also lets you cut out a third party so you don't have to worry about entering your page urls into their service.

I would be more concerned with making sure my page content was narrowly focused and had all the right things on the page for AdSense to correctly identify the the content rather than attacking the advertisers. A quick email to Google describing the problem would allow them to at least let you know if your pages were at issue and why they may be attracting the wrong types of ads.

You think it's as easy as having the "right content"? So you think your international visitors are seeing what you have targeted? Try installing a click monitoring tool and you'll be very surprised. Leaving it up to Google and unfiltering your ads is an incredibly bad idea. How are you going to email Google "describing the problem" when you don't know what the problem is in the first place other than low CTR and CPM.

incrediBILL

11:19 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It wasn't a general opinion, I was going off what the agreement stated.

If there is a variance to that, other than the AdSense Preview tool, I was not aware of it.

Can someone kindly direct me to this information stating we are allowed to make our own preview tools or an API description of the interface?

Sanenet

11:24 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



][..]By installing and/or using the AdSense Preview Tool, you agree to abide by the Google Toolbar Terms of Use [..] as if such terms applied to the AdSense Preview Tool and Google's advertising services [..]

[google.com...]

[..]But you first need to obtain Google's permission if you want to sell the Google Toolbar or any information, services, or software associated with or derived from it, or if you want to modify, copy (except as provided below), license, or create derivative works from the Google Toolbar.[..]

[toolbar.google.com...]

(Emphasis mine)

Sanenet

11:40 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That said, it seems like a good little tool. It really needs to be checking against a database, in order to help get around the annoying limit of 200 banned urls.

Strikes me that there are 2 possible cats under which banned ads fall - untargetted, and unwanted. So, we need to build up two url cats for each of these cats.

An untargetted ad would be a topic that keeps popping up - ie, you have a website about War honor medals and adverts for the PC game "Medal of Honor" keep popping up. That sort of thing would be fairly easy to build up a wordlist for, and each time the ad arrives, ban it.

Unwanted ads would be, say, adverts for "adult lubricants" on a recipe site about jelly. Again, build up the word list, and ban.

The difference between the two cats would be more the type of ban imposed. (Remember that we have a limit of 200 urls).

On untargetted ads, we can keep track of which pages are attracting more them, and reoptimise the page. We can then rop the ban after a couple of weeks in the hope that the reoptimisation worked (but keep a flag up in our db, so that if the ads start coming back we ban at once without any intervention). We could also opt to "keep an eye on them" rather than just ban them - so if the ad appeared more than, say, 10 times over the course of a week, we ban it.

The unwanted ads would be banned at once and the URL added to our list for good.

What you do need is a manual interface, that displays the ad text, the url, the page it appeared on, and the frequency. Gives us the option to build up the system :) The KW lists would need to be built up over the course of several weeks, but that's a question of starting with the obvious ones for our site, and then adding to them from our experience as time goes by.

Any thoughts?

amznVibe

5:39 am on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My code is not derived directly from their preview/toolbar and I am not selling it ;)
Both clauses covered. Thanks for helping us move forward.

Part of the reason I post code here is because it is wide open to authoritative Google employees - this means instant review. Don't think they haven't seen it yet and wouldn't say anything if they had a problem with it. Consider carefully that 1. I didn't have to post it anywhere 2. I could have posted it far-far away from Google's eyes. This isn't some dark manipulation of Adsense, this is a positive tool to help publishers until Google decides to improve filters and preview ability.

Trust me on this, the day you have a raunchy sex ad showing on your sites for six hours [webmasterworld.com] because you were away, you are going to want to be using this code to protect yourself. Anything showing on your site is your liability.

Moving it to mysql is my next step for much more power in sorting and filtering.
Easy enough, I've just been too busy to hash out the code. Will get to it this week.

MikeNoLastName

6:39 pm on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How about a third cat for traffic stealing, made for adsense, scraper sites bidding under 10 cents :-) Please e me when it can detect and filter these automatically. :)