Seeing hits to ads.txt from non-google IP's

Forum Moderators: phranque

Message Too Old, No Replies

Seeing hits to ads.txt from non-google IP's

started about a week ago

SumGuy

1:30 am on Jul 5, 2019 (gmt 0)

I can (or will) post more details about the IP's and user-agents involved, but all of a sudden I'm seeing hits to ads.txt from non-google IP's. Don't know what's up with that...

(if this should be moved to bots forum, go ahead...)

tangor

6:03 am on Jul 5, 2019 (gmt 0)

Should see from a wide variety ... last time I looked it was over 20 (non-g) on one of my sites.

All get 404s since I do not have any IAB style (third party) advertising active.

Ads.txt is for advertisers ... not specifically g itself.

lucy24

6:23 am on Jul 5, 2019 (gmt 0)

Oh, heck, everyone asks for ads.txt, even if they have no earthly reason to believe such a file exists on your site.

:: detour for quick check ::

Yup. Out of the truly ridiculous number of requests I get, no more than 1/8 are from Google. The rest come from what can loosely be described as The Usual Suspects, including but not limited to 34/52/54 and the like. Interestingly, every single request except Google gets a 403, meaning that they have all caused offense in some way.

User-agents range from hilarious, like
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.50

to improbable, like
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0

or, most popular of all,
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36

In fact a remarkable number (about 3/4 of the total) claim to be some recent Mac.

And finally a tiny handful profess to be
adstxt.com/1.2
which I tend to doubt, since no two have even similar IPs.

I suppose someone hereabouts can explain what information they hope to glean. (Like asking for robots.txt to learn the names of your roboted-out directories, I guess.)

NickMNS

1:34 pm on Jul 5, 2019 (gmt 0)

Ads.txt is not a Google thing. It was implemented to allow programatic ad-buyers to verify that the ad-networks selling the ad space in fact had the right to show ads where they claimed, thus it is normal and expected that requests for ads.txt would come from a variety to sources.

SumGuy

1:51 pm on Jul 5, 2019 (gmt 0)

Hmmm. All I've ever seen (since I started seeing ads.txt requests maybe 2 years ago?) was from google. But over the past 1.5 years I've moved and expanded my IP blocking list to my router, so anything getting blocked there I have no way to know what they were trying to get. And I also don't have ads.txt so anyone asking would get 404. That's why I thought it was a google thing.

Kendo

1:52 pm on Jul 5, 2019 (gmt 0)

They could be probing for a particular plugin's existence that may be exploitable.

About a month ago set up live stats. Instead of looking at an overvi3eww created from the site logs, I can now refresh a page and see the latest hits... all of them. From that I deduced that the 1400 unique visitors each day were comprised mostly of malovents looking for a means to exploit the web site and/or server.

It was pretty obvious at first... hundreds of hits on PHP pages that did not exist on a Windows server. Hundreds of hits looking for particular WordPress plugins to exploit. Untold hits containing SQL commands and so forth.

So I started blocking the repeat offenders and even whole network blocks. Along the way I compiled a list if IP addresses for the popular search engines and some code that alerted me to new ones. Conclusion = there is a lot of hacking software out there claiming to be a searchbot.

Since I started, the 1400 unique visitors per day has been reduced to 800. I still see a lot of mischief from individual IPs that are randomly assigned. However I did mange to kill off the traffic from SEO spiders... the ones that study your site and then sell the info to your competitors!

NickMNS

1:59 pm on Jul 5, 2019 (gmt 0)

Here is a detail explanation of Ads.txt, how it works, why use it, and who it applies to.
[iabtechlab.com...]

SumGuy

12:25 am on Jul 7, 2019 (gmt 0)

As of today, looking as far back as Jan 2015, I have a grand total of 507 requests for ads.txt. The first hit coming on 12/21/2017 from IP 198.148.27.17 (Pulsepoint Inc). The next hit happened on 1/4/2018 and was from 66.249.x.x - what I consider to be google's "googlebot" subnet. That marks the start of google's usually once-per-day request for ads.txt.

Of the 507 requests for ads.txt, all but 7 of them came from 66.249. (hence why I thought it was something specific to G). All of G's requests have the same (typical) user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The non-G hits to ads.txt have come from:

198.148.27.17 12/21/17 PulsePoint UA = Ads.txt-Crawler/1.0
54.173.25.142 1/29/18 Amazon AWS no UA
54.88.97.127 3/21/18 Amazon AWS UA = IndustryIndexBot/1.0 (+http://industryindex.com/bot/)
45.79.71.25 4/1/18 members.linode.com UA = Java/1.8.0_161
18.234.171.45 10/3/18 Amazon AWS UA = python-requests/2.4.3 CPython/2.7.9 Linux/4.9.93-41.60.amzn1.x86_64
74.128.145.x 7/1/19 Road Runner Lexington KY (HEAD only) no UA
67.226.210.4 7/3/19 Tremor Video DSP (?) UA = Dispatch/0.13.2

The reason for my relatively low (and G-focused?) requests for ads.txt must be because my site has absolutely zero cross-linking to outside domains for any reason (ie no tracking, FB, ad networks, googletag stuff, adwords, etc). Otherwise I have no idea why...

tangor

12:40 am on Jul 7, 2019 (gmt 0)

Chuckles. :)

Looking back at LAST MONTH on a rather obscure hobby site, 108 ads.txt requests and only three were from g.

Go figure.

YMMV ... A LOT ...

tangor

12:47 am on Jul 7, 2019 (gmt 0)

Aside: the hobby site mentioned above has NO ADS and is not commercial in any way (but is linked by many as internationally authoritative for that niche).

lucy24

2:07 am on Jul 7, 2019 (gmt 0)

absolutely zero cross-linking to outside domains for any reason

Also zero cross-linking from outside domains? At latest count, I see around 900 requests within this calendar year, and that's on teeny weeny sites.

:: digression ::

Oh, will you look at that. On two separate dates in March for my test site, and a single (different) date for my personal site, and further-removed dates on other sites, there is:

80.248.227.abc - - [13/Mar/2019:01:52:01 -0700] "GET /robots.txt HTTP/1.1" 200 309 "-" "CipaCrawler/3.0 (info@domaincrawler.com; http://www.domaincrawler.com/example.com)" 
80.248.227.abc - - [13/Mar/2019:01:52:04 -0700] "GET /humans.txt HTTP/1.1" 403 929 "-" "CipaCrawler/3.0 (info@domaincrawler.com; http://www.domaincrawler.com/example.com)" 
80.248.227.abc - - [13/Mar/2019:01:52:07 -0700] "GET /ads.txt HTTP/1.1" 403 929 "-" "CipaCrawler/3.0 (info@domaincrawler.com; http://www.domaincrawler.com/example.com)" 
80.248.227.abc - - [13/Mar/2019:01:52:11 -0700] "GET / HTTP/1.1" 403 929 "-" "CipaCrawler/3.0 (info@domaincrawler.com; http://www.domaincrawler.com/example.com)"

Wasn't I only just talking about bad reasons to request robots.txt? On the test site--the only one they hit twice--they would have met a comprehensive, all-encompassing Disallow.

SumGuy

11:59 am on Jul 7, 2019 (gmt 0)

> Also zero cross-linking from outside domains?

There are a couple of links from wikipedia. I've just done a google advanced search for "mydomain.tld" and I see results from zoominfo, bloomberg (private company information), frasers, a few in books.google.com, google scholar and scientific journals, various corporate directories, about a dozen other companies in related fields, some university lab and personal websites. Our company's domain has had an active website going back to about 1998. Fecebook auto-generated what is I guess a place-holder or dummy page for us a few years ago - we don't use it.

I've probably blocked domaincrawler.com (probably the entire AS network that hosts it) but I don't know why you brought up domaincrawler in connection with my observations related to ads.txt.

lucy24

5:54 pm on Jul 7, 2019 (gmt 0)

I see results from

Wow, that sounds like a solid collection.

I don't know why you brought up

Because I was searching for requests for ads.txt and this was one of the very few with a name as opposed to a made-up humanoid UA. This led to noting that they were one of the very few that requested ads.txt as part of a set of requests for other stuff. If nothing else, it suggests that ads.txt is becoming a standard file that robots expect to find. (But humans.txt? Seriously? That one never did become a standard.)

JS_Harris

4:37 am on Aug 18, 2019 (gmt 0)

Standards? webmasters are placing beacons on their own sites now for fear of being denied revenue. It seems webmasters no longer set the standards.

tangor

10:31 am on Aug 19, 2019 (gmt 0)

Standards? We don't need no steeking standards!

For the humor challenged, that is humor ...

The ads.txt hits are mounting. See all that lucy24 indicated and a growing number of apparently "humanbot" from IPs across the grid. Two of those (Chinese origin) also attempted to penetrate my non-existent "wp" install via some 800+ attempts.

Could ads.txt be the next next intrusion point? Or just more NOISE on the web?