Forum Moderators: open

Message Too Old, No Replies

Bots find new domain in days

robot/scrape activity before there's a site

         

dstiles

1:39 am on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I registered a new domain name for a new site 10 days ago. As soon as it was registered I set up DNS and prepared the server with an empty directory and added the domain to IIS. The site does not get uploaded until later today. The domain has not been published ANYWHERE except (obviously) in the registry (TUCOWS) but it's a .COM so vulnerable to scraping.

So far, in order of appearance, the logs register the following activity:

Days 1-3: nothing

Day 4:
ashburn.notadot.com from 198.186.193.* (index page) (aptly hosted by Prescient Software, Inc)

Day 5:
notadot
slurp from 66.228.166.* (2 on robots.txt then 2 on index)

Day 6:
nat-user.futuresoft.com from 208.219.207.* (2 index, half-hour apart) (MCI aka Verizon Business)
notadot
slurp as before

Day 7:
notadot (3 hits, first two an hour apart then 10 hours)

Day 8:
notadot (once)

Day 9: s203.ofdp.irs.gov from 63.105.37.* (I'm guessing USA's IRS - I resent that, being English!)

Day 10: googlebot from 66.249.66.* (2 hits on robots.txt then one on index, then three on robots)

Note: notadot, futuresoft and irs had no UA so auto-blocked anyway.

Note: Google returns 163 results for notadot.com, a handful for domain services the rest from logs. Web page is under construction. Registered 2002 on godaddy.

Note: googlebot's first group of three hits occurred within the same second so it could not have actually read robots to see if it was blocked even had there been one. Perhaps that's its default: no file, blunder ahead - yes, I know that's allowed but it seems a bit swift.

Note: futuresoft.com seems to be a security company, domain registered in 1997.

leadegroot

10:33 am on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, yeah - I always try to have at least a semblence of the site in place before I register the domain. (Not always possible, of course).
Google quite often hits the new registration within minutes, and I like to feed them something on topic from the first feed :)

Lord Majestic

10:41 am on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



All registered domains appear in the zone list for DNS purposes so search engines use that to discover new domains quickly. Not all TLDs have got zone files easily accessible but .COMs can be had easily.

dstiles

7:00 pm on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since a lot of domains never have web sites that's a bit presumptuous of google. And yahoo. And definitely notadot and irs!

And it probably leads to innocents having multiple domains, obtained for name protection etc, listed in google for a single site - a lot of people can't do anything about redirects even if they realise it's a problem. How many web sites actually HAVE a robots.txt to ban bots with?

What if I build a site and don't want to tell google? Ok, I can disallow it in robots.txt but I shouldn't have to put up with ANY scanning, even of that, from them if I choose not to tell them.

Sorry, I know the arguments pro/con. Just ranting! :(

dstiles

7:28 pm on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Also, just found amongst some 301'd domains registered at the same time:

fbsd.integratedds.net from 63.147.185.*

POST'ed (only hit) to the url http://example.com/_vti_bin/_vti_aut/fp30reg.dll which, according to google, is a Front Page exploitable DLL. No help to them since both vti and fp are blocked.

incrediBILL

9:05 pm on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I shouldn't have to put up with ANY scanning, even of that, from them if I choose not to tell them

That's a misconception because any time you add anything to DNS you've literally told the online world about it.

Some parts of the online world immediately come to see what's up, such as search engines and even port scanning hackers!

The only way to completely avoid alerting the SE's is to not have any domain whatsoever and use just a raw IP address, and preferably a port # other than :80 to keep it from being scanned by IP crawlers.

[edited by: incrediBILL at 9:07 pm (utc) on Oct. 19, 2008]

dstiles

12:52 am on Oct 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, I know, Bill. :) I just think some of them should mind their own business, that's all. It's a bit like someone barging into your new house instead of waiting to be invited.

Still, keeps us busy, eh? :)

skrenta

3:14 pm on Oct 20, 2008 (gmt 0)

10+ Year Member



The big search engines buy the whois data from the domain registrars such as network solutions. As soon as you register your domain, it will be in the database of domains fed to Google, Yahoo, etc. and they will try to crawl it.

It's difficult to probe domains via dns alone, most dns servers are configured to not answer queries unless you're looking up a specific address.

dstiles

5:19 pm on Oct 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not entirely true about DNS rejection. A lot of spammers seem to manage to get email addresses (or at least, they used to, perhaps not so often nowadays) and what registry would sell their database to the likes of notadot, which seems to have no web site and no information about it? Ok, netsol, I grant you... :)