Forum Moderators: open
Host Name: c122-108-33-nnn.eburwd9.vic.optusnet.com.au
IP Address: 122.108.33.nnn
Browser: Safari 4.0.3
Javascript: Enabled
Multiple visits about 5-12 times per day per day spread out evenly and only hits example.com and no other pages according to Statcounter.
This doesn't seem like normal traffic and it have been visiting for about 2 weeks, mayb
Just relaunched my freehosted site on a new domain Dec 1st and am paranoid of scrapers... I read somewhere those sites that offer free xml sitemap services feed your site info to scrapers.... of course I read that an hour after I used one to generate a sitemap.
[edited by: incrediBILL at 6:08 am (utc) on Jan. 10, 2010]
[edited by: tedster at 6:45 pm (utc) on Jan. 10, 2010]
[edit reason] Obscured IPs [/edit]
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_4_11; en) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9
Mozilla/5.0 (iPod; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20
Safari5531.21.10 CFNetwork/438.14 Darwin/9.8.0 (i386) (MacBookPro4%2C1)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0
-----
2.) Whatever/whomever it is, if it's only hitting the same page over and over, with or without graphics, it could just be a bookmark filed in an auto-open-on-launch tab/folder. Alternatively, the real Safari has a "Top Sites" feature that checks frequently-visited pages automatically. Both scenarios could result in frequent, intermittent, and innocent hits round the clock.
Either way, or if the 'visitor' is up to no good, or if the hits are HEAD and not GET, the hit rate sounds at least semi-automated and/or unnecessary and they're wasting your resources/bandwidth.
So if you'd rather err on the cautious side, you can rewrite* their address to a custom error page containing instructions to touch base with you if need be. Then include your e-mail address, but only as a graphic.
Note: If you don't want to make a graphic, or if you use mod_rewrite and don't want to mess with allowing another file, you can use reCAPTCHA.net's free "Mailhide" and include a link on your error page via a snippet of code. Works great.
-----
3.) I allow sitemap.xml to be accessed only by msnbot and googlebot only from .search.msn.com or .googlebot.com. If any other UA tries to access that file, they're denied and I block them. I'm from the more-restrictive camp when it comes to bots and see no need to make it easier to scrape or save via sitemap.xml.
-----
*Caveat: Some of the above require .htaccess (and mod_rewrite). Depending on the ISP, free accounts typically have very limited ability to control access and may not offer those features, sorry.
RE "if the hits are HEAD and not GET" What does that mean?
Still in the process of getting as many backlinks updated from my geocities addresses as I can, but once thats done I will drop yahoo and move to hosting with htaccess. SOL for the time being for bot exclusion. I just need to implement anti-scrape stratagies for any new pages... Any help in this department without access to htaccess would be GREATLY appeciated.
My current pages doing OK overall, about a dozen used to be top ten, a few were number 1. Some now are back in the top 20 in Google. I've had to do about 20 DMCA takedowns in the past month. A lot of damn work protecting my content.
For the sake of reference here are some of the suspect visits:
6th January 2010
01:50:41 AM No referrer example.com/
03:45:02 AM No referrer example.com/
06:26:32 AM No referrer example.com/
05:56:54 PM No referrer example.com/
07:51:18 PM No referrer example.com/
10:12:41 PM No referrer example.com/
11:16:19 PM No referrer example.com/
7th January 2010
01:12:52 AM No referrer example.com/
06:32:08 PM No referrer example.com/
07:39:00 PM No referrer example.com/
08:58:29 PM No referrer example.com/
8th January 2010
01:29:53 AM No referrer example.com/
04:23:59 AM No referrer example.com/
05:44:16 AM No referrer example.com/
06:49:56 AM No referrer example.com/
04:43:37 PM No referrer example.com/
05:49:01 PM No referrer example.com/
07:00:15 PM No referrer example.com/
08:03:59 PM No referrer example.com/
09:49:45 PM No referrer example.com/
In a nutshell, GET is the typical browsing person (but can be a bot, good and bad). HEAD is automated, either by robot, add-on, etc. Thing is, with your current hosting set-up, unless you can view your access logs and/or run scripts to retrieve specific info, you're not going to be able to see if hits are HEAD or GET, etc.
2.) Good bots respect robots.txt, also robots-related meta tags in HTML. Bad bots don't. Absent .htaccess or password-only access -- and a lot of learning on your part:) -- I honestly don't know how you stop anything/anyone intent on doing whatever they want to on your site, sorry. That said, I'm familiar with Linux. If your site is on a Windows box, perhaps someone else will chime in.
3.) Twenty DMCAs in a month is a lot. A LOT. And a LOT of time and work. Thing is, putting your site on a free server without any safeguards is akin to telling the world, "L@@K! Free Stuff! HELP YOURSELF" If you really want to protect things properly, I suggest making paid-hosting arrangements tomorrow, after looking into ISPs where you will have .htaccess capabilities, ditto access and error logs, and ideally, mod_rewrite and FTP. (I require terminal access and Perl but those aren't musts for most non-geeks.)
4.) Learning the particulars of everything from robots.txt to even basic .htaccess takes time and testing, trial and error. (Heck, mastering mod_rewrite is a master class in itself!) And the paid-for hosting world, with your own site residing its own domain, is a far, faaaar cry from GeoCities' servers.
When you're ready and able to implement basic safeguards:
-> Check out the various forums here and read, read, read. And then read some more:) Start with each specific Forum's Library (linked atop each post) because many of the docs include basic DIY info. Or...
-> Hire someone who geeks for a living and can build the basics for you until you get up to speed. Or...
-> Ask your new, paid-for ISP (smiles) if they can start some of the basics for you, and point you to their usage docs.
-> Stick with a major brand host (Yahoo, etc.) where the trade-off is more convenience for less control. Because even if you're a roll-your-own type, there's more to life than geeking. Really:)
Good luck!
And those 20 DMCA's were not all recent "thefts"... just haven't searched for infringers in the past several years because the pages copied were on top in the engines and I was neglecting the site. 15 were successful takedowns.
So, lots of stuff to add to list of things to learn. I better come up with a business plan to turn this hobby site into something worthwhile!