Forum Moderators: phranque

Message Too Old, No Replies

Possible Site Scraping

Ongoing for a few days now

         

azlinda

10:06 pm on May 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When I log in to Google Analytics and watch the real time map, there are two people from Roseville, CA who are sitting on the site for very long periods of time. This has gone on for at least five days. I strongly suspect that they are copying my site. Is there any way to find out what the IP(s) for these two could be?

lucy24

10:28 pm on May 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Five DAYS, you say? Why are you futzing around with GA instead of heading straight for the access logs?

If I remember rightly, GA obfuscates the last segment of visitors' IP. So for the time being, block the remaining /24 (or whatever it would be if they're coming from IPv6). On a small site there shouldn't be much collateral damage.

This is all assuming they really are scraping--which honestly should not take that long--rather than that you've got a couple of devoted fans who have just discovered your site, and really like it. It should be easy to tell by looking at what links they follow, and how long they spend on each page.

azlinda

11:44 pm on May 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have been in the raw access logs and can't find anything. I guess I don't know what I'm looking for. :) Mine is not a small site. It's 42,000 pages.

lucy24

12:26 am on May 4, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If your access logs don't show human visits corresponding to the times and pages you see in GA, then they're just hitting your analytics and can probably be ignored. (I stress: Probably. If they haven't been to your site, how do they know the URLs?) This, too, should be easy to check: If you block the visible part of the IP, and the hits keep showing up in analytics, then you know they're not really on your site.

Although analytics requests without accompanying human visits are especially associated with unwanted events such as referer spam, it can also happen legitimately, as when a human visits a page and then comes back a few hours later. Their browser won't re-request all files, but analytics will show a fresh request.

NickMNS

1:19 am on May 4, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mine is not a small site. It's 42,000 pages.

That's small, it could all be scraped in a matter of minutes. That is roughly the number of pages Google crawls on my site on a daily basis.

I guess I don't know what I'm looking for

tl:dr = The trick is to find the time to the minute when a specific page was viewed by the "bot/user", then find that specific page view in your logs to find the IP, then with IP you can see everything else that that "bot/user" did.

In more detail:
In analytics go to the "behavior" report "site-content / all pages", add a secondary dimension for "city" and filter for the city in question "Roseville". If there are several pages start by selecting the on with the least page views, then click on the link in that row, it should send you to a new view with only that page. Then add the city again as the secondary dimension. Then look at the data in the other rows to find a pattern or signature that is unique for that row, it could be "number of page view" + "avg time on page". Then change the secondary dimension to "time" "hour". Note the hour for the row that matches the signature. Repeat again but change the time dimension to "minute", find the same row note the minutes. Now you have a one minute interval of log entries to check, depending on the volume of traffic that will likely be only few dozen entries.

Go to your logs, find the entries for that minute interval. Remember to check the time zone. For example, my GA reports in Eastern time, but I have my logs set to UTC, so I need to make the adjustment. The log entry should be easy to spot, find it, and record the IP and the UA. You can then search your log for that IP and get all the entries. If you don't find a match, go back to GA start the process over with another page and another 1 minute interval. After a several attempts you still come up empty handed then it is likely referrer spam or other as described by @Lucy24, above. And yes it is a real PITA.

azlinda

4:14 pm on Aug 13, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thank you, NickMNS! I appreciate your taking the time to explain all this.