Forum Moderators: open

Message Too Old, No Replies

Slow Scrape

Plus images

         

blend27

6:01 pm on Oct 21, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Notice these:

Chrome Version: is 48
... Native Chome 48 all the way till version 78(i think) does not send Headers: Sec-Fetch-User or Sec-Fetch-Dest,, _Mode or Site as far as I know...
Tth-Endproxy: blah blah blah
Tth-Logid: blah blah blah

HEADERS

"request": {
"headers": {
"Accept-Language": "en-US,en;q=0.5",
"user-agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.6555.1574 Mobile Safari/537.36",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"X-Canonical-SSL-URL": "https://www.example.com/example.extention",
"host": "www.example.com",
"Tth-Endproxy": "http://rd-gtest:W48W5HKHSMNDSF45COL7A@[2605:.....",
"X-REWRITE-URL": "/example.extention",
"XCluster-name": "",
"Sec-Fetch-User": "?1",
"Tth-Logid": "",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip",
"XSite-name": "EXAMPLE.COM",
"X-ORIGINAL-URL": "/example.extention",
"Upgrade-Insecure-Requests": "1",
"content-length": "0",
"Sec-Fetch-Dest": "document"
},
"xtras": {
"requestTime": "October, 21 2023 06:24:15",
"ip": "188.221.128.65",
"string": "bcdd8041.skybroadband.com",
"timeTaken": "0",
"fromSession": false
},
"protocol": "HTTP/1.1",
"method": "GET",
"content": ""
}
}


The Big Slow goes after pages where Hi-DeF Images are displayed. There is no way that these are accessible without a His to where these are link to. The only place where these IMAGE displaying URIs are listed are in SERP on Goog and Bing as fare as I know.

I've looked at a set of IPs requests are originating from: 99% residential IPs, some proxy servers, but ALL have Tth-Endproxy: blah blah blah, as a parts of Headers.

Different versions of Chrome and Safari, Different OS-eseses, JS is not executed.

SumGuy

11:30 pm on Oct 21, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



For what it's worth, the IP is Sky broadband in the UK, so on first glance it's a residential IP. AbusedIP and Cleantalk both have nothing on it, but Spur says it's part of a call-back proxy network.

blend27

12:51 pm on Oct 22, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is not about IP(or Sky).

It is a stealth scrape, meaning hundreds of residential(and I know it), tor-exits, hosting range ips = the works! are being used to get HiDef images on one of my sites.

I pulled a plug on this behavior 2 days ago. More than 6K requests logged so far, Much more in the time before that before I caught it.

What interests me is that the pages being requested are not SERP, but Proper URIs that only existed in Older SiteMap File(page) the has not published on the site for more than a year or so.

SiteMap page had NoArchive header and NoIndex one as well.

All This time around requests contain "Tth-Endproxy": "http://rd-gtest:W48W5HKHSMNDSF45COL7A@[2605:.....", Entry header pair.

99% of UAs in requests are outdated(69 and below) version of Chrome browser spreading on All OS levels.

It is fun to watch.

And all the HiDef images rewritten to get a fresh Image of my random favorite animals (.), would not do it any other way....

SumGuy

1:19 pm on Oct 22, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Are you seeing any forwarding headers in these requests, like X-Real-IP, X-Client-IP, True-Client-IP, X-Forwarded-For, CF-Connecting-IP, Fastly-Client-IP, X-Azure-SocketIP etc?

lucy24

6:14 pm on Oct 22, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm generally very forgiving on version numbers, but Chrome/48 would get them blocked even on my site, with the possible exception of Androids.

blend27

4:06 pm on Oct 23, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@SumGuy - Are you seeing...
Nope,
"request": {
"headers": { ....some headers are added by me }
}
those are the ones supplied.


@lucy24

I know I know...

I had a wrong regex filtering/dropping requests, changed

from:
<!-- Chrome 77 and down are blocked -->
<add input="{HTTP_USER_AGENT}" negate="false" pattern="^.*(Chrome/[1-7][0-7]\.).*"/>

to
<add input="{HTTP_USER_AGENT}" negate="false" pattern="^.*(Chrome/[1-7][0|1|2|3|4|5|6|7\.).*" />
<!-- Chrome 9 and down are blocked -->
<add input="{HTTP_USER_AGENT}" negate="false" pattern="^.*(Chrome/[0-9]\.).*" />

blend27

6:24 pm on Oct 26, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Cranking up the 'Auty' Scrape Attempts, this is funny.....

If Chrome Browser UA is used AND version is LESS than 80(f it) AND header structure contains any "Sec-Fetch-" headers = Boom Capcha!

SumGuy

12:54 pm on Oct 27, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



------------
The Sec-Fetch-Site fetch metadata request header indicates the relationship between a request initiator's origin and the origin of the requested resource.

In other words, this header tells a server whether a request for a resource is coming from the same origin, the same site, a different site, or is a "user initiated" request. The server can then use this information to decide if the request should be allowed.

Sec-Fetch-Site: cross-site
Sec-Fetch-Site: same-origin
Sec-Fetch-Site: same-site
Sec-Fetch-Site: none
------------

Interesting. I see a chart that shows Chrome using it at version 76. Seems that all browers are providing this header. Does anyone here block based on Sec-Fetch?

Is the point of Sec-Fetch to tell you that some site is embedding your site's content into their's (and you wouldn't see that in the referer)?

lucy24

5:28 pm on Oct 27, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: exploration of logged headers (only covering the past year) ::
Is the point of Sec-Fetch to tell you that some site is embedding your site's content into their's (and you wouldn't see that in the referer)?
If a referer can be faked, why can't any other header?

Firefox sends the Sec-Fetch set too, but it looks as if Safari doesn't. That is, Safari-as-such, not the various webkit-based browsers that include “Safari” in the UA string.

Does “Sec-Fetch-User” ever have any other value that ?1 [sic] and if not, what's the point? Is it held in reserve for future developments?

It may be worth looking more closely at requests that send Sec-Fetch-Site: cross-site but no Referer (about 1/8 of “cross-site” requests).

:: wandering off to investigate, which will take time ::

blend27

10:17 am on Oct 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@SumGuy

Ha, you remember correctly! The whole point is that below 76 older browsers did(do) not send SEC-FETCH- headers, install Chrome 75 and see. Did not before. So that is why I call it a Slow Scrape. Bunch of less than 76ers(no pun intended .. [google.com...] .., ;)) are trying to scrape URIs with HEADRES are being present for the older UAs. On Chrome that is.

---Does anyone here block based on Sec-Fetch?
try it below 76, 99.9999% of requests are fake...

@Lucy24

--if not, what's the point?

I want I want I want, and I am a bad bot!

Cross-Site is for Human browsers when embedded, so we have "content-security-policy" for that.

Providing headers that are fake is a given, UA is a HEADER Value, SEC-FETCh is too.