Forum Moderators: open

Message Too Old, No Replies

At what point are "web scrapers" crossing the line?

         

seovshate

5:49 pm on Jan 9, 2024 (gmt 0)



Hello,

Can you define the difference between "SEO" and Online Harassment / Bullying?

At what point are "web scrapers" crossing the line and should be held accountable somewhere, but where?

Let's say you have a website and post articles that nobody will find on the internet.

Now let's say you have a competator that is scarping your website and immediatly re-writing this news and sharing it to 10x the amount of fans as yours.

Now imagine this competator also has 5 more websites and will continue to post on these websites every hour or two, the content you worked hard to find and write about to make sure they flood the "rankings".

Imagine this happening every, single, day where this competator is comepting against your keyword searches with 5 websites scraping you.

You will try to change your posting time, but it won't matter if you post the article at 6 AM, 1 PM or even 9 PM. It will happen.

I was able to block this competators proxies and hacked machines that allow them to get this content and it has actually made things worse.

They are now like manually monitoring my pages and there is nothing I can share that isn't reposted by the same person in 1 hour. This includes writing the entire article and making images for it and sharing it..

We both have Adsense as a publisher and I am failing to understand how this is being allowed to happen.

Is there anything that can be done in this situatoin? Or is this now the state of the internet?

Basically, anybody who creates news or content is being harassed and bullied (or is it "SEO") to compete directly aginst them. Then throw the content to AI Writing and more.

All of the recent AI etc has made them create even more websites and it basically makes looking for news pointless. But if I don't post it, nobody does.

Can somebody give me direction or advice? How do you stop the "Ultimate Hater" who is obsessed with copying everything you do?

If it was every once in a while, sure. But it's every single day every hour.

I consider it similar to Company A following Company B and waiting until they leave the customer's house. Then knocking on their door and offering to do the same monthly service work for 20% cheaper than the company they currently pay. For every single customer, because it's "cheaper" to attempt to steal customers than to source and find their own customers basically.

If anybody says to "find a lawyer" can you please post the lawyers information as well? I can't seem to find anybody who takes on these types of cases and would love any type of direction or offer from anybody.

Thank you for listening!

not2easy

5:42 pm on Jan 17, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



When you say
Most of the websites are Wordpress....
does that mean most of your sites? Or most of the sites showing your content after you've posted it?

It is really unhelpful to send so many mixed signals. If you want assistance, please drop the complaints against unknown people and platforms and describe the actual situation you wish to have assistance with. We get it, others are reusing your content. How they might be doing that and how you can prevent it takes an understanding of the situation. It does not help to hear how frequently the problem happens without any understanding of your situation.

Are you experiencing content repurposing from your website? If your website is WordPress, it has RSS feeds integrated so you would need to check your settings for RSS. With the wrong settings, others can display your RSS feed on their sites and that is not a copyright violation because you can control it.

Is your content being shown on another website? For that you can file a DMCA if you're willing to do the documentation and research to do that. This includes efforts to find what method is employed. Your access logs can provide details.

People are trying to help but after reading the whole thread I cannot guess what is happening where.

blend27

5:58 pm on Jan 17, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- You have certainly confused me. --

Sometimes it is not2easy

running away..

But then it is clear that OP is all over the pond, reading into comments is(for me) putting it all back together, not illegal on my part.

as not2easy mentioned:

1st. Define Platform being used on you site(s)
2nd. Go over the links provided in this thread.
3rd ask questions based in your understanding of threads read.

OP obliviously has a hole in SOFTWARE FIREWALL he is trying to protect his content at this point. Not the end of the world, at all.

blend27

12:25 am on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...to get a lick of it, lunch new sub-domain, don't publish any content...

look at you logs/headers,

shi.t-t-t-t-t, Jealous Marry, Monster J, local Teddy, Orange Monkey, Orange Monkey wannabe and such coming for your content, alright?

Slice., slice your traffic like it owes you money, slice.

When done, like a good man said.. Slice More!

seovshate

2:10 pm on Jan 18, 2024 (gmt 0)



"You have certainly confused me."

How? What is confusing here? What's the difference between an App or Website with the same scenario?

Basically it's a "Tiered SEO", what's happening is this same company bought 1 domain name for $800,000.

That domain name is the tier 1 website.

These 5 websites are Tier 2 and they are flooding the niche.

Then there is tier 3 and it's about another 20 websites and it's just spamming the next few days after the content by paying indians.

It appears this is "SEO" these days, because like most would say "just focus on content and this or that".

If they ever get "penalized", it seems to be a soft penalty and they wait a few months for the domain and then continue.

Because there is so many domains, it really doesn't affect much.

Anybody these days can throw up a Wordpress (a nice one too) for $200.

It's supposed to be against google's terms, even Adsense terms called Ad Arbitrage. And then there's anti-competitive laws and more.

What part is confusing? Maybe I should just bend over backwards and also send then an email with all of my content. H\eck, I can throw it through an AI spinner first too and email it to them so they can copy me easier.

Sorry I'm not trying to be sarcastic.

seovshate

2:18 pm on Jan 18, 2024 (gmt 0)



sorry, i think I have trouble expressing my thoughts on this matter. It has created almost PTSD. Please re-read and help if you can.

not2easy

2:52 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What's the difference between an App or Website with the same scenario?
A website is under your control normally. An App is built and shared. Free or paid, it is done. You can't control what an app does/shares with whom/what. You have complete control over what or who can access your content on a website. Those links I had posted were assuming you were talking about a website. The suggestions others have shared here also assume a website. This is the reason some clarification was requested. Just like you, we are busy people who don't care to spend time trying to help without knowing what it is you are asking for help with.

seovshate

3:00 pm on Jan 18, 2024 (gmt 0)



I think maybe you guys updated your posts, so I'm going to add more text here:

"Are you experiencing content repurposing from your website? If your website is WordPress, it has RSS feeds integrated so you would need to check your settings for RSS. With the wrong settings, others can display your RSS feed on their sites and that is not a copyright violation because you can control it."

My website is custom made in PHP. I have disabled my RSS feeds many years ago when this first started.

"Is your content being shown on another website? For that you can file a DMCA if you're willing to do the documentation and research to do that. This includes efforts to find what method is employed. Your access logs can provide details."

"It's not my content" is what they say, and they say "I'm just posting it first".

But.. how do I post it second and not have to spend hours daily looking for it? The only way I see here is to bot my competitors.. which is "ok" and grey area, AS LONG AS YOUR PLAYING BY THE RULES. You can't bot me 24/7 and want to take over the work I am doing.. Add value dude!

"People are trying to help but after reading the whole thread I cannot guess what is happening where."

Thank you! I truly appreciate it!

seovshate

3:01 pm on Jan 18, 2024 (gmt 0)



"But.. how do I post it second and not have to spend hours daily looking for it? " Not only "second", but minutes later...

seovshate

3:11 pm on Jan 18, 2024 (gmt 0)



"A website is under your control normally. An App is built and shared. Free or paid, it is done. You can't control what an app does/shares with whom/what. You have complete control over what or who can access your content on a website. Those links I had posted were assuming you were talking about a website. The suggestions others have shared here also assume a website. This is the reason some clarification was requested. Just like you, we are busy people who don't care to spend time trying to help without knowing what it is you are asking for help with."

Consider my website like Reddit. I write about news and I post a link to the news I am writing about..

seovshate

3:23 pm on Jan 18, 2024 (gmt 0)



"why don't you just do it back to them?"

Because, they don't post anything of their own... they are doing it to the other real webmasters in the niche...

So if I were to "do it back to them", it's really doing it to the webmaster they already stole it from... So by me going "into battle", what about the other people? It shouldn't be like this...

not2easy

4:01 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Whatever your website is like, you can control access. As mentioned, if it is WP you can control the RSS feed.

seovshate

5:24 pm on Jan 18, 2024 (gmt 0)



"Whatever your website is like, you can control access. As mentioned, if it is WP you can control the RSS feed."

I do? They are even like botting site: my domain into the search engines and if I would submit my article before posting it live the same thing would happen..

Over the year I was able to setup a system that would monitor for ips coming to news posted like this and make a list so that I could monitor more and then ban.

Now it's like their bot is still doing everything, in the code it says:

If ($domain == 'mydomain'){$donotrequest=true;}

So they still have the link and maybe even the social media text for this link and are Googling it and attempting to find it that way because if they come to the website they would get blocked.

So basically it sounds to me like your saying that "I need to fix this" somehow... but the more I did the more it created this "hate" and why I'm posting here..

There's really no more ways for me to lock things down. I mean if somebody creates a Fake Facebook account, goes to your page and subscribes to receive alerts first and even keeps coming to your Facebook page with a mobile IP.. What do you do?

I have set my page to be logged in to see it and that improved things again but then it got incredibly worse..

So like my original posts, I guess this is the "state of the internet" and this is "perfectly legal" and how dare your competitor post a news that you do'nt have, you should bot them fast as possible and it's just business man. That's what your saying?

tangor

2:17 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Kill the app. Go back to hosting the site yourself, on your own server or discrete host, THEN do all your protections. Your app is a leaky sieve. The FB login merely gives permission to access, which bypasses many access controls you have available at the server.

found this at Stackoverflow

The technical difference according to two features:
1. Where the "work" is done
2. What is being transferred to/from the server

Web app
1. The "work" is done at the browser (JavaScript)
2. Data is being transferred from/to the server
In comparison: Faster

Website
1. The "work" (most of it) is done at the server
2. Rendered pages (data + UI) are being transferred from the server
In comparison: Easier SEO


Not the answer you want. Not the retribution you wish to serve. But it is a much better way to control access to your material than what you are doing right now. That said, with that kind of investment in the domain name you probably have enough capital to invest in a law firm charged with PROTECTING YOUR COPYRIGHT and seeking damages as well as issuing "cease and desist" orders.

Meanwhile, we hear you loud and clear. What we AREN'T hearing is what technical steps you have taken, the platform(s) involved, or the methods of which the information can be accessed. To take a pest out, you have to know where they are getting in and respond appropriately!

blend27

2:10 pm on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



---- My website is custom made in PHP. I have disabled my RSS feeds many years ago when this first started. ----

There We Are! This is Gonna be Great!

Now lets find something to think about!

@seovslove(<-...life at the end of thy tunnel)

Keep in mind that in any good Restaurant a Cook has almost exact/best recipes mixed up daily, but it is the Server that serves it to public...

HTTP Request comes in and things to check:

What steps are you taking to prevent your content being scraped by robotic behavior, lets star(t) at that?

Programmatically( abbreviated to a train of thought )....

Active Access control: Lets spill some spicy gravy on them beans...
Passive Access control: watchful with one eye!

GO

-------------------------------------------------------------
but before u do...

Now the Spicy, but watch and implement based you YOUR site traffic...

Active Access control. To shorten -NA- means get the IP and if YOUR rules are broken -> stick it in a file(map for a while or ever) that .htaccess(or mod_whatever - ask @Lucy24) has access to so and subsequent requests are blocked.

Ph\ack robots.txt

IP: aggregate, block split into ranges: server firewall level -> NA
IP Range based HTACCESS and or mod_whatever(again ask @Lucy24) or web.config, Server level before it gets to your HTML -> check -> NA
Proper UA: AND AND AND Newer Browsers only - HTACCESS(regexp) -> check -> NA
Proper URI: anyone is looking for anything other than URI that might be hosted on you website -> check -> NA
Proper Headers for Proper UA -> HTML -> check -> NA
Known "Good" Bot Crawler: IP Range Based & RDNS(Goog, Bing, Slurp(just effing around here;)), and only if they deserve it!
Known Hosting Ranges/None Human/Proxies: -> IP based SQL query look up -> check
Split the World: US based only(as u mentioned) -> IP based SQL query look up -> check

...and now we are at: IP VS Bot vs CountryIP vs ISP vs Human.
..and HTML is what Server brings on a silver platter some(one-thing at a time) to consume.

Pizza and a Beer and Swiss Chocolate! just had some as everyone can tell already ;)

seovshate

5:02 pm on Jan 19, 2024 (gmt 0)



"Kill the app. Go back to hosting the site yourself, on your own server or discrete host, THEN do all your protections. Your app is a leaky sieve. The FB login merely gives permission to access, which bypasses many access controls you have available at the server."

My App is a website? I only allow users to Login to their account on my website with the Facebook app... So I am able to see and control all visitors to the website...

"That said, with that kind of investment in the domain name you probably have enough capital to invest in a law firm charged with PROTECTING YOUR COPYRIGHT and seeking damages as well as issuing "cease and desist" orders.


This is kindove what I am looking for.. Anything that can basically get this to stop. I am a single dude who has no family or anything and I don't understand how to do this. When I speak to a lawyer, you can see how many details there are... I don't even know where to begin.

"Meanwhile, we hear you loud and clear. What we AREN'T hearing is what technical steps you have taken, the platform(s) involved, or the methods of which the information can be accessed. To take a pest out, you have to know where they are getting in and respond appropriately!"

I am not sure, it's manually right now... Like it's slowly getting worse and worse..

On Wednesday dude was copying in 1 hour

Thursday 30 minutes

This morning I posted it and he did the cloaking trick into google to make Google think he actually shared it 4 minutes before me.. But then I goto his page and he didn't even share it.. Because I'm starting to make this public and regular people are seeing how much of a bully he is.

seovshate

5:05 pm on Jan 19, 2024 (gmt 0)



After I post it, when he does this to my posts.. my organic reach on Facebook drops ... it start going viral immediately because it's awesome content.. then he does his thing and my post get ghosted.. It goes from 20 clicks in 1 minute to 0 clicks in 10 minutes.. Then he'll share it and then add it to his spam sites and keep sharing it over and over so nobody can share news

seovshate

5:13 pm on Jan 19, 2024 (gmt 0)



What steps are you taking to prevent your content being scraped by robotic behavior, lets star(t) at that?

Programmatically( abbreviated to a train of thought )....

Active Access control: Lets spill some spicy gravy on them beans...
Passive Access control: watchful with one eye!

SO many that i don't even know where to begin.. I'm basically hurting my traffic because I'm doing so many bot detections and methods to slow down what is happening. When I remove the detections etc, the hate is incredible and I'm being booted in 5 minutes flat. Don't matter if I post at 4 AM, 8 AM 12 PM 8 PM etc.

---

I'm not sure what you mean by the next post? I have created my own stats system and I am able to monitor my traffic and see everything that I need to (I believe). including all pages they have gone to, all traffic on the same isp, how fast they are going to new content and so much more.

seovshate

5:23 pm on Jan 19, 2024 (gmt 0)



The "Google Trick" i'm not even sure if it impacts SEO on what.. but basically all you do is change your Wordpress posting time to the previous 30 minutes ago..

It looks like Google doesn't show the time when they first discovered the article, but when the webmaster says in the meta publish time when the post was made..

I wish Google bot would do something like if (I went to the page it says it was on and it wasn't really there but says it was) { do stuff

seovshate

5:31 pm on Jan 19, 2024 (gmt 0)



And I know what you might be thinking: "Maybe they just found it.".

You would have to be seeing it happen every single day for years now...

Here's the rundown on what I did and why it's only 4 minutes before me:

I posted the article and didn't share it or add it publically to my website. I submitted it to Google and waited about 15 or 20 minutes. I then shared it on Facebook and less than 10 minutes later this happened.

So he really lied about posting it about 40 minutes or so ahead of when he really posted it. His site has 10x more traffic than mine and I searched before posting, and after and it was only me. Then he pops up out of no where.

It's like this all of the time.

seovshate

5:38 pm on Jan 19, 2024 (gmt 0)



@blend27 I re-read your post a bunch of times, IDK what you mean lol.

But it almost appears to me like there are now "profiles", like there are many profiles that have been created to blend in as regular users making it very impossible to see. The best thing I can see right now is the type of usernames he made that all seem to be coming quickly now to all of the content.

Usernames like

YODAWG
Brandon Banks
NotMe

etc.

seovshate

5:42 pm on Jan 19, 2024 (gmt 0)



Hey, at least i'm not getting "ddosed" again like probably 50 times after what appears to them to be getting mad... It's like a huge fraud operation or something, but then there's the guy who spent that crazy amount of money...

Maybe he's just freaking out because on the outside. "it looks sooo easy", but I assure you. it is not. That's How I got sucked in, LOL.

seovshate

5:53 pm on Jan 19, 2024 (gmt 0)



There's nothing special about me, I don't do anything great. I'm just trying live and eat. It takes an incredible amount of work though. I probably could find a way easier job, but I've done this so long that I am scared I guess. I also know that others can be the same way who have done this a long time, but this guy is different. They messaged me to brag about their new Tesla and brand new home. Basically the better they do, the more they justify what they are doing. Like "I'm [insert keyword] king.com! and [insert keyword] queen.com! ya!" There is no reason for them to keep acting this way. Plz don't take anything I say the wrong way, I have developed extreme anxiety about this situation and know that sometimes I do not respond to things correctly regarding it. The best I can do is push hard and try to ignore this crap, but like .... ? LOL.

londrum

7:15 pm on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if they're reposting that much, so quickly, and for so long, then it must be automated. which probably means you can block it.

i would start by looking into blocking frames (i'm still not sure whether you're talking about your site being framed, or scraped -- if it's being framed then it should be very easy to fix). if it's scraping then maybe i'd pay for cloudflare and start using all their blocking features.

One question… are the words they are posting identical to the words you are posting? (…ignoring any words that might be added around the article)

it's difficult not to take these attacks personally, but you're probably just up against computers who don’t know you from adam, and do this to thousands of other websites every day

seovshate

9:00 pm on Jan 19, 2024 (gmt 0)



"if they're reposting that much, so quickly, and for so long, then it must be automated. which probably means you can block it."

It was and usually is, but right now it is not.. or if it is, it's through "sessions" and different users with separate cookie caches that I'm not able to detect as of yet..

I'm looking into "fingerprinting" but services are expensive and using a Canvas fingerprint seems to maybe be my best bet. Maybe I can see a new pattern and be able to block it and hopefully All Users allow canvas so that I can if(canvas image == broken) { block;

"i would start by looking into blocking frames (i'm still not sure whether you're talking about your site being framed, or scraped -- if it's being framed then it should be very easy to fix). if it's scraping then maybe i'd pay for cloudflare and start using all their blocking features."

I am blocking iframes... I do use Cloudflare and their blocking features, like tons of them...

"One question… are the words they are posting identical to the words you are posting? (…ignoring any words that might be added around the article)"

Sometimes yes I think they are spinning it and basically making a slightly better version. Without blocking all of the VPN's and Cloud traffic, what's happening (or what I think is happening) is my text is being added to tons of their penalized websites and it's getting mine removed from the rankings.

Like, if they aren't able to copy it immediately, the other tactic seems to be this penalty that get's my page completely removed. There are other pages still there (because others found it) but when they post it (even the next day) my page disappears...

"it's difficult not to take these attacks personally, but you're probably just up against computers who don’t know you from adam, and do this to thousands of other websites every day"

Ya I get it, easier said than done.. 4 minutes man!4 minutes! How is this "SEO" and why can't I do something about it? :(

seovshate

9:07 pm on Jan 19, 2024 (gmt 0)



"it's difficult not to take these attacks personally, but you're probably just up against computers who don’t know you from adam, and do this to thousands of other websites every day"

I was typing something else, but as you say this, isn't that even worse? So there are thousands of others feeling like I am because of 1 person and/or company? That seems like a toxic to humankind.... so shouldn't like we disallow this or is back to my original question:

Is this "SEO" or is this "hate" or what is it? Please help me webmasters.

seovshate

9:48 pm on Jan 19, 2024 (gmt 0)



"it's difficult not to take these attacks personally, but you're probably just up against computers who don’t know you from adam, and do this to thousands of other websites every day"

Also, so what do I do? What would you do? Imagine you have already been in this "battle" for years and now they appear butt hurt because you have literly won time and time again.. and now they just straight hate on you like "what are you going to do"?

I'm sorry if this word seems weird, but what else do you call it? What do I do here?

I have an enemy and he's not hiding it for quite a while now...

I don't think Google needs many versions of the same thing for them to work...

There's only 1 reason to do this "extra work" ....

Happy Friday! I can't wait to post here in a few hours and it happen again! WOO!

seovshate

10:11 pm on Jan 19, 2024 (gmt 0)



"i would start by looking into blocking frames (i'm still not sure whether you're talking about your site being framed, or scraped -- if it's being framed then it should be very easy to fix). if it's scraping then maybe i'd pay for cloudflare and start using all their blocking features."

I've always had a paid Cloudflare account.. it didn't even stop the DDOS because it was over 900,000 requests every time in a few hours...

It wasn't the website, nor images but the pixel inside of the image that verifies when a user opens the newsletter.

So it wasn't too page heavy but all of the amount of requests and serving the image after it saves the data cripled the network..

They did this every single time my rankings went up and usually within a few hours.. they are monitoring something very heavily...

I have caches setup for things now that when it overloads it stop processing certain things and appears to act normal. It has stopped the ddos's...

Like I would literally find a new IP and then ban them and then less than 1 hour later I got ddosed.. about 20 times? Ya.. it's this same exact guy.

seovshate

10:12 pm on Jan 19, 2024 (gmt 0)



The pixel is just one instance, it was small thing after thing and I like had to cache everything and more.

seovshate

10:19 pm on Jan 19, 2024 (gmt 0)



When the DDOS's happen and the really illegal things.. they always do it so that it's anonymous for sure and then if you were to track it down, they can say it was an accident..

For instance, when it's attacking a page.. it will send all of the requests to the same exact page with the same query parameter and everything. It wno't matter because it's too many requests and Cloudflare after (i forget how long) they appear to shift your website onto an "attack server" so that regular users aren't affected. It slows down your TTFB for a few days and more.

But the only reason for the same page and stuff I believe is to say "oops, It got stuck in a loop. I didn't mean to send 900,000 requests".

seovshate

11:55 pm on Jan 19, 2024 (gmt 0)



What I do for pages is on the top of the scripts.. I tried everything but what seems to work is this:

If CPU Usage > 90 then
headers(Cache page for 5 minutes);

Even when the requests keep coming, because of cache cloudflare is better at handling it (maybe because it's not hitting the home server) and you don't get shifted to that server and their attack does nothing.
This 101 message thread spans 4 pages: 101