WP scraped/fake sites gaming Google serps?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

WP scraped/fake sites gaming Google serps?

RedBar

8:45 pm on Nov 25, 2021 (gmt 0)

System: The following 12 messages were cut out of thread at: https://www.webmasterworld.com/google/5052789.htm [webmasterworld.com] by not2easy - 11:01 am on Nov 27, 2021 (atl -4)

Fake sites --- Here's a sort of possibly crazy question?

Those who have sites copied, are they WordPress?

Are the fake sites WordPress?

I appreciate that the majority of sites are WP, just seeing if any dots can be connected as to why G is not able to distinguish the fakes and especially so with all the WP bloat code?

Markedd

9:22 am on Nov 26, 2021 (gmt 0)

@RedBar In my case, yes, they all are built on the Wordpress platform.

RedBar

2:48 pm on Nov 26, 2021 (gmt 0)

@Markedd - Have you ever considered a diferent platform / template?

Traffic for Thursday was 106%, most unexpected with US traffic lower than average but still 28.3% of overall PVs.

Today is slow so far.

Markedd

5:32 pm on Nov 26, 2021 (gmt 0)

@RedBar I never saw a reason to. Do you think Google has trouble distinguishing between Wordpress websites?

RedBar

6:07 pm on Nov 26, 2021 (gmt 0)

@Markedd - I have absolutely no idea other than WP is so code bloaty that to distinguish one WP site from another similar WP site x millions must be extremely time-consuming and difficult if not almost impossible.

Question, do you host your WP site or do you use WP / whoever?

Markedd

8:07 am on Nov 27, 2021 (gmt 0)

@RedBar I use a third-party VPS and an extremely lightweight theme, so the only bloat could be the WP core code itself.

RedBar

1:46 pm on Nov 27, 2021 (gmt 0)

@Markedd - Even the most lightweight of WP designs are very bloaty in comparison to custom built code. WP has some great designs and plug-ins and is the de facto standard for many bloggers etc, I even recommend this type of solution and especially so for journalists etc, ... A top journo friend of mine could not believe it when I told him to go down the WP route but he's really happy he listened to my advice.

However with all that bloat there must be complications and conflicts and yes, WP does do updates and bug fixes, and this makes me wonder if anyone who really knows WP code if they know how to manipulate to their advantage when copying / scraping someone else's site?

What does anyone else think?

I wouldn't know where to start creating WP code however, insofar as my sites are concerned, I can look at my html5 code in a text pad and immediately see if / where there is an error.

Am I barking up the wrong tree when it comes to scraping and ranking?

not2easy

3:35 pm on Nov 27, 2021 (gmt 0)

Sorry for the thread split, but I wanted to post and did not want to derail the monthly discussion for it. The thing about WP is that it is easier to copy using a browser than a bot because the "pages" don't physically exist except as unformatted table entries in a .sql file. Headers and titles may be in different tables so unless you can hack into it and gain access to the .sql you can only get parts and pieces. If you can get to the .sql you have the text content and assorted settings.

I am pretty sure that there are scripted bots that could scrape the source code, but that only includes URLs to the resources such as images. WP is just a framework and the contents are stored in .sql tables and images stored in separate folders. Resources such as .js and .css are in other folders. It is simple enough to protect it with permissions, careful plugin selection and regular management of updates. Not that it can't be done, but hard to believe it is widespread wholesale scraping.

Lazy iframing can borrow the contents but it is easily noticed by bots and owners. If a 'fake' site creataor visits the leading sites and scrapes/copies their text and repurposes that, it might help for short term gain, but scaling it for volume would be messy and temporary.

NickMNS

4:54 pm on Nov 27, 2021 (gmt 0)

I am pretty sure that there are scripted bots that could scrape the source code,

First off we did not define what we mean "source code". On the web the code falls into to two categories, server side code (typically JS) and client side code (PHP, Node.js, Python, Perl...).

Client side:
No bot is needed to access the client side code, because for it to be able to run, it must be downloaded to the client's device and thus the client already has all the code.

Server side:
A typical web "crawler" bot sends requests to the server and the returns an http response with the data. What ever work was done on the server will have been completed before the response is returned to the bot. The bot has no access to server itself and thus cannot access the server side code. Now, there are plenty of bots that attempt to gain access to the server, but this is somewhat different, and I would be tempted to say it has more to do with server security than a choice of particular code stack, as weak security can impact any server running any code. But WP, due to it's prevalence and "black box" nature makes it particularly susceptible to this type attack. Many people running WP have little to no experience as webmasters and have little interest in learning, they just want to publish their blog, that is fine. But ignorance is not bliss, just because you are unaware of the existence of a particular risk, does not mean that you will not be impacted by it.

The thing about WP is that it is easier to copy using a browser than a bot because the "pages" don't physically exist except as unformatted table entries in a .sql file.

This is true basically true of any website not just WP. This of course assumes that the nature of the data is relatively linear and easily consumed on per page basis. There may times where getting "unformatted" data is preferable. But that is a whole separate discussion.

The other point raised in the thread is that "code bloat" somehow leads to obfuscation of the true content. And that simply is the case, To be clear, I'm referring here to "code bloat" in terms of HTML/CSS on a page. As server side code is completely irrelevant to the stealing of content. It is incredibly simple to parse content from HTML, bloating code doesn't change that. You can test this yourself, open "dev -tools" in your browser, select the "body" tag, then in the console type:

$0.innerText

All and only the text that appears on the page will be printed to the console. Regardless of the depth of the DOM tree or the number of useless css classes and id appear in the code.

Code bloat should be avoided because it can slow the page speed down, which in turn can have many negative effects on SEO, but code bloat will not prevent Googlebot or any other bot from capturing the content of the page.

RedBar

6:32 pm on Nov 27, 2021 (gmt 0)

@not2easy - I actually have a programme which can copy any website in its entirety UNLESS the site's host denies me from doing so, I can assure you this is very rare.

not2easy

7:51 pm on Nov 27, 2021 (gmt 0)

@NickMNS - WP is php client side.The pages are created on the fly from the WP platform framework but the pages do not exist on the server in the form they are viewed. The request calls the part together.

Code bloat in WP is their effort to be everything for everyone so unless you are competent enough to figure out how to disable unwanted unused 'goodies' they load unused and unwanted. I'm sure you could deal with it but many starting out have absolutely no idea and might rely on some plugin to deal with it.

The source code I referred to is the html code that exists only in the browser after WP assembles the parts.
----
@RedBar, I have seen some of these programs. But having a basket of parts does not include the content. with proper setup of the WP database your program would run into trouble getting to the .sql where the content is at.

JesterMagic

3:02 pm on Nov 28, 2021 (gmt 0)

One of my non WP sites gets copied all the time by bots and then gets put on usually WP sites. The bot just copies the article itself and ignores the rest. I guess it help that I have all articles in an article HTML section.

In these type of situation Google should be able to tell the difference in what is real and what is fake because these sites are coping content from some where else.

Has nothing really to do with WP.

These people just use WP because it is easy to spin up a site and there are a lot of plugins etc allowing wbmasters to do all sorts of things.