Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Search Exploit using Indexing API and Scraped Content

Negative SEO...got you're attention? :c)

         

OldFaces

4:53 am on Dec 21, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



TLDR:
- crawl and scrape content rich websites
- take content and host on hijacked and/or numerous owned domains
- use API Indexing tool to get 40%+ indexed
- capture traffic for phishing and/or ad imps.

Sure, google won't rank these websites high in SERPs and *should* ultimately be removed (well...not always [webmasterworld.com...] but what are the potential crawling & indexing impacts to the content rich websites that are scraped?


SEO Hypothetical: If you were to crawl and scrape a content rich website, host that content on your own unique domains, and immediately turn around using the Indexing API (you know the one "for job postings and events" that everyone is abusing to sell mass index services), your chancing of being indexed is high and you *could* be first to present (or re-present) this content to googlebot.

You could render this scraped content across thousands of hijacked or compromised websites with unique domains, and using various Indexing API keys (or hiring a service of which there are tons on BHW) submit these pages for immediate crawl.

The impacts of this could be significant for obvious reasons. You'd likely get these pages indexed (even the lower end indexing services on BHW seem to get 40%+ indexed) and could potentially capture traffic to either generate banner revenue or try and install malware.

Now, we all know Google is decent at ultimately finding and removing these spam websites or at the very least burying it so far down SERPs it'll never be found.

BUT...
    what does this do to the original content rich website whose content was scraped?



That's the tricky bit. We don't know. Is it possible that there is a component of Google Search's Spam Detection & Prevention algos that looks for patterns in spammer behavior to prevent future crawls and/or indexation? Perhaps look for content patterns? Matching character percentages and etc.?

After all, as Search, crawling is costly and you want to limit having to render all these spam websites. Sometimes you can't just ignore the hijacked domains because 95% of the content on the website is fine, it's just these injected pages. Sure, maybe Search could effectively blackball certain IPs or domains, but not all.

So what is Search to do to prevent the cost of crawling these hijacked websites with scraped content? Look for any other patterns of behavior...and that's where I wonder if there is a negative consequence to content rich websites who find their scraped content on other domains.

I don't think big domains would be at risk..they have plenty of authority and links pointing to their domains to far outweigh any internal Spam Flags that arise in Search. But what about those medium sized and other smaller websites where a red flag triggered might be just enough to tip the scales? How would this Spam Flag impact their crawling and/or ultimate indexing?


Backstory:
Beginning late July we began receiving MASS targeted crawler requests from various data centers in China, then Singapore, Eastern European countries, AWS locations in the US, and now MSFT data centers in Seattle. Approximately 3k unique IP addresses each attempting to crawl from 8k through upwards of 200k page requests per IP per day have been blacklisted since July. This targeted crawl continues today with dozens if not hundreds of new IPs appearing every day.

So obviously this is a targeted attack, and not just our fare share of the 'noise' that exists on the web. If it's targeted, it must mean that the entity doing it gets SOME kind of value. What value...who cares? Lets stick to topic...

Performing an "exact match" Google search for various boilerplate terminology we have painted on two primary page types (e.g. pages contained in domain/dir1/ and domain/dir2/) and setting the All Filters > Tools > to Past 24 hours or Past week always shows a significant number of 3rd party hijacked websites hosting our content.

Now I'm an older timer and know not to link to anything...but I so want to share a query you can do with "a blob of text that appears on our pages" and set filter to "Past Month" to see the results. You'll just have to trust me or private message me...but if you do this query you'll see there are pages of 3rd party websites that scraped and hosted our content and are still indexed. There are TWO pages from our actual domain at the top of the results, and there are pages of results of 3rd party unique domains with our scraped content.

Yes, these pages will not rank well, and yes these results will eventually go away (I think). But is it possible that the Search algo in it's ultimate quest to prevent crawling spam websites in the future, uses the keyword content or pattern of content etc. as identifiers of spam which then gets negatively applied to those pages on our domain?

You see, we have a problem where gbot is ignoring crawling our PRIMARY content (pages in domain/dir1 and domain/dir2). Like big time. Gbot will crawl maybe 10k pages of our primary content, ignores the strong internal URLs and will (I kid you not) send us a 400k request crawl budget to ping internal API URLs that are only found and compiled via AJAX on these pages. lol So it's not a problem with Google's crawl budget, but a problem with what google finds to be unique/valuable content on our domain.

So back to the hypothetical: Gbot learns that content discovered regardless of domain that fit xyz patterns triggers a red flag that says 'this content is not worth crawling regardless of the domain'. So if gbot discovers content that matches this pattern on our domain, then treats us the original content creator just like any of these other hijacked websites and learns to stop prioritizing crawling and indexing content found in those domain/dir1 and domain/dir2 directories.

YES, our domain has brand authority. YES, we are crawled by gbot, but at a fraction of what was historically the case. Our primary two pages (those that are being scraped) are no longer being crawled as they should, and the small percentage of these two types of pages that are indexed fluctuate in and out of the index. Most never get crawled to begin with.

Obviously, this is getting into very specific granularity into things we and likely most of the search team at google couldn't answer. But fascinating to think about.

This also lends some interesting observations for when people talk about "Negative SEO". Some are adamant to say Negative SEO doesn't really exist because it doesn't do anything with SERPs, but there could be consequences to crawling and indexing...?


Ask Bard about this hypothetical. Just say "Do you have any thoughts about this hypothetical?" and copy/paste the above. ChatGPT has a different take but I (likely incorrectly) tend to work 1st w/ bard when it comes to SEO..

I agree with how Bard ends it's response with "Need for Ongoing Research"