Is Google penalizing .php scrapers at last?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Is Google penalizing .php scrapers at last?

Phony .php links "cached" 31 DEC 1969!

larryhatch

7:44 am on Jan 28, 2005 (gmt 0)

I just did a G-search for site:www.mysite.net.

Up came most or all of my 135 pages. FIVE of those were on OTHER sites!
Checking source code on those, all 5 had phony .php links to my pages.
They even had mouseover directives to display MY urls masking their obvious intent.

I clicked on the Google Cached version of each one.
Each and every one was "cached" as of 31 December 1969.

All my legitimate pages are cached December 2004 or January 2005.

Questions:

1) Does this mean that Google has fixed the phony redirect issue?
2) If NOT, just what do the 1969 cache dates indicate?
3) Does this mean that the scrapers no longer gain PR, placement or other credit for my content?
4) Does this mean that any PR etc. is now passed through to my pages?
5) Why do ANY pages from scraper.com show on a query for site:www.mysite.net at all?

Very curious. - Larry

siteseo

7:12 pm on Jan 28, 2005 (gmt 0)

Were they listed as "supplemental results"?

larryhatch

9:48 pm on Jan 28, 2005 (gmt 0)

I called up site:www.mysite.net on Google.
These 5 scrapers were right in line with my regular pages.
How do I determine if they are "supplemental results"?

Only the cache dates and of course wrong domains gave them away. -Larry

larryhatch

9:49 pm on Jan 28, 2005 (gmt 0)

I called up site:www.mysite.net on Google.
These 5 scrapers were right in line with my regular pages.
How do I determine if they are "supplemental results"?

Only the cache dates and of course wrong domains gave them away. -Larry

larryhatch

6:28 am on Jan 29, 2005 (gmt 0)

Hello Siteseo: NOW I see what you mean, and YES.

I did an inurl search inurl:www.mysite.net.

The 5 scraping sites were way way back at the very end of 135 listings, all mine being first.
Each and every one of those had 'Supplemental Result' clearly indicated.
I missed that earlier. Each one had the 1969 cached date as well, a 1 to 1 correspondance.

Again:

1) Does this mean that those 5 pages are flagged as duplicate content?
2) Does this indicate some other possible penalty upon the scrapers?
3) Do the scrapers still get PR or other benefit from my work? -or-
4) Is PR etc. possibly passed thru to my rightful original pages?
5) Are "supplemental results" always given the bogus 1969 cache date?

Sorry for all the questions. I'm trying to make sense of all this. -Larry

xcomm

7:29 am on Jan 29, 2005 (gmt 0)

I have this nice 31 Dec 1969 23:59:59 GMT in an old cached 301 site now in bin as I finished the domain change. So this site is not existent since about a month.

I like the kind of humor with this 31 Dec 1969 23:59:59 GMT - and I like that there is at least one problem someone in Google seems to solve from their bunch of problems in their core business. :-)

larryhatch

7:40 am on Jan 29, 2005 (gmt 0)

Can anyone answer my numbered questions please?
(see above) Larry

walkman

1:57 pm on Jan 29, 2005 (gmt 0)

I clicked on the Google Cached version of each one.
Each and every one was "cached" as of 31 December 1969."

I think as long as they're there you'll still be hurt, regardless of the cache date.

larryhatch

2:57 pm on Jan 29, 2005 (gmt 0)

Thanks Walkman: any input is appreciated. I hope your suspicion is wrong of course.

Anyone else? I can't be only one asking these same questions. - Larry

lammert

11:35 pm on Jan 29, 2005 (gmt 0)

As a programmer, I know that 31 dec 1969, 23:59:59 is commonly coded in Linux and C as the value -1. (one second before 1 January 1970, the beginning of computer time). I uses this code many times to mark a deleted or special database record and I think Google programmers did the same to mark pages/caches to be deleted.

I noticed the same thing you noticed since about one week. I have a set of keywords that gives me a few hundred results and I check this result almost every day to see how my site is performing. Since about one week I see many scraper sites jumping up and down the list, and many have their cache marked 1969. Also old fashioned bookmark pages from 1997, 1998 are sometimes marked with this date. So I guess they have a filter which compares the amount of unique content with the number of links on a page.

The strange thing is, that the order of the list every day seems to change. I even have the impression that some pages are jumping from the state "to be deleted" back to "accepted", but I may be wrong. Maybe these are just new indexed files from scraper sites that were not fully indexed yet.

Not 100% knowing what is going on, but I think Google finally has implemented a massive filter that will wipe out many sites in the next large update, but they are still fine tuning the parameters. About 50% of the pages for my keyword set are marked as 1969. So if my assumption is correct, this would indicate that 50% of the pages about these keywords will be deleted soon. If this percentage can be applied for the whole Google content, this is the largest clean-up ever.

Anyone has details about the percentage of 1969 marked caches for their own set of keywords?

zeus

12:05 am on Jan 30, 2005 (gmt 0)

lammert, I think you are on the right track, but the clean up has to come soon, because the serps are realy getting worse and I think there are a lot of sites, which has been hijacked or are hit by normal redirect google error, that needs a full new spidering, to get google to be up to date again.

lammert

12:25 am on Jan 30, 2005 (gmt 0)

zeus, I agree that the clean-up has to come soon. I see about 10 new scraper pages every day on my keywords search. Which is a growth of about 2% of the total amount of pages with these keywords daily. And Google has to keep the filter in place after the cleanup, to reject new scraper sites.

After all it is so easy to create such a money making site:

Grab content from sites and SERPs with a bot
Put the content in a database
Write a PHP script which outputs the content a a normal looking HTML file
Ad Adsense to it
Submit the site to the search engines
Wait until the Adsense check arrives.

Why didn't I do it my self?

Am I too stupid to do it? No :(
Am I too honest to do it? Hopefully :)

zeus

1:41 am on Jan 30, 2005 (gmt 0)

Lammert thats my weak side, Im also to hornest to that kind of stuff, about the filter, I do have some troubles with it I just see to much omitted results, many good different content is filtered from sites, I also hope they fix that.

lammert

1:56 am on Jan 30, 2005 (gmt 0)

I see the same at the moment. In my test 85% of the scraper pages are marked for deletion, but also many forum pages with interesting content.

One good thing is that the filter catches almost all illegal copies of my content, so the filter fortunately recognizes which content is original, and which was created later.

As a programmer I really would like to know what kind of tool they built and how they tune it, but I guess GoogleGuy won't post his knowledge about this issue to this thread ;)

grelmar

2:48 am on Jan 30, 2005 (gmt 0)

Ok, but how do you go after the scraper site? After reading this, I went and did a check (normally I do it once a month or so) and discovered that my site had been scraped, in it's entirety (well over 100 pages of static HTML + a forum).

The scraped site is fully self supporting - all the links have been changed to direct to within the scraped site. They've maintained 100% of the content (cheeky buggers didn't even delete the copyright notice on the pages) - with one simple addition, a link at the bottom of every page to a "mother" site, - which is just a scumwad directory site. (my own site is notably absent from their directory).

I'm based in Canada, using a Canadian hosting service. The dirtbag site is registered through GoDaddy, and all the contact information goes to a post office box in Scottsdale Arizona.

How to I get these guys to clear my site out of their caches? I put a lot of work into re-designing it for good placement in SERPS, and it's been working a charm. I don't want to lose that work for a duplication penalty, and I'm very concerned about copyright issues as well.

If this is gonna cost me $$ for lawyers, my site will fold. period.

lammert

3:20 am on Jan 30, 2005 (gmt 0)

Grelmar, You actually might be lucky that they registered through GoDaddy. There have been some posts here that GoDaddy takes back domains used for spam sites more rapidly than other registrars. If the domain is gone, their pages will be removed from Google, Yahoo, MSN and others.

So maybe your first step should be to contact GoDaddy and file an official complaint. If you have enough prove that the copyright owns to you, they must take action because of the DMCA.

For an official complaint according to the DMCA, it must contain the following info:

an electronic or physical signature of the person authorized to act on behalf of the owner of the copyright or other intellectual property interest;
a description of the copyrighted work or other intellectual property that you claim has been infringed;
a description of where the material that you claim is infringing is located on the site;
your address, telephone number, and email address;
a statement by you that you have a good faith belief that the disputed use is not authorized by the copyright or intellectual property owner, its agent, or the law;
a statement by you, made under penalty of perjury, that the above information in your Notice is accurate and that you are the copyright or intellectual property owner or authorized to act on the copyright or intellectual property owner's behalf.

There is also a special forum dealing with copyright at [webmasterworld.com...]

grelmar

4:02 am on Jan 30, 2005 (gmt 0)

Thanks for the advice. Didn't mean to hijack the thread, it just seemed somewhat related.

I'll take the "notify" GoDaddy route. I'm also in luck because, after poking through the site in question, I discovered that they scaped another, much bigger and more commercially oriented site that, I happen to know the owner of. I'm letting him know so we can both lodge complaints.

larryhatch

8:06 am on Jan 31, 2005 (gmt 0)

Hello Lammert & Grelmar:

Do you suppose there is some concensus that the 1969 cache date marks a real or future penalty?

I can say this: NO site in the 1st five pages for my keywords have a 1969 cache date,
I take this to mean that they are WAY down in the SERPs, a penalty in itself.

If so, then duplicate penalties on the original or rightful pages seem less likely.

It wouldn't be to hard to decide which page was bogus.
The phony ones all seem to have complicated arcane .php redirects,
while the genuine sites tend to have straight html links, internal and outbound.

Here's a great way for Google to improve their rankings:
What better vote for an original site, than for some scraper to scrape it!
All they need to do is pass PR thru the phony php redirects, multiplied by 2 or 3. - Larry

lammert

10:00 am on Jan 31, 2005 (gmt 0)

I have some reasons to believe that 1969 marked pages get the final penalty in the future, not now.

First of all many of the scraper pages marked with 1969 still have a decent PR in the Google toolbar - although this is not a 100% indication of the PR inside Googles own databases.
Most of the pages marked 1969 that use Adsense show normal advertisments, no PSA's. So they were not booted from Adsense.
And I think most important is that I see pages where the date jumps between 1969 and 2004. So 1969 is not a death penalty, just a mark. Unfortunatly one of those pages that jumped back from 1969 to 2004 violated copyright of one of my pages :( but I have taken other effective measures against that site ;)

1969 marked pages are penalized in some way. I.e. almost all 1969 pages are in the last 50% of the SERP for my keyword set. Although PR, Adsense etc. may be normal for these pages, no normal searcher on Google will ever find them. So they may get normal treatment, but no visitors.

Interesting is, that there are two recent threads on WW that show effects that might be related to the 1969 <-> 2004 marking action. One is in the Adsense forum [webmasterworld.com], where people see that the earnings are less stable than in the past. The second is in the Google News forum [webmasterworld.com] about rapidly changing SERPs.

grelmar

10:27 am on Jan 31, 2005 (gmt 0)

Off hand, I would say that it is indicative of a pending penalty.

Why? Because in my case (at least), it looks as if the scraper has already been penalized, with a pr0 on the"scraped" pages, and the "mother" domain has a much lower PR than I'm used to seeing on directory sites.

It's hard to say for sure, though. I'm playing whack-a-mole trying to keep track of where my scraped content is residing, because this site seems to have a way of floating the content between a number of subdomains. (not really a major technological breakthrough). I just wonder why they would bother. If they keep moving the content around, how the heck is G-Bot supposed to assign any PR value to it?

As for keyword issues... I confess I'm at a loss to the advantage of the technique. Again, if it keeps moving around, how does that benefit the scraper? A few days in the SERPS with my keywords, then the content moves, and what, he replaces the content with an adense clickthrough page? Sure, if the entire process is automated, but youd have to be moving huge chunks of content around for it to be viable even if it is automated.

Or is my brain just seizing up at 3 a.m.?

martingale

10:31 am on Jan 31, 2005 (gmt 0)

Are these links really all phony? I run some PHP software on my site, and it has a "link tracking" system that works via a redirect. The idea of this link tracking system is it updates a counter on my site every time someone clicks a link so I can see how much traffic i am sending to places I link to.

There is certainly no intent to be malicious in doing this, but I have been reading recently that due to some Google bug, Google handles this badly.

I think you should at least be aware that at least a lot of people running these PHP link things do it with no ill will and with no idea that it might hurt your Google ranking. They do it because they want to count how many clicks they send you, that's it.

Of course since this Google bug exists there are no doubt some nasties out there who do it to cheat... I think it should be easy to spot the honest people from the cheats though. Honest links will still redirect straight to yourself. The cheats will intercept it and send the traffic back to their own page...

Don't just assume the guy with the PHP links is up to no good though, this is really Google's bug to fix.

Marval

11:16 am on Jan 31, 2005 (gmt 0)

martingale - thats a very important point - I know of many people that do the same thing with their directory software and it is never done with any mal-intent. I happen to agree with you that somehow, Google would need to realize that these are seperate sites and not a part of the target site but Im sure they are already looking into some automated function to do that

grelmar

6:17 pm on Jan 31, 2005 (gmt 0)

In my case, I know it's malicious. They provide no links to my site, and have copied the entire content of my site.

They went and inserted an extra link to their "mother" site at the bottom of every page of mine they stole.

I just can't see an "honest" reason for doing this.

siteseo

6:52 pm on Jan 31, 2005 (gmt 0)

I posted this in another thread as well, but perhaps it bears repeating here. Larryhatch, I originally asked you if the scraper sites were listed as "supplemental results," to which you responded in the affirmative. This jives with a theory of mine that dupe content sites get relegated to the supplemental index. I believe the higher-PR site ranks for the terms and the lesser-PR pages go supplemental. In the future, this should mean that scrapers never have a chance, as they immediately go into the supplemental index.
I asked Big G if dupe content pushed a site into the supplemental results, and this was their response:

"...supplemental sites are part of Google's auxiliary index. We're able to place fewer restraints on sites that we crawl for this auxiliary or supplemental index than sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index.

The index in which a site is included is completely automated; there's no way you can select or change the index in which your site appears. Please be assured that the index in which a site is included does not affect its PageRank."

larryhatch

5:45 am on Feb 1, 2005 (gmt 0)

Thanks siteseo: That is what I was hoping to hear.
The Google response doesn't tell me much though.

The scraped pages are so low in the SERPs that they aren't much of a concern any more. - Larry

walkman

7:30 am on Feb 1, 2005 (gmt 0)

"I believe the higher-PR site ranks for the terms and the lesser-PR pages go supplemental"

I had an issue with on a site of mine with domain-com and www.domain-com and the opposite happened. Now I think the problem I think is that Google is penalizing BOTH sites, the supplemental and the "original" page. That's what I have noticed. Maybe if you have a PR8 or 20,000 backlinks you might overcome it, but normal sites are hit hard.

larryhatch

7:52 am on Feb 1, 2005 (gmt 0)

I (very cautiously, new at this) put up an .htaccess file to 301 redirect from
mysite.net to www.mysite.net.

End of problem. I think my rankings improved slightly, but there could be other reasons. -Larry

flex55

3:12 pm on Feb 1, 2005 (gmt 0)

Supplement Results always go to the end of the SERPS, so, not to worry about the scrapper's ranking.. it's burried as low as it can be.

To me, however, I found a bunch of Supplement Result pages. I couldn't find someone with similar data to my site, yet, lots of my pages became Supplement. Dont know why..