homepage Welcome to WebmasterWorld Guest from 54.204.58.87
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 80 message thread spans 3 pages: < < 80 ( 1 2 [3]     
Google Products Used for Negative SEO
turbocharged




msg:4567897
 11:52 am on Apr 25, 2013 (gmt 0)

A website I am working on had previously suffered in the SERPS for its homepage. Duplicate content, created externally, resulted in hundreds of copies of their homepage and internal pages. Most of the copies reside on Google owned properties (Google Apps and Blogspot). To combat the problem, we did a complete homepage re-write. New images, lots of next text and functionality were added to combat the problem. We also modified the htaccess to prevent this from happening again (or so we thought).

A single character in the htaccess file left only the homepage open to Appspot. Within two days, four Appspot URLs were indexed and our client's homepage was in the "omitted results."

I am here to say that Google's preference for their own brands is harming the internet in more ways than just limiting consumer choice. Webmasters and SEO professionals, like me (webmaster) are spending countless hours defending themselves against Google's products. I discovered this when I did a search and found many complaints from others that have been proxy hijacked by Appspot.

Since I am not an htaccess pro, or SEO pro by any means, we submitted a request for additional funding from our client to bring a SEO pro on for a limited consulting basis. Other proxies that have cached content, not owned by Google, appear to be placed in the omitted results appropriately. But we develop webpages mostly, and have no idea how many other Google products/techniques are being used as a negative SEO weapon.

Yesterday I sent out 4 DMCA notices to Google on behalf of our client. Today I will work on assembling the rest of the list, which reaches well beyond one hundred different domains, and hand them off to another in my department to File DMCA notices. Once I get client approval for the SEO pro, he/she will get the list to for review.

A good portion of the wasted time/money could be saved if Google used noindex on Appspot proxy pages. Does anyone have any idea why they would legitimately not do so? At the present time, I estimate this problem is going to cost our client $1,500. In addition to submitting the DMCA notices, we will have to rewrite some pages that probably won't meet Google's threshold for removal. Multiply this by thousands or tens of thousands of small businesses, and you have a lot of financial damage occurring. What a nightmare!

 

rish3




msg:4570946
 2:40 am on May 6, 2013 (gmt 0)

When I sent the original DMCA notices to Google, and referenced that my client's homepage was indexed in Google on a multitude of Appspot subdomains, they had absolutely no idea what I was talking about.

This is exactly what bothered me. Google, by the nature of what they do, should understand the situation *more* than anyone else. They don't. Leverage appspot proxies as part of a negative seo campaign is very common. They don't seem to know, or perhaps don't care.

My earlier rant about Google probably understanding it if it were something they cared about drives off that point. I didn't take the google home page as a good example because it doesn't even have a complete sentence on it. But, if, for example, the Adsense blog was being proxied, and the proxied copy started ranking for terms...I bet they would not only DMCA the content, they would shut down that particular appspot app and ban the owner.

rish3




msg:4570947
 2:51 am on May 6, 2013 (gmt 0)

Unless you take the time to learn enough about what you're doing to know you could go by IP Range too.


Yes, I understand that. I was highlighting the difference between:

a) blocking the AppSpot user-agent, which prevents ALL AppSpot apps from scraping my site

b) other proxies, on platforms I'm currently unaware of, that may employ this technique later.

I'm not asking you for technical help. I'm pointing out that it's frustrating that Google gives anything that pretends to be a proxy a pass on the rules.

And, if they get smarter, and copy the content from the Google cache, wayback machine, etc, what then? So long as the front page says "I'm a proxy"...they are immune.

TheOptimizationIdiot




msg:4570948
 2:59 am on May 6, 2013 (gmt 0)

Oh, wow. I keep trying to get out of this thread!

I'm pointing out that it's frustrating that Google gives anything that pretends to be a proxy a pass on the rules.

They do not. You do not understand the rules online, so you don't know proxies aren't "getting a pass" on them from anyone, including Google.

And, if they get smarter, and copy the content from the Google cache, wayback machine, etc, what then?

You think they haven't already thought of that? Really? Wow! Try your favorite search engine and see why so many recommend using "noarchive" on pages. You're years behind the game, really, scrapers using those for data was "old knowledge" 5 or 6 years ago.

turbocharged




msg:4570949
 3:15 am on May 6, 2013 (gmt 0)

This is exactly what bothered me. Google, by the nature of what they do, should understand the situation *more* than anyone else. They don't. Leverage appspot proxies as part of a negative seo campaign is very common. They don't seem to know, or perhaps don't care.

@rish3
In my original DMCA notice I hit on a few points. First, my client's homepage was in the omitted results. What was indexed were a multitude of Appspot subdomains (around 140 of them). Some of the App Ids were named "FU-Company" or the like, which clearly showed intent to harm my client. I also questioned why my client's logo, text, etc. were cached in Google on many of these Appspot subdomains. Regardless of these points, the DMCA was denied. I followed up to each of the DMCA denials stating that Google is storing the cache on these Appspot subdomains, and I would like these removed ASAP. I'm still waiting on their response to the follow-ups.

As noted previously, the caches for the Appspot rips are dropping out of Google. But when you take a snippet of my client's homepage, his site is still in the omitted results while the Appspot subdomains are still riding high. Most of the Appspot subdomains have now have no cache, no meta description but they are still driving the client site into the omitted results.

I don't think for a minute that Google does not care about what their Appspot proxies are doing. Somehow they are benefiting from this. And the excuse that Appspot proxies are "just a proxy" is absolute rubbish. Google would not be indexing over a hundred of these Appspot proxies pointing to my client's homepage and caching them if they were ordinary proxies. What makes them extraordinary? Google owns them of course. Less traffic for my client's site, because he is in the omitted results, at the very least could generate more traffic to a Google owned property. That may help to pad their quarterly traffic stats that they present to investors. Considering how many webmasters are complaining about this, the traffic Google owned properties get because of stolen content appearing on Appspot proxies could be quite significant when the scale is considered.

TheOptimizationIdiot




msg:4570952
 3:36 am on May 6, 2013 (gmt 0)

I followed up to each of the DMCA denials stating that Google is storing the cache on these Appspot subdomains, and I would like these removed ASAP. I'm still waiting on their response to the follow-ups.

Prediction: They will deny them and tell you AppSpot is not storing anything, because you said on AppSpot subdomains rather than of AppSpot subdomains in Google's index. Let me know if I'm wrong and they do anything other than deny you, please. I'd like to know if Google's DMCA people can figure out when you say something is stored on one site you really mean a "cache of what it displays" is stored on another. TIA

diberry




msg:4570992
 5:24 am on May 6, 2013 (gmt 0)

I'd like to know if Google's DMCA people can figure out when you say something is stored on one site you really mean a "cache of what it displays" is stored on another. TIA


They can if they have a human look at it. Or at least that's still what I think was going on when I submitted a complaint about an obvious PDF ripoff of my page and it got rejected. Fortunately, I know what PDFs are and how to tell that's what people were posting. I explained that more clearly, and Google suddenly understood and took action.

I think what's happening is that the DCMA team is really bots - only if you appeal their decision or state the case so a bot can understand it, do you get the action you were hoping for. That really sucks for less techy webmasters who don't know much about caching, PDFs, how bots think, etc.

A simple solution would be to let us submit screengrabs of the copied material. Surely a computer can look at it graphically and detect the probability there's a match. If the probability seems good, it gets passed onto a human reviewer.

If people are upset that Google isn't doing more about dupe content, I think they have every right to be. (And Google has every right to not do more.) So while maybe we don't always know the best way to communicate the problem to the DCMA group, it's not as if Google can't automate the process of figuring out what we mean.

tedster




msg:4571084
 11:49 am on May 6, 2013 (gmt 0)

It appears that is what you are trying to do to this thread - turn it into an English lesson with the proper use of verbs, nouns, etc.

That's the way technology is... it doesn't understand "what you mean" it only understands exactly what you say.

This issue is really pretty simple in my mind. A DMCA violation includes PUBLISHING a copy of something. That copy is preserved on a server. A scraper does that.. a thief does that.. but not a proxy server, at least not by definition.

Can indexing proxy server results create duplicate content in the SERPs? Yes, always had. Google has struggled with it for years. They're better but not prefect. The best defense in my mind is the forward-reverse googlebot check.

That step can also be complex to understand well - as one of our first big threads about it clearly showed a few years back: [webmasterworld.com...]

rish3




msg:4571279
 9:08 pm on May 6, 2013 (gmt 0)

They do not. You do not understand the rules online, so you don't know proxies aren't "getting a pass" on them from anyone, including Google.


Alright, then, I'll outline why I feel that way:

Two different websites were presenting near verbatim copies of my content to Google's crawler, and those pages got indexed in the Google cache.

Neither was monetizing the content. One happened to call itself a proxy. The DMCA request for the one that didn't call itself a proxy was granted. The DMCA request for the one that did call itself a proxy was denied...with specific wording from Google that is was denied because the site "was a proxy".

Lastly, again, I disagree that these "web based proxies" should be considered anything other than a normal website from Google's perspective. The RFCs are talking about an entirely different kind of proxy.

TheOptimizationIdiot




msg:4571285
 9:22 pm on May 6, 2013 (gmt 0)

Lastly, again, I disagree that these "web based proxies" should be considered anything other than a normal website from Google's perspective.

Oh, I absolutely agree, but I don't make the rules on the legality of proxy serving and I don't get to decide whether Google goes with "original discovery" rather than whatever BS they use now that they seem to think works, but from the complaints here over the years really doesn't.

To do what I do and be successful at it there are a number of times when I just have to throw what I think about the way things should be done out, realize the situation is what it is and I can't change it, so my job is to deal with and find a way to overcome it. (IOW: Complaining would get me nowhere and knowing the rules helps me think "that's a stupid rule (or way to do things) but it is what it is, so how do I deal with and over come it?".)

And, I think Google's wrong to allow a duplicate to out rank an original discovery. But it's not my search engine, so I don't get to decide.

I had one case when the original had been known to Google for 3+ years before a duplicate was posted. The page generated a significant amount of traffic for years before and after the duplicate was posted. One day the forum duplicate (posted and discovered 3+ years after the original which continued to out rank it for years) floated to the top for some "only Google knows reason" that makes no sense to me. I don't think it's right. I don't like it. But my job is to STFU and deal with it to get the right page back ranking.

The RFCs are talking about an entirely different kind of proxy.

This is where I keep talking about knowledge level. I'm not trying to sound harsh, but the reality is if you read the following with an open mind, you'll see the RFC describes what AppSpot's proxies are. It's not Google who has it wrong. I'm sorry. I don't necessarily like it, but that's reality.

Emphasis Added
An intermediary program which acts as both a server and a client
for the purpose of making requests on behalf of other clients.
Requests are serviced internally or by passing them on, with
possible translation, to other servers. A proxy MUST implement
both the client and server requirements of this specification. A
"transparent proxy" is a proxy that does not modify the request or
response beyond what is required for proxy authentication and
identification. A "non-transparent proxy" is a proxy that modifies
the request or response in order to provide some added service to
the user agent, such as
group annotation services, media type
transformation, protocol reduction, or anonymity filtering. Except
where either transparent or non-transparent behavior is explicitly
stated, the HTTP proxy requirements apply to both types of
proxies.

[faqs.org...]

AppSpot's non-transparent proxy: acts as both a server and client; makes requests on the behalf of others; requests are passed on (they don't cache); they add a service to the user-agent and modify the request by requesting the page themselves rather than forwarding the original requesting user-agent to the host server.

They're a "real" proxy by definition according RFC2616.

rish3




msg:4571313
 10:12 pm on May 6, 2013 (gmt 0)

This is where I keep talking about knowledge level. I'm not trying to sound harsh, but the reality is if you read the following with an open mind, you'll see the RFC is not talking about an totally different type of proxy at all.


I get your point, I do. But these "proxies" don't follow the requirements of the RFCs.

For example...

If a proxy receives a host name which is not a fully qualified domain name, it MAY add its domain to the host name it received. If a proxy receives a fully qualified domain name, the proxy MUST NOT change the hostname.


The best analogy I can give is that a go-kart might appear to be a motor vehicle. That doesn't make it one, and doesn't allow it to inherit, for example, the right-of-way laws afforded to one.

You keep hinting that I don't know what I'm talking about. Consider that perhaps we are just misunderstanding each other.

TheOptimizationIdiot




msg:4571317
 10:21 pm on May 6, 2013 (gmt 0)

If a proxy receives a host name which is not a fully qualified domain name, it MAY add its domain to the host name it received. If a proxy receives a fully qualified domain name, the proxy MUST NOT change the hostname.

Why do you think they're changing the hostname?

When the proxy makes a request to a remote location on behalf of the visitor to the proxy, the visitor sees the response to the request made on their behalf from the location (including hostname) on the remote server, unless it's a redirect, then they see the response from the location redirected to.

So, when you request http://www.google.com/ via AppSpot proxy, that's what's returned to you via the proxy. They're not changing the location (hostname) requested by the visitor and showing their visitors something other than what's hosted on the location of the site they're proxy serving are they?

When you request http://google.com/ via AppSpot proxy, the request is redirected to www.google.com and that's the information shown to the requesting AppSpot visitor or am I seeing something different, not noticing something?

If they are doing something different please give an example.

rish3




msg:4571322
 10:39 pm on May 6, 2013 (gmt 0)

But how do you think they're changing the hostname?


This is crux of where we disagree. I feel like the RFC is talking about real proxies, that implement non-optional requirements, like:

- inserting Via: headers (sec 14.45)
- deleting hop-by-hop headers such as Connection: (sec. 14.10).
- Adding warning headers since they've changed the entity-body (sec 14.46)

In other words, I don't think they had hacks like the AppSpot proxy in mind.

Personally, I believe they were basing the RFC on proxies that intercepted the requests either by being "in-between" the web browser and the server, or by deliberate configuration in the browser (settings->proxy).

So, getting back the original question, in those scenarios "changing the url" is what they are doing to every url in the body of the page...they are changing it to point to their own domain. It's a neat little hack that I feel is a go-kart.

TheOptimizationIdiot




msg:4571330
 11:27 pm on May 6, 2013 (gmt 0)

Okay, I'm looking at everthing you're saying:
All Emphasis Added

- inserting Via: headers (sec 14.45)

HTTP/1.0 200 OK
x-xss-protection: 1; mode=block
via: HTTP/1.1 GWA
p3p: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
content-type: text/html; charset=ISO-8859-1
x-frame-options: SAMEORIGIN
cache-control: max-age=3600
Vary: Accept-Encoding
Date: Mon, 06 May 2013 22:46:49 GMT
Server: Google Frontend

[source: WebmasterWorld control panel server header]



- deleting hop-by-hop headers such as Connection: (sec. 14.10).

13.5.1 End-to-end and Hop-by-hop Headers

Hop-by-hop headers, which are meaningful only for a single
transport-level connection, and are not stored by caches or
forwarded by proxies.


The following HTTP/1.1 headers are hop-by-hop headers:

- Connection
- Keep-Alive
- Proxy-Authenticate
- Proxy-Authorization
- TE
- Trailers
- Transfer-Encoding
- Upgrade



- Adding warning headers since they've changed the entity-body (sec 14.46)

14.46 Warning

The Warning general-header field is used to carry additional
information about the status or transformation of a message which
might not be reflected in the message.
This information is typically
used to warn about a possible lack of semantic transparency from
caching operations or transformations
applied to the entity body of
the message.

The links are the only thing I can find modified by the proxy and it's easily argued a non-transparent proxy modifying the links to keep request running through the proxy is not only necessary, it's obvious and would defeat the purpose to not do so. It's reflected in the message, and it's also semantically transparent to users.

Since it's obviously reflected in the message and transparent to users they are not required to set the specific header when they serve the body of information as the receive it, except to ensure they continue making subsequent requests on behalf of their user when a link is clicked.

IOW: They don't change the "original information" of the message, they simply make sure the subsequent requests are made via their service and warning someone they are doing that would not really be necessary as a "non-transparent" proxy since it's not only necessary to continue proxy serving, it's obviously reflected in the message presented to users.

IMO They would be "farther off" as a "non-transparent" proxy to not warn people if they didn't not ensure subsequent requests were through their service.

So, getting back the original question, in those scenarios "changing the url" is what they are doing to every url in the body of the page...they are changing it to point to their own domain. It's a neat little hack that I feel is a go-kart.

It's not really a "neat little hack" imo. To function as a non-transparent proxy at the request of the user they have to, and it's not only understood by the user it's obvious in the message body so there's no "trick" or "hack" involved.

A "trick", imo, would be if a person thought they were "surfing via proxy" and then they clicked a link and requested the URL on the site being proxied themselves and landed on the site they thought they were surfing via proxy without a level of anonymity they should be able to expect.

rish3




msg:4571344
 12:14 am on May 7, 2013 (gmt 0)

- inserting Via: headers (sec 14.45)

I can't tell what your example is. I suspect it's an example where an AppSpot proxy passes on an existing Via: header. The AppSpot proxies will pass one on if it exists, but they don't inject one as it should.

Hop-by-hop headers, which are meaningful only for a single transport-level connection, and are not stored by caches or forwarded by proxies.


Correct...so a proxy that's compliant doesn't forward one if it exists..in other words, it deletes it. The AppSpot proxy doesn't comply in that respect.

The links are the only thing I can find modified by the proxy

The appspot proxies blindly changes many things that have the hostname in it. <link rel="canonical" /> for example.

A "trick", imo, would be if a person thought they were "surfing via proxy" and then they clicked a link and requested the URL on the site being proxied themselves and landed on the site they thought they were surfing via proxy


That only happens with these types of proxies. Again, I think a "real proxy" is one that is either

- An obstacle between the web browser and the internet at large...all packets funnel between it and the outside world.
- Specifically configured via options->proxy in a web browser, which basically creates the same situation.

All that said, I think this is an issue where two people can respectfully disagree. You think that the RFC had these type of proxies in mind, I don't.

TheOptimizationIdiot




msg:4571345
 12:17 am on May 7, 2013 (gmt 0)

Honestly, rish3, I wish there were some things they would change, but in all the years I've looked, they don't make many mistakes, and they may "play in the grey" a bit, but they're usually not totally, flat out wrong when it comes to standards and protocol.

Anything there's question about I've found is "subject to interpretation", and I really wish I could publish original content and know if it's discovered on my site first by gBot I'm not ever going to have to worry about it being replaced by another site. I wish things were that way and that easy and I think that's how they should be, but it's not reality.

TheOptimizationIdiot




msg:4571351
 12:33 am on May 7, 2013 (gmt 0)

I'm sorry you're misunderstanding things.

You're right, we'll have to agree to disagree but if you really feel like trying to see if your interpretation of AppSpot Proxies v RFC 2616 is correct rather than mine, then please file a suit against Google, because if you're right and I'm wrong (and they're wrong), then you have grounds to file against them and you'll win, but if not, then you'll get your a** handed to you.

Correct...so a proxy that's compliant doesn't forward one if it exists..in other words, it deletes it. The AppSpot proxy doesn't comply in that respect.

Well, I'm not sure what server header check you're using, but I've checked more than one page on more than one site and the AppSpot proxy mentioned earlier in this thread is compliant.

Have you actually tried it? And if so, where?

I suspect it's an example where an AppSpot proxy passes on an existing Via: header. The AppSpot proxies will pass one on if it exists, but they don't inject one as it should.

Well, rather than suspecting, you should try it out for yourself. It's really simple. Go to your control panel, click Server Headers (on the bottom left) then enter sites until you find one that doesn't have a via header. After you do, enter that site via the AppSpot Proxy I cited earlier in this thread and see if you get the via header. I've gotten the same result from more than one site, which is the added via header.

BTW: Please, be honest if you do, because it's obvious to me you haven't even taken the time to check.



If anyone wants to see the truth, please, go to:
control panel > server headers and enter:


http://www.huffingtonpost.com/
HTTP/1.0 200 OK
Server: Apache
Content-Type: text/html; charset=utf-8
P3P: CP='NO P3P'
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=15
Date: Tue, 07 May 2013 00:41:17 GMT
Connection: close

http://my-home-proxie.appspot.com/huffingtonpost.com

HTTP/1.0 200 OK
via: HTTP/1.1 GWA
vary: Accept-Encoding
p3p: CP='NO P3P'
content-type: text/html; charset=utf-8
cache-control: max-age=3600
Date: Tue, 07 May 2013 00:39:09 GMT
Server: Google Frontend

Once again, emphasis added above.
GL to you rish3.

I don't see a connection header from the proxy server above, but I do see one from huffingtonpost.com and I do see a via from the proxy. Also, please don't try to say I edited my findings, because it's way too easy for people to check and know I didn't.

/EndThreadForMe

rish3




msg:4571354
 12:49 am on May 7, 2013 (gmt 0)

Honestly, rish3, I wish there were some things they would change, but in all the years I've looked, they don't make many mistakes, and they may "play in the grey" a bit, but they're usually not totally, flat out wrong when it comes to standards and protocol.


It's not Appspot's core code I'm questioning. It's a popular pre-packaged python "proxy" script running on appspot. And it's not an RFC compliant proxy.

Well, rather than suspecting, you should try it out for yourself.

Just did. On the Via header, you are correct. Apologies. It seems to be something Appspot is doing though...nothing in the source for the popular python proxy that would account for it. My "suspecting" was based on looking at the source.

Don't gloat too much though...you had earlier stated that it didn't munge the <link rel=canonical .. /> bit, based on an educated assumption, just like me :)

rish3




msg:4571358
 1:00 am on May 7, 2013 (gmt 0)

Then WTF are you F'ing B*tching about AppSpot for in the First Place?


You sure you didn't make that assumption, or perhaps are confusing me with someone else in this thread?

This isn't the first post where I've referenced the "popular python script" in this thread.

Also, perhaps I could have been more specific, but my references to "appspot proxies" was referred to these user-deployed scripts. I thought that was obvious since that's the only kind of proxy on the appspot domain.

rish3




msg:4571359
 1:01 am on May 7, 2013 (gmt 0)

I was complaining about Google, but specifically about Google's team that handles DMCA requests....

tedster




msg:4571388
 2:59 am on May 7, 2013 (gmt 0)

We're not here to have arguments but rather discussions, olks. This thread is cleaned up of arguments and locked.

This 80 message thread spans 3 pages: < < 80 ( 1 2 [3]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved