Forum Moderators: not2easy
I suppose the only way I can stop this is to put a noarchive tag on the page, but then Google can't archive it either which I do want.
Anyone know how to set up a tag just for that one domain?
<td class="invData"><a href="store_view.asp?sid=31&id=642">Name of My Product</a> x 1
<br />
<span class="txtSmall">Entirety of the text from my page on that product<br /></span><br />
</td>
<td class="invData"><img style="width: 75px; height: 75px; border: 1px solid #C0C0C0;" alt="Name of my product pic" src="images/obj/642.jpg" /><br /></td>
The page from which the text is taken is showing as a supplemental result in Google, so based on a thread in another forum, I looked for the text and found it on this page about an rpg. Thing is, the text doesn't actually show on the page. I don't know if I am getting a duplicate content penalty on this page because of this or what.
<!-- HOST: -->
<TABLE width="100%" border="1" bgcolor="#FFFFFF">
<TR>
<TD><TABLE width="100%" border="0">
<TR>
<TD> <P><FONT color="#000000" size="2" face="Arial, Helvetica, sans-serif">This
is the</FONT><FONT size="2" face="Arial, Helvetica, sans-serif">
<A href="http://www.THEIR DOMAIN.com"><FONT color="#0000FF">THEIR DOMAIN NAME</FONT></A>
<FONT color="#000000">cache of </FONT>
<FONT color="#008000">http://MYDOMAIN.com/</FONT>
<BR>
<FONT color="#000000">The</FONT> <A href="http://www.THEIRDOMAIN.com"><FONT color="#0000FF">THEIR NAME</FONT></A>
<FONT color="#000000">cache is a snapshot that we took when this
page was indexed.</FONT><BR>
<FONT color="#000000">The page may have changed since that time.
Click here for the </FONT><A href="http://MY DOMAIN.com/"><FONT color="#0000FF">current
page</FONT></A><FONT color="#000000">.</FONT></FONT></P></TD>
<TD align="right" valign="top"> <DIV align="right"><A href="javascript:history.go(-1)" target="_self"><FONT size="2" face="Arial, Helvetica, sans-serif" color="#0000FF">Return
To Your Search Results</FONT></A> </DIV></TD>
</TR>
<TR>
<TD colspan="2"><CENTER>
<FONT color="#666666" size="2" face="Arial, Helvetica, sans-serif">THEIR NAME
is not affiliated with the authors of this page nor responsible
for its content.</FONT></CENTER></TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE><BASE HREF="http://wwwMY DOMAIN.com/"><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><head>
- - - -Meta tags deleted for Posting to WW - - -
<base href="http://www.MY DOMAIN.com">
<script type="text/javascript">
if(top!= self) top.location.href = self.location.href
</script>
</head>
Everything below the Doctype was on my page originally. Everything above it was what they put on there. Notice I have a pop out of frames script on there but this is not a frame capture. Their page in the browoser window looks just like Google's cache of my site only it has their domain data in it.
It's been a long day. Yesterday I found 5 sites that had taken gobs of text from my site. Today I found 3 more. Most of them are commercial sites that are essentially competitors. I have a copyright notice on every single page, and as I have been updating the site, I have even moved the notice closer to the top of the screen so no one can say they didn't see it. Yet competing businesses still take whole pages of my text like it was nothing. I can't afford to pay copyscape to look automatically for my 450+ pages, and it takes a lot of time to look for this stuff manually. It really has me disgusted.
In a nutshell what is happening is a site is hotlinking your content. Not just the pictures it may or may not be serving some or none of the content from its own server except as a pass through.
This is called IP delivery the content of your site is read from your site and presented to the s/e bots and the surfers as the other site's content in realtime.
A simple 4 or 5 statement routine does the heavy lifting it opens a connection from the thief's site to yours and reads the page pointed by the the thief.
At this point the thief can pretty much chose what to do, they can render the page as is, make a copy on their server for later use, or replace any part of the page from your site prior to delivering it.
The most common form just delivers the page as stolen. This usually results in the home page being duplicated.
They can feed the bots just specific pages if they chose.
The next form is that of a modified proxy script or system. Normally proxies do not allow the bots in, remove the blocks and you have a piece of software that can actully cause the bots to duplicate entire sites as being part of the proxy servers page set.
This code also is fairly straight forward.
There is also a form that adds some ad displays to what you see but not what the bot sees.
There is also at least one site using a windows exploit to cause trouble.
Now if you put this behind a rewrite ruleset then you won't know what code was doing the IP delivery.
To determine if IP delivery of your site is taking place you must be on the constant lookout for duplicate content from your site.
Some currently known scripts that normally might be nothing to worry about but can or have been easly modified are:
nph-proxy.cgi
nph-proxy.pl
trace.php
tracker2.php
go.php
get.php
These scripts can be hidden behind rewrite rules or embedded as SSI.
Frequently these scripts will be teamed up with various data mining scripts powered by various s/e results or DMOZ dumps or directly accessing DMOZ.
How you can determine this is taking place is to paste the link in a header checker, it will show a return Code of 200 but not your URL. When the link is clicked in a browser window your page is served and the url in the browser shows the other sites url not yours. This could also be a copy stored on the server of the thief which is yet another reason to have something like the date or date time somewhere in your pages.
If it is a true live IP delivery system, you should be able to see it in your logs as it happens.
tail -f logfile running in a shell session if you are on *nix servers, window folks you are on your own.
Some methods of dealing with these folks run the range from simple email begging to full legal action.
Detection doesn't always prevent the damage.
Ususlly the sites doing this are housed at rather large hosting providers.
I'd recommend as a preventive measure that the IP address ranges of the large hosting providers be nailed via firewall rules. Some holes may be needed for various hosting company monitoring links when setting up the blocks for the hosting provider you host with. Your hosting folks can answer those questions.
Another step would be to fail any access that doesn't provide a user agent.
If they can't get to the content they can't steal the use of it or get Google to trip whatever Google trips when it chokes on duplicate content.
One of the sites that has copied our home page is hosted by the same folks as we are.
Another site I'm aware of is at an institution of higher education.
Keep an eye out, there's critters out there that don't care about any rules.
I have a question re discerning what method is being used concerning what you said below:
Some currently known scripts that normally might be nothing to worry about but can or have been easly modified are:nph-proxy.cgi
nph-proxy.pl
trace.php
tracker2.php
go.php
get.phpThese scripts can be hidden behind rewrite rules or embedded as SSI.
Are these discernable on the affected page itself or elsewhere? i.e., how do we determine it is an IP delivery?
Also, I haven't been able to bring up an htaccess file on another site to see what they are doing. Is this possible?
Thanks for your input
Via a rewrite ruleset if you are on an Apache server.
It goes in your .htaccess file.
Are you keeping this rewrite ruleset a secret or just assuming that all of us are proficient mod_rewrite coders?
Here is the code, please remember that all code can have unintended consquences, especially code that blanket denies access.
The following rewrite ruleset denies access to all agents that fail to send a user agent id.
This will not protect against folks who provide a false user agent value.
RewriteCond %{HTTP_USER_AGENT} =""
RewriteRule (.*) /$1? [F,L]
Lorel,
Detecting IP delivery is a pain, first I may decide to grab a page from your site and render the page using a noindex, etc robots tag. I might do this and never cause you a bit of trouble with the S/Es, simply because the normal S/Es will honor (we hope) the robots tag.
A number of framing systems likewise will not cause you problems with the S/Es, as the S/E if it follows the rules is instructed to not index the content.
Then along comes theEvilOne (a very nasty critter) who does the same thing to pages on my site but changes the robots metatag.
You'll find his copy of your pages in the S/E indexes but you'll never find (if he has done his coding correctly) his site accessing your site.
You'll find mine if you really look hard, but you'll have a lot of trouble connecting the dots.
What has happened is that folks have built a rather large economy on top of a very open and easy to use framework but driven by a system that has very large builtin pitfalls.
The members of this Forum that keep harping on finding and building traffic via means other than S/E serps have it pegged correctly.
I have an update on what I believe must be an example IP delivery hijacking.
I found hijackers using a 302 redirect on their website to mine which steals my PR (a 302 tells the search engine that what used to be attributed to "my" site is now the property of the hijacking site), i.e., they are stealing my PR.
However, they also have some other method going too (which I assume is IP delivery) because I found this link in all my new client's link searches on yahoo directory, i.e., I search for links to my client's sites and this site shows up with a link to MY site, not my client's site. So they are picking up the text on my page and that (including links to all my new clients) is being attributed to them.
I wrote them on Oct 4 and asked them to remove my link and it's still up there (they host their own site and same email used for both so I can't contact the host). They have Google adsense ads on every page on their directory including the pages with 302 redirects which is strongly against Google's AdSense policy. I just reported them to Google adsense! Ha! that will hit them in the pocketbook.
It doesn't block agents that send an agent id.
Seems your KEYWORD etc. system isn't a conforming agent so it got blocked,
The ole meta refresh can cause problems if Google doesn't have a clue as to how to handle them.
I haven't heard from the controlled experiment that has been under way for some time now.