Just found a new method of highjcking

Forum Moderators: not2easy

Message Too Old, No Replies

Just found a new method of highjcking

using basehref

Lorel

9:55 pm on Sep 22, 2005 (gmt 0)

I just discovered a new method of hijacking a web site. They don't need to capture your site in a frame. They use a base href tag with your URL in it.

just search for a string of text on your website in quotes.
and then search the code for a base href tag.

anyone else find this?

martzy

1:11 am on Sep 24, 2005 (gmt 0)

I don't see how someone can hijack your site using base href?
If you use "base href" at least you can find out who is is hijacking your site by searching for your own URL.

Fill me in if I missed something on the hijacking aspects.

Lorel

1:47 am on Sep 24, 2005 (gmt 0)

I searched for unique text on my page and this is how I found it. It appears in the supposed cache of a 2nd rate "search" engine (like what is seen in Google's cache). I checked the source code and they put a base href tag with my own URL in it and ALL the code for my page is below it incliding head tags, etc where I have my own base href tag installed. I also checked Google's cache of my page and it does the same thing--my base href tag is above the capured code for my web page as well as inside.

I suppose the only way I can stop this is to put a noarchive tag on the page, but then Google can't archive it either which I do want.

Anyone know how to set up a tag just for that one domain?

Dijkgraaf

3:30 am on Sep 29, 2005 (gmt 0)

Google adds the basehref to cached pages so that any included contents such as images and javascript loads from your site so that you don't get broken images and javascript errors when viewing cached pages, is that what you are refering to?

Lorel

4:25 pm on Oct 2, 2005 (gmt 0)

Yes, that is probably what is happening. These appear to be scraper directories using the same techniques as Google for caching pages. This may be a legit method of caching pages but makes it very hard to find copyright infringements.

Dijkgraaf

10:51 pm on Oct 2, 2005 (gmt 0)

Well actually no, it makes them a lot easier to find if you have access to your logs as you will be seeing requests for images with the referer being a page outside of your domain.

HRoth

12:39 pm on Oct 3, 2005 (gmt 0)

Is this what you are talking about:

<td class="invData"><a href="store_view.asp?sid=31&id=642">Name of My Product</a> x 1
<br />
<span class="txtSmall">Entirety of the text from my page on that product<br /></span><br />
</td>
<td class="invData"><img style="width: 75px; height: 75px; border: 1px solid #C0C0C0;" alt="Name of my product pic" src="images/obj/642.jpg" /><br /></td>

The page from which the text is taken is showing as a supplemental result in Google, so based on a thread in another forum, I looked for the text and found it on this page about an rpg. Thing is, the text doesn't actually show on the page. I don't know if I am getting a duplicate content penalty on this page because of this or what.

sparticus

11:59 pm on Oct 5, 2005 (gmt 0)

Or, are you referring to something like this:

...Full html code of caching page with notice saying that this is cached...
</body>
</html>
<base href="http://www.mysite.com">
<HTML><HEAD>
<TITLE>My Title</TITLE>
... Rest of my code here, a complete copy of my page

Lorel

2:46 pm on Oct 6, 2005 (gmt 0)

Here is what appears on the source code of the site that captured my page without using a frame (notice the Host data is OUTSIDE OF ANY BODY TAGS)--I changed all domain data.

<TABLE width="100%" border="1" bgcolor="#FFFFFF">
<TR>
<TD><TABLE width="100%" border="0">
<TR>
<TD> <P><FONT color="#000000" size="2" face="Arial, Helvetica, sans-serif">This
is the</FONT><FONT size="2" face="Arial, Helvetica, sans-serif">
<A href="http://www.THEIR DOMAIN.com"><FONT color="#0000FF">THEIR DOMAIN NAME</FONT></A>
<FONT color="#000000">cache of </FONT>
<FONT color="#008000">http://MYDOMAIN.com/</FONT>
<BR>
<FONT color="#000000">The</FONT> <A href="http://www.THEIRDOMAIN.com"><FONT color="#0000FF">THEIR NAME</FONT></A>
<FONT color="#000000">cache is a snapshot that we took when this
page was indexed.</FONT><BR>
<FONT color="#000000">The page may have changed since that time.
Click here for the </FONT><A href="http://MY DOMAIN.com/"><FONT color="#0000FF">current
page</FONT></A><FONT color="#000000">.</FONT></FONT></P></TD>
<TD align="right" valign="top"> <DIV align="right"><A href="javascript:history.go(-1)" target="_self"><FONT size="2" face="Arial, Helvetica, sans-serif" color="#0000FF">Return
To Your Search Results</FONT></A> </DIV></TD>
</TR>
<TR>
<TD colspan="2"><CENTER>
<FONT color="#666666" size="2" face="Arial, Helvetica, sans-serif">THEIR NAME
is not affiliated with the authors of this page nor responsible
for its content.</FONT></CENTER></TD>
</TR>
</TABLE>

</TD>
</TR>
</TABLE><BASE HREF="http://wwwMY DOMAIN.com/"><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><head>
- - - -Meta tags deleted for Posting to WW - - -
<base href="http://www.MY DOMAIN.com">
<script type="text/javascript">
if(top!= self) top.location.href = self.location.href
</script>
</head>

Everything below the Doctype was on my page originally. Everything above it was what they put on there. Notice I have a pop out of frames script on there but this is not a frame capture. Their page in the browoser window looks just like Google's cache of my site only it has their domain data in it.

HRoth

5:58 pm on Oct 6, 2005 (gmt 0)

I think that what I encountered is different from what others have seen. The page that had duplicated my entire product page displayed the text as a 1 pixel image. That's why it was not visibly displayed, but it was still in the source code and could still be seen by search engines and by copyscape. I guess a lot of webmasters checking their stuff on copyscape would visit the page, not see their text displayed, and move on. Just because I am a jerk, I looked in the source code. I noticed that same page had five other pages from other sites similarly non-displayed, and it is a huge site with many, many pages. They removed my info from the page source, claiming it was an image. And you would not believe all the "we honor copyright" stuff up on that site, supposedly put together by artists. What a bunch of jackels.

Lorel

6:57 pm on Oct 6, 2005 (gmt 0)

HRoth

How can they use a 1 pixel image of your site that gets picked up the search engines? Search engines can't read images. Am I missing something?

HRoth

9:17 pm on Oct 6, 2005 (gmt 0)

Sorry. I misinterpreted the html that I posted. I saw the 1 pixel and though it was somehow related to the text. The text from my page was in the source code of the other site's page but not displayed on the page. I tried downloading the jpg that the source code says is there with the text, and when I did, I got an image 1 pixel big. I don't understand what <scan> does in html, but my understanding is that search engines can read text in the source code that is not displayed on the page.

It's been a long day. Yesterday I found 5 sites that had taken gobs of text from my site. Today I found 3 more. Most of them are commercial sites that are essentially competitors. I have a copyright notice on every single page, and as I have been updating the site, I have even moved the notice closer to the top of the screen so no one can say they didn't see it. Yet competing businesses still take whole pages of my text like it was nothing. I can't afford to pay copyscape to look automatically for my 450+ pages, and it takes a lot of time to look for this stuff manually. It really has me disgusted.

sparticus

10:43 pm on Oct 6, 2005 (gmt 0)

Yes, but it's on the internet, so you're allowed to copy it. It's not like a real book or anything! ;)

theBear

12:31 am on Oct 7, 2005 (gmt 0)

This is part of what is going on.

In a nutshell what is happening is a site is hotlinking your content. Not just the pictures it may or may not be serving some or none of the content from its own server except as a pass through.

This is called IP delivery the content of your site is read from your site and presented to the s/e bots and the surfers as the other site's content in realtime.

A simple 4 or 5 statement routine does the heavy lifting it opens a connection from the thief's site to yours and reads the page pointed by the the thief.

At this point the thief can pretty much chose what to do, they can render the page as is, make a copy on their server for later use, or replace any part of the page from your site prior to delivering it.

The most common form just delivers the page as stolen. This usually results in the home page being duplicated.

They can feed the bots just specific pages if they chose.

The next form is that of a modified proxy script or system. Normally proxies do not allow the bots in, remove the blocks and you have a piece of software that can actully cause the bots to duplicate entire sites as being part of the proxy servers page set.

This code also is fairly straight forward.

There is also a form that adds some ad displays to what you see but not what the bot sees.

There is also at least one site using a windows exploit to cause trouble.

Now if you put this behind a rewrite ruleset then you won't know what code was doing the IP delivery.

To determine if IP delivery of your site is taking place you must be on the constant lookout for duplicate content from your site.

Some currently known scripts that normally might be nothing to worry about but can or have been easly modified are:

nph-proxy.cgi
nph-proxy.pl
trace.php
tracker2.php
go.php
get.php

These scripts can be hidden behind rewrite rules or embedded as SSI.

Frequently these scripts will be teamed up with various data mining scripts powered by various s/e results or DMOZ dumps or directly accessing DMOZ.

How you can determine this is taking place is to paste the link in a header checker, it will show a return Code of 200 but not your URL. When the link is clicked in a browser window your page is served and the url in the browser shows the other sites url not yours. This could also be a copy stored on the server of the thief which is yet another reason to have something like the date or date time somewhere in your pages.

If it is a true live IP delivery system, you should be able to see it in your logs as it happens.

tail -f logfile running in a shell session if you are on *nix servers, window folks you are on your own.

Some methods of dealing with these folks run the range from simple email begging to full legal action.

Detection doesn't always prevent the damage.

Ususlly the sites doing this are housed at rather large hosting providers.

I'd recommend as a preventive measure that the IP address ranges of the large hosting providers be nailed via firewall rules. Some holes may be needed for various hosting company monitoring links when setting up the blocks for the hosting provider you host with. Your hosting folks can answer those questions.

Another step would be to fail any access that doesn't provide a user agent.

If they can't get to the content they can't steal the use of it or get Google to trip whatever Google trips when it chokes on duplicate content.

One of the sites that has copied our home page is hosted by the same folks as we are.

Another site I'm aware of is at an institution of higher education.

Keep an eye out, there's critters out there that don't care about any rules.

sparticus

12:54 am on Oct 7, 2005 (gmt 0)

"Another step would be to fail any access that doesn't provide a user agent."

How do you do that?

theBear

1:12 am on Oct 7, 2005 (gmt 0)

Via a rewrite ruleset if you are on an Apache server.

It goes in your .htaccess file.

There is an equivalent method for windows as well.

Lorel

4:11 pm on Oct 7, 2005 (gmt 0)

Thanks The Bear for the very detailed explanation of what is going on. There is a lot of info to digest.

I have a question re discerning what method is being used concerning what you said below:

Some currently known scripts that normally might be nothing to worry about but can or have been easly modified are:
nph-proxy.cgi
nph-proxy.pl
trace.php
tracker2.php
go.php
get.php
These scripts can be hidden behind rewrite rules or embedded as SSI.

Are these discernable on the affected page itself or elsewhere? i.e., how do we determine it is an IP delivery?

Also, I haven't been able to bring up an htaccess file on another site to see what they are doing. Is this possible?

Thanks for your input

andrea99

4:21 pm on Oct 7, 2005 (gmt 0)

Via a rewrite ruleset if you are on an Apache server.
It goes in your .htaccess file.

Are you keeping this rewrite ruleset a secret or just assuming that all of us are proficient mod_rewrite coders?

theBear

10:23 pm on Oct 7, 2005 (gmt 0)

andrea99,

Here is the code, please remember that all code can have unintended consquences, especially code that blanket denies access.

The following rewrite ruleset denies access to all agents that fail to send a user agent id.

This will not protect against folks who provide a false user agent value.


RewriteCond %{HTTP_USER_AGENT} =""
RewriteRule (.*) /$1? [F,L]

Lorel,

Detecting IP delivery is a pain, first I may decide to grab a page from your site and render the page using a noindex, etc robots tag. I might do this and never cause you a bit of trouble with the S/Es, simply because the normal S/Es will honor (we hope) the robots tag.

A number of framing systems likewise will not cause you problems with the S/Es, as the S/E if it follows the rules is instructed to not index the content.

Then along comes theEvilOne (a very nasty critter) who does the same thing to pages on my site but changes the robots metatag.

You'll find his copy of your pages in the S/E indexes but you'll never find (if he has done his coding correctly) his site accessing your site.

You'll find mine if you really look hard, but you'll have a lot of trouble connecting the dots.

What has happened is that folks have built a rather large economy on top of a very open and easy to use framework but driven by a system that has very large builtin pitfalls.

The members of this Forum that keep harping on finding and building traffic via means other than S/E serps have it pegged correctly.

Lorel

10:38 pm on Oct 7, 2005 (gmt 0)

The Bear,

Sounds like this make a base href and break out of frames script obsolete.

Oh well, there is still Google's Spam Report. They are hiring people to analyze the reports and this IP delivery sure seems to qualify for "Sneaky Redirets".

Thanks for your input.

Lorel

9:50 pm on Oct 12, 2005 (gmt 0)

I tried the code to disallow anyone without a user agent and it blocked the program I use to check keyword density and afraid it might be blocking other sites I want to visit my site I took it back off.

I have an update on what I believe must be an example IP delivery hijacking.

I found hijackers using a 302 redirect on their website to mine which steals my PR (a 302 tells the search engine that what used to be attributed to "my" site is now the property of the hijacking site), i.e., they are stealing my PR.

However, they also have some other method going too (which I assume is IP delivery) because I found this link in all my new client's link searches on yahoo directory, i.e., I search for links to my client's sites and this site shows up with a link to MY site, not my client's site. So they are picking up the text on my page and that (including links to all my new clients) is being attributed to them.

I wrote them on Oct 4 and asked them to remove my link and it's still up there (they host their own site and same email used for both so I can't contact the host). They have Google adsense ads on every page on their directory including the pages with 302 redirects which is strongly against Google's AdSense policy. I just reported them to Google adsense! Ha! that will hit them in the pocketbook.

theBear

2:05 am on Oct 13, 2005 (gmt 0)

That ruleset will block all agents that fail to provide a user agent. I warned you that it could have side effects, you have to go through your logs and determine what it blocked that you didn't want blocked.

It doesn't block agents that send an agent id.

Seems your KEYWORD etc. system isn't a conforming agent so it got blocked,

The ole meta refresh can cause problems if Google doesn't have a clue as to how to handle them.

I haven't heard from the controlled experiment that has been under way for some time now.

g1smd

1:39 am on Nov 7, 2005 (gmt 0)

Google adds a <base> tag in the code on their cache copy, so that images and relative links still work.

Look at an original copy of the page direct from the site, and see if that has any <base> tag present.

nfinland

8:54 pm on Nov 15, 2005 (gmt 0)

P.S.

If i put
<base href="http://www.mysite.com/">

Should I also put

on my detail page?

g1smd

9:17 pm on Nov 15, 2005 (gmt 0)

If your navigation uses relative links then you need the full page URL on every page.

If you use absolute linking... http://www.domain.com/other.page.html or just /folder.name/other.page.html with a leading / on the URL, then the base tag only needs the base domain in it.