homepage Welcome to WebmasterWorld Guest from 23.20.220.61
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque & physics

Webmaster General Forum

    
How do I see the SOURCE CODE of redirecting page?
Apparent scraper redirects to my site.
Larryhat




msg:338556
 12:01 am on Oct 18, 2004 (gmt 0)

Hello: A large site is composed entirely of content from other legitimate sites. Much of that is direct links to the related pages ON those sites, including mine, with links much like this:

www.scraper.org/websites/site1234.htm

When I click on that link, I go directly to MY page via some kind of redirection. I want to see the source code of "/site1234.htm" itself to see if there are counters, commercial stuff, dirty tricks or whatever on that page. I don't need to see the source code for my own page of course, but that's all I have gotten so far.

Is there a way to show that source? I'm using Firefox 0.8. Do I need some kind of crawler? I've never tried anything like that. Any help much appreciated.

Thanks in advance - Larry

 

netguy




msg:338557
 12:07 am on Oct 18, 2004 (gmt 0)

If it's a fast redirect, and you have the URL, you should be able to grab the source with:
view-source:http://www.domain.com/websites/site1234.htm

Steve

<added> I'm not sure about Firefox, but it should work in IE</added>

Larryhat




msg:338558
 1:01 am on Oct 18, 2004 (gmt 0)

Hello Netguy:

Thanks for the fast and helpful response.

This is really weird. I did as you suggested in both Firefox and IE. In BOTH, I only got the source code for my own page!

For a moment, I thought the scraper had copied my entire page, down to the last character. To test for that, I just made a small change (removed a <!- remark --> from bottom of MY page and sent it up to the host.

I repeated the tests, and scraper's version ~/websites/site1234.htm made the same change!

Clearly this is an instant redirection, but how on Earth does "view-source" get redirected? Its not supposed to execute the code, just display it as is.

I'm at a loss here. Can anyone suggest what the perp is actually doing, and how?
Highly interesting. This guy has done this with most of the best legitimate sites in my field. Clearly he is sending traffic our way, but any page rank presumably stays with www.scraper.org.

As a side note: I wrote to G about my worries of penalties for duplicate content on one page. The matching scraped page has since vanished from the G SERPS. That's only one of course, this guy must have hundreds more.

- Larry

encyclo




msg:338559
 1:11 am on Oct 18, 2004 (gmt 0)

Try pasting the site1234.htm URL into the Server headers checker [webmasterworld.com] and see what you get. The site1234.htm file might be a script issuing a 301 or 302 redirect after logging the click, so there would be no source code to see.

If you're not sure, post the results here (after removing the specifics, obviously!) and we'll take a look.

Larryhat




msg:338560
 1:24 am on Oct 18, 2004 (gmt 0)
Hello Enclyclo: Good idea!

I did as directed and the header check returned this:

HTTP/1.1 302 Object moved
Server: Microsoft-IIS/5.0
Date: Mon, 18 Oct 2004 01:15:52 GMT
X-Powered-By: ASP.NET
Connection: keep-alive
Location: http://www.larry's-site.net/MAPSMENU.html
Connection: Keep-Alive
Content-Length: 121
Content-Type: text/html
Set-Cookie: ASPSESSIONIDQQDDADRS=XXXXXXXDDOLELJEIMC; path=/
Cache-control: private

- - -

I guess that's what's called a 302 redirect?

IF so (or even if not) would this pass any PR to my page? If PR passes, I don't have an awful lot to complain about, just a fear of duplicate content.

What about the cookie that is set? I don't use cookies at all. What benefit might the scraper expect from setting a cookie? Does a private "cache-control" mean anything relevant? This looks sophisticated, and I'd like a handle on what the perp is actually doing.

Thanks VERY much - Larry

encyclo




msg:338561
 1:41 am on Oct 18, 2004 (gmt 0)

302 is not good.

This thread [webmasterworld.com] (for example, posts 11 & 12 and a whole load more) is just one of the numerous long threads talking about the problems with 302 redirects. Of course, that doesn't mean that the directory is intentionally attempting to do something bad (heaven knows, there are a huge number of server admins who don't have a clue about doing proper redirects), but it still leaves you in a potentially difficult situation.

A straight link is by far the best, and a 301 permanent redirect comes next. 302s and meta refreshes are generally not good news.

The cookie or the cache control would have no influence on this issue - it looks like a pretty standard IIS header.

Larryhat




msg:338562
 3:03 am on Oct 18, 2004 (gmt 0)

Thanks again Encyclo.

Ooof! that's a long thread, and I went thru it again with trepidation. It looks like G has made SOME progress with 302 redirect problems. In my particular case, I just surfed G and Y, and guess what? My page/site comes in on top for the relevant search phrase, and scraper is nowhere to be seen!

That wasn't the case a few weeks ago, before I complained to Y and G. I didn't lose PR or serps positions luckily, others clearly did. Now I have to check for OTHER pages of mine which probably got scraped as well. If other redirects all fell out of the serps, then problem is solved for now.

Best wishes and thanks again - Larry

plumsauce




msg:338563
 8:41 am on Oct 18, 2004 (gmt 0)


there are a huge number of server admins who don't have a clue about doing proper redirects), but it still leaves you in a potentially difficult situation.

since a 302 is an entirely legitimate redirection notification, i would tend to argue that it is the search engines that are getting this wrong.

agreed, this has been argued at length in other threads.

Larryhat




msg:338564
 9:01 am on Oct 18, 2004 (gmt 0)

Hello Plumsauce:

I won't argue your points which are valid.

However, the Scraper has NO RIGHT to redirect my pages one way or another. He never asked permission for anything at all. He took one of my images (a mathematical graph), converted it from a .gif to a .jpg, shrunk it down some, and uses that as one of his uncredited displays as it it were his. His content pages are all scraped from legitimate sites. At bottom of those is a "Fair Use" disclaimer which is a model of weasel worded self-serving nonsense. To mellow the scraping a bit, he gives some links to the rightful writers / webmasters, but those all look like the same 302 type.

Legitimate? Yes, if one page on a website redirects to another .. to another site, with permission. But, nothing like this can possibly be described as white-hat.

It is up to the search engines to search and 'repair' these practices, I agree.

Best - Larry

webdude




msg:338565
 4:06 pm on Oct 18, 2004 (gmt 0)

plumsauce,

since a 302 is an entirely legitimate redirection notification, i would tend to argue that it is the search engines that are getting this wrong.

agreed, this has been argued at length in other threads.

I agree with you totally on this. It should be a legitimate way of redirecting. It is the bots that are broke. As far as I know, the only fix is to try to get the 302 removed. It takes a while for the SERPs to straighten out after this has happened though.

plumsauce




msg:338566
 7:28 pm on Oct 18, 2004 (gmt 0)

Hi Larry,

I don't want to make you feel bad. I was just clarifying a fairly narrow technical point. It still has an undeniably negative effect for you personally no matter which end is broken.

I have had some success using DMCA provisions where the host is in the US. This usually entails a detailed request for whatever it is that I want addressed to the webmaster pointing out the DMCA implications with a cc to the hosting company. Wait a week, then write directly to the hosting company referencing the earlier notice. You either get what you want, or the site gets dropped. I usually am looking for a text link back to the article as copyright attribution. Strangely enough it is about these types of issues.

BTW, an absolutely failsafe method of seeing the source code is using a packet sniffer, provided that compression is not in use. This works extremely well over a modem line because there is very little noise in the form of background traffic. Alternatively, use wget to grab the elements including any javascript files that might be doing the dirty work.

Plumsauce

drbrain




msg:338567
 7:59 pm on Oct 18, 2004 (gmt 0)

302 Found

The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

Since a 302 response is temporary (unless a Cache-Control or Expires header exists) the search engine may not assume that A (the URL that responsds 302) is at B (the URL in the Location header).

Even with a Cache-Control or Expires header, the search engine may only temporarily consider that A is at B.

In Google's case, I would not pass PR or what-have-you across a 302, since the server said "this is only temporary, watch this space, there may be a 200, 301, or 410 here at some future time."

WebGuerrilla




msg:338568
 8:02 pm on Oct 18, 2004 (gmt 0)

Larry, I'm not really clear what the problem is here. There are thousands of directories on the web who use logging scripts that use redirects to forward visitors to the final site.

Using this type of setup is preferred because it allows you to see what your visitors are actually clicking on. Of course, there are sites that will try and use a redirect to gain some type of competitive advantage, but from what you've said, it doesn't sound like this is what the site in question is doing, so I wouldn't worry about it much.

Larryhat




msg:338569
 12:15 am on Oct 19, 2004 (gmt 0)

Hello WebG: Maybe I didn't make myself clear.

My site is non-commercial for years now, as I slowly build up a non-obsolete version of some specialized software of mine. Meanwhile, I use the old version to build up a web presence. 20 years hard work.

Now here comes this scraper site, using 302 redirects and scraped content from all over the place, for the sake of his banner ads revenue. He's done so much of this that he outranks legitimate sites in the SERPs.

He is using other people's hard unpaid work to raise his own ratings. This looks like a perversion of the proper use of 302 redirects, at least to me. I'm not even addressing the mass of wholly scraped content here, I just mention that to illustrate the methods and intent of the perp. It looks professionally done BTW.

Best - Larry

idoc




msg:338570
 4:08 am on Oct 19, 2004 (gmt 0)

"20 years hard work. Now here comes this scraper site, using 302 redirects"

Very probably no accident and likely very intentional on their part. I remember a post about a dutch seo company doing this back around the first of the year or before. There was a link to an article whereby the firm described how they siphon traffic from target sites and the method was described as "ingenious" in the article... if I remember the wording right. I believe a good deal of black hat seo now relies on this bot exploit. The redirecting sites should not be tempoarily redirecting a page on their site to another site they do not control and the bots should count the redirect as any other incoming link and pass the appropriate p.r. transfer due the redireced to site from the link.

Larryhat




msg:338571
 4:25 am on Oct 19, 2004 (gmt 0)

Hello Idoc: You said it better than I could. Now when is G and Y going to do something about it? Their silence is alarming.

OH the Dutch sites! I had a couple of those deep-linking my maps. That, I can fix on this end. I fully agree that passing PR thru to 302 target pages would go a LONG way toward solving this. Legitimate sites would have some incentive to work on genuinely 'temporary' pages, and the perps would lose benefit from the scams. Rightful sites would have far less risk of being penalized or banned for duplicate content.

Offhand (and I'm far from expert on this) it looks like a win-win situation .. just pass thru the PR. Am I missing something important here? Nothing new if so.

Best wishes - Larry

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved