homepage Welcome to WebmasterWorld Guest from 67.202.56.112
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Blocking #*$! from scraping my site
thirteen




msg:4350928
 7:01 pm on Aug 12, 2011 (gmt 0)

Is there a way to block #*$!.org from scraping my site?
They have a "#*$! Scraper" webpage for scraping websites.
[#*$!.org...]

I would like to block all traffic originating from #*$!.org.

 

thirteen




msg:4350929
 7:05 pm on Aug 12, 2011 (gmt 0)

I guess this is a taboo subject. Can't mention the website that is a scraper website.

wilderness




msg:4350943
 7:41 pm on Aug 12, 2011 (gmt 0)

I guess this is a taboo subject. Can't mention the website that is a scraper website.


Expecting an answer within four minutes is hardly rational.

It's certainly NOT, however you meed to select the correct forum [webmasterworld.com] and provide a full log line, obscuring the last numbers of the full-IP-range.

lucy24




msg:4350975
 9:11 pm on Aug 12, 2011 (gmt 0)

Expecting an answer within four minutes is hardly rational.

I think he was just reacting to what happened to his link once it got posted.

Can't mention the website that is a scraper website.

Nothing personal: you can't mention any website unless it's a Recognized Authority. If you give any full url other than
http://www.example.com
it will be auto-converted to a clickable link. When you're asking questions about how to word your htaccess, this is not what you want.

Anyway, blocking someone from visiting at all is much easier than blocking them only from scraping. If they work from a predictable IP range, a simple "Deny from..." in your htaccess will shut them out. If they're faking their address, a rewrite using either the User Agent or the Referer -- whichever is appropriate -- can achieve the same thing.

In this case, your Recognized Authority is Apache [httpd.apache.org]. Bookmark the page ;)

thirteen




msg:4351036
 2:31 am on Aug 13, 2011 (gmt 0)

Thanks lucy24, you are correct about my post.

This site is not an authority. Their name rhythms with Google but with starts with Scr. They go around scraping a lot of websites and mines included. They even has a page with the webpage title "Scraper" in it.

I have used .htaccess to block Russia but I do not know what block of IP Address to use against this site. They are not continuous blocks. I suspect other people also have this problem with this scraper site and wanted to know if they had a solution.

Unfortunately, my post got obfuscated with the #*$! mask by an editorial program so people don't know what site I asking about.

wilderness




msg:4351038
 2:36 am on Aug 13, 2011 (gmt 0)

I have used .htaccess on block Russia but I do not know what block of IP Address to use against this site.


As I previously explained, the SSID forum allows posting of a full-log line, however does require obfuscating the Class D of the IP range.

Nobody will be able to assist you without that information.

As an example:

207.46.199.zzz - - [12/Aug/2011:06:59:45 -0600] "GET /robots.txt HTTP/1.1" 200 4179 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

wilderness




msg:4351039
 2:45 am on Aug 13, 2011 (gmt 0)

If you give any full url other than
http://www.example.com
it will be auto-converted to a clickable link.


The keyword here is "clickable" (i. e, functioning link.)
This may be easily circumvented to display, by simply inserting a space and breaking the link, for purposes of display and assistance.

http:// www.webmasterworld.com

thirteen




msg:4351042
 3:06 am on Aug 13, 2011 (gmt 0)

I hope the traffic log information get through and not masked:

Location:Littleton, Colorado, United StatesIP Address:Comcast Cable (67.176.121.175) [Label IP Address]Referring URL:
www.#*$!.org/cgi-bin/nbbw.cgi

Location:Arlington, Massachusetts, United StatesIP Address:Psinet (38.97.75.202) [Label IP Address]Referring URL:
www.#*$!.org/cgi-bin/nbbw.cgi

wilderness




msg:4351051
 3:19 am on Aug 13, 2011 (gmt 0)

This is not a "log" as I requested and provided an example of, rather something you retrieved from a stats program.

RAW VISITOR LOG?

These IP ranges may be useful:

#entire Comcast Class B's
RewriteCond %{REMOTE_ADDR} ^67\.1([678][0-9]|9[01])\.

#Your Colorado Class C
RewriteCond %{REMOTE_ADDR} ^67\.176\.([0-9]|[1-9][0-9]|1[01][0-9]|12[0-7])\.

You may also add additional conditions based upon User-Agent, referer and even headers to reduce the chances of innocents being denied.

thirteen




msg:4351052
 3:27 am on Aug 13, 2011 (gmt 0)

You may also add additional conditions based upon User-Agent, referer and even headers to reduce the chances of innocents being denied.



I don't want to deny a whole class of Comcast and Psinet IP Addresses.

Is it possible to deny just by the "Referral"? I know the website name and can use it as the criteria to start the denial.

wilderness




msg:4351054
 3:33 am on Aug 13, 2011 (gmt 0)

for the third time.

RAW VISITOR LOG?

lucy24




msg:4351055
 3:43 am on Aug 13, 2011 (gmt 0)

[Overlapping]

You can deny by referer (sic) but it isn't what you want here. The referer isn't the visitor itself, it's who sent them to you. That is: if you click on a link to go somewhere, the "referer" is the page where you clicked the link. You probably want the user agent.

:: shuffling papers ::

Can't imagine what's taking g### so long, but earlier today at the end of a thread* I posted an idiot-level explanation [webmasterworld.com] (sorry) of what you can expect to see in your raw logs. Almost all of that information is also available to your htaccess; you just need to find the part that is unique to the site. As wilderness said, probably a combination. Say, IP address within a certain range, and user-agent that includes certain attributes.


* Someone explained to me how to link to a specific post, but now I can't find the explanation. The long post timestamped 1:18.

thirteen




msg:4351056
 3:48 am on Aug 13, 2011 (gmt 0)

Unfortunately, I don't have that information. The raw visitor log are not archived.

thirteen




msg:4351058
 3:56 am on Aug 13, 2011 (gmt 0)

lucy,

In this particular case, I think I want to deny the referring site and not the visitor. This site is acting like a portal. The visitors goes to this scraper site and they click on a link to my page.

Instead of sending the visitors to my site, the scraper site goes to my site and scrape the page the visitor wants to see and serves it up the visitor.

I want to block any traffic coming from that scraper site so the visitor will have come to my site directly if they want to see my content. I want to keep the visitor and lose the scraper site.

I can identify when they come scraping. In my log, I will see an entry like this:
Number of Entries:1
Entry Page Time:Aug 11 2011 10:45:44 AM
Visit Length:0 seconds
Browser:Firefox 5.0OS:Win7
Resolution:1366x768
Total Visits:1
Location:Arlington, Massachusetts, United States
IP Address:Psinet (38.97.75.202) [Label IP Address]
Referring URL: www.#*$!.org/cgi-bin/nbbw.cgi

The only constant is the line on Referring URL. It's always the same URL. So when I see www.#*$!.org/cgi-bin/nbbw.cgi on that line, I know my pages been scraped.

wilderness




msg:4351068
 4:50 am on Aug 13, 2011 (gmt 0)

This will make the pests at PSI (Performance Systems International ) hiccup for a few minutes and they'll come at you with something stronger and perhaps more often.

SetEnvIf Referer [google.com]

Refer based solutions are less than 100% accurate, and may be easily defeated, even if the visitor is not aware that is the reason for denial.

FWIW, raw visitor logs are essential.
If your host doesn't provide them, find another host immediately.
If you've simply failed to turn them on in CP, do so and begin understanding them.

thirteen




msg:4351071
 5:14 am on Aug 13, 2011 (gmt 0)

Thx, I will try the
setenvif referer ^http://(www\.)?blockeddomain\.com getout

[webmasterworld.com...]

lucy24




msg:4351089
 7:23 am on Aug 13, 2011 (gmt 0)

Heh. If the thread weren't seven years old, I would stop by and make sure the OP had in fact included the line

Deny from env=getout

since nobody had the nerve to ask :)

wilderness




msg:4351151
 3:09 pm on Aug 13, 2011 (gmt 0)

Deny from env=getout

since nobody had the nerve to ask


The Apache Forum was fairly new in 2004 Jim established the Apache Forum after the SSID (which he was one of the heavy and longtime participants of) was taken off-line due to controversy.
The SSID forum in the pre-Apache Forum days was non-moderated and very heavily participated. When somebody spotted a bot or harvester in their logs, most everybody had lines added within moments to defeat the bot/harvester.

These brief and incomplete answers were part of the practiced learning process for new comers (same reason I didn't reply with an incorrect syntax of case on "setenvif referer".)

Most of these present noobs don't even take the time to read the forum charter or library (which holds many of their answers). They are not looking to learn the process, rather they wish exclusive copy and paste solutions.

Leosghost




msg:4351168
 3:40 pm on Aug 13, 2011 (gmt 0)

@the OP ..btw.. s_c_r_o_o_g_l_e.org do not scrape websites ..the page you see is not a scraper for scraping websites ( so worrying about them scraping yours is unnecessary )..they scrape google results and present them to searchers without ads on the "serps" ..and they "scrub" off Googles tracking cookie so google dont know who searched what ..

The site's organiser used to be quite active here .."scarecrow" ..I think he is still a member ..haven't seen him post in nigh on a year ( they have scraped various google "feeds" down the years ..google sometimes shut off certain feeds that they were using ) ..They may well send you visitors in response to a search ..but if the visitor has javascript enabled they will see your adsense if you are running it ..just like they would if they came direct via google..

I've no connection with them ..I am just setting the record straight about what they do.

FWIW, raw visitor logs are essential.
If your host doesn't provide them, find another host immediately.

Agreed :) how can anyone hope to run a secure site without access to raw logs

Most of these present noobs don't even take the time to read the forum charter or library (which holds many of their answers). They are not looking to learn the process, rather they wish exclusive copy and paste solutions.


Again agreed..and not only in the apache forum these days..:(

wilderness




msg:4351173
 4:07 pm on Aug 13, 2011 (gmt 0)

Most of these present noobs don't even take the time to read the forum charter or library (which holds many of their answers). They are not looking to learn the process, rather they wish exclusive copy and paste solutions.


Again agreed..and not only in the apache forum these days..


I'm likely making an assumption, however I'm more inclined to believe the absence of Jim's presence are in fact these deterioration's of etiquette.

lucy24




msg:4351263
 9:33 pm on Aug 13, 2011 (gmt 0)

The raw visitor log are not archived.

Does the site say so explicitly, or is it a default you can change in your prefs? Mine defaults to three days; I changed it to fifteen to be safe, though I currently pick them up about once a day in spite of recent code change forcing me to use command-line ssh instead of gui ftp.

jdMorgan is about due for a post saying "The reports of my demise are exaggerated" because I know I've seen him a few times and I haven't been around all that long ;)

thirteen




msg:4351313
 3:49 am on Aug 14, 2011 (gmt 0)

Does the site say so explicitly, or is it a default you can change in your prefs?



My web host lets me archive the logs but I never changed the default setting. The default is set for 24 hours.

I use a third party vendor for analyzing traffic, so I didn't pay attention to the raw traffic log.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved