Forum Moderators: coopster

Message Too Old, No Replies

Anyway to scrape the translation results of my site?

Babelfish, Google Translate, etc

         

StoutFiles

6:41 pm on Aug 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Right now I'm trying something like...

<?php
$context=array('http' => array ('header'=> 'Range: bytes=1024-', ),);
$xcontext = stream_context_create($context);
$test=file_get_contents("http://babelfish.yahoo.com/_my_site_translation_stuff)",FALSE,$xcontext);
echo $test ;
?>

However, this doesn't work. Any ideas on how this could be possible?

andrewsmd

7:18 pm on Aug 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It may be working, try echo(htmlentities($test)); I'm not for sure what you are trying to do but if your file_get_contents is working, that echo should create a page. Maybe that's what your trying to do? The more you explain the more help you will get.

StoutFiles

12:21 am on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Let's say I want to translate Google's home page to French.

<?php
$context=array('http' => array ('header'=> 'Range: bytes=1024-', ),);
$xcontext = stream_context_create($context);
$test=file_get_contents("http://babelfish.yahoo.com/translate_url?doit=done&tt=url&intl=1&fr=bf-home&trurl=http%3A%2F%2Fgoogle.com&lp=en_fr&btnTrUrl=Translate)",FALSE,$xcontext);
echo $test ;
?>

Because of the way babelfish and other translators seem to work, file_get_contents doesn't work. It just returns a 403 Error.

Note that this scraper works for regular pages, just not for scraping a page that has just been translated by another page. That's what I'm trying to do.

StoutFiles

3:11 am on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Edit: never mind

andrewsmd

3:25 am on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think you can read pages with file_get_contents those are for files you can access locally. However you could use curl. Something like this.
$url = "www.example.com"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$html=curl_exec($ch);
echo($html);

rocknbil

6:17 pm on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It appears from the file_get_contents() documentation you can indeed get a remote file, but I agree, curl would be a better option. It's better because you can post data to a location, as if you'd gone to the site and submitted the form manually, then slurp up the results.

However, the 403 error you are getting may be unavoidable and it may still be there using curl. For those unaware, 403 is forbidden [w3.org], and they *probably* have tools in place to stop you from scraping results unless you are actually on the site.

Essentially, whether you know it or not, you are stealing content from another site, and we all know where that leads . . .

If this turns out to be the case, just link to the site: "Translate this page" and you're done.

StoutFiles

6:22 pm on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I figured out how to use file_get_contents() to get the scrape of a translation; however, for babelfish there is a catch. If they see that example.com is trying to scrape the translation for any pages on example.com, they will redirect to the frames page.

londrum

6:28 pm on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



just do it locally, and run the file through your home computer rather than through your website.