Anyway to scrape the translation results of my site?

Forum Moderators: coopster

Message Too Old, No Replies

Anyway to scrape the translation results of my site?

Babelfish, Google Translate, etc

StoutFiles

6:41 pm on Aug 31, 2009 (gmt 0)

Right now I'm trying something like...

<?php
$context=array('http' => array ('header'=> 'Range: bytes=1024-', ),);
$xcontext = stream_context_create($context);
$test=file_get_contents("http://babelfish.yahoo.com/_my_site_translation_stuff)",FALSE,$xcontext);
echo $test ;
?>

However, this doesn't work. Any ideas on how this could be possible?

andrewsmd

7:18 pm on Aug 31, 2009 (gmt 0)

It may be working, try echo(htmlentities($test)); I'm not for sure what you are trying to do but if your file_get_contents is working, that echo should create a page. Maybe that's what your trying to do? The more you explain the more help you will get.

StoutFiles

12:21 am on Sep 1, 2009 (gmt 0)

Let's say I want to translate Google's home page to French.

<?php
$context=array('http' => array ('header'=> 'Range: bytes=1024-', ),);
$xcontext = stream_context_create($context);
$test=file_get_contents("http://babelfish.yahoo.com/translate_url?doit=done&tt=url&intl=1&fr=bf-home&trurl=http%3A%2F%2Fgoogle.com&lp=en_fr&btnTrUrl=Translate)",FALSE,$xcontext);
echo $test ;
?>

Because of the way babelfish and other translators seem to work, file_get_contents doesn't work. It just returns a 403 Error.

Note that this scraper works for regular pages, just not for scraping a page that has just been translated by another page. That's what I'm trying to do.

StoutFiles

3:11 am on Sep 1, 2009 (gmt 0)

Edit: never mind

andrewsmd

3:25 am on Sep 1, 2009 (gmt 0)

I don't think you can read pages with file_get_contents those are for files you can access locally. However you could use curl. Something like this.
$url = "www.example.com"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$html=curl_exec($ch);
echo($html);

rocknbil

6:17 pm on Sep 1, 2009 (gmt 0)

It appears from the file_get_contents() documentation you can indeed get a remote file, but I agree, curl would be a better option. It's better because you can post data to a location, as if you'd gone to the site and submitted the form manually, then slurp up the results.

However, the 403 error you are getting may be unavoidable and it may still be there using curl. For those unaware, 403 is forbidden [w3.org], and they *probably* have tools in place to stop you from scraping results unless you are actually on the site.

Essentially, whether you know it or not, you are stealing content from another site, and we all know where that leads . . .

If this turns out to be the case, just link to the site: "Translate this page" and you're done.

StoutFiles

6:22 pm on Sep 1, 2009 (gmt 0)

I figured out how to use file_get_contents() to get the scrape of a translation; however, for babelfish there is a catch. If they see that example.com is trying to scrape the translation for any pages on example.com, they will redirect to the frames page.

londrum

6:28 pm on Sep 1, 2009 (gmt 0)

just do it locally, and run the file through your home computer rather than through your website.