Pass data from Chrome extension

Forum Moderators: open

Message Too Old, No Replies

Pass data from Chrome extension

php, jquery, mysql, cross-browsing, etc. issues

Moby_Dim

3:28 pm on Nov 9, 2018 (gmt 0)

Not sure which thread is the most appropriate for this.
Php-application on localhost launches a web-scraper to obtain some data from a few internet sites 2-5 times a day. Data goes to mysql.
There are a couple of sites where the bot is not suitable. I've tried to create Chrome extension for the first time which do can get the data from the problem sites using jquery ajax request. But it works outside the localhost you know.
Are there any ways to put the data in mysql or simply save'em in a plain file in the end? Any other methods to obtain data from js oriented sites (may be = how to emulate browser behavior)? Tnx in advance.

NickMNS

3:40 pm on Nov 9, 2018 (gmt 0)

Use selenium. Also, I recommend that you use Python instead of PHP to take the data parse it and feed it to your DB.

Moby_Dim

3:52 pm on Nov 9, 2018 (gmt 0)

Thank you, but not sure that a new prog.language is the wise price for a few sites. I'd prefer to visit'em as a real user lol ;)

Moby_Dim

3:57 pm on Nov 9, 2018 (gmt 0)

btw, I ran through Python description and suspect that the result of scraping my problem sites 'd be the same as with php or perl. Guess the right way is something like JSON, etc.

robzilla

4:27 pm on Nov 9, 2018 (gmt 0)

Did you try adding localhost to the permissions section of the extension's manifest.json file?

NickMNS

4:56 pm on Nov 9, 2018 (gmt 0)

What exactly is the issue, what is blocking you?
Just to clarify, Selenium is used to control the browser it allows you to programatically (using JS) interact with the browser and gives you access to the data (html/xml/JSON/csv.etc...).

First you will need to write a crawler/scraper script that tells Selenium what to do, this can be done in PHP or Python or any other scripting language.
Once you have "collected" the data with Selenium then you need to parse it into a usable format, typically one would collect a mix of formats HTML,JSON etc., typically one strips away all the HTML and only preserves the text elements and data. Finally, once the data is parsed you save it to your DB to be used in the future. Certainly one can use PHP for parsing but I doubt that it is well suited for that task. Python has several packages available that are designed specifically for that task (eg: beautiful soup). If you're lucky the data you collect may already be in some usable format such as JSON or csv but that typically isn't the case as webpages are intended to be readable and these formats are not.

I understand that Python may be new to you but it is a very easy language to learn, it is easily readable far less cryptic than PHP. The time spent trying to write script in PHP to do something that it is not really intended to do is likely going to be the same as learning how to do it Python. Here is an example of a Python script I wrote to scrape paginated content of the web. This script handles only the first part, that is directing selenium and collecting the data. Parsing handled after and is very specific to the content collected.


import time
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome import service
webdriver_service = service.Service('path_to/Selenium_drivers/chromedriver_win32/chromedriver.exe')
webdriver_service.start()

def paginated_scrapper(url):
 """ function to scrape paginated results when the pages are the standard
  page 1,2,3...
  url: string representing the base url to crawl
  return a dict of beautiful soup objects where the key an int
  representing the page number.""" 
 driver = webdriver.Remote(webdriver_service.service_url,
        webdriver.DesiredCapabilities.CHROME)
 driver.get(url)
 time.sleep(5)
 data = {1: bs(driver.page_source)}
 for i in range(2,1000):
  try:
   element = driver.find_element_by_link_text(str(i))#.click()
   driver.execute_script("arguments[0].click();", element)   
   time.sleep(3)
   data[i] = bs(driver.page_source)
  except scx.NoSuchElementException:
   print('Ending on page:', i)
   break
  except Exception as e:
   print('error occured on page', i, ':', e)
   break
 driver.quit()
 return data

Moby_Dim

5:01 pm on Nov 9, 2018 (gmt 0)

>>Did you try adding localhost to the permissions section of the extension's manifest.json file?
This's not the problem. I can get data using extension, but I can see them in a html-file, which the extension uses for output. I need the output be passed to db.

Moby_Dim

5:04 pm on Nov 9, 2018 (gmt 0)

>>What exactly is the issue, what is blocking you?.......

Well, Nick and thank you for the time you are here. Should study Selenium docs may be.

Moby_Dim

5:13 pm on Nov 9, 2018 (gmt 0)

Php simple html parser is not bad btw. Guess jquery has more advantages (functionality) to create a parser here but the PSHP is already a ready-to-use product. Parses ~ 95% of the resources I've tracking. The problem is the rest 5% ;) And I'm sure this % will rise in the future. So the problem needs to be solved now.

robzilla

5:32 pm on Nov 9, 2018 (gmt 0)

I need the output be passed to db.

POST it to a PHP script that connects to the database?

Moby_Dim

6:03 pm on Nov 9, 2018 (gmt 0)

>>POST it to a PHP script that connects to the database?
HOW?
Sorry, may be you can not understand me.
I can see the result in extension window but HOW to direct these data to "PHP script that connects to the database"? I 'm able to copy-paste, of course, but it is not the solution may be ;)

robzilla

7:52 pm on Nov 9, 2018 (gmt 0)

If you use a GET request to fetch the data (with jQuery), you can also pass that data along to another address using a POST request [api.jquery.com]. You have a PHP application that scrapes the other web pages, so you could use that same script, or a separate one, to accept the data from the POST request and store it in your local database.

If that's unclear, or not what you're looking for, I'll repeat NickMNS's question:

What exactly is the issue, what is blocking you?

Moby_Dim

6:11 am on Nov 10, 2018 (gmt 0)

"The issue" mentioned is that some websites do not respond to the bot request like other sites. Browsers do not care (they have the functionality bots have not) and as users we can not distinguish the difference, but if you try to save the content of the web page in a text file from the bot program, you'll see a usual html page structure and data in most cases and a strange set of characters in the case i'm speaking about.

robzilla

9:15 am on Nov 10, 2018 (gmt 0)

Yes, I get that, but what's the exact issue you face in trying to store that data? If you can fetch the data from a Chrome extension you built, I assume you can also POST that data. And since you have a PHP-based web scraper, I also assume you can write a PHP script that would accept that POST data and store it in a database. So what's the problem?

Moby_Dim

9:31 am on Nov 10, 2018 (gmt 0)

>> If you can fetch the data from a Chrome extension you built,
This's exactly the point of the problem I've wrote about.

robzilla

10:22 am on Nov 10, 2018 (gmt 0)

Maybe post some code from your Chrome extension, to stop us from going in circles here :-)

Moby_Dim

4:33 pm on Nov 10, 2018 (gmt 0)

"Via Extension Road" is a very long and complicated way in any case. Actually I've tried this after having bumped into CORS policy barrier trying jquery.ajax from my program for 2 days. Now I've rolled back to this and seems I've dig up a solution (because obtained the 1-st successful result after trying JSONProxy).
Sorry for the trouble, time wasted, etc, and thank you. #*$! happens u know.