Crawler advice needed

Forum Moderators: coopster

Message Too Old, No Replies

Crawler advice needed

How to build a php site crawler

Tourex

11:18 am on Jan 3, 2017 (gmt 0)

I am currently rebuilding/upgrading a large non-profit information website and keen to add a site search facility.
One problem is that most pages of the site utilise bootstrap tabs. All the search facilities that I've looked at will provide links to the relevant page, but not with the necessary tab opened. My main menu uses a ?t=x paramter to ensure that the page loads with the correct tab open.

All things considered, I feel it would be best for me to build and host my own mysql index of the pages. However, I need to build a crawler that can work through a list of specified links and then grab just the content for a specific page tab.

Can anyone please offer some guidance and/or hopefully links to code that will help me build this crawler. Thanks in anticipation.

keyplyr

11:31 am on Jan 3, 2017 (gmt 0)

I would look around at github.com.

mack

2:26 pm on Jan 3, 2017 (gmt 0)

It really depends on how large your site is and also how you serve your content. Foor example if you store your content in a database and then present it via a dynamic site you already have the data, you just need to figure out the best way to make it searchable. If you need to crawl then rolling your own system may well be the way to go.

I have done some experimental crawling over the past year using "simplehtmldom" as my document parser. In my case I wasn't limiting the crawler to any specific domain, so you will not need to do quite as much as me, for example, robots.txt because you own the site you have control over what to and not to crawl.

Here are the basics of how I did it. The bot starts on the homepage. It extracts the HTML and stores it in the database. it also extracts all link and stores them in a database table called todo. It then moves onto the next page from the todo list and repeats the process. Effectively what I was doing was working through the to-do list, adding a new record to a pages table. If however the record already existed, I didn't re-write it, simply awarded it a point. The number of points a page had went towards ranking (very simplified approach to PR.)

Regarding your tabs. This is something I haven't dealt with, but you could use some custom code and have the bot look for it to define what tab to open? Not entirely sure if this is the best way of achieving this, but it may work.

Mack.

Tourex

2:33 pm on Jan 3, 2017 (gmt 0)

Thanks Mack - I'll certainly have a look at simplehtmldom. I am serving a mixture of static pages and dynamic content. The dynamic is not an issue - it is the static pages that are giving me the headache because of the tabs.

phparion

5:21 pm on Jan 9, 2017 (gmt 0)

I would like to add my two cents.

@mack: have you tried storing the so called "to-do" list of URLs that you want to parse in the second run of the crawler into an array rather than putting the load on database? I would try to see the performance difference by storing the to-do in an array and then run usort() function to eliminate the duplicate URLs and then deal with database. it will be faster.

@tourex: if it is your own website, you already have the database, just add a searching feature. instead of writing a crawler that can be tricky, time consuming and not so good with your host/server resources, try converting the static pages to dynamic. It will help you in the long run.

However, if it is inevitable to write a crawler, then start with PHP-cURL library. I have done some wonders with it :) ... you will need a strong regex practice as well.

Tourex

5:39 pm on Jan 9, 2017 (gmt 0)

Thanks phparion

My site is a visitor information site for a particular destination. Around 30% of the content are static editorial pages of information. We then have hotel and accommodation listings, which are stored in a MySQL database. On top of that, we have a phpBBforum that ideally I would like to include in the search facility. I doubt whether I have the skills and experience to write a crawler. The static pages don't change very often, so on balance, I think my best bet might be to copy them to a database then use php to search the different database tables. That has a slight advantage in that it enables me to rank the results better, with 'editorial' at the top, 'listings' second and then any relevant forum posts last.

Food for thought and thanks for your input. Any other thoughts will still be appreciated though.

phparion

5:47 pm on Jan 9, 2017 (gmt 0)

1- even if you have multiple applications installed on your server, if their databases are on the same server. it should not be a problem for you to write a search feature that can query all the database simultaneously. I would prefer to create "VIEWS" i.e write queries and store their results in a VIEW. Then while searching query the VIEW rather the multiple tables and databases.

2 - since you have decided to copy the static contents to database. then why not just generate those pages from the database? get rid of static contents they are a pain in the neck to maintain.

Tourex

6:04 pm on Jan 9, 2017 (gmt 0)

1 - I'll have to look into 'Views' - not something I'm familiar with. My experience is relative limited and I try to keep things simple.

2 - That was something I was thinking of. I'm using Bootstrap for the new site and making very extensive use of tabs to try and keep some semblance of order to some 100,000 words of editorial content. The current 'old' site has the best part of 1,000 page URLs and by using tabbed pages and accordians I'm expecting to get that down to under 100 pages. It would be a nightmare trying to keep the entire page in a database and would be just so easy to break the structure. However, I guess I could structure the database so that the content of each tab was in a separate table row. Food for thought! My tired old brain (I'm a 'senior citizen' or 'old fart', whichever you prefer) is reeling from all the possible options.

phparion

6:20 pm on Jan 9, 2017 (gmt 0)

well, it is not easy to understand what exactly you want to achieve without having a look at the actual website. But from what I understand, you are trying to use tabs to collect many pages within a single page. That is a simple database task,

a) pages
pageID (PK)
pageTitle
pageLink

b) pageTabs

pageTabID (PK)
pageID (FK)
tabTitle
tabContents

While displaying the data dynamically, queries can be as easy as,

SELECT tabTitle FROM pageTabs pt, pages p 
   WHERE pt.pageID=p.pageID AND p.pageID = 10

a simple php loop can display all the tab/css buttons etc. Similarly, you can display each tab contents dynamically.

I hope it helps.

Tourex

9:53 am on Jan 10, 2017 (gmt 0)

Hi phparion

Thank you! That's all good information that I will take onboard. You help is really appreciated.