Architecting a predictive modeling software

  softwareengineering

I have a predictive model which runs 10,000+ scenarios to predict the outcome. It’s a micro-simulator model for an insurance company. The data part is almost static with changes from time-to-time, meaning data is kind of read-only and data update happens outside of this simulator.

Each simulation run accepts several input parameters through a user interface. Upon submission of a run (with all user defined parameters), this model spawns 10,000 parallel jobs in cloud environment using 10,000 cores. Each of these jobs reads entire data at the beginning from a SQL database and then starts crunching data. Each job is CPU intensive and consume the CPU capacity to the extent possible. So far, memory usage is not an issue. A recent code optimization has brought the run-time down from 4+ hours to less than 1.5 hours. Meaning calculation engine takes around 90 mins to complete data crunching for a single job in a high end dedicate physical server (this is just to give you a sense of run-time). So far end users have no complaint and everything looks good.

The calculation using is written using un-managed (to reduce layers of libraries to manage things from behind the scene) C++. The data volume is not that big (around 1,500,000 rows in a table) and the calculation engine reads the data in a single connection at beginning which takes less than a minute.

Recently we are feeling for a revisit to the architecture to determine whether this distributed architecture is the best or there may be a better way to handle this. Management has a wish to bring down the run-time within 15 mins which seems difficult unless we made some drastic changes in design. We are exploring for options like GPUs, FPGAs or things like an Inifiniband backbone, use of Hadoop etc.

Any similar experience / idea to deal with this kind of architecture? Any idea if it can be done differently instead of the way we are doing it now.

5

Management has a wish to bring down the run-time within 15 mins.

Start measuring your code to find out where the “hot spots” are. Hot spots are the places where your code spends the most time executing. Usually, they comprise less than 10 percent of your code. Once you find them, focus your efforts solely on improving the performance of the code in these hot spots.

Other things to check for:

  1. Database latency/locks. If it’s SQL Server, explore NOLOCK
  2. The time spent sending data over the network
  3. The overhead involved in spinning up 10,000 jobs simultaneously. Maybe use fewer, less granular jobs?

2

LEAVE A COMMENT