Anyone who advanced past basic mathematics in school has learned this simple concept: large problems can often be divided into groups of smaller problems.
In computing, this concept is gaining traction on a very large scale. Cloud computing, parallel processing, grid computing and virtualization are all different methods used to spread large computational problems among multiple resources. But let’s move past the hype and talk about how you can best use some of these architectures in your IT environment.
In this column, I’ll to talk about four unique ways to distribute high-performance analytics resources that will give you the right amount of computing power right where you need it for your analytical problems. I’ll explain each of the methods and describe specific scenarios where you might use each one.
Method 1: Shared storage
What it is: The first configuration is basic grid computing. In this scenario, multiple machines are pulling data from a single data source, and each machine is running different pieces of the bigger algorithm or mathematical equation. Essentially, you’re breaking one big problem into multiple pieces and running each of those pieces against the same data source at the same time.
This configuration is used primarily as a way to solve batch window time problems. If your current process doesn’t run as quickly as you need it to, you split it up and run each piece in parallel.
When you use it: Typically, the calculations in these scenarios can be partitioned pretty naturally. You can partition the problem by something inherent in the data, such as time, geographic region or products. There is an obvious way to break the problem into smaller pieces or a natural way of splitting up the algorithm. A second type of partitioning scheme takes advantage of discrete computational steps or subparts in the algorithm (sometimes called threads). Subparts can be calculated in parallel and then brought back together
Business benefit: As opposed to running all of your calculations as a sequential process, you’re computing things in parallel. Whether you use data partitioning or algorithm partitioning, the main point is that independent computational processes can take place at the same time from a single data source.
Method 2: Moving compute to partitioned data
What it is: Partitioned data architectures use a single process for distributing the data. Each processor or “worker node” then accesses its portioned data and performs its computing. In many cases, the data is already partitioned and exists in separate data stores before you apply analytics to it.
When you use it: These architectures are useful for situations where the data is fairly static or can be easily bulk partitioned. Retailers often use this type of architecture for markdown optimization and merchandise planning. They have a lot of data that needs to be stored and associated with appropriate computations, and there’s a high correlation between the analytics needed and the data that persists on that node. Using this setup, retailers have been able to reduce batch processing times for promotion optimization and price markdown problems. In many cases, they have already split the data by division, which is the right granularity for the optimization problem, so they can optimize prices by division.
Business benefit: When there is a business need for data to be partitioned in a certain way, allowing analytics to be used within that existing partitioning scheme has many natural advantages, including speed and accuracy. In other cases, analytics need to be moved to where the data is because governance around data and the cost to move data is too high.
Method 3: Moving data to compute
What it is: This type of distributed processing spreads a combination of threads and processes across multiple machines. Essentially, you break the problem up into a bunch of small pieces along with the data that is needed for each subproblem, and send both the data and the algorithm to different nodes to be processed. In the most complex cases, this involves a “tree” of workers (or nodes), with some workers automatically delegating tasks to subworkers.
When you use it: In general, this configuration is good for large computations of smaller amounts of data where you distribute all or most of the data to all the different worker nodes. In other words, the amount of data being analyzed is relatively small but the computational tasks involve many parts and subparts. Customers are using this today for complex risk calculations For example, large international banks can recalculate their entire risk portfolios at very high speeds with this grid configuration handling hundreds of predictive computations for a pricing portfolio in a very short amount of time. One key factor in determining the value of this setup is to consider whether the amount of data being moved around is a manageable size for the available network bandwidth.
This architecture distributes work across many machines and uses threads to leverage the hardware on each machine. Algorithms maximize the use of this architecture. With a variant of the classic message passing interface (MPI), you can break problems down into smaller subproblems and use threads to further break down and solve the subproblems.
Business benefits: Now that you’re processing data faster than the window of time needed, you can start asking what-if questions of real-time streaming data.
Method 4: Adding the present to past and future calculations
What it is: Methods 1 through 3 look at historical data and traditional architectures with information stored in the warehouse. In this environment, it often takes months of data cleansing and preparation to get the data ready to analyze. Now, what if you want to make a decision or determine the effect of an action in real time, as a sale is made, for instance, or at a specific step in the manufacturing process. With streaming data architectures, you can look at data in the present and make immediate decisions. The larger flood of data coming from smart phones, online transactions and smart-grid houses will continue to increase the amount of data that you might want to analyze but not keep. Real-time streaming, complex event processing (CEP) and analytics will all come together here to let you decide on the fly which data is worth keeping and which data to analyze in real time and then discard.
When you use it: Radio-frequency identification (RFID) offers a good user case for this type of architecture. RFID tags provide a lot of information, but unless the state of the item changes, you don’t need to keep warehousing the data about that object every day. You only keep data when it moves through the door and out of the warehouse.
The same concept applies to a customer who does the same thing over and over. You don’t need to keep storing data for analysis on a regular pattern, but if they change that pattern, you might want to start paying attention. If you can detect that they’re using credit in a different way, for example, you may want to respond. If their phone has been working fine, but you can see that its performance is starting to deteriorate, what can you do to improve performance on the fly? You can only take immediate action if you have a system for analyzing data in real time.
Business benefits: The attention toward in-database analytics fits within this area of high-performance analytics as well. Our strategy for the future will be to put the computing power as close to the data as possible, recognizing that as volumes of data increase, you need to move data management process and analytic process to the right place. Sometimes those processes are directly connected to the devices coming in – where analytics needs to be applied before you ever store data.
Download the IDC White Paper: Innovation Powered by Grid