Integrating Spark into the BI Application - The InetSoft Approach

To avoid the aforementioned problems, the key is to break down the barrier between Spark and the BI tool. Instead of relying on the JDBC/ODBC interface, we completely fused Spark into StyleBI. The following is a high-level view of the integration.

spark integration into bi application

All interactions between StyleBI and Spark are through the native API. Spark becomes a native part of the product. Data inside a cluster can be accessed through Spark SQL using SQL-like queries, or directly as files or data connectors provided by the data stores. Unlike JDBC, data is not retrieved from Spark until it has finished all processing and is ready to be presented to users.

Equally importantly, data processing jobs normally performed by BI tools are also pushed into the cluster. For example, to join the result of a Spark SQL query with an in-memory table, the in-memory table is pushed into the cluster, and the join is submitted to the InetSoft execution engine embedded in the Spark nodes. Data already residing in the cluster is never moved.

View a 2-minute demonstration of InetSoft's easy, agile, and robust BI software.

Analytic Query Acceleration

Spark is a general purpose computing platform. Although it was designed to handle real-time computing needs, it provides no special treatment for analytic queries. A BI tool could use Spark as-is and get decent performance. But BI queries generated by interactive analysis often have distinct patterns, and it is possible to use this knowledge to further optimize the execution. As a natively integrated tool, Style Intelligence is in a unique position to add special logic to improve performance further. The acceleration is achieved by three methods:

StyleBI analyzes a visualization and generates candidate queries to materialize. The goal is to preprocess as much as possible and only leave the parts that need to be dynamically updated to be performed later.
For data stored in a materialized format, logic responsible for handling interactive queries is pushed down to the data storage layer whenever possible. This significantly reduces the amount of data Spark needs to process.
A specialized columnar data store is created to store the materialized data. It conforms to the standard Hadoop/Spark API so it can be accessed by Spark jobs, but is optimized to handle the types of queries generated from interactive analysis.

spark business intelligence integration architecture

Spark Is More Than Big Data

While Spark gained its fame through its real-time cluster computing platform, it comprises many other components. Many of those are of great interest to BI users. Ignoring them would be a great disservice to users. Chief among them are Machine Learning (Spark ML) and streaming. We consider both areas as core capacities of any future BI tool and have brought them into the product as part of the native integration with Spark.

Read how InetSoft saves money and resources with deployment flexibility.

Spark ML

Spark ML provides an opportunity to bring Machine Learning to the masses, and we view StyleBI as the bridge from Spark ML to business users. The integration has the following objectives:

Creating and training models should be easy and accessible to people with minimum Machine Learning training. Every opportunity was taken to automate the process.
Once a model is trained, it should be available to business users with no Machine Learning knowledge.
Like Spark queries, Spark ML integration is done at the native level, fully integrated into the product, so it enjoys all the benefits of the cluster.

Spark ML provides built-in algorithms in the following areas:

Classification - prediction of categorical values
Regression - prediction of numeric values
Clustering - automatic grouping of similar items
Recommendation

The first three are available in StyleBI and can be accessed as a regular query without any programming or special knowledge.

Learn about the top 10 features of embedded business intelligence.

Spark ML

Streaming is another area that is revolutionizing BI applications. Not long ago, real-time data processing was considered a niche requirement and often reserved for the most specialized applications. With the advent of Kafka and Spark, streaming has been brought into the mainstream.

However, many challenges remain. Compared with batch processing, streaming needs to deal with many new problems such as back pressure and late arriving records. In addition, programming is often required to create a streaming application.

Spark Streaming provides an abstraction of stream processing that nicely fits into the overall Spark framework. We are actively working on integrating streaming into StyleBI. The goal is to simplify stream processing and make it accessible to regular users without programming skills. To a large degree, building a query against a stream should not be much different from a query against a database.

View the gallery of examples of dashboards and visualizations.

Summary

Big Data is more than throwing a cluster together and connecting to it through a JDBC driver. It requires an architecture that has all the pieces working closely together. It's our view that the most effective solution requires a complete fusion of the various technologies.

The StyleBI/Spark integration is the first step in achieving this vision. As the innovations flourish, our mission is to create a platform that effortlessly integrates with software in the Hadoop ecosystem, so business users can enjoy all the benefits.

spark machine learning bi architecture

Previous: Supercharging BI with Spark, Easy Machine Learning

Integrating Spark into the BI Application - The InetSoft Approach

Analytic Query Acceleration

Spark Is More Than Big Data

Spark ML

Spark ML

Summary

More Resources