To avoid the aforementioned problems, the key is to break down the barrier between Spark and the BI tool. Instead of relying on the JDBC/ODBC interface, we completely fused Spark into Style Intelligence. The following is a high-level view of the integration.
All interactions between Style Intelligence and Spark are through the native API. Spark becomes a native part of the product. Data inside a cluster can be accessed through Spark SQL using SQL-like queries, or directly as files or data connectors provided by the data stores. Unlike JDBC, data is not retrieved from Spark until it has finished all processing and is ready to be presented to users.
Equally importantly, data processing jobs normally performed by BI tools are also pushed into the cluster. For example, to join the result of a Spark SQL query with an in-memory table, the in-memory table is pushed into the cluster, and the join is submitted to the InetSoft execution engine embedded in the Spark nodes. Data already residing in the cluster is never moved.
Spark is a general purpose computing platform. Although it was designed to handle real-time computing needs, it provides no special treatment for analytic queries. A BI tool could use Spark as-is and get decent performance. But BI queries generated by interactive analysis often have distinct patterns, and it is possible to use this knowledge to further optimize the execution. As a natively integrated tool, Style Intelligence is in a unique position to add special logic to improve performance further. The acceleration is achieved by three methods:
While Spark gained its fame through its real-time cluster computing platform, it comprises many other components. Many of those are of great interest to BI users. Ignoring them would be a great disservice to users. Chief among them are Machine Learning (Spark ML) and streaming. We consider both areas as core capacities of any future BI tool and have brought them into the product as part of the native integration with Spark.
Spark ML provides an opportunity to bring Machine Learning to the masses, and we view Style Intelligence as the bridge from Spark ML to business users. The integration has the following objectives:
Spark ML provides built-in algorithms in the following areas:
The first three are available in Style Intelligence and can be accessed as a regular query without any programming or special knowledge.
Streaming is another area that is revolutionizing BI applications. Not long ago, real-time data processing was considered a niche requirement and often reserved for the most specialized applications. With the advent of Kafka and Spark, streaming has been brought into the mainstream.
However, many challenges remain. Compared with batch processing, streaming needs to deal with many new problems such as back pressure and late arriving records. In addition, programming is often required to create a streaming application.
Spark Streaming provides an abstraction of stream processing that nicely fits into the overall Spark framework. We are actively working on integrating streaming into Style Intelligence. The goal is to simplify stream processing and make it accessible to regular users without programming skills. To a large degree, building a query against a stream should not be much different from a query against a database.
Big Data is more than throwing a cluster together and connecting to it through a JDBC driver. It requires an architecture that has all the pieces working closely together. It's our view that the most effective solution requires a complete fusion of the various technologies.
The Style Intelligence/Spark integration is the first step in achieving this vision. As the innovations flourish, our mission is to create a platform that effortlessly integrates with software in the Hadoop ecosystem, so business users can enjoy all the benefits.
InetSoft Technology Corp.
InetSoft Technology Corp.