Analyzing Streaming Data

Below is the continuation of the transcript of a Webinar hosted by InetSoft on the topic of What Machine Learning Means for Company Analytics. The presenter is Abhishek Gupta, Chief Data Scientist at InetSoft.

This ties into what we talked about in terms of deploying these models in production in real time. The earlier points we've made about automation which I think apply in a typical organization even actually the leading edge Bay Area start up internet companies, but there's this distinction between software engineers and data scientists. What is this distinction?

Software engineers tend to be very good at writing code. They're very disciplined. Their code is very nice. It scales. It's easy to maintain. Data scientists are very good at the analysis of data and making sense of data. Usually when a data scientist does an analysis and builds a model they end up having to hand over their model to a software engineer who then rewrites it into the production environment, according to the standards of the production environment.

On the other hand to the extent that a software engineer can do what a data scientist can do, because obviously there are these easy to use machine learning libraries that they can use for themselves. It's possible that they do some of that, but I think if you talk to many people who work with data they will tell you that - for many software engineers their strength is really in actually writing something that's been well specified and spec'ed out.

#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index.

Data Scientists and Software Engineers

With data scientists, their strength is in the data discovery and learning what the right model to use. Whenever there's a little bit of ambiguity around the project then it becomes harder for a software engineer to actually do something that replicates what a data scientist can bring to the table.

On the other hand on with streaming data there is a distinction between processing and the simple analysis. At massive scale even the simple things like counts of what are the top items become difficult with massive scale. Then there are companies now out there who are able to do massive scale analytics of that sort, but also anomaly detection and correlations of massive scale. They tend to be kind of the leading edge companies. Then there is online learning. I think there are companies doing that, but that's still not a common thing to do.

Our next topic was the third topic I believe in the agenda. It's this idea of that in a large organization, there's just going to be a lot of different business intelligence software floating around. That's just the reality of today. I mean there's all these great free tools. These things exist, and they're high quality, and people use them for a good reason. I've been involved in how a clustering procedure would help choose the number of clusters for users, and we've done basic research on variable selection for support vector machines and for regression. I think it's clear to everyone that algorithms are really a commodity now.

Organizations, especially large organizations are just going to have different software around. The question is how to stop fighting over them. I like this one, and you like that one, and how do we use all of these things productively in a way that benefits your organization and solves business problems. Of course on the Open Source side of things there's just so much flexibility, so many different algorithms to choose from being made at universities.

View a 2-minute introduction to InetSoft's serverless BI solution.

Driving Open Source Contributions

Now we see large companies contributing to different Open Source libraries and driving the direction of Open Source packages. For anyone thinking about this from a managerial perspective I think one interesting thing to think of is how do you get involved in a Open Source project and help stir it in a way that benefits your company. I mean this is something that's possible to do.

As organizations mature in their use of streaming analytics, one of the first challenges they encounter is the need to balance raw throughput with analytical depth. High‑velocity data pipelines often prioritize speed over context, but meaningful insights require enrichment, correlation, and filtering. Extending a streaming analytics strategy means designing pipelines that can perform lightweight transformations in real time while offloading heavier computations to micro‑batch or asynchronous layers. This hybrid approach preserves responsiveness without sacrificing analytical rigor.

Another important evolution is the shift from simple event monitoring to continuous intelligence. Early streaming dashboards typically focus on counts, thresholds, and anomaly alerts. As teams gain confidence, they begin layering in predictive scoring, pattern detection, and contextual recommendations. These capabilities allow organizations to move from reacting to events to anticipating them. For example, instead of merely flagging a spike in sensor readings, a continuous intelligence system can identify the likely cause and suggest corrective actions before failures occur.

Scalability also becomes a central concern as data volumes grow. What begins as a manageable stream can quickly expand into millions of events per second, especially in IoT, finance, and e‑commerce environments. A robust streaming analytics architecture must support horizontal scaling, partitioning, and fault tolerance. This ensures that even during traffic surges—such as product launches or seasonal peaks—dashboards remain responsive and alerts fire without delay. Cloud‑native platforms and containerized deployments make this level of elasticity far more achievable.

Operationalizing machine learning within streaming pipelines is another frontier. Traditional batch‑trained models often degrade when applied to real‑time data because patterns shift rapidly. To address this, organizations are adopting online learning techniques, model retraining schedules, and drift‑detection mechanisms. Embedding these capabilities into the streaming platform ensures that predictions remain accurate and trustworthy. It also reduces the friction between data science and engineering teams by providing a shared environment for deployment and monitoring.

Finally, governance and observability must evolve alongside the streaming architecture. Real‑time systems introduce new risks: misfiring alerts, runaway compute costs, and silent data corruption can all occur if pipelines are not monitored closely. Extending the platform with lineage tracking, audit logs, performance dashboards, and automated quality checks helps maintain reliability at scale. When governance is integrated directly into the streaming workflow, organizations can innovate rapidly while maintaining confidence in the accuracy and stability of their real‑time analytics.

We will help you get started Contact us