InetSoft Webinar: Buzz Around Big Data

This is the transcript of a Webinar hosted by InetSoft on the topic of "Buzz Around Big Data." The speaker is Abhishek Gupta, product manager at InetSoft.

There is no doubt about it. There is a lot of buzz around Big Data. The definition of Big Data is not exactly clear. It can mean different things to different people. One definition is that Big Data is essentially the tools and technologies that make managing or getting value from data at extreme scale, affordable or economical. And that seems like a very simple definition.

It seems like a very simple definition, but really I think that it's the key of the way we are telling clients they need to think about Big Data. Extreme scale makes sense except the term extreme scale keeps changing. Big Data is not two terabytes or two petabytes. There is no demarcation.

It's whatever is not affordable for you today but can be affordable with new techniques and technologies.It's the frontier of data management. It’s the techniques and technologies. There is no single magic technology box where you dump your data into it and turn the crank and out come valuable insights.

A lot of people think Hadoop is Big Data. A lot of people equate Big Data with Hadoop. Hadoop is part of the story, but it's not the whole thing. In fact, the way I have been explaining it to clients is there is kind of a 2/2 matrix. There is one dimension, latency. High latency is batch, and low latency is real time.

#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index Read More

The other dimension is structure from mono scheme or highly structured to lightly structured or unstructured. So if you think about that 2/2 matrix, each quadrant has its own set of technologies. Hadoop, of course, is more in the high latency unstructured space. For instance, in the low latency, structured space there are things like in-memory technology which are being used for analytics to generate more real time question and answer type of things. That’s also a piece of it.

To be clear, when I say unstructured I mean things like text feeds from social media and by structured I mean like relational databases. That’s one of the things that people associate things like Hadoop is unstructured. And the truth of it is what we see is Hadoop as a distributed file system so you put unstructured files in it, but most of those files have some structure in them like a web log.

So even though it's a file and called unstructured, it's not like we see people using Hadoop for email or free text processing. It's more around files that have some structure that allows for parsing of that structure and doing analytics on those in a very scalable way.

Where is this extreme frontier now if Big Data is on the extreme? You would expect examples in life science with string DNA data. And then there is telecom where you are storing call data records in the billions. You hear people using distributed so first off when we talk about Big Data technology, the way I have been explaining it to clients is, generally speaking, it's massively parallel processing for huge workloads. It offers you a flexible analytic model, which includes late schema binding or no schema, schema less kind of things. You don’t have to have just one schema.

Generally speaking Big Data technology means massively parallel Some of the columnar databases give you the capability to late bind a schema to a particular data that you have captured in a columnar format. So that gives you more flexibility. And then the last thing is, is they tend to be linearly scalable so you can buy as much as you need ,and then when your data needs to grow you can buy some more.

view gallery
View live interactive examples in InetSoft's dashboard and visualization gallery.

I mean there is no perfect, one technology. When you start to see pieces parts of these things, and there is a business case for investing in those technologies, then you have you got a Big Data problem. Typically we see folks using Big Data technologies for petascale in terms of a volume, but there are high velocity cases, too. For instance, a hospital connected all the medical equipment that was monitoring premature babies. I think it was 96 million data points a day was what they were collecting.

And they use a streaming technology to essentially only persist that data for a very short window, and then you run it through a filter, and that filter does the intelligence. So they weren’t even really storing it or using Hadoop, but the streaming technology they were using was massively parallel, and the flexible analytic model very scalable.

That’s a Big Data use case, and it didn’t have anything to do with data storage. It was more high velocity number of data points per a period of time. When you get up into the hundreds of millions of data points within a couple of hours or a day, we see the Big Data technologies beginning to take over, but there is no one dividing line.