InetSoft Webinar: Big Data Doesn’t Have to Have Data Quality Problems

This is the continuation of the transcript of a Webinar hosted by InetSoft on the topic of "What is Big Data and What isn't?." The speaker is Abhishek Gupta, Product Manager at InetSoft.

There is this idea out there that, you know, Big Data because it’s unstructured or because some of the use cases are social. Inherently social media data has a quality problem, right? It’s unimaginable that anything that doesn’t go through a traditional ETL process could be utilized, right? Which is just untrue, right? So, there are a couple of reasons why this is untrue.

The first is that often times we load quality data into the system, right? It may be Hadoop data, right? It may be about long field data, right? It maybe data that we keep around for a long time, but it’s data. It’s deemed to be the official copy, right? And we loaded in because we wanted as part of the reference, right? So, you can load the quality data in, right? It’s not like it can only go over trash cans and put stuff like that in this Big Data platforms, right. So, you can start with quality data.

Number two is you can use your data quality tools against these environments, right? So, for example one thing that we’re talking about this year coming up, and it’s something that if I get a chance to work on in quite a few customer projects now, is our information profiling and data quality. MDM works against the Hive construct, so you can basically used your MDM notions against data that’s stored in our bigger environments right.

#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index Read More

The other reason is that there is bad information out there. And the third reason, it is kind of interesting but often times, we are loading data. This does get into the volume and velocity side of this. We’re loading data directly from the generating source, right? So, it doesn’t mean that it’s always 100% right, but it darn well better be good quality data or you’ve got bigger issues, right?

They try to tell a story that, “Oh this is for stuff that you are not going to put into Oracle database anyway right? And it’s just bad information, right?“ So, you know. Hopefully, you know, this won’t come up again, and I'm sure it will, but the answer is no, it’s for any data source, structured and unstructured, and semi-structured, that you can think of. It really doesn’t matter how it gets stored.

So if you’re taking telematics directly off the car, that’s presumed to be the right data. It’s presumed to be good data. It doesn’t mean it’s perfect, right? But it doesn’t mean also that it’s inherently flawed, right? If you’re taking data off a Big Cloud over in Europe, it’s presumed to be good data, right? If you’re reading data directly out from your web servers, presumably you trust what they’re telling you they’re doing is in fact what they’re doing.

Now, that doesn’t mean that the data is inherently perfect. It doesn’t mean that you don’t need to deal with glitches and obviously, those things can and do happen. But the data source itself is the master source, or at least, certainly and often times the source of record, right? If your data is coming off that well the big data quality is not your biggest issue because your underlying your source systems is sick, right?

So, again I think it’s just people either are not thinking this stuff through or, and in this one vendor’s case they are trying to throw a foot out there and distract from new ways to impact new things.

Data coming in from lots of different types of sources and machine learning -- this has become a really super hot topic. One thing I've been hearing is machine learning prevents human biases. I feel this is not true.

Read what InetSoft customers and partners have said about their selection of Style Scope for their solution for dashboard reporting.

This one is a related one. Machine learning is something you do in real time. It really kind of irks me now. I’m sure giving my soft fuzzy demeanor.It’s really hard for people to imagine, but this was directly out of a Director of Marketing’s mouth, for someone that frankly should know better and from a company that is trying to spread the myth themselves.

That’s an example of a Big Data company not actually having any Big Data experience, right? We started at some beginning point, too. It happened to be 5, 6 years ago, right? You have to have some idea of what the hell you’re talking about, when you go make declarative statements, right? And this declarative of statement was the use of machine learning prevents human bias which is really in a mean thing to say right? Because it shows that you don’t understand human bias. It also shows you don’t understand the machine-learning.

Previous: Big Data Is Not Only For Unstructured Data