This is the continuation of the transcript of a Webinar hosted by InetSoft on the topic of "The Newest Buzz Word in BI: Big Data." The speaker is Abhishek Gupta, product manager at InetSoft.
We hit on a lot on the big companies in this Big Data space, but what other company should we be keeping an eye out for now and then the future? It’s important to understand the history of Hadoop which is that it actually came out of Yahoo. And so anyone who’s split off from Yahoo to focus on Hadoop has a lot of credibility because they’re probably engaged in the project heavily.
Both Cloudera and Hortonworks have that kind of lineage, Hortonworks really is almost like a spin off from Yahoo so they maybe have a stronger claim there, but on the other hand, Cloudera is really the predominant company I would say in the space, and their distribution of Hadoop is the most widely deployed.
So those two are big. Other players that are important to keep an eye on, there is one company called Hadapt and so it’s like the word adapt with an H at the beginning. They actually combine Hadoop and massively parallel processing the technology. I was discussing before with respect to the SQL parallel data warehouse edition. They combine the two into their own product.
There is also a company called MapR, M-A-P-R, with the R capitalized, and for the longest time I considered them an also-ran, but just in the month of June they became much more important than at least I thought because they have announced that their distribution of Hadoop is now available on Amazon web services instead of you using Amazon’s own distribution.
They also announced with Google that Google’s knew compute engine, which is their infrastructure as a service cloud will also offer MapR distribution of Hadoop to get going, and putting Hadoop in the cloud is a nice way of doing things.
Maybe we can talk about that in a little bit, but by and large, the thing to remember is you if you can just go to a web browser and provision the whole cluster, that can certainly be a lot easier than building out the cluster yourself with physical machine in your data centers, especially if you only needed it for a discrete amount of time.
IBM has its own distribution as well, although at this time, they’re offering Cloudera’s distribution on top of their own so I think that’s going to have a lot of popularity, and we’ve already talked about Microsoft working in concert with Hortonworks, so there is that.
A lot of the BI companies are morphing into Big Data companies, too, so keep an eye on them. A big one there is Tableau. They have a big data visualization product of the same name, and that really kind of started as BI data visualization, and it still is a bit of a trick connecting through hive to get to Hadoop. They’re also a Hadoop tool, and they’re certified as working with Cloudera’s distribution of Hadoop.
Where do I see Big Data going in the near future, and out to a couple years from now maybe five years? It’s always fun to do some crystal ball stuff. I mentioned before that you know there is this thing called the hive which create a SQL abstraction over Hadoop, and that’s how a BI stack talks to Hadoop but I think I also mention that’s how other visualization tools talk to Hadoop.
Actually if you look at the whole field out there that’s how pretty much all the tools that’s started life as BI tools now have a Big Data story. It’s through hive. And I think that’s fine as far as it goes, but I do think that that’s a little bit of a stop-gap measure that in a fact all these tools that were designed to work on relational data sources. InetSoft's has branched out into other data source types like xml feeds and spreadsheets, for instance.
Our interacting with Hadoop by abstracting it as a relational data source, and that’s really not what it is. The native way to talk to Hadoop is by writing Map Reduce jobs in Java code, and I don’t know if that’s especially efficient. So we have an inefficient but very directly controlled approach of using Java. We’ve got that rather bland approach of writing SQL queries that get compiled into Java which may or may not be efficient once it’s compiled. I believe we’re going to see some middle ground.
I believe that Java is not going to be the only language in which Map Reduce jobs can be written in, that things will be more pluggable with Hadoop, and that a lot of the tools that we’re using will be able to talk to Hadoop more directly and more efficiently.
I also think what that means is that the skill set of working with Hadoop is going to become more prevalent really because it has to, and because, people want to follow the money, and you’ll have some insight from doing Hadoop analysis. I think these are going to be the lucrative opportunities for people to get these skill sets, and for all of that to be based just on Java programming, I think it’s probably less than reasonable. So I think we’re going to see Hadoop opening up a lot, and I think we’re going to see a lot more people with the competency to work with it.