This is the transcript of a Webinar hosted by InetSoft on the topic of "The Newest Buzz Word in BI: Big Data." The speaker is Abhishek Gupta, Product Manager at InetSoft.
We’re going to be talking about the newest buzz word in BI: Big Data. You know we all have been through Big Data, and we probably all have a certain idea what it is. You can define it in your own world and define it as how you see it.
The reason that a lot of people don’t know what it is because honestly there are different definitions out there. People have talked about one thing, and they believe they’re on the same page. But I saw a survey of small and medium businesses who were asked about Big Data, and it turned out there were about three or four different definitions that were prevalent out there.
The way I like to define it as a kind of baseline is that Big Data is the science and the practice of working with data that in some way, shape or form is just too big for traditional transactional databases to work with in an efficient way. Now, you know, people will go beyond that and name an amount, you know, into the hundreds of terabytes, or into the petabytes ranges is one definition.
That definition is going to be fluid, right? You know two years from now that the size will probably be increased because even transactional databases will be able to take on those volumes. So it’s always going to move, but I would say you know if it’s too big for a transactional reporting system then it’s Big Data.
And if it’s coming in fast and furious through some kind of streaming data source that’s a good sign. You know if you’re dealing with things that are not stored in relational data sources, maybe they are log files. Maybe it’s data coming from sensors. That kind of thing is a good omen that it’s Big Data as well. And you know if that sounds a little inconsistent, it’s because it is. There are really different uses in this whole field.
Another Big Data scenario is one that incorporates multiple data sources. If you’re familiar with the concept of data warehousing, if you’ve done some work in the field of BI, you know BI and Big Data actually have a tie in. Think about pulling data from all kinds of systems, but just think about some of those systems and reporting tools not being relational databases but being something a little less traditional than that.
We’ll probably talk about BI a little bit later in the Webinar, but let’s start out with what’s started the whole Big Data craze. What product came out that tipped the scale? The underlying causes for Big Data’s popularity and feasibility honestly are that processing and storage have become far cheaper in the last several years than they ever were.
We know that things always tend to get cheaper and more sophisticated in technology. That’s not new. But it’s really gotten to the point now where a lot of the data that we used to throw away because it just wasn’t practical to keep, it is now easily kept. Storage is cheap enough where you can pretty much keep everything, and if you can keep everything then you have a lot more data and detail that you can analyze.
It turns out that analysis is quite valuable in lots of different settings. Also what happened was Google was working on some technology which by the way they kept to themselves, but they did share the underlying thoughts and engineering underneath it.
They had something called map reduce because what they were doing, and still are doing, is crawling the Web. The huge amounts of data that that involves, relational databases just didn't cut it for them. So they created something called map reduce. They created their own file system, and the long story short, there is this open source project called Hadoop. H-A-D-O-O-P. Hadoop is, in fact, the open source implementation of Google’s map reduce and the Google file system, although in Hadoop, it’s known as the Hadoop distributed file system.
A distributed file system is a method of storing and accessing files across multiple servers in a network, making them appear as though they are part of a single unified system. Instead of relying on one central machine to hold and manage all data, the system distributes files across several nodes, each contributing storage capacity and processing power. This architecture allows users and applications to access and manipulate data without needing to know where the files physically reside. By spreading data out across nodes, a distributed file system can handle much larger volumes of information and deliver better performance than a single-server approach.
Compared to the centralized file systems that came before, distributed file systems offer significant advantages in scalability, reliability, and fault tolerance. A single-server setup is limited by the capacity and speed of that one machine, creating bottlenecks and single points of failure. Distributed systems solve these issues by replicating data across multiple nodes, so if one server fails, the files remain available from another location. This not only improves uptime but also enables organizations to grow seamlessly as data demands increase. As a result, distributed file systems have become foundational to modern computing environments, supporting everything from cloud storage to big data analytics.
Next: Dropping Data Storage Costs Helped Spark the Big Data Trend |