Want that great presence in the world of data management? Data warehouses, right? Over the past two decades, while other systems and software have evolved through many, many iterations, changes, and even been completely abandoned by new models, the old backbone of the data warehouse has stood tall. She may have secretly given her cheeks, her wrinkles a facelift, or she may have inspired some less impressive parodies, but nothing held her attention for long.
Until now. Ever since Hadoop appeared on the scene, there have been mutterings that the shiny new star is serving up some of the best data management roles? Those roles are, until a few years ago, data warehousing was a solid win.
But is it really time for data warehousing to retire?Is Hadoop even trying to get into her shoes? Who else is waiting in the wings?
Let's take a closer look at the full breadth of these reported competitors.
What's behind the enduring appeal of data warehousing?
Simply put, data warehousing means aggregating data from disparate sources into a central repository for reporting and analysis. It has been a practical solution for a long time for the following reasons: because the data is aggregated, it undergoes a process of extraction, transformation, and loading that harmonizes it into a "unique version of the truth," mitigating inconsistencies and reconfiguring the way the data is formatted to fit into predefined schemas.
The result is a complete, reliable, and consistent source of data that can be queried by business intelligence software.
What exactly is Hadoop?
It's an open source programming framework for users who need to work with massive data sets. Using a distributed storage system, it gives users a way to store, clean and process large amounts of data.
In order to transfer data at gigabyte speeds, the Hadoop Distributed File System (HDFS) reads data along thousands of hardware nodes. Even if many nodes stop working due to technical failures, the system stays up and running. This means there is a low risk of data loss? This is a real fear for organizations that use large amounts of data for very complex analysis.
It's no wonder Hadoop is turning to an industry looking for a reliable way to run big data processing tasks.
Additionally, it's open source? That's a huge draw. It's infinitely scalable and infinitely customizable. The scope for including custom applications, queries, and methods is limitless. The complexity of data mining can grow with the complexity of the data and the amount of data.
Where is it better than data warehousing?
Big data is getting bigger, and many large data warehouses are trying to adopt customized multiprocessor appliances to cope with skyrocketing storage demands. But all but the largest organizations need to pay for them.
Hadoop, meanwhile, has the flexibility to handle snowballing data. Users can then combine it with a data warehouse layer or a service built on top, whether it's SQL software like Presto, or Hive, which works in a similar way, or NoSQL like an HBase class.
But that doesn't mean that Hadoop is going to fetch relational databases or data warehouses. In fact, as we'll see soon, it's likely to be best supported, not replaced.
So are they competitors?
Not at all. Simply put, they don't play the same role.
Data professionals tend to see Hadoop as a complement to their existing data warehouse architecture and one that can save them a lot of cash. By migrating chunks of data to Hadoop, the pressure on relational databases can be reduced, thus making the data warehouse platform cheaper and scalable without increasing the speed of speech that.
In this way, Hadoop reduces the total cost of data warehousing, rather than replacing something in it.
How does it make data warehouses perform better?
Data warehouses are expensive to build, expensive to run, and expensive to grow. As the amount of data collected grows, storage requirements and expenses grow exponentially.
In addition, these huge collections of data mean that users can't access the full scope of the data warehouse every time they run a query? and their hardware can't handle it. This means using analytic datasets to give individual departments in the business access to data in specific areas of the data warehouse.
It is an imperfect system. Not only did it limit the scope of the analysis that users could perform on the data, it was also a ticking time bomb.
As more and more data pours into the warehouse, each dataset can become so overwhelmed that it becomes difficult to use. You can take the pressure off the hardware by restricting access, but that means giving departments narrower and narrower options for analyzing data. For strict business intelligence, that's not good enough.
Hadoop doesn't suffer these setbacks. The barrier to entry is low and it's open source for incremental investments. It can be built up over time, and you can keep increasing the amount of data without spending a lot of money to match it.
For companies that are new to the data industry? Don't have an investment in a mainframe or Unix-based data warehouse? This scalable, incremental framework is very appealing. But Hadop is a framework, not a perfect solution. It's great at handling huge data sets, but it was never intended to be a replacement for a data warehouse.
So are Hadoop and data warehouses the ultimate BI dream team?
Wow, hold on a second. Using Hadoop with a data warehouse handles the data storage problem. But storing data is only one element of BI.
Broadly speaking, a functional, usable BI system should consist of five components:
1. Several kinds of storage of data somewhere.2. Tools for dividing this data, e.g., geographic, operational, or other business-needed tools.3. Preparing the tools for data analytics.4. An ETL data engine to help you process this data quickly.5. A way to display all of this data on a front end (usually a dashboard of some kind).
Even when Hadoop and data warehouses work together in the best case scenario, they only handle the first of these components. Now, innovations in BI technology, which offers all five components simultaneously, are quickly relegating dream teams to a Type II portfolio.
Who, and who, would be out to steal the limelight?
As we've seen, data warehousing and Hadoop are a successful double act. But to perform fast, high-performance data analysis from multiple sources, you don't need either of them. Now, we are witnessing the rise of a new star.
The overall "single stack" solution eliminates the need for a relational database, links directly to the source data wherever it comes from, and performs ELT functions in the field. The best job is to create a metadata (abstraction) layer that can be used to query data in any number of tables, drawn from any source in any format.
The right approach is to solve the problems that usually accompany huge data sets by building smart, hard disk-saving approaches like columnar databases and in-memory processing. Simplify processing by first loading only the data that's being used, and then make sure to load that data into the computer's main memory instead of hogging up RAM. this means you get full, unrestricted access to all your data without needing a computer the size of the Hollywood Hills to process it.
A singing, dancing superstar
Even better, using a full BI system removes the need for additional layers of software that can make data understandable to non-technical users.
As we've seen, data warehouses and Hadoop fall short because they are strictly "back-end" solutions? They only deal with the outer layers of data.
In order for your front-end users to access the data, you still need to introduce and integrate a variety of applications that allow business teams to extract and visualize the insights they need.
While Hadoop is open source, it's not "free". Getting it to do what you want it to do, and integrating it with your data warehouse, your tools to process and prepare data for analysis, and the front-end dashboard interface, either requires a significant investment of resources or bringing in a third party to manage it. Plus, of course, you still need to invest in the hardware it needs to run.
With a decent single-stack alternative, you can query the source data, process it quickly using an ETL data engine, and in one step generate new reports and table dashboards. Now this innovation challenges the future of data warehousing, Hadoop or no Hadoop.
So, yes, maybe it's time for this (international) national treasure to step back and let the next generation of data technology take over. But not because Hadoop has stolen her crown, but because single-stack technology is providing redundant storage data solutions for BI.