"If you really want to get to the bottom of what's happening in your organization's business, you're going to need a lot of very detailed data feeds." The Director of Research at the Data Warehouse Institute (TDWI), Philip? Lutham wrote in one of his latest TDWI Big Data Analytics reports. "If you really want to see something you've never seen before, this helps you mine data that has never been analyzed by business intelligence."
This is the raison d'etre of big data analytics, and its unprecedented. It's not just the very concept of Big Data that reminds us of it, but at least as far back as the early 21st century, "when storage and CPU technology was being overwhelmed by megabytes of data, and IT was facing a crisis of data scalability." Advanced analytics in applications targeting large and disparate data sets are unprecedented (e.g., data mining). This is where the emergence of big data analytics becomes epochal. It's a signal of the end of the data scalability crisis, says Lutham.
This makes unprecedented sense for organizations. Data mining, data analytics, and in some cases reporting on the data collected by an organization. This is why hands-on programs such as data sampling are seen as a rather pragmatic necessity for businesses.
"You can't put your entire data set into a data mining program. You have to select the data you need and you have to make sure it's the right data, because if you don't put in the right data, your techniques may not work." Data Warehouse Institute researcher Mark? Madsen told attendees at the Predictive Analytics Workshop.
"You can put a very small percentage of the data you collect into mining...sampling of probabilistic events." He continued, "but the decomposition will be so rare that it becomes a very rare event, making it very difficult to turn into a sample."
Ideally, you want to identify all those "rare" events that are anomalies, such as fraud, customer churn, and potential supply chain disruptions. They're the high-value stuff that's hidden in your undifferentiated data, and they're hard to find.
IBM, Microsoft, Oracle, and Teradata, along with most of the other big-name BI and data warehousing (DW) vendors, have begun selling products that integrate with Hadoop. Some are even trumpeting their implementation of the ubiquitous MapReduce algorithm.
These vendors aren't just talking about Big Data, they're talking about Big Data combined with advanced analytics such as data mining, statistical analysis, and predictive analytics. In other words, they are talking about big data analytics.
According to research from the Data Warehouse Institute, big data analytics is not here yet; it has not yet been embraced by the mainstream. In a recent survey by the Data Warehouse Institute, more than one-third (34%) of respondents said that their organizations practice some form of advanced analytics in conjunction with big data. In most cases, they simply use very simple methods. For example, data sampling.
In fact, if organizations aren't thinking about phasing out sampling and other so-called best-practice "artifacts" of the past, they're really missing the boat, said Dave Inbar, senior director of big data products at data integration specialist PervasiveSoftware.
"If you continue to use data sampling, you can actually work with all the data, but the science of the data is inherently weakened," he said. He said. "In the world of Hadoop, there's no reason why you can't use commodity hardware, really smart software. In the past, there may have been reasons for us to use sampled data, and there may have been reasons for economic cost considerations, or reasons for the technology not being up to snuff. But today, none of those reasons exist. Data sampling used to be the best practice solution in the past, but I think its time has passed."
"The needle-in-a-haystack problem doesn't lend itself to sampling, so you're over-emphasizing the training set in that way, which can lead to problems." Madsen, who is in charge of information management consulting, noted, "Ultimately, it's easier to run the entire dataset than it is to follow statistical algorithms closely and worry about samples. Technology can deal with issues with the data as distribution challenges arise and can access statistical methods."