Recently, Bernard Marr, a renowned big data expert, analyzed the similarities and differences between Spark and Hadoop in an article
Hadoop and Spark are big data frameworks. data frameworks and both provide tools to perform common big data tasks, but they don't perform the same tasks exactly and are not exclusive of each other
While Spark is purported to be 100 times faster than Hadoop in certain situations, it doesn't have a distributed storage system of its own
And while distributed storage is the foundation of many big data projects today, the It can store petabyte-sized datasets on a virtually unlimited number of hard disks on an average computer, and provides good scalability by simply adding hard disks as the dataset grows
So Spark needs a third-party distributed storage, and it's for this reason that many big data projects are installing Spark on top of Hadoop, so that Spark's advanced analytics applications can use data stored in HDFS
The real advantage of Spark over Hadoop is speed; Spark does most of its operations in memory, whereas Hadoop's MapReduce system writes all the data back to the physical storage medium after each operation, this is to ensure that the full recovery in the event of a problem, but Spark's elastic distributed data store enables this as well
In addition, Spark trumps Hadoop when it comes to advanced data processing (e.g., real-time stream processing, machine learning)
This, along with its speed advantage, is, in Bernard's opinion, the real reason for Spark's increasing popularity
Real-time processing means that data can be presented to analytical applications the moment it is captured, and feedback can be obtained immediately
This kind of processing is increasingly being used in a wide variety of big data applications, such as recommendation engines used by retailers, and performance monitoring of industrial machinery in the manufacturing industry
The speed and streaming data processing capabilities of Spark are also ideally suited for machine learning algorithms
This is the first time that Spark has been used in a big data application. The speed and streaming data processing capabilities of the Spark platform are also well suited to machine learning algorithms, which can learn and improve on themselves until they find the ideal solution to a problem
This technology is at the heart of state-of-the-art manufacturing systems (e.g., predicting when a part is going to break) and driverless cars
Spark has its own machine learning library, MLib, whereas Hadoop systems need to rely on third-party machine learning libraries Apache Mahout
In fact, while there is some functional overlap between Spark and Hadoop, neither is a commercial product and there's no real competition, and companies that profit from providing tech support for these kinds of free systems tend to offer both
Cloudera, for example, offers both Spark and Hadoop services. Cloudera, for example, offers both Spark and Hadoop services, and will provide advice on what's best for the customer's needs
Bernard argues that while Spark is growing rapidly, it's still in its infancy, with an underdeveloped security and technical support infrastructure, and that, in his view, the rise in activity in the open-source community suggests that business users are looking for innovative uses for the data they've already stored
This is the first time that a company like Cloudera has been able to offer both a Spark and Hadoop service. /p>