They can handle super large amounts of data.
They run on clusters of inexpensive PC servers.
PC clusters are very easy and inexpensive to expand, avoiding the complexity and cost of "sharding" operations.
They break performance bottlenecks.
NoSQL proponents say that NoSQL architectures eliminate the time it takes to convert Web or Java applications and data into an SQL-friendly format, making execution faster.
"SQL isn't for all program code," and for data that is heavy with repetitive operations, SQL is worth the money. But when the database structure is very simple, SQL may not be very useful.
Not too many operations.
While NoSQL proponents also acknowledge that relational databases offer an unparalleled collection of features and play an absolutely stable role in data integrity, they also say that the specific needs of an organization may not be as numerous.
Bootstrap support
Because NoSQL projects are open source, they lack formal support from vendors. This is something they, like most open source projects, have to seek support from the community.
Benefits:
Easy to scale
There are many different kinds of NoSQL databases, but the one ****ing feature is that they all remove the relational nature of relational databases. There is no relationship between the data, which makes it very easy to scale. It also invariably brings scalability at the architectural level.
Large data volumes, high performance
NoSQL databases have very high read and write performance, especially in large data volumes, but also excellent performance. This is due to its non-relational nature and the simple structure of the database. Generally MySQL uses Query Cache, every time a table is updated Cache is invalidated, is a large granularity of Cache, in the target web2.0 applications with frequent interactions, Cache performance is not high. NoSQL's Cache is record-level, a fine-grained Cache, so NoSQL is much more performant at this level.
Flexible data model
NoSQL does not require that you create fields for the data you want to store, and you can store custom data formats at any time. Whereas in relational databases, adding and deleting fields is a huge hassle. In the case of very large tables, adding fields is a nightmare. This is especially true in the web 2.0 era of large data volumes.
Highly available
NoSQL makes it easy to implement highly available architectures without compromising performance too much. For example, Cassandra, the HBase model, can also achieve high availability through a replication model.
Major applications:
Apache HBase
This big data management platform is built on Google's powerful BigTable management engine. As a database with multiple advantages of open source, Java coding, and distributed, Hbase was originally designed to be used in the Hadoop platform, and this powerful data management tool has also been adopted by Facebook to manage the huge data of the messaging platform.
Apache Storm
A distributed, real-time computing system for processing high-speed, large data streams, Storm adds reliable, real-time data processing to Apache Hadoop, as well as low-latency dashboards, security alerts, and improvements to the way things were done to help organizations more efficiently capture business opportunities and grow new business.
Apache Spark
This technology uses in-memory computing, starting with multi-iterative batch processing, allowing data to be loaded into memory for repeated queries, in addition to incorporating multiple computing paradigms such as data warehousing, streaming, and graph computation, etc. Spark is implemented in the Scala language, built on HDFS, and works well with Hadoop, and runs 100 times faster than MapReduce.
Apache Hadoop
This technology is quickly becoming one of the standards for big data management. When it is used to manage large datasets, Hadoop demonstrates very good performance for complex distributed applications, the flexibility of the platform allows it to run on commercial hardware systems, and it can easily integrate structured, semi-structured, and even unstructured datasets.
Apache Drill
How big a dataset do you have? Actually, no matter how large a dataset you have, Drill can handle it with ease. With support for HBase, Cassandra, and MongoDB, Drill builds an interactive analytics platform that allows for large-scale data throughput and produces results quickly.
Apache Sqoop
Perhaps your data is locked up in legacy systems right now, and Sqoop can help you with that. This platform uses concurrent connections to easily move data from relational database systems to Hadoop, with customizable data types and mappings for metadata propagation. In fact, you can also import data (such as new data) into HDFS, Hive, and Hbase.
Apache Giraph
This is powerful graph processing platform with great scalability and usability. The technology has been adopted by Facebook, and Giraph can run in a Hadoop environment, which allows it to be deployed directly into an existing Hadoop system. In this way, you can get powerful distributed graphing capabilities while still leveraging existing big data processing engines.
Cloudera Impala
The Impala model can also be deployed on your existing Hadoop cluster to monitor all queries. The technology is as powerful as MapReduce for batch processing, and Impala works well for real-time SQL queries, so you can quickly learn about the data on your big data platform through efficient SQL queries.
Gephi
It can be used to correlate and quantify information, and you can get different insights from your data by creating powerful visualizations for it.Gephi already supports several chart types and can run on large networks with millions of nodes.Gephi has an active user community, and Gephi Gephi has an active user community, Gephi also offers a large number of plug-ins that integrate seamlessly with existing systems, and it can also visualize and analyze complex IT connections, individual nodes in a distributed system, data flows, and other information.
MongoDB
This solid platform has been revered by many organizations for its excellent performance in big data management.MongoDB was originally created by employees at DoubleClick, and the technology is now widely used for big data management.MongoDB is a NoSQL database developed by applying open source technology that can be used to store and process data on platforms like JSON. Currently, the New York Times, Craigslist, and numerous organizations have adopted MongoDB to help them manage large data sets. (Couchbase Server also serves as a reference).
Top 10 companies:
Amazon Web Services
Forrester calls AWS a "cloud juggernaut," and when it comes to big data in the cloud, it's Amazon. The company's Hadoop product is called EMR (Elastic Map Reduce), which AWS explains uses Hadoop technology to provide big data management services, but it's not pure open-source Hadoop, and it's been modified and is now used exclusively on the AWS cloud.
Forrester says EMR has a good market outlook. Many companies provide services to customers based on EMR, and a few apply EMR to data querying, modeling, integration, and management. And AWS is still innovating, Forrester says that in the future EMRs can automatically scale and resize based on workload needs. Amazon plans to offer more robust EMR support for its products and services, including its RedShift data warehouse, the newly announced Kenesis real-time processing engine, and planned NoSQL database and business intelligence tools. AWS doesn't yet have its own Hadoop distribution, though.
Cloudera
Cloudera has the open-source Hadoop distribution, which uses many of the technologies from the Apache Hadoop open-source project, though the distribution based on those technologies is also a big step forward.Cloudera has developed a number of features for its Hadoop distribution, including Cloudera Manager, for management and monitoring, and a SQL engine called Impala, among others.Cloudera's Hadoop distribution is based on open-source Hadoop, but it is not a purely open-source product. When Cloudera's customers need certain features that Hadoop doesn't have, Cloudera's engineers implement those features or find a partner that does have the technology. forrester says: "Cloudera's approach to innovation stays true to core Hadoop, but because it enables rapid innovation and proactively meet customer needs, something that sets it apart from those other vendors." Cloudera's platform currently has more than 200 paying customers, some of whom have been able to effectively manage petabytes of data across more than 1,000 nodes with Cloudera's technical support.
Hortonworks
Like Cloudera, Hortonworks is a pure Hadoop technology company. Unlike Cloudera, Hortonworks believes strongly that open source Hadoop is more powerful than any other vendor's Hadoop distribution.Hortonworks' goal is to build a Hadoop ecosystem and a community of Hadoop users to advance the open source project.The Hortonworks platform and open source Hadoop are linked closely, which company executives say will benefit users by preventing them from being trapped by vendors (if Hortonworks customers want to leave the platform, they can easily move to other open source platforms). That's not to say that Hortonworks relies exclusively on open-source Hadoop technology, but rather because the company gives back to the open-source community all of its developments, such as Ambari, a tool that was developed by Hortonworks to fill in cluster management project holes.Hortonworks' solution has been supported by Teradata, Microsoft, Red Hat and SAP, among other vendors.
IBM
When organizations consider some big IT project, IBM is the first place many people think of.IBM is one of the major players in the Hadoop project, and Forrester says that IBM has more than 100 Hadoop deployments, and many of its customers have petabytes of data.IBM has a rich history of experience in many areas, including grid computing, global data centers, and enterprise Big Data project implementations, and many other areas. "IBM plans to continue to integrate SPSS analytics, high-performance computing, BI tools, data management and modeling, workload management for dealing with high-performance computing, and many other technologies."
Intel
Similar to AWS, Intel continues to improve and optimize Hadoop to run on its own hardware, specifically, to make Hadoop run on its Xeon chips, to help users break down some of the limitations of the Hadoop system, and to make the software and hardware work better together, and Intel's Hadoop distributions have done a Forrester points out that Intel has only recently launched this product, so the company has a lot of possibilities for improvement in the future, and both Intel and Microsoft are considered to be potential stocks in the Hadoop market.
MapR Technologies
MapR's Hadoop distribution is perhaps the best so far, though many people may not have heard of it.Forrester's survey of Hadoop users showed that MapR had the highest ratings, with its distribution receiving top marks for both architecture and data processing power. MapR has incorporated a special set of features into its Hadoop distribution. Examples include network file system (NFS), disaster recovery, and high-availability features.Forrester says MapR doesn't have the name recognition in the Hadoop market that Cloudera and Hortonworks do, and that MapR will need to step up its partnerships and marketing if it is to become a really big player.
Microsoft
Microsoft has always kept a low profile when it comes to open source software, but in the big data situation it has had to consider making Windows compatible with Hadoop as well, and it's actively invested in open source projects to promote the Hadoop ecosystem more broadly. We can see the results in Microsoft's public **** cloud Windows Azure HDInsight offering. Microsoft's Hadoop service is based on the Hortonworks distribution and is customized for Azure.
Microsoft has a number of other projects as well, including one called Polybase, which lets Hadoop queries implement some of the functionality of SQLServer queries.Forrester says, "Microsoft has a presence in the database, data warehouse, cloud, OLAP, BI, spreadsheet (including PowerPivot), collaboration and development tools market, and Microsoft has a large user base, but it has a long way to go to become an industry leader in this area of Hadoop."
Pivotal Software
The combination of EMC and Vmware's partial big data business spinoff produced Pivotal, which has been working hard to build a high-performance Hadoop distribution, and to that end has added a number of new tools to the open source Hadoop base, including an SQL engine called HAWQ and a Hadoop application that specializes in solving big data problems.Forrester says the strength of the Pivotal Hadoop platform is that it integrates many technologies from Pivotal, EMC, and Vmware, and that Pivotal's real strengths are actually equal to the fact that the two major companies, EMC and Vmware, are backing it up. So far, Pivotal has fewer than 100 users, and most of them are small and medium-sized customers.
Teradata
For Teradata, Hadoop is both a threat and an opportunity. Data management, especially in the area of SQL and relational databases, is Teradata's specialty. So the rise of NoSQL platforms like Hadoop could have threatened Teradata.Instead, Teradata embraced Hadoop, and through a partnership with Hortonworks, Teradata integrated SQL technology in the Hadoop platform, which makes it easy for Teradata's customers to use the data stored in Teradata's data warehouse.
AMPLab
It is by transforming data into information that we can make sense of the world, and that's exactly what AMPLab does.AMPLab works in a variety of areas, including machine learning, data mining, databases, information retrieval, natural language processing, and speech recognition, in an effort to improve techniques for screening information, including information within opaque data sets. In addition to Spark, the open source distributed SQL query engine Shark also originates from AMPLab, Shark has extremely high query efficiency, good compatibility and scalability. Developments in recent years have brought computer science into a whole new era, and AMPLab envisions a flexible solution for us to solve difficult problems using a variety of resources and technologies, such as big data, cloud computing, and communications, in order to deal with a variety of challenges that are becoming more and more complex.