The most valuable and hard to replace thing in big data is the data, everything revolves around the data.
HDFS is the earliest big data storage system, storing valuable data assets, a variety of new algorithms, frameworks to be widely used, must support HDFS, in order to access the data already stored in it. So the more big data technology develops, the more new technologies there are, the more support HDFS gets, and the more you can't get away from HDFS. HDFS may not be the best big data storage technology, but it is still the most important big data storage technology .
How does HDFS enable high-speed, reliable storage and access to big data?
The Hadoop Distributed File System, HDFS, was designed with the goal of managing thousands of servers, tens of thousands of disks, and managing large-scale server compute resources as a single storage system, providing petabytes of storage capacity to applications, and allowing applications to store large-scale file data as if it were a regular file system.
Files are stored in multiple copies:
Disadvantages:
Advantages:
HDFS implementation of mass storage and high-speed access.
RAID slices data and performs concurrent read and write access on multiple disks, increasing storage capacity, speeding up access, and improving data reliability through data redundancy checksums, so that no data is lost even if a disk is damaged. Extending the RAID design concept to the entire distributed server cluster results in a distributed file system, which is the core principle of the Hadoop distributed file system.
Like RAID, which stores files on multiple disks and parallelizes reads and writes, HDFS parallelizes reads and writes and redundantly stores data in slices on a large distributed server cluster. Because HDFS can be deployed on a large server cluster, all the disks of the servers in the cluster are available for HDFS, so the entire HDFS storage space can reach petabytes.
HDFS is a master-slave architecture. An HDFS cluster will have a NameNode (Named Node, or NN for short) that acts as the master server.
HDFS exposes the filesystem namespace, which allows users to store data in files in the same way that we normally use the filesystem in os, without caring about how the data is stored at the bottom. At the bottom layer, a file is divided into one or more data blocks, and these database blocks are stored in a set of data nodes. The default 128M for data blocks in CDH. In NameNode, namespace operations of the file system can be performed, such as opening, closing, and renaming files. This also determines the mapping of data blocks to data nodes.
HDFS is designed to run on ordinary, inexpensive machines that typically run a Linux operating system. A typical HDFS cluster deployment will have a dedicated machine that can only run NameNode, while the other machines in the cluster each run a DataNode instance. While it is possible to run multiple nodes on a single machine, it is not recommended.
Responsible for storing, reading, and writing file data, HDFS splits file data into blocks, with each DataNode storing a portion of the block, so that files are distributed and stored across the entire HDFS server cluster.
Application clients (Clients) can access these Blocks in parallel, which allows HDFS to realize parallel access to data at the scale of the server cluster, greatly improving access speed.
There are many DataNode servers in an HDFS cluster, typically hundreds to thousands, each equipped with several disks, and the entire cluster has a storage capacity of several petabytes to hundreds of petabytes.
MetaData management is responsible for the entire distributed file system, i.e., file pathnames, block IDs, and storage locations, etc.
This information is similar to that of the os file. information, similar to the file allocation table (FAT) in OS.
HDFS replicates a Block into multiple copies (3 by default) and stores multiple copies of the same Block on different servers and even in different racks to ensure high availability. When a disk is damaged or a DataNode server is down, or even a switch is down, making its stored blocks inaccessible, the client looks for its backup Block to access.
In HDFS, a file is split into one or more data blocks. By default, there are three replicas of each data block, each stored on a different machine, and each replica has its own unique number:
The number of replica backups for the file /users/sameerp/data/part-0 is set to 2, and the BlockIDs of the stored blocks are 1 and 3, respectively:
When any of the servers mentioned above goes down, each data block has at at least one backup still exists that does not affect access to the file /users/sameerp/data/part-0.
Like RAID, data is divided into blocks and stored on different servers to achieve high-capacity data storage, and the data in different parts can be read/written in parallel to achieve high-speed access to the data.
Copy storage: The process by which a NameNode node selects a DataNode node to store a copy of the block, with the strategy being a trade-off between reliability and read/write bandwidth.
The default approach in The Definitive Guide to Hadoop:
The first of Google's Big Data "troika" is GFS (Google File System), and the first product of Hadoop is HDFS, the distributed file store that is the foundation of distributed computing.
Over the years, various computing frameworks, algorithms, and application scenarios have evolved, but the king of big data storage is still HDFS.
Disk media is affected by the environment or aging during storage, and the data stored on it may be misplaced.
HDFS calculates and stores a checksum for blocks of data stored on a DataNode. When you read the data, it recalculates the checksum of the read data, throws an exception if the checksum is incorrect, and the application catches the exception and reads the backup data on another DataNode.
When a DataNode detects that a local disk is damaged, it reports all the BlockIDs stored on that disk to NameNode, which checks which other DataNodes have backups of those blocks and notifies the corresponding DataNode servers to replicate the corresponding blocks to other servers to ensure that the number of backups of the blocks meets the requirements. are copied to other servers to ensure that the number of backups of the data blocks meets the requirements.
The DataNode communicates with the NameNode via heartbeat. If the DataNode does not send a heartbeat after a timeout period, the NameNode assumes that the DataNode has gone down and immediately finds out what data blocks are stored on the DataNode and which servers are still storing them, and notifies those servers to replicate the data blocks to other servers. The servers are then notified to make a copy of the data blocks to other servers, ensuring that the number of data blocks stored in HDFS matches the number set by the user and that no data is lost even if the server goes down again.
The NameNode is the core of HDFS and records the HDFS file allocation table information, and all the file paths and data block storage information are stored in the NameNode. If the data recorded on the NameNode is lost, all the data stored in the entire cluster on the DataNode will be useless.
Therefore, it is very important to have high availability fault tolerance for NameNode, which uses a master-slave hot-standby approach to provide highly available services:
Two NameNode servers are deployed in the cluster:
The two servers are elected through a Zk election, which determines who is the master server, mainly by competing for znode lock resources. The DataNode, on the other hand, sends heartbeat data to both NameNodes at the same time, but only the master NameNode can return control messages to the DataNode.
During uptime, the file system metadata information is synchronized between the master and slave NameNode via a *** shared edits of the shared storage system. When the master NameNode server goes down, the slave NameNode is upgraded to become the master server via ZooKeeper and ensures that the metadata information of the HDFS cluster, that is, the file allocation table information, is complete and consistent.
Software systems, performance almost, the user may be acceptable; poor experience, perhaps also tolerated. But if the availability is poor, often out of order and unavailable, it will be a problem; if there is important data loss, the development of a big problem.
And distributed systems may fail in a very large number of places, memory, CPU, motherboards, disks will be damaged, the server will be down, the network will be interrupted, the server room will be blacked out, all of which may cause the software system is unavailable, and even permanent loss of data.
So when designing a distributed system, software engineers must tighten the string of availability and think about how to ensure that the entire software system is still available under various possible failure scenarios.
## 6 Strategies for ensuring system availability
Any program, any data, must have at least one backup, that is, the program must be deployed to at least two servers, and the data must be backed up to at least another server. In addition, slightly larger Internet companies will build multiple data centers, data centers back up each other, user requests may be distributed to any data center, the so-called off-site multi-live, in the event of a major geographic failure and natural disasters, but still ensure that the application is highly available.
When the program or data you want to access is not available, you need to transfer the access request to the server where the backed up program or data is located, which is failover. What you should be aware of with failover is the identification of the failure. In a scenario where the master and slave servers are managing the same data, such as NameNode, if the slave servers mistakenly take over the management of the cluster as if the master server is down, there will be a situation where the master and slave servers will send commands to the DataNode together, which will lead to cluster chaos, also known as "brain fart". "This is also the reason why the master server is elected in this scenario. This is why ZooKeeper was introduced for this scenario when the master server was elected, and how ZooKeeper works will be analyzed later.
When a large number of user requests or data processing requests arrive, due to limited computing resources, it may not be able to handle such a large number of requests, which in turn leads to resource exhaustion and system crash. In this case, some of the requests can be denied, i.e., flow-limiting, or some of the functionality can be turned off to reduce resource consumption, i.e., degrading. Flow-limiting is a feature that is always available for Internet applications because you can't predict when access traffic that exceeds the load capacity will suddenly arrive, so you have to be prepared in advance, and when you encounter a sudden peak in traffic, you can start flow-limiting right away. The downgrade is usually prepared for predictable scenarios, such as the e-commerce "Double 11" promotion, in order to ensure that the core functions of the application during the promotional activities can run normally, such as the order function, you can downgrade the system to deal with the closure of some of the non-important functions, such as product evaluation function.
How HDFS achieves high-capacity, high-speed, and reliable storage and access to data through large-scale distributed server clusters.
1. File data is sliced up in blocks, which can be stored on any DataNode server in the cluster, so the files stored by HDFS can be very large, and a file can theoretically occupy all the disks on the entire HDFS server cluster, realizing high-capacity storage.
2. HDFS general access mode is through the MapReduce program in the calculation of the read, MapReduce input data for the slice read, usually a slice is a block of data, each block of data to allocate a computational process, so that you can start a lot of processes at the same time on an HDFS file of multiple blocks of data to concurrently access, thus realizing the high-speed access to data. We will discuss the specific MapReduce processes in detail later in the column.
3. DataNode stored data blocks will be replicated, so that each data block in the cluster has multiple backups, to ensure the reliability of the data, and through a series of fault-tolerant means to achieve high availability of the main components of the HDFS system, which in turn ensures that the data and the entire system is highly available.