Word count is the simplest and most can reflect the MapReduce idea of one of the programs, can be called MapReduce version of the "Hello World", the program's complete code can be found in the Hadoop installer in the The complete code of the program can be found in the Hadoop installation package in the "src/examples" directory.
Big Data Mystery: To Graduate or Not to GraduateThink about what your reasons are for going to graduate school. Maybe you'll find the answer for yourself ....... Yu Heping
raid5 data problemNo, just like the new disk in disk management to repartition and format to recognize, but then, the data written in RAID mode is also gone, because the data written in RAID on a disk is fragmented, and even the file system of this disk is fragmented and incomplete. Timely luck happens to have a more complete file system on this disk metadata, is that you can see the files on other computers, but these files common sense to open, absolutely report the wrong. Maybe you are lucky enough to see a text file can be opened, but that is also estimated to be less than 4KB of content. File size is less than the RAID stripe size divided by the number of RAID disks, and is also related to the file system cluster size. This means that you basically can't read this disk on another computer. All of this is to give you a better idea of what's going on.
Big Data tells you whether you want to be a civil servantIt is normal to not finish the questions. Line test exam time is 120 minutes, excluding the time to apply the answer card, the average question counts only more than 50 seconds of time to answer. For our candidates, we should first focus on the strengths of the force, to be able to do, will do, good to do the questions are good, to ensure that the correct rate
What is RAID5 RAID5 data recovery principleThis problem is more complex, the server hard disk, the structure is more complex, simple to say, RAID5 at least 3 hard disk combinations need to be the same type, same capacity Hard disk combination, if the server is bad, then you need to mark the location of the hard disk in the memory, in order to recover data for use in the late
server is bad, at least 2 hard disk problems, so, bad, do not operate, protect the scene, find professional data recovery personnel to restore data, general data can be recovered, recommended Xi'an military king data, professional data recovery organization, if the server disk is just a whole row of bad, then the hard disk will not be able to recover the data. Just the server disk whole column is bad, the hope of recovery is great
Big data must need HadoopYes. There is no technology that can replace Hadoop.
Use HDFS's mand command to view files on hdfs, or use the default Hadoop's web manager to view, from hadoop0.2.23 onwards, hadoop designed a set of Rest style interfaces to browse and manipulate the data on HDFS by means of protocol access. data on HDFS.
Big Data tells you whether to get a driver's license while in collegeOn the question of whether college graduates need a driver's license, in general, very few workplaces require a driver's license when looking for a job, so getting a driver's license isn't necessary. However, in general, especially boys, certainly want to drive their own car. Therefore, the best time to get a driver's license is during the college years, and you can basically get a driver's license in a month or two during the summer and winter breaks. If you don't get your driver's license in time during your school years, I'm afraid you won't have as much time to get it when you start working. So if you can get a driver's license in college, try to get one, and if you can't, don't push yourself too hard.
Of course, there is also the problem that you have to take enough extracurricular credits before graduating from college. If you don't have enough extracurricular credits, you can take a driver's license, which can be used as 2 extracurricular credits for your graduation, so it's better to take a driver's license during your college years.
Big Data: Getting Started with HadoopWhat is Big Data:
(1.) Big Data is data that cannot be captured, managed, and processed by conventional software within a certain period of time ***, in short, the amount of data is very large, so large that it cannot be processed by conventional tools, such as relational databases, data warehouses, and so on. Here "big" is a what order of magnitude? For example, in Alibaba every day to deal with data up to 20PB (that is, 20971520GB).
2. Characteristics of big data:
(1.) Huge volume. According to the current development trend, the volume of big data has reached the petabyte level or even the EB level.
(2.) The data types of big data are diverse, mainly unstructured data, such as web magazines, audio, video, pictures, geographic location information, transaction data, social data and so on.
(3.) Low value density. Valuable data only accounts for a small portion of the total data. For example, only a few seconds of information in a video screen is valuable.
(4.) Generate and require fast processing. This is the most significant feature of the big data area with traditional data mining.
3. In addition to this there are other processing systems that can handle big data.
Hadoop (open source)
Spark (open source)
Storm (open source)
MongoDB (open source)
IBM PureDate (commercial)
Oracle Exadata (commercial)
SAP Hana (commercial) )
Teradata AsterData (commercial)
EMC GreenPlum (commercial)
HP Vertica (commercial)
Note: Here we will only introduce Hadoop.
II: Hadoop Architecture
Hadoop Source:
Hadoop originated from three papers published by Google in 2003 to 2004 on GFS (Google File System), MapReduce and BigTable, by Doug Cutting, the founder.Hadoop is now an Apache Foundation top-level project. "
Hadoop" is a fictional name. Named by Doug Cutting's children for his yellow toy elephant.
The core of Hadoop:
(1.) HDFS and MapReduce are the two main cores of Hadoop. Through HDFS to realize the underlying support for distributed storage, to achieve high-speed parallel read and write and large-capacity storage expansion.
(2.) Through MapReduce to realize the distributed tasks to handle program support, to ensure high-speed partition processing data.
3. Hadoop subprojects:
(1.) HDFS: distributed file system, the cornerstone of the entire Hadoop system.
(2.) MapReduce/YARN: parallel programming model. YARN is the second generation of MapReduce framework, from Hadoop 0.23.01 version, MapReduce was refactored, usually also known as MapReduce V2, the old MapReduce is also known as MapReduce V1.
(3.) Hive: a data warehouse built on Hadoop, providing SQL voice-like querying of data in Hadoop,
(5.) HBase: the full name is Hadoop Database, Hadoop's distributed, column-oriented database, derived from Google's paper about the BigTable paper, mainly for random access, real-time read and write big data.
(6.) ZooKeeper: is a coordination service designed for distributed applications, mainly for users to provide synchronization, configuration management, grouping and naming services, to alleviate the coordination tasks undertaken by distributed applications.
There are many other projects that are not explained here.
Three: Installation of the Hadoop environment
User creation:
(1.) Create a Hadoop user group, enter the command:
groupadd hadoop
(2.) Create the hduser user, enter the command:
useradd -p hadoop hduser
(3.) Set the password for hduser, enter the command:
passwd hduser
Enter the password twice when prompted
(4.) Add permissions for the hduser user, enter the command:
#modify permissions
chmod 777 /etc/sudoers
#Edit sudoers
Gedit /etc/sudoers
#Restore the default permissions
chmod 440 /etc/sudoers
Modify the sudoers file first. First, modify the permissions of the sudoers file and add hduser to sudoers by looking for the line "root ALL=(ALL)" in a text editor window and updating the line "hduser ALL=(ALL) ALL" immediately after it. Remember to restore the default permissions when you are done, otherwise the system will not allow you to use the sudo command.
(5.) Reboot the VM after the setup, and enter the command:
Sudo reboot
After reboot, switch to the hduser user to log in
Installing the JDK
(1.) Download the jdk-7u67-linux-x64.rpm, and go to the directory where you downloaded it.
(2.) Run the install command:
Sudo rpm -ivh jdk-7u67-linux-x64.rpm
When finished, check the installation path and enter the command:
Rpm -qa jdk -l
Remember the path,
(3.) Configure environment variables, enter the command:
Sudo gedit /etc/profile
Open the profile file and add the following at the bottom of the file
export JAVA_HOME =/usr/java/jdk.7.0.67
export CLASSPATH=$ JAVA_HOME/lib:$ CLASSPATH
export PATH=$ JAVA_HOME/bin:$PATH
Save the file, close it, and then enter the command to make the environment variable to take effect:
Source /etc/profile
(4.) Verify the JDK by typing the command:
Java -version
If the correct version appears, the installation is successful.
Configure local SSH password-free login:
(1.) Use ssh-keygen to generate private and public key files, enter the command:
ssh-keygen -t rsa
(2.) The private key stays on the local machine, and the public key is sent to the other hosts (which are now localhost). Enter the command:
ssh-copy-id localhost
(3.) Use the public key to log in Enter the command:
ssh localhost
Configure the other host to SSH in for a secret-free login
(1.) Clone twice. Select the virtual machine in the left column of VMware and right-click on it, and in the shortcut menu that pops up, select the Manage - Clone command. Select "Create a full clone" for the clone type, and click the "Next" button until you're done.
(2.) Boot into each of the three virtual machines and use ifconfig to query each host's IP address.
(3.) Modify the hostname and hosts files for each host.
Step 1: To modify the hostname, enter the command in each host.
Sudo gedit /etc/sysconfig/neork
Step 2: Modify the hosts file:
sudo gedit /etc/hosts
Step 3: Modify the IPs of the three VMs
The first corresponds to the IP of the node1 VM: 192.168.1.130
The IP of the second virtual machine corresponding to node2: 192.168.1.131
The IP of the third virtual machine corresponding to node3: 192.168.1.132
(4.) Since you have already generated a key pair on node1, you only need to generate a key pair on node1 now. Enter the command:
ssh-copy-id node2
ssh-copy-id node3
This will publish the public key of node1 to node2 and node3.
(5.) To test the SSH, enter the command on node1:
ssh node2
#exit login
exit
ssh node3
exit
IV: Hadoop fully distributed installation
1. Hadoop has three modes of operation:
(1.) Standalone Mode: No configuration is required, Hadoop is considered to be (1.) Stand-alone mode: no configuration is required, Hadoop is regarded as a non-distributed mode running independent Java processes
(2.) Pseudo-distributed: there is only one node of the cluster, this node that is the Master (master node, master server) is also a Slave (slave node, from the server), you can be in the single node with a different java process to simulate the various types of nodes in the distributed
(3.) Fully distributed: For Hadoop, different systems will have different ways to divide the nodes.
2. Installation of Hadoop
(1.) Get the Hadoop tarball hadoop-2.6.0.tar.gz, after downloading, you can use VMWare Tools to enjoy the folder through ***, or use Xftp tools to node1. enter node1 extract the tarball to /home/hduser directory, enter the command to extract the tarball to the /home/hduser directory, enter the command to extract the tarball to the /home/hduser directory, enter the command to install the tarball. hduser directory, enter the command: # Enter the HOME directory i.e. "/home/hduser"
cd ~
tar -zxvf hadoop-2.6.0.tar.gz
(2.) Rename hadoop. ) Rename hadoop by typing the command:
mv hadoop-2.6.0 hadoop
(3.) Configure Hadoop environment variables by typing the command:
Sudo gedit /etc/profile
Add the following script to the profile:
# hadoop
export HADOOP_HOME=/home/hduser/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
Save to shut down, and finally enter the command to make the configuration take effect
source /etc /profile
Note: Both node2, and node3 should be configured according to the above configuration.
3. Configure Hadoop
(1.) The hadoop-env.sh file is used to specify the JDK path. Enter the command:
[hduser@node1 ~]$ cd ~/hadoop/etc/hadoop
[hduser@node1 hadoop]$ gedit hadoop-env.sh
And then add the following to specify the path to the jDK.
export JAVA_HOME=/usr/java/jdk1.7.0_67
(2.) Open the specified JDK path, enter the command:
export JAVA_HOME=/usr/java/jdk1.7.0_67
(4.) core- site.xml: This file is the Hadoop global configuration, open and add configuration attributes to the element as follows:
fs.defaultFs hdfs:node1:9000 hadoop.tmp.dir file:/home/hduser/hadoop/tmp Two commonly used configuration attributes are given here The default path prefix, fs.defaultFS, indicates the default path prefix when the client connects to HDFS, and 9000 is the port on which HDFS works. hadoop.tmp.dir will be saved to the system's default temporary file directory, /tmp, if it is not specified. (5.) hdfs-site.xml: This file is the configuration of hdfs. Open it and add configuration attributes to the element. (6.) mapred-site.xml: This file is the configuration of MapReduce, can be copied from the template file mapred-site.xml.template to open and add configuration in the element. (7.) yarn-site.xml:If the use of the YARN framework is configured in mapred-site.xml, then the YARN framework uses the configuration in this file, open it and add configuration attributes to the element. (8.) Copy the seven commands to node2,node3. enter the following command: scp -r /home/hduser/hadoop/etc/hadoop/ hduser@node2:/home/hduser/hadoop/etc/ scp -r /home/hduser/hadoop/etc/hadoop/ hduser@node3:/home/hduser/hadoop/etc/ 4. Verification: The following verifies that the hadoop is correct (1.) On the Master host (node1) format the NameNode. enter the command: [hduser@node1 ~]$ cd ~/hadoop [hduser@node1 hadoop]$ bin/hdfs namenode -format (2.) Shut down node1,node2 ,node3,system firewall and restart the VM. Input the command: service iptables s Enter the command: service iptables s sudo chkconfig iptables off reboot (3.) Enter the following to start HDFS: [hduser@node1 ~]$ cd ~/hadoop (4.) Start all [hduser@node1 hadoop]$ *** in/ start-all.sh (5.) Check the cluster status: [hduser@node1 hadoop]$ bin/hdfs dfsadmin -report (6.) Check the status of the hdfs operation in your browser at: :node1:50070 (7.) Stop Hadoop. enter the following Command: [hduser@node1 hadoop]$ *** in/s-all.sh V: Hadoop-related shell operations (1.) Create file1.txt in the /home/hduser/file directory of the operating system. file2.txt can be created using the graphical interface. file1.txt input content: Hello World hi HADOOP file2.txt input content Hello World hi CHIAN (2.) start hdfs after the creation of directory /input2 [hduser@node1 hadoop]$ bin/hadoop fs - mkdir /input2 (3.) Save file1.txt.file2.txt to hdfs: [hduser@node1 hadoop]$ bin/hadoop fs -put -/file/file*.txt /input2/ (4.) [hduser@node1 hadoop]$ bin/hadoop fs -put -/file/file*.txt /input2/ (5.) [hduser@node1 hadoop]$ bin/hadoop fs -put -/file/file*.txt /input2 node1 hadoop]$ bin/hadoop fs -ls /input2
How much does RAID5 data recovery costI don't know exactly, I've been to the AIT Data Recovery Center before, and I've spent less than 2,000 dollars on it, and it seems to depend on what the problem is. I that the problem is more complex, spend 2000 can be fixed, is also very surprising