1, big data application model
Cloud computing (Cloud Computing) is an Internet-based computing, is based on parallel computing (Parallel Computing ), distributed computing (Distributed Computing) and grid computing (Grid Compu- It is based on Parallel Computing, Distributed Computing and Grid Compu- ting, and integrates network storage, virtualization, load balancing and other technologies. It will originally need to be performed by personal computers and private data centers to transfer the task to have professional storage and computing technology to complete a large computing center to achieve the computer software, hardware and other computing resources to fully **** enjoy [z}. Enterprises or individuals no longer need to spend a lot of money on the purchase of infrastructure, not to mention the need to spend energy on hardware and software installation, configuration and maintenance, which will be provided by the cloud computing service provider CSP ( Cloud Service Provider) to provide the appropriate services. Enterprises or individuals only need to pay for the leased computing resources in accordance with the timing or measurement. Cloud computing service providers have big data storage capacity and computing resources, and are regarded as the best choice for outsourcing information services [31 Therefore the application of big data is often combined with cloud computing.
Hadoop is currently the most widely known implementation of Big Data technology, which is an open source implementation of Map/Reduce}4} and GFS ( Google File System) in Google Cloud Computing.Hadoop provides a computational framework, and its most core technology is HDFS ( HadoopDistributed File System) as well as MapReduee } HDFS provides a high-throughput distributed file system, while MapReduee is a distributed processing model for large data.Hadoop provides a reliable ****enjoyable storage and analytics system for big data [5-6 }v
Although there are organizations that build their own clusters to run Hadoop, there are still many organizations that choose to run Hadoop or offer Hadoop as a service in a cloud built on leased hardware. Examples include Cloudera, which offers to run Hadoop on public or private clouds, and a cloud service called Elastic MapReduee offered by Amazon, etc.f}l Thus combining cloud computing with Hadoop to process big data has become a trend.
2. Big Data Security Risk Analysis
With the widening scope of big data applications, the need for data security is becoming more and more urgent.
Since cloud computing is characterized by outsourcing data to cloud service providers to provide services, this service model transfers the ownership of the data to the CSP, and the user loses direct control over the physical resources [A1. The big data stored in the cloud usually exists in clear text, and the CSP has the underlying control over the data, so a malicious CSP may steal the user's data without the user's knowledge, and the cloud computing platform may also steal the user's data without the user's knowledge. The cloud computing platform may also be attacked, resulting in the failure of the security mechanism or illegal control, leading to unauthorized people reading the data, which poses a threat to the security of big data.
Hadoop was not designed with security in mind, and after Ha-doop 1.0.0 and Cloudera CDH3, Hadoop added Kerberos authentication and ACL-based access control [91]. Even with the addition of authentication and access control, the security mechanism is still very weak. security mechanism is still very weak, because the authentication mechanism of Ker-beros is only applied between clients (Clients ), key distribution centers (I}ey Distribution Center (I}DC ), and servers (Serv-er), which is only for machine-level security authentication, and does not authenticate the Ha-doop application platform itself [ }o} In contrast, the ACL-based access control policy needs to be configured through the attributes in hadoop-policy. xml after enabling the ACL, which includes nine attributes that restrict the access of users and group members to the resources in Hadoop as well as the access of users and group members to the resources in Hadoop as well as the access of users and group members to the resources in Hadoop as well as the access to resources in Hadoop as well as the access to resources in Hadoop by users and group members by users and group members by users and group members by users and group members by users and group members by users and group members by users and group members by users and group members by users and group members. communication between nodes, but the mechanism relies on the administrator's configuration of it [Chuan, this traditional access control list based on traditional access control lists can be easily tampered with on the server side without being easily detected. Moreover, the granularity of ACL-based access control policy is too coarse to protect user privacy fields in a fine-grained way in the MapReduce process. Moreover, for different users and different applications, the access control list needs to be changed frequently, which is too cumbersome and not easy to maintain. Therefore, Hadoop's own security mechanism is not perfect.
2.1 Security Risks of CSPs and Uers in Different Application Modes
There are multiple application modes for Hadoop in cloud computing. Hadoop is built in a private cloud, that is, the enterprise itself applies Hadoop, and the platform is used by employees from various departments within the enterprise, and outsiders can not access and use these resources. At this time the CSP refers to the creation and management of Hadoop, IaaS level and PaaS level CSP for the same entity; in the public cloud platform to apply Hadoop , C SP has two levels, IaaS level CSP, to provide infrastructure; PaaS level C SP, responsible for Hadoop construction and management. At this point the two levels of CSP are often different entities.