Why Cloudera created the Hadoop security component Sentry

1. The security system of big data

To make this clear, we have to start from the four levels of the security system of big data platforms: peripheral security, data security, access security, and access behavior monitoring; as shown in the following figure;

Peripheral security technology refers to the traditional network security technology, such as firewalls, login authentication, etc.

Data security, in a narrow sense, includes encryption and decryption of user data, which can be subdivided into storage and transmission encryption; it also includes user data encryption and transmission encryption; it also includes user data encryption and encryption. Including the encryption and decryption of user data, which can be subdivided into storage encryption and transmission encryption; also includes the desensitization of user data, desensitization can be regarded as "lightweight" data encryption. For example, if someone's birthday is "2014-12-12", the desensitized data will be "2014-x-x". The outline of the data is still there, but it is no longer possible to pinpoint the value. The higher the degree of desensitization, the less legible the data will be. The example above can also be desensitized to "x-x-x", which is equivalent to completely blocking the information from the public.

Access security is mainly the management of user authorization, and the management of read, write, and execute permissions for user-groups in Linux/Unix systems is a classic model, and HDFS has expanded this concept to form a more complete ACL system; in addition, with the popularization and deepening of the application of big data, the need for differentiated access permissions for data within a file has become more and more important;

Access security is mainly the management of user authorization.

Access behavior monitoring refers to the recording of user access behavior to the system: such as which files to view; which SQL queries to run; access behavior monitoring on the one hand, in order to carry out real-time alarms to quickly dispose of illegal or dangerous access behaviors; on the other hand, in order to investigate and collect evidence after the fact, to analyze and locate the specific purpose of the data access behavior from a long period of time.

In these four layers of security, the third layer has the most direct relationship with the upper layer of business: multi-tenancy of the application, access control by permissions are directly dependent on the technical realization of this layer.

2.? HDFS authorization system

In the third layer, the Hadoop ecosystem has long followed the Linux/Unix authorization management model, which divides file access into read-write permissions (there is no concept of executable files on HDFS), and divides the owners of permissions into three categories: owners, groups, and others. group), and others. This model restricts permissions to three categories of owners. If you try to add a new "group" and set the users in that group to have different permissions than owner, group, or other, the existing Linux/Unix authorization model does not elegantly solve this problem.

To illustrate the above situation, let's say there is a sales department and the department manager, manager, has the right to modify sales_data; the members of the sales department have the right to view sales_data, and no one outside the sales department has access to sales_data. The authorization for sales data sales_data is as follows:

-rw-r----- ?3? manager sales? 0? 2015-01-25? 18:51? sales_data?

Then the sales department expands its staff, and there are two more sales managers, one called manager1 and the other called manager2. these two sales managers are also allowed to modify sales data. managers are also allowed to modify sales data. In this case, manager1 and manager2 can only use a new account, manager_account, and then enable that account to make changes to sales_data using setuid. This makes managing permissions on the same data complex and not easy to maintain.

Because of these problems, support for HDFS ACLs (Access Control Lists) was added in Hadoop 2.4.0. This new feature is a good solution to the problems mentioned above. However, with the wide application of Hadoop in the enterprise, more and more business scenarios require that the granularity of big data access control is no longer limited to the file level, but more detailed constraints on the data inside the file which can be read and written, which can only be read, which is not allowed to be accessed at all. For SQL-based big data engines, data access should not only be at the table level, but also at the row and column level.

3.? Authorization on Hiveserver2

Hive was one of the early engines to bring the high-level query language SQL to the Hadoop platform, and the early Hive server process was called Hiveserver1; Hiveserver1 did not support either processing multiple connections in parallel or access authorization control; both of these issues were later addressed on Hiveserver2 was solved. Hiveserver2 was able to use grant/revoke statements to restrict user access to databases, tables, and views, and row and column permissions were controlled by generating views; however, Hiveserver2's authorization management system was found to be problematic in that any authenticated logged-in user was able to add access rights to any resource for themselves to add access to any resource. In other words, Hiveserver2 does not provide a secure authorization system. Hiveserver2's authorization system is designed to provide a safeguard mechanism to prevent normal users from misuse; it is not designed to protect the security of sensitive data. However, these are more the words of some companies. In fact, Hiveserver2's own security system is being gradually improved, and the above problems are being rapidly fixed.

But authorization management is not just needed for Hive. Other query engines also urgently need these techniques to refine and standardize application access to data. For fine-grained authorization management implementations, a large portion of the functionality is common across engines, so independently implemented authorization management tools are essential.

4.? Sentry provides secure authorization management

In this context, some developers at Cloudera have taken advantage of the existing authorization management model in Hiveserver2 to extend and refine many of the details to complete a relatively valuable authorization management tool called Sentry, the following figure shows a comparison of Sentry with the authorization management model of Hiveserver2:

5. The following figure compares Sentry with the authorization management model in Hiveserver2:

Many of the basic models and design ideas in Sentry are derived from Hiveserver2, but the concept of RBAC has been enhanced on top of it. In Sentry, all permissions can only be granted to roles, and when a role is mounted to a user group, only users within that group have the corresponding permissions. The mapping of permissionsàrolesàgroupsàusers is particularly clear in Sentry, as it shows how a permission can end up being owned by a user; from permissions to roles to user groups are all granted through grant/revoke SQL statements. Moving from a "user group" to being able to influence a "user" is accomplished through Hadoop's own user-group mapping, which provides two types of mappings: one for Linux/Unix users on the local server to their groups, and one via LDAP to the groups they belong to. Hadoop provides two types of mappings: one for Linux/Unix users on the local server to their groups, and the other for users to their groups via LDAP; the latter is more suitable for large systems because it has the benefit of being centrally configurable and easy to modify.

Sentry extends the data objects supported in Hiveserver2 from database/table/view to server, URI, and column granularity. While column authority control can be accomplished with views, the view approach can make naming a view extremely complex for multiple users with a large number of tables; and a query statement originally written by the user against the original table cannot be used directly at this point, as the view name may be completely different from the original table.

Currently, the authorization levels supported by Sentry 1.4 are limited to SELECT, INSERT, and ALL, but subsequent releases will be able to support the same level of authorization as Hiveserver 2. Sentry is derived from the authorization management model of Hiveserver 2, but is not limited to managing only Hive. Sentry is derived from the authorization management model in Hiveserver2, but is not limited to managing only Hive, but rather to managing other query engines such as Impala, Solr, etc. The architecture of Sentry is shown in the following diagram:

There are three important components in Sentry's architecture: one is the Binding; the second is the Policy Engine; and the third is the Policy Provider.

Binding To achieve the authorization of different query engines, Sentry inserts its own Hook functions into different stages of compilation and execution of each SQL engine. These Hook functions play two major roles: one is to act as a filter, only releasing SQL queries with corresponding data object access rights; the other is to play the role of authorization takeover, after using Sentry, the grant/revoke management rights are completely taken over by Sentry, the execution of the grant/revoke is also completely implemented in Sentry; the authorization information of all engines is also stored in Sentry. The authorization information of all engines is also stored in a unified database set by Sentry. In this way, all engine permissions are centrally managed.

The Policy Engine determines whether an input permission request matches a saved permission description, and the Policy Provider is responsible for reading out the original set of access permissions from a file or database. *** module, which can later be used to serve other query engines.

5. Summary

The fine-grained access control on big data platforms is being done by all, but of course the platform vendors are still dominated by Cloudera and Hortonworks, Cloudera pushes Sentry as the core authorization system; Hortonwork relies on the open-source community on one hand, and relies on the acquired XA Secure on the other. Regardless of how the influence of the two companies on the big data platform market changes in the future, the fine-grained authorization of access to big data platforms are worth learning.

Ranking of admissions in Qingbei Province

What year is the year of 5g

Is Rihua Technology a leading provider of intelligent human defense and human defense alarm system solutions in China?

Big data climate theory

What is the disease of the "pendulum"?

Crazy big data songs

Is it useful to go to the middle school?

Basic methods of rational storage

What are the leading stocks of quantum communication concept stocks?

Failing Maths Exam Reflection and Summary