Current location - Loan Platform Complete Network - Big data management - HBASE 1.0
HBASE 1.0
Predecessor: BigTable

Web search:

google distributed storage system BigTable relies on GFS

Hbase (open-source implementation of bigtable): Highly reliable, high-performance, column oriented, scalable

Stores structured and semi-structured data

Benefits:

Horizontal scalability is particularly good:

Dependencies:

File storage system: HDFS

Massive data processing: MapReduce

Collaborative management services: Zookeeper

Satisfies: real-time computation of large volumes of data

Data types:

RDBMS: relational data model, multiple data types

Hbase:

Data manipulation:

Storage model:

Indexing:

Data maintenance:

Scalability:

Vertical scaling:

Horizontal scaling:

Hbase's Access interfaces:

JAVA API

shell

thrift Gateway

restful Gateway

SQL interfaces: pig writing class sql? hive accessing Hbase with hivesql

Hbase's Data types:

Column qualifiers

Each value is an uninterpreted bytes

A row can have a row key and multiple columns

A table consists of a family of columns

Hbase data model:

Column families support dynamic extensions, preservation of old versions (HDFS can only append data)

Base Elements:

Row key: rowkey

Column family

Column qualifier

Cell (timestamp concept, corresponding to the version of the data)

Coordinate concept:

Four-dimensional positioning: row key, column family, column qualifier, timestamp

Sparse tables

HBASE: column-oriented storage

RDBMS: row-oriented storage, transactional operations (complete records), not easy to analyze (requires full table scanning)

4.3 Principles of HBASE Implementation

4.3.1 Library Functions, Master Servers, and Region Servers

Master Servers:

Partition Information

Partition Information

Partition Information

Partition Information

Partition Information

Partition Information

Partition Information

Maintains and manages partition information

Maintains a list of region servers

Confirms which region servers are currently working

Responsible for allocating and load balancing a region

Adds, deletes, and retrieves tables

Region servers:

The client does not rely on the Master to get location information

Storage and management of user data

Region server - 10-1000 regions -----Store is a column family ----Each column family is an Hfile ----All regions share 1 Hlog

Write data process: Region server - Write Cache Memstore - Write Log (Hlog)

Read Data Flow: Region server - Read Cache Memstore (latest data) ----StoreFile

Cache Flush: Periodic flushing of cache contents to Storefile Flush Cache ---. Hlog write flag

Each refresh will generate a new StoreFile Each Store contains multiple StoreFiles

Each Region server has its own Hlog, which will start checking to make sure that the cache refreshes whether there is new content to be flushed, and if found, it will flush a new storefile, and then delete the storefile after completion. Hlog, start to provide services

Storefile merging, storefile number reaches a threshold, will be merged. When the Storefile exceeds the size threshold it triggers a split of the Region

4.4 How Hlog works

Zookeeper is responsible for listening to the region server, the master handles failures, recovers through the faulty server's Hlog, slices the Hlog by region, and assigns the region and corresponding Hlog to the new region. Hlog is allocated to the new region server

An HBASE table will be divided into multiple Regions (1G-2G depending on server performance)

The same region will not be split to different servers

Finding of Region:

Meta table: regionID Server ID Stored metadata

Root table: only one region

Three-tier addressing:

zookeeper file - root table - multiple meta tables - multiple user data tables

The client will have a cache of the three-tier addressing of Hbase, the call to access the interface of Hbase, and the cache will expire After that, address it again

zookeeper decides the master server, make sure there is only one master

4.5 Hbase application scenarios

Performance optimization:

1) Time close to the storage ---- introduce timestamps into the row key, use Long.max-timestamps for sorting

2) Improve read/write performance,set HcloumnDescriptor.setMemory=true when creating a table, it will put the table into memory cache

3) Save storage-space ---- set the maximum number of versions, save the latest version of the data, set the max-version parameter to 1

4) timetolive parameter. will automatically empty the expired data

Test Hbase performance:

Maste-status (web browser query)

ganglia

OpenTSDB

Armbari

sql query HBASE

1) hive Integrate hbase

2) Phoenix

Hbase secondary index (secondary index)

Only indexing rowkey is supported by default

Hbase row access:

1) Single row key access

2) Determine start and end point to access interval data

3) Full table scan

Sample secondary indexes:

Hindex Hbase+redis? Solr+ Hbase

Mechanisms for secondary indexes:

Hbase Coprocessor?

endpoint? ---- Stored procedure

observer ---- trigger

Monitor data insertion actions through Observer, synchronize writes to the index table, and complete indexing of tables and columns

? Hbase master table Index table

4.6 Shell commands for HBASE

Three deployment modes: standalone Pseudo-distributed ? Distributed

HDFS

create table

create table, F1, F2, F3

list table

Add data to only 1 column of 1 row at a time

put? table R1, R1:C1 , "1,2, 3"

scan? table? R1, {column='R1:C1'}

get? table

Delete table:

disable table +drop table

4.7 JAVA API +HBASE