Web search:
google distributed storage system BigTable relies on GFS
Hbase (open-source implementation of bigtable): Highly reliable, high-performance, column oriented, scalable
Stores structured and semi-structured data
Benefits:
Horizontal scalability is particularly good:
Dependencies:
File storage system: HDFS
Massive data processing: MapReduce
Collaborative management services: Zookeeper
Satisfies: real-time computation of large volumes of data
Data types:
RDBMS: relational data model, multiple data types
Hbase:
Data manipulation:
Storage model:
Indexing:
Data maintenance:
Scalability:
Vertical scaling:
Horizontal scaling:
Hbase's Access interfaces:
JAVA API
shell
thrift Gateway
restful Gateway
SQL interfaces: pig writing class sql? hive accessing Hbase with hivesql
Hbase's Data types:
Column qualifiers
Each value is an uninterpreted bytes
A row can have a row key and multiple columns
A table consists of a family of columns
Hbase data model:
Column families support dynamic extensions, preservation of old versions (HDFS can only append data)
Base Elements:
Row key: rowkey
Column family
Column qualifier
Cell (timestamp concept, corresponding to the version of the data)
Coordinate concept:
Four-dimensional positioning: row key, column family, column qualifier, timestamp
Sparse tables
HBASE: column-oriented storage
RDBMS: row-oriented storage, transactional operations (complete records), not easy to analyze (requires full table scanning)
4.3 Principles of HBASE Implementation
4.3.1 Library Functions, Master Servers, and Region Servers
Master Servers:
Partition Information
Partition Information
Partition Information
Partition Information
Partition Information
Partition Information
Partition Information
Maintains and manages partition information
Maintains a list of region servers
Confirms which region servers are currently working
Responsible for allocating and load balancing a region
Adds, deletes, and retrieves tables
Region servers:
The client does not rely on the Master to get location information
Storage and management of user data
Region server - 10-1000 regions -----Store is a column family ----Each column family is an Hfile ----All regions share 1 Hlog
Write data process: Region server - Write Cache Memstore - Write Log (Hlog)
Read Data Flow: Region server - Read Cache Memstore (latest data) ----StoreFile
Cache Flush: Periodic flushing of cache contents to Storefile Flush Cache ---. Hlog write flag
Each refresh will generate a new StoreFile Each Store contains multiple StoreFiles
Each Region server has its own Hlog, which will start checking to make sure that the cache refreshes whether there is new content to be flushed, and if found, it will flush a new storefile, and then delete the storefile after completion. Hlog, start to provide services
Storefile merging, storefile number reaches a threshold, will be merged. When the Storefile exceeds the size threshold it triggers a split of the Region
4.4 How Hlog works
Zookeeper is responsible for listening to the region server, the master handles failures, recovers through the faulty server's Hlog, slices the Hlog by region, and assigns the region and corresponding Hlog to the new region. Hlog is allocated to the new region server
An HBASE table will be divided into multiple Regions (1G-2G depending on server performance)
The same region will not be split to different servers
Finding of Region:
Meta table: regionID Server ID Stored metadata
Root table: only one region
Three-tier addressing:
zookeeper file - root table - multiple meta tables - multiple user data tables
The client will have a cache of the three-tier addressing of Hbase, the call to access the interface of Hbase, and the cache will expire After that, address it again
zookeeper decides the master server, make sure there is only one master
4.5 Hbase application scenarios
Performance optimization:
1) Time close to the storage ---- introduce timestamps into the row key, use Long.max-timestamps for sorting
2) Improve read/write performance,set HcloumnDescriptor.setMemory=true when creating a table, it will put the table into memory cache
3) Save storage-space ---- set the maximum number of versions, save the latest version of the data, set the max-version parameter to 1
4) timetolive parameter. will automatically empty the expired data
Test Hbase performance:
Maste-status (web browser query)
ganglia
OpenTSDB
Armbari
sql query HBASE
1) hive Integrate hbase
2) Phoenix
Hbase secondary index (secondary index)
Only indexing rowkey is supported by default
Hbase row access:
1) Single row key access
2) Determine start and end point to access interval data
3) Full table scan
Sample secondary indexes:
Hindex Hbase+redis? Solr+ Hbase
Mechanisms for secondary indexes:
Hbase Coprocessor?
endpoint? ---- Stored procedure
observer ---- trigger
Monitor data insertion actions through Observer, synchronize writes to the index table, and complete indexing of tables and columns
? Hbase master table Index table
4.6 Shell commands for HBASE
Three deployment modes: standalone Pseudo-distributed ? Distributed
HDFS
create table
create table, F1, F2, F3
list table
Add data to only 1 column of 1 row at a time
put? table R1, R1:C1 , "1,2, 3"
scan? table? R1, {column='R1:C1'}
get? table
Delete table:
disable table +drop table
4.7 JAVA API +HBASE