Current location - Loan Platform Complete Network - Big data management - How many slices is reasonable to allocate to an Elasticsearch cluster
How many slices is reasonable to allocate to an Elasticsearch cluster

?Elasticsearch is a very versatile platform that supports a wide variety of user instances and provides great flexibility for organizing data and replication strategies. However, this flexibility can sometimes make it difficult to determine early on how well to organize data into indexes and shards, especially if one is unfamiliar with the Elastic Stack. while not necessarily causing problems when first launched, they can lead to performance issues as the amount of data grows. The more data the cluster has, the more difficult it is to correct the problem, as it may sometimes be necessary to re-index large amounts of data.

? So when we run into performance issues, they can often be traced back to the way the indexing is done and the number of slices in the cluster. Then the question arises, how many shards should we have and how big should my shards be.

Let's say we have a cluster with the following architecture:

node : A single ElasticSearch instance. Typically, a node runs in an isolated container or virtual machine

index : In ES, an index is a collection of documents

shard : Because ES is a distributed search engine, indexes are often broken into These data, which are distributed across different nodes, are called shards. ES automatically manages and organizes the slices, and rebalances the slices when necessary, so you don't have to worry about the details of how the slices are handled.

Replica : By default, ES creates 5 primary slices for an index, and one replica slice for each. This means that each index has 5 primary slices, and each primary slice has a corresponding replica. For distributed search engines, the allocation of slices and replicas is at the heart of the design for high availability and fast search response. Both primary and replica can handle query requests, the only difference between them is that only the primary can handle indexing requests. While replicas are important for search performance, users can add or remove replicas at any time. Extra replicas give more capacity, higher throughput, and better failure recovery.

As shown above, there are two nodes in the cluster with a default sharding configuration. ES automatically assigns the five primary slices to two nodes, and their corresponding replicas are on completely different nodes. Where node1 has slices 1, 2, and 3 and replica slices 4 and 5 for a particular index, and node2 has slices 4 and 5 and replica slices 1, 2, and 3 for that index.

When data is written to a slice, it is periodically published to immutable Lucene segments on disk for querying. As the number of segments grows, these segments are periodically merged into larger segments. This process is called merging. Because all segments are immutable, this means that the amount of disk space used typically fluctuates during indexing because new merged segments need to be created before replacement segments can be deleted. Merging can be very resource intensive, especially in terms of disk I/O.

A slice is the unit to which an Elasticsearch cluster distributes data. The speed at which Elasticsearch can move a slice when rebalancing data, for example after a failure, will depend on the size and number of slices as well as network and disk performance.

Note 1 : Avoid using very large slices as this can negatively impact the cluster's ability to recover from failures. There is no set limit on the size of a slice, but typically many scenarios are limited to a slice size of 50GB.

Note 2: Once you have configured your indexes in an ElasticSearch cluster, understand that you will not be able to adjust the slice settings while the cluster is running. Even if you find that you need to adjust the number of slices, you will have to create a new one and reindex the data (reindexing is time consuming, but at least it will ensure that you don't have any downtime).

The configuration of a master slice is similar to partitioning a hard disk. When partitioning an empty hard disk space, the user is asked to backup the data, then configure a new partition, and finally write the data to the new partition.

Note 3: Whenever possible, use time-based indexing to manage data retention. Data is grouped into indexes based on retention periods. Time-based indexes also make it easy to change the number of primary tiles and replicas over time because the next index to be generated can be changed.

For each Elasticsearch index, information about the mapping and state is stored in the cluster state. It is kept in memory for quick access. As a result, having a large number of indexes in a cluster can lead to larger cluster state, especially if the mappings are large. This can become slow because all updates need to be done through a single thread to ensure consistency before changing the distribution in the cluster.

Each slice has data that needs to be kept in memory and uses heap space. This includes data structures that hold information at the slice level, but also at the segment level to define where the data resides on disk. The size of these data structures is not fixed and will vary depending on the usage scenario. However, an important feature of segmentation-related overhead is that it is not strictly proportional to the size of the segment. This means that larger segments have less overhead per amount of data compared to smaller segments. The difference can be significant. To be able to store as much data as possible for each node, it becomes important to manage heap usage and minimize overhead. The more heap space a node has, the more data and segmentation it can handle.

As a result, indexes and slices are not idle in a clustered view because each index and slice has some level of resource overhead.

Each slice allocated comes at an additional cost:

Note 1 : Smaller slices result in smaller segments, which can increase overhead. We aim to keep the average slice size between a few GB and a few tens of GB. For time-based data usage scenarios, it is common to keep the slice size between 20GB and 40GB.

Note 2: Since the overhead of each slice depends on the number and size of the segments, forcing smaller segments to be merged into larger ones through the forcemerge operation reduces overhead and improves query performance. Ideally, this should be done once no more data is being written to the index. Note that this is a more performance- and overhead-intensive operation, so it should be performed during off-peak hours.

Note 3 : The number of slices we can keep on a node is proportional to the amount of heap memory available, but Elasticsearch does not have an enforced fixed limit. A good rule of thumb is to make sure that the number of slices per node is lower than 20 to 25 slices per GB of heap memory configuration. So a node with 30GB of heap memory should have up to 600-750 shards, but going below that limit keeps it better. This usually helps the cluster stay healthy.

Note 4: If you are concerned about rapid data growth, it is recommended to follow this limit: ElasticSearch recommends a maximum JVM heap space of 30-32G, so limit the maximum size of the slice to 30GB, and then make a reasonable estimate of the number of slices. For example, if the data can reach 200GB, then the maximum allocation is 7 to 8 slices.

Note 5: If there is a need for date-based indexing, and there are very few search scenarios for indexed data. Perhaps these indexes will reach hundreds or thousands, but the amount of data in each index is only 1GB or less. For such scenarios, the recommendation is to allocate only 1 slice for the index. If you use the default configuration of ES (5 slices), and use Logstash to generate indexes on a daily basis, then in 6 months, you will have 890 slices. Any more than that, and your cluster will struggle to work - unless more (say 15 or more) nodes are provided. Think about it, most Logstash users don't search very often, not even once a minute. So in this scenario, a more economical setup is recommended. In this scenario, search performance is not a priority, so you don't need a lot of replicas. Maintaining a single replica is sufficient for data redundancy. However, the proportion of data that is constantly loaded into memory is correspondingly high. If the index only needs a single slice, then a Logstash configuration can be maintained in a 3-node cluster for up to 6 months. Of course you need to use at least 4GB of memory, but 8GB is recommended, as it is significantly faster and less resource-intensive in a multi-data cloud.

In Elasticsearch, each query is executed in a single thread per slice. However, multiple slices can be processed in parallel, and multiple queries and aggregations can be performed on the same slice.

This means that if caching is not involved, the minimum query latency will depend on the data, the type of query, and the size of the slice. Querying a large number of smaller slices will result in faster processing per slice, but it requires sequential queuing and processing of more tasks, and it is not necessarily faster than querying a smaller number of larger slices. Having a large number of small slices will also reduce query throughput if there are multiple concurrent queries.

The best way to determine the maximum slice size from a query performance perspective is to benchmark with real data and queries. Always benchmark the query and what the node loading the index needs to process in production, as optimizing individual queries can produce misleading results.

When using time-based indexes, typically each index is associated with a fixed time period. Daily indexes are very common and are typically used to hold data with short retention periods or large daily volumes. These allow the retention period to be managed at the right granularity and the daily base volume to be easily adjusted. Data with longer retention periods, especially if daily volumes do not warrant the use of daily indexes, are often indexed using weekly or monthly indexes for guaranteed slice sizes. This reduces the number of indexes and slices that need to be stored in the cluster over time.

Note : If you use time-based indexes, which is some fixed period of time, you need to adjust the period of time covered by each index based on the retention period of the data and the expected amount of data to achieve the target slice size. That is, if we want to determine the size of the final slice, we need to adjust whether our indexes need to be evaluated on a daily or weekly or monthly basis, based on how long our data will be kept and the predicted volume of expected data.

Time-based indexing with fixed intervals works well when the amount of data is reasonably predictable and changes slowly. If the index changes rapidly, it is difficult to maintain a uniform target slice size. To better handle this type of scenario, the Rollover and Shrink API was introduced (/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

/a/1190000008868585