How to handle 430 million records per day in SQLServer

Project Background

This is a project for a data center, the project is ridiculously difficult, this project really makes me feel, shopping malls such as the battlefield, and I'm just one of the pawns, too many tactics, too many high-level competition between, too many insiders. Specifically this project, I have time to write a related blog post out.

This project is required to do environmental monitoring, we tentatively called the monitored equipment collection device, the collection device attributes called monitoring indicators. Project requirements: the system supports no less than 10w monitoring indicators, the data update of each monitoring indicator is no more than 20 seconds, and the storage delay is no more than 120 seconds. Then, we can arrive at a more desirable state through a simple calculation? The data to be stored is: 30w per minute, 1800w per hour, that is, 432 million per day. And in reality, the amount of data will be about 5% larger than this. (In fact, most of it is information garbage, which can be processed by data compression, but others just want to mess with you, what can be done)

The above are the indicators required by the project, and I think a lot of students with quite a bit of experience in big data processing will bare their noses at the idea that it's just this much? Well, I also read a lot of big data processing stuff, but have not dealt with it before, to see others is the head, what distributed, what read-write separation, it does seem easy to solve. But the problem isn't that simple, as I said above, it's a very nasty project, one that is typical of the industry's vicious competition.

No more servers, but this server in addition to matching the database, centralized collector (that is, data parsing, alarms, storage procedures), but also to support 30w points of the northbound interface (SNMP), in the program is not optimized before the CPU perennial occupation of more than 80%. Because the project requires the use of dual-machine hot standby, in order to save time and reduce unnecessary trouble, we put the relevant services together in order to be able to take full advantage of the characteristics of the HA (externally purchased HA system)

System data correctness requirements are extremely perverted, the requirements from the bottom of the collection system to the uppermost layer of the monitoring system, a piece of data can not be worse our system architecture is as follows, you can see, which The database pressure is very large, especially in LevelA node:

Hardware configuration is as follows: CPU: Intel? The CPU: Intel? Xeon? processor E5-2609 (E5-2609) Processor E5-2609 (4-core, 2.40GHz, 10MB, 6.4 GT/s)Memory: 4GB (2x2GB) DDR3 RDIMM Memory, 1333MHz,ECC Hard Disk: 500GB 7200 RPM 3.5'' SATA3 hard disk. Raid5.

The database version used is SQLServer 2012 Standard Edition, the genuine software provided by HP, which lacks many of the NB features of the Enterprise Edition.

Writing bottlenecks

The first blocker we encountered was that we found that SQLServer simply could not handle such a large amount of data under the existing program, what is the specific situation?

Our storage structure

Generally, in order to store a large amount of historical data, we would have a physical split table, otherwise millions of records per day, and hundreds of millions of records per year. Therefore, it turns out that our table structure is like this:

CREATE TABLE [dbo]. [His20140822]( [No] [bigint] IDENTITY(1,1) NOT NULL, [Dtime] [datetime] NOT NULL, [MgrObjId] [varchar](36) NOT NULL, [Id] [varchar](50) NOT NULL, [ Value] [varchar](50) NOT NULL, CONSTRAINT [PK_His20140822] PRIMARY KEY CLUSTERED ( [No] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

No as a unique identifier, collection device Id (Guid), monitoring indicator Id (varchar(50)) , record time, and record value. And collect device Id and monitor indicator Id as index for quick search.

Batch Write

Write was using BulKCopy at the time, yes, that's it, claimed to write millions of records are seconds

public static int BatchInert(string connectionString, string desTable. DataTable dt, int batchSize = 500) { using (var sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.UseInternalTransaction) { BulkCopyTimeout = 300, NotifyAfter = dt.Rows.Count, BatchSize = batchSize, DestinationTableName = desTable }) { foreach (DataColumn column in dt. Columns) sbc.ColumnMappings.Add(column.ColumnName, column.ColumnName); sbc.WriteToServer(dt); } return dt.Rows.Count; }

What is the problem?

The above architecture is OK with 40 million data per day. However, when adjusted to the configuration in the above context, the centralized monitoring program overflowed the memory. Analysis revealed that too much data was received and placed in the memory, but did not have time to write to the database, which ultimately resulted in generating more data than was consumed, leading to memory overflow and the program could not work.

Where exactly was the bottleneck?

Is it because of the RAID disks? Is it the data structure? Is it the hardware? Is it the SQLServer version? Is it the lack of partitioned tables? Or was it the program?

There was only a week, a week to get it wrong, the project supervisor wants us to get out of the way, so there is a continuous work for 48 hours of the feat, there are everywhere to call and beg for someone to catch the chicken?

But what you need at this time is to be calm and collected?SQLServer version? Hardware? None of which are likely to be changed at this time.RAID disk arrays, should not be. So what the hell, really fucking calm.

You may not be able to appreciate the tension in the room, and in fact, after all this time, it's hard for me to go back to that situation myself. But I can say this, perhaps we now have a variety of methods, or in the outsider we have more thinking, but when a project presses you to the point of giving up, your ideas at that time, consider the constraints of the environment in the field, there may be a major deviation. It is possible for you to think fast, but also possible to think stagnant. Some colleagues in this high-pressure environment, even more low-level errors, thinking has been completely messed up, even less efficient?36 hours without closing their eyes, or only in the construction site (rainy days everywhere is muddy, dry then all the mud and dust) squinting for two or three hours, and then continue to do so for a week in a row! Or it will continue!

A lot of people give a lot of ideas, but they seem to work and they seem to not work. Wait, why is that? It seems to work and it doesn't seem to work? I vaguely seem to have caught a glimpse of the direction, what is it? Right, validation, we are now running in a live environment, just because there was no problem before, doesn't mean there is no problem now under pressure, to analyze such a small feature in a large system, the impact is too big, we should decompose it. Yes? Unit testing? , which is the testing of individual methods, we need to validate each function, each individual step exactly where it takes time?

Step-by-step testing to validate system bottlenecks

Modifying BulkCopy's parametersFirst of all, what comes to mind is that the various parameters of ShugaBulkCopy, BulkCopyTimeout, BatchSize, are constantly being tested and adjusted, and the results always fluctuate within a certain range that doesn't actually have an impact. Maybe it will affect some CPU counts, but far from my expectations, the write speed is still in 5 seconds 1w~2w fluctuations, far from the requirements of 20 seconds to write 20w records.

Storage by acquisition deviceYes, the above structure is one record per value per indicator, isn't that too much of a waste? So is it feasible to have it as one record by collection device + collection time? The question is, how to solve the problem of different collection device properties are not the same? At this time, a colleague has shown his talent, monitoring indicators + monitoring values can be stored in XML format. Wow, it can be done like this? What about querying, you can use for XML in this form.

So there is this structure: No, MgrObjId, Dtime, XMLData

Results validation, slightly better than the above, but not too obvious.

Data table partitioning at that time has not yet learned this skill, read the article on the Internet, seems to be quite complex, time is running out, do not dare to try.

Stopping other programs I know that this is certainly not possible, because the architecture of the software and hardware can not be modified for the time being. But I wanted to verify if it was these factors that were affecting it. As it turns out, the prompts are indeed obvious, but still not up to snuff.

Is it a SQLServer bottleneck? Out of ideas, could this be a SQLServer bottleneck? I went online to check the relevant information, may be the bottleneck of IO, Nima, what else can be done, to upgrade the server, to replace the database, but, the project side to give?

Wait, there seems to be something else, indexes, to indexes! The presence of indexes affects inserts, updates

De-indexing

Yes, queries are definitely slower after de-indexing, but I have to verify if de-indexing will speed up writes first. If decisively de-index the MgrObjId and Id fields.

Run, the miracle appeared, each write 10w records, in 7 ~ 9 seconds can write completely, so that the system requirements.

How is the query resolved?

A table takes more than 400 million records a day, which is impossible to query, in the absence of indexes. What to do! I came back to our old way of physically splitting tables. Yes, originally we split the tables by days, so now we split them by hours. So 24 tables, each table only needs to store about 1800w records.

Then query, the history of an attribute over an hour or several hours. The result: slow! Slow! Slow!!! Querying 10+ million records with indexes removed is simply unthinkable. What else can be done?

Continue to split the table, it occurred to me that we can also continue to split the table by the underlying collector, because the collection device is different in different collectors, then we query the historical curve, only check the historical curve of a single index, then this can be scattered in different tables.

Said to do, the results, by 10 collection embedded and by 24 hours to split the table, generated 240 tables per day (history table name is similar to this: His_001_2014112615), and finally write more than 400 million records a day and support a simple query of this problem has been solved!!!!

Query optimization

After the above problem was solved, half of the difficulties of this project had been solved, and the project supervision was too embarrassed to come over and find fault, I don't know for what kind of tactical arrangement it was.

After a long period of time, by now almost the end of the year, the problem came again, is to drag you to death so that you can not accept other projects at the end of the year.

This time the requirement is like this: because the above is to simulate 10w monitoring indicators, and now actually online, but only about 5w devices. Then this obviously can not meet the tender requirements, can not be accepted. So what to do? These smart people think, since the monitoring indicators halved, then we also halved the time, not reached: that is to say, according to the current 5w of equipment, then you want to 10s within the warehouse storage. Holy shit ah, according to your logic, if we have only 500 monitoring indicators, not in 0.1 seconds into storage? Don't you think about the feelings of those monitored devices?

But what can you do when people want to play you? Take it. As a result of the time down to 10 seconds, the problem came, we carefully analyze the above logic can be known, divided into tables is divided by the collector, and now the collector to reduce, but the number of increased, what happened to it, write can be supported, but each table record close to the 400 w, some of the collection of equipment to monitor the indicators, to close to the 600 w, how do you break it?

So the technology-related staff met to discuss related initiatives.

How to optimize the query without indexing?

A colleague suggested that the order of the where clause, will affect the results of the query, because according to the results of your brush after the selection and then processed, you can first brush out a part of the data, and then continue to the next condition of the filter. It sounds like it makes sense, but doesn't SQLServer Query Analyzer optimize it automatically? Forgive me for being a noob, I'm just feeling it, and it feels like it should be auto-optimized just like the VS compiler.

Specifically how, or to use the facts to speak:

Results of the colleague modified the client, the test feedback, there is a large improvement. I checked the code:

Is it really such a big impact? Wait a minute, did I forget to clear the cache and create an artifact? So I asked my colleague to execute the following statement in order to come up with more information:

--BEFORE OPTIMIZING DBCC FREEPROCCACHE DBCC DROPCLEANBUFFERS SET STATISTICS IO ON select Dtime,Value from dbo.his20140825 WHERE Dtime>='' AND Dtime<='' AND MgrObjId='' AND Id= '' SET STATISTICS IO OFF -- AFTER OPTIMIZATION DBCC FREEPROCCACHE DBCC DROPCLEANBUFFERS SET STATISTICS IO ON select Dtime,Value from dbo. his20140825 WHERE MgrObjId='' AND Id='' AND Dtime>='' AND Dtime <='' SET STATISTICS IO OFF

The result is as follows:

Better before optimization instead?

A closer look at the IO data reveals that the pre-reads are the same, meaning that the data records we're querying are all the same, and the physical reads and table scans are the same all the time. And the logical reads are slightly different, which should be caused by the number of cache hits. That is to say, in the case of not creating an index, where clause condition order, the query result optimization effect is not obvious.

Then, the only way is through the index.

Attempts to build an index

Building an index is not a simple matter, it is necessary to understand some of the basics, in the process, I went through a lot of detours, and finally the index was built up.

The following experiment is based on the following total number of records to do the verification:

The idea of building an index by a single field is mainly affected by my establishment of the data structure, my data structure in memory is: Dictionary<MgrObjId,Dictionary<Id,Property>>. I thought that by building the index for MgrObjId first and then the index for Id, the SQLServer query would be faster.

Building the index by MgrObjId first, with an index size of 550M, took 5 minutes and 25 seconds. The result, like the predicted plan above, didn't work at all, but instead was slower.

Build index by multiple conditionsOK, since the above does not work, then we build index by multiple conditions and how?CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id,Dtime)

Results, the query is indeed twice as fast:

Wait, is this the benefit of indexing? That's all it took to spend 7 minutes and 25 seconds and 1.1G of space in exchange for this? Something must be wrong, so I started digging around, checking out some books on the subject, and eventually, there was increased progress.

Correct indexing

First of all, we need to understand a few key points of indexing:

After indexing, sorting by the indexed fields with the least amount of repetition will achieve optimal results. In the case of our table, if an aggregated index for No is created, it is optimal to put No first in the where clause, followed by Id, then MgrObjId, and finally time, and time indexes are best avoided if the table is an hourly one

The order of the where clauses determines whether or not the query analyzer will use the index for the query. For example, if you create an index on MgrObjId and Id, then where MgrObjId='' and Id='' and Dtime='' will use an indexed lookup, while where Dtime='' and MgrObjId='' and Id='' will not necessarily use an indexed lookup.

Put the result column of the non-indexed column in the containing column. Since our conditions are MgrObjId and Id and Dtime, it is sufficient to include only Dtime and Value in the returned result, so by putting Dtime and Value in the containing column, the indexed result returned will have this value, and you don't have to look up the physical table, which will result in optimal speed.

With the above points of principle, we establish the following index: CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id) INCLUDE(Value,Dtime)

The time consumed is: more than 6 minutes! , the index size is 903M.

Let's take a look at the estimated plan:

You can see that the index is fully used here, with no additional consumption. And the actual implementation of the results, less than 1 second, surprisingly, it does not take a second in the 1100w records to filter out the results! Awesome!

How do you apply an index?

Since the writing is done and the reading is done, how do we combine them? We can index the data from an hour ago and leave the data from the current hour unindexed. That is, don't index the table when you create it again!

How else can we optimize

You can try read-write separation, write two libraries, one real-time and one read-only. Data within an hour to query the real-time library, an hour before the data query read-only library; read-only library timed storage, and then build an index; more than a week of data, analyze and process and then store. In this way, no matter what time period the query data, can be correctly processed? Within an hour query real-time library, an hour to a week query read-only library, a week before the query report library.

If you don't need to physically split the tables, you can just rebuild the indexes at regular intervals in the read-only library.

Summary

How to handle billions of data (historical data) in SQLServer can be done as follows:

Remove all indexes on a table

Insert with SqlBulkCopy

Subdivide or partition a table to reduce the total amount of data per table

Build indexes on a table after it has been completely written and then indexed

Specify the index fields correctly

Put the fields that need to be used in the containing indexes (everything is included in the indexes that are returned)

Queries return only the fields needed

How to handle four hundred and thirty million records per day in SQLServer

Tagged with: