Demand background:
With the growth of business and the increasing requirements for operation and maintenance efficiency and quality, the demand for automated operation and maintenance systems is also increasing.
At present, the operation and maintenance of many medium and large enterprise customers that I serve are still in the original state of "slash and burn".
The "knife" and "fire" mentioned here are the remote clients of operation and maintenance personnel, such as xshell and Windows Remote Desktop.
This working model has many limitations.
For example, the installation and initialization of servers, databases, middleware, etc., application software deployment, service release and monitoring are all done manually. Completed.
This requires operation and maintenance personnel to log in to the server and manage and maintain each one.
If there are dozens or hundreds of units, it will be very tiring.
The author has operated and maintained more than 4,000 servers with a team of more than 20 people. Think about it carefully, can this work be done manually?
In addition, the manual operation method relies too much on the execution sequence and operation steps of the operation and maintenance personnel. A slight carelessness may lead to production accidents. Even a double check before changes is difficult to guarantee that no accidents will occur.
How can you often walk by the river without getting your shoes wet?
At this time, operation and maintenance personnel began to explore the use of scripts and batch management tools.
This approach does improve efficiency and quality, but it is not universal.
The first is the problem of non-standardization of scripts.
Each operation and maintenance personnel has his own style of problem-solving, and there are huge differences between different personnel, so version management of these scripts developed by different people is a challenge.
The second problem is the handover of scripts. The structure of company personnel is not static. Some people come and some leave. Resignation and work handover will cause scripts to be unable to be passed down and reused among operation and maintenance personnel.
Therefore, building an automated operation and maintenance system has become the only option.
So how to build an automated operation and maintenance system? The research in this article is divided into three major aspects:
The first is why should we build an automated operation and maintenance system?
The second one is to introduce how the operation and maintenance system is designed, run and handles problems based on the author's experience.
The third one is the author’s thoughts on some problems encountered in the process of automated operation and maintenance, and a summary.
This article focuses on the automated database operation and maintenance system
The core content is as follows:
1. Reasons for building an automated operation and maintenance system
Why? Build an automated operation and maintenance system.
It must be some challenges encountered during operation and maintenance.
The first is the need for change.
It manifests itself in three aspects:
First, the number of changes is large. Currently, we serve 30,000 companies, which is a large volume.
Second, there are many types of changes, and different customer needs are different, including but not limited to capacity expansion, performance optimization, fault handling, DG switching and migration, RAC construction, etc.
Third, the risk of changes is high. Some changes are high-risk operations, and automated processing is safer.
The second is the operation and maintenance environment, which is mainly reflected in the large number of servers and various database types. Our customers are free to choose which database to use, corresponding to different environments.
The third is the human factor.
In the process of building an automated operation and maintenance system, one of the more important considerations is the human factor.
It is precisely because each operation and maintenance personnel has different abilities, technical levels, and even operation and maintenance habits and tools.
As a result, we must create a standardized automated operation and maintenance system to improve work efficiency.
2. How to build an automated operation and maintenance system
Let’s take a look at how each module is designed and works.
1. Automated installation system
Installing a database is one of the more tedious tasks with a lot of data.
There are many operating systems, but there are few people and less available time. Automated installation saves time and effort. The entire automation process adopts a common framework, mainly for Oracle installation and MySQL installation under Linux.
Basic security settings will be performed before delivery to users, which improves security to a certain extent and reduces some manual operations.
2. Automated operation and maintenance platform
After the server is automatically installed with the database, it will be taken over by the automated operation and maintenance platform.
The automated operation and maintenance platform is an operating platform for operation and maintenance personnel. It mainly solves management problems caused by large quantities such as safety, efficiency, and speed.
The following factors should be considered during the design process: Design the operation interface of the entire operation and maintenance system into a bastion machine-based architecture.
Operation and maintenance engineers can log in to the management system at any time and anywhere to perform operation and maintenance operations. This is more convenient, and SecureCRT issues instructions to the operated machines.
Leverage existing protocols and tools.
The characteristic of this platform is that all systems are managed using SSH instead of developing some agents yourself, which also reflects the perspective of automated operation and maintenance.
3. Automated inspection system
Since we have many customer systems and a lot of business, how to design a system to inspect their operation?
We have adopted two methods: self-developed central control system and third-party management platform. Let’s first look at the self-developed central control system:
Use one server alone to inspect the others. Database nodes and scripts can use shell or Python.
Set the traversal time interval. If a fault occurs, you can promptly notify the operation and maintenance personnel by calling or sending text messages.
The second is to host all database nodes on a third-party monitoring platform.
4. Automated performance analysis system
The system does not have to run stably forever, and performance problems are inevitable. Performance analysis systems are a top priority.
Here I will write another article separately.
5. Automated monitoring and early warning system
Usually the customer's system runs 24/7, which requires early warning monitoring.
Early warning monitoring system + on-duty personnel are standard configuration.
The construction method of the early warning monitoring system refers to the inspection system, but the indicators collected are different.
6. Automated backup system
Three centers in two places + DG + NBU
3. Thoughts on building an automated operation and maintenance system
The author summarizes the construction goals of the automated operation and maintenance system into four words.
The first is completeness. The system must be able to cover all operation and maintenance requirements.
The second one is concise, simple and easy to use. The learning cost for operation and maintenance personnel should not be high. The more complex and difficult-to-use a system is, the less likely it is for the system's capabilities and efficiency to be fully utilized.
The third is efficiency, especially when batch processing or performing specific tasks.
The fourth is security. If an operation and maintenance system is not secure, it may be quickly taken over by hackers.
Summary
The author is currently also slowly transforming from database architecture, optimization and fault handling to an automated operation and maintenance system.
To summarize the past, I think there are three aspects for your reference.
The first is the principle of step-by-step:
Focus on the current problem, deal with the current problem well, and the subsequent problems will be easily solved.
If the system designed at the beginning is very large and has rich functions, it will lead to some uncontrollable situations. But if the initial goal is to solve some specific problems and is targeted, then it will be easier to advance. In the process of building the automated operation and maintenance system that the author participated in, our initial goal was to build a basic change batch operation platform, and first move some of the work that needs to be performed repeatedly onto the platform.
Then enrich the functions of this operating platform and improve efficiency according to the needs of operation and maintenance, and finally connect the surrounding systems and connect them with each other to form a complete automated operation and maintenance system. The second is to consider scalability:
When designing a system, you may not need to consider so much functionality or design, but you must consider whether the system can still support when the number of servers expands significantly.
The third is for practical purposes:
If it is inconvenient to use, the operation and maintenance personnel will give up the first time, so how to promote it?
How to build an automated database operation and maintenance system
Tag: Between the two capabilities of the ble expansion accident team and the simple system