First of all, congratulations on choosing to learn Linux, you may be about to embark on a journey to work with Linux, before you go, let me show you everything you need to know about Linux and Linux Ops. Sourced from - public number: ma brother linux ops
Linux has become the main operating system for today's mid to high end servers and is in an irreplaceable position due to its high efficiency, ease of tailoring, and wide range of applications.Linux can be installed in a wide variety of computer hardware devices such as cell phones, Linux can be installed in a variety of computer hardware devices, such as mobile phones, tablet PCs, routers, video game consoles, desktop computers, mainframes and supercomputers. With the rapid development of Linux in the Chinese market, the shortage of domestic Linux talents is gradually highlighted, and the recruitment of Linux talents has become one of the hottest recruitment.
First of all, Linux is a very, very big concept. It's impossible to get all the way through it. Ideally, if you understand linux, you can do all the work. Personally, I prefer to say what kind of work you want to do and what part of linux you need to learn.
By my personal experience, I would like to introduce what are the common areas of linux, and what jobs they correspond to.
1) linux applications. This part can not be strictly considered linux, just run on linux applications, such as web, network, IT, etc., careers include system development, back-end development, server performance optimization, operations and maintenance, etc.;
2) linux customization. This part involves linux version of the user package more, the kernel will have some involved, mainly a variety of commercial linux customization, services and so on. For example, redhat and so on, a lot of foreign companies, most of the domestic recruiting site support and so on.
3) linux kernel development. This part is mainly linux kernel driver development. Almost all programming work. Mainly chip companies, and product development companies using the chip. The former such as intel, marvell, the latter such as zte huawei.
4) android derivatives. Because android, including the slowly exploding tizen, uses the linux kernel, the reason is the same as 3. So cell phone chip companies and cell phone development companies are also one of the employers of linux developers. For example, Qualcomm, TI, etc.;
A, Linux operation and maintenance of the main job content
Linux operation and maintenance as a large number of jobs in the demand for the largest number of people, the highest salary, this article focuses on Linux operation and maintenance of the career, the content of this article by the specialized study of Linux operation and maintenance of learning and career development of the organization MaGuo education and enthusiasts jointly authored.
Internet Linux operation and maintenance work, service-centered, stable, safe, efficient for the three basic points, to ensure that the company's Internet business can provide users with high-quality services 24 hours a day, 7 × 24 hours. The responsibilities of operation and maintenance cover the life cycle of the product from design to release, operation and maintenance, change and upgrade, and to offline.
Operation and maintenance responsibilities throughout the product life cycle is important and extensive, but the responsibility of the operation and maintenance engineers are not limited to this part of the work, but also need to summarize the problems encountered in the work, extract the relevant technical direction, research and development of related tools and platforms to support/optimize the development of the business and improve the efficiency of the operation and maintenance of the relevant technical work mainly includes:
Service monitoring technology: including Monitoring platform development, application, service monitoring accuracy, real-time, comprehensive guarantee
Service fault management: including service fault plan design, automation of the implementation of the plan, fault summary and feedback to the product/system design level to optimize the stability of the product
Service capacity management: measurement of the capacity of the service, the planning of the service room construction, expansion, migration and other work
Service capacity management: measuring the capacity of the service, planning for the service of the construction, expansion, migration and so on.
Service Performance Optimization: Improve the performance and response speed of the service from various directions, including network optimization, operating system optimization, application optimization, client optimization, etc., to improve the user experience
Service Global Traffic Scheduling: Access to the service's traffic, and distribution of the traffic among various server rooms based on the capacity and the service status
Service Task Scheduling: Various kinds of timed/non-timed scheduling for the service. scheduling triggering and status monitoring of various timed/non-timed tasks of the service
Service security: including access security of the service, anti-attack, privilege control, etc.
Data transmission technology: including the research and development of various types of transmission technologies such as p2p and other applications, but also the solution to the problem of long-distance big data transmission, etc.
Service auto-publishing and deployment: the research and development of deployment platforms/tools and the use of platforms/tools to achieve security. platform/tools, to achieve safe and efficient release of services
Service cluster management: including service server management, large-scale cluster management, etc.
Service cost optimization: to reduce the resources used for service operation as much as possible, and to reduce the cost of service operation
Database management (DBA): through the design, development, and management of high-performance database clusters, to make database services more stable, more efficient, and more cost effective. database services more stable, efficient, and easier to manage.
Platform development: the development and management of docker and other platforms, and service access technology
Distributed storage platform development, optimization and access
and so on, all the work related to service quality, efficiency, cost, security, and so on, and the technology, components, tools, platforms involved in the operation and maintenance of the technical scope. Do a good job in each technical direction, complete the corresponding components, tools, platform development can play a positive role in the fulfillment of the duties of operation and maintenance, the development of the business to play a key role in the impact.
Two, Linux operation and maintenance work classification
Operation and maintenance of the direction of the work is more, with the continuous development of the scale of the business, the more mature Internet companies, operation and maintenance positions will be divided into more detailed. Currently many large Internet companies, only system operation and maintenance in the initial period, with the mold, the quality of service requirements, but also gradually carried out the work subdivision. In general the work classification of the operation and maintenance team (see Figure 1-1) and responsibilities are as follows.
Figure 1-1 Job Classification of Operation and Maintenance Team
2.1-Application Operation and Maintenance (SRE): Application Operation and Maintenance is responsible for the change of online services, service status monitoring, service disaster recovery and data backup, etc., and routine troubleshooting and emergency response to services, etc., with the following responsibilities: design review, service management, resource management, routine inspection, and pre-planning management, Data backup.
2.2-System Operation and Maintenance (SYS): responsible for IDC, network, CDN and basic service construction (LVS, NTP, DNS); responsible for asset management, server selection, delivery and maintenance, with the following duties: IDC data center construction, network construction, LVS load balancing and SNAT construction, CDN planning and construction, server selection, delivery and maintenance, kernel selection and OS related maintenance work, asset management, and basic service construction.
2.3-Database Operation and Maintenance (DBA): Database Operation and Maintenance is responsible for data storage scheme design, database table design, index design and SQL optimization, database change, monitoring, backup, high availability design, etc. Detailed work is as follows: design review, capacity planning, data backup and disaster recovery, database monitoring, database security, database high availability and performance optimization, automation system construction, operation and maintenance. optimization, automation system construction, operation and maintenance research and development, operation and maintenance platform, monitoring system, automated deployment system.
2.4-Operation and Maintenance Security (SEC): Operation and Maintenance Security is responsible for security reinforcement of network, system and business, etc. It conducts routine security scanning, penetration testing, security tool and system development and emergency response to security incidents, which are as follows: establishment of security system, security training, risk assessment, security construction, security compliance, and emergency response.
Three, Linux operation and maintenance of the daily use of software and skills
Operation and maintenance engineers use operation and maintenance platforms and tools include:
Web servers: apache, tomcat, nginx, ligstat, top, tcpdump, last
Operation and maintenance of the technology as the basis for technical security through technical security products to provide higher quality services. The responsibilities of the operation and maintenance work and the position in the business determine that the operation and maintenance engineers need to have more extensive knowledge and in-depth technical skills:
Solid basic knowledge of computers, including computer system architecture, operating system, network technology, etc.;
General application needs to understand the operating system, network, security, storage, CDN, DB, etc., and know its related principles;
Programming skills, including the ability to use the computer to provide a high quality service.
Programming skills, from the development of small operation and maintenance tools to the development of large-scale operation and maintenance systems/platforms require good programming skills;
Data analysis skills: the ability to organize and analyze the system's operating data, from which to identify problems and find solutions to the direction of the problem;
A wealth of knowledge about the system, including the system tools, the typical system architecture, the selection of common platforms;
Comprehensive use of the system, including the system tools, typical system architecture, common platforms;
The ability to comprehensively utilize tools and platforms;
Fourth, the development process of Linux operation and maintenance work
Early operation and maintenance teams, with fewer personnel, mainly carried out the construction of data centers, basic network construction, server procurement and server installation and delivery work. There was very little work involved in changing, monitoring, and managing online services. At this time, the operation and maintenance team belongs more to the role of infrastructure, providing a simple, usable network environment and system environment can be.
With the gradual maturation of business products, there are higher requirements for service quality. At this time, the operation and maintenance team will also undertake some server monitoring work, and at the same time will be responsible for LVS, Nginx and other business logic has nothing to do with the 4/7 layer of operation and maintenance work. At this time, the service change is more of a manual operation on a unit-by-unit basis, or there are some simple batch scripts. The focus of monitoring is more on the server status and resource utilization, the monitoring of service application status is almost rare, monitoring more use of various open source systems such as Nagios, Cacti and so on.
As the scale and complexity of the business continues to increase, the operations team will gradually be divided into application and system operations and maintenance of the two blocks. Application O&M starts to take over the online business and gradually carries out the work of service monitoring and grooming, data backup and service changes. With the deepening of the service, application O&M engineers have the ability to start some simple optimization of the service. At the same time, in order to cope with the large number of service changes every day, we also began to write various types of operation and maintenance tools, for some specific services can be easily batch changes. As the scale of the business increases, the number of infrastructure failures due to insufficient capacity planning or weak ability to withstand risks is also increasing, forcing operations and maintenance personnel to start putting more effort into the direction of multi-data center disaster recovery and preplanning management.
After the business scale reaches a certain degree, the open source monitoring system in terms of performance and functionality, has been unable to meet the business needs; a large number of service changes, complex service relationships, the previous way of manual records, tool changes in terms of efficiency or accuracy can not meet the business needs; in terms of security also appeared in a variety of large and small events, forcing us to invest more energy in security defense. security, forcing us to invest more energy in security defense. Gradually, the operation and maintenance team formed the five major work categories mentioned before, and each category requires specialized talents. At this time, system operation and maintenance is more focused on infrastructure construction and operation and maintenance, providing a stable and efficient network environment, and delivering servers and other resources to application operation and maintenance engineers. Application operation and maintenance is more focused on service operation status and efficiency. Database O&M belongs to the refinement of application O&M work, focusing more on automation, performance optimization and security defense in the database field. O&M R&D and O&M security provide various platforms and tools to further enhance the efficiency of O&M engineers and make business services run more stably, efficiently and safely.
We divide the O&M development process into four stages, as shown in Figure 1-2.
Figure 1-2 O&M development process
Manual management stage: business traffic is not large, the number of servers is relatively small, and the system complexity is not high. For daily business management operations, we are more often logging into the server one by one to perform manual operations, belonging to their own, everyone has their own way of operation, the lack of the necessary operating standards, process mechanisms, such as business directory environment are all kinds of.
Tools batch operation stage: With the increase in server size and system complexity, the all manual operation method can no longer meet the needs of the rapid development of business. Therefore, operations personnel gradually began to use batch operation tools, for different types of operations appeared different scripting programs. However, each team has its own tools and needs to adjust them every time the operational requirements change. This is mainly due to insufficient specification of the environment and operation, resulting in weak programmable processing capabilities. At this point, although efficiency was partially improved, bottlenecks were soon encountered again. The quality of the operations did not improve much, and even larger scale problems could arise due to batch execution. We began to establish a large number of process specifications, such as the review mechanism, first online a server to observe 10 minutes before continuing later operations, an upgrade is completed after at least 20 minutes of observation, etc.. These still mainly rely on human supervision and implementation, but in the actual process implementation is often not in place, but reduce the efficiency of the work.
Platform Management Stage: At this stage, there are higher requirements for operation and maintenance efficiency and misuse rate, and we decided to start building an operation and maintenance platform to carry standards and processes through the platform, thus freeing up manpower and improving quality. At this time, the change action of the service is abstracted, forming a unified standard for operation method, service catalog environment, service operation mode, etc. For example, the start-stop interface of the program must include start, stop, reload, etc. Constrain the operation process through the platform, such as observing 10 minutes for one server on line as mentioned above. In the platform to force the setting of pause checkpoints, after the completion of the first server operation, the operation and maintenance personnel need to fill out the appropriate check items, and then you can continue to perform the subsequent deployment actions.
System self-scheduling phase: a larger number of services, more complex service correlation relationships, and the proliferation of various operation and maintenance platforms, the original way of transforming batch operations into platform operations is no longer suitable, and a higher level of abstraction is needed for service changes. Each server is abstracted into a container, and the scheduling system schedules and deploys the service to the appropriate server according to the resource usage, and automates the linkage with various operation and maintenance systems in the surrounding areas, such as the monitoring system, logging system, and backup system. Through the self-scheduling system, the capacity is dynamically scaled according to the service operation, and common service failures can be handled automatically. The work of operations staff will also be front-loaded into the product design phase, assisting R&D staff in modifying the service so that it can be plugged into the self-scheduling system.
In the whole development process of operation and maintenance, we hope to automate all the work, reduce the duplication of human work, reduce the cost of knowledge transfer, so that our operation and maintenance delivery is more efficient and safer, so that the product operation is more stable. For the handling of faults, we also hope to change from after-the-fact processing to early detection, and from manual processing to automatic system disaster recovery.
Fifth, the cutting-edge skills that must be seized by Linux Ops in 2018
This is the tip of the iceberg of the profound changes that are taking place in the world of technology, so the question arises? How should you transform yourself as a traditional Ops?
Here's a little advice: roughly, you need to learn these four parts:
Automation Ops (Ansible, Puppet, Saltstack, etc.)
Devops (Docker, K8s, Jenkins, Jira, etc.),
Cloud service technologies (virtualization, OpenStack, AWS and AliCloud various product service architectures, etc.)
python