# linux-insides

## 文档类型:pdf 上传时间：2018-06-05 文档页数:842页 文档大小:6.29 M 文档浏览:13712次 文档下载:0次 所需积分:0 学币 文档评分:3.0星

linux-insides内容摘要: Table of ContentsSummaryIntroduction1.1Booting1.2From bootloader to kernel1.2.1First steps in the kernel setup code1.2.2Video mode initialization and transition to protected mode1.2.3Transition to 64-bit mode1.2.4Kernel decompression1.2.5Kernel load address randomization1.2.6Initialization1.3First steps in the kernel1.3.1Early interrupts handler1.3.2Last preparations before the kernel entry point1.3.3Kernel entry point1.3.4Continue architecture-specific boot-time initializations1.3.5Architecture-specific initializations, again...1.3.6End of the architecture-specific initializations, almost...1.3.7Scheduler initialization1.3.8RCU initialization1.3.9End of initializationInterrupts1.3.101.4Introduction1.4.1Start to dive into interrupts1.4.2Interrupt handlers1.4.3Initialization of non-early interrupt gates1.4.4Implementation of some exception handlers1.4.5Handling Non-Maskable interrupts1.4.6Dive into external hardware interrupts1.4.7Initialization of external hardware interrupts structures1.4.82Softirq, Tasklets and Workqueues1.4.9Last part1.4.10System calls1.5Introduction to system calls1.5.1How the Linux kernel handles a system call1.5.2vsyscall and vDSO1.5.3How the Linux kernel runs a program1.5.4Implementation of the open system call1.5.5Limits on resources in Linux1.5.6Timers and time management1.6Introduction1.6.1Clocksource framework1.6.2The tick broadcast framework and dyntick1.6.3Introduction to timers1.6.4Clockevents framework1.6.5x86 related clock sources1.6.6Time related system calls1.6.7Synchronization primitives1.7Introduction to spinlocks1.7.1Queued spinlocks1.7.2Semaphores1.7.3Mutex1.7.4Reader/Writer semaphores1.7.5SeqLock1.7.6RCU1.7.7Lockdep1.7.8Memory management1.8Memblock1.8.1Fixmaps and ioremap1.8.2kmemcheck1.8.3CgroupsIntroduction to Control Groups1.91.9.1SMP1.10Concepts1.113Per-CPU variables1.11.1Cpumasks1.11.2The initcall mechanism1.11.3Notification Chains1.11.4Data Structures in the Linux Kernel1.12Doubly linked list1.12.1Radix tree1.12.2Bit arrays1.12.3Theory1.13Paging1.13.1Elf641.13.2Inline assembly1.13.3CPUID1.13.4MSR1.13.5Initial ram diskinitrdMisc1.141.14.11.15Linux kernel development1.15.1How the kernel is compiled1.15.2Linkers1.15.3Program startup process in userspace1.15.4Write and Submit your first Linux kernel Patch1.15.5Data types in the kernel1.15.6KernelStructuresIDT1.161.16.1Useful links1.17Contributors1.184Introductionlinux-insidesA book-in-progress about the linux kernel and its insides.The goal is simple - to share my modest knowledge about the insides of the linux kerneland help people who are interested in linux kernel insides, and other low-level subjectmatter.Feel free to go through the book Start hereQuestions/Suggestions: Feel free about any questions or suggestions by pinging me attwitter @0xAX, adding an issue or just drop me an email.SupportSupport If you likelinux-insidesyou can support me with:On other languagesBrazilian PortugueseChineseRussianSpanishTurkishContributionsFeel free to create issues or pull-requests if you have any problems.Please read CONTRIBUTING.md before pushing any changes.5IntroductionAuthor@0xAXLICENSELicensed BY-NC-SA Creative Commons.6BootingKernel Boot ProcessThis chapter describes the linux kernel boot process. Here you will see a series of postswhich describes the full cycle of the kernel loading process:From the bootloader to kernel - describes all stages from turning on the computer torunning the first instruction of the kernel.First steps in the kernel setup code - describes first steps in the kernel setup code. Youwill see heap initialization, query of different parameters like EDD, IST and etc...Video mode initialization and transition to protected mode - describes video modeinitialization in the kernel setup code and transition to protected mode.Transition to 64-bit mode - describes preparation for transition into 64-bit mode anddetails of transition.Kernel Decompression - describes preparation before kernel decompression and detailsof direct decompression.Kernel random address randomization - describes randomization of the Linux kernelload address.7From bootloader to kernelKernel booting process. Part 1.From the bootloader to the kernelIf you have been reading my previous blog posts, then you can see that, for some time now,I have been starting to get involved with low-level programming. I have written some postsabout assembly programming forx86_64Linux and, at the same time, I have also started todive into the Linux kernel source code.I have a great interest in understanding how low-level things work, how programs run on mycomputer, how they are located in memory, how the kernel manages processes andmemory, how the network stack works at a low level, and many many other things. So, Ihave decided to write yet another series of posts about the Linux kernel for the x86_64architecture.Note that I'm not a professional kernel hacker and I don't write code for the kernel at work.It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these thingswork. So if you notice anything confusing, or if you have any questions/remarks, ping me onTwitter 0xAX, drop me an email or just create an issue. I appreciate it.All posts will also be accessible at github repo and, if you find something wrong with myEnglish or the post content, feel free to send a pull request.Note that this isn't official documentation, just learning and sharing knowledge.Required knowledgeUnderstanding C codeUnderstanding assembly code (AT&T syntax)Anyway, if you are just starting to learn such tools, I will try to explain some parts during thisand the following posts. Alright, this is the end of the simple introduction, and now we canstart to dive into the Linux kernel and low-level stuff.I've started writing this book at the time of the3.18Linux kernel, and many things mighthave changed since that time. If there are changes, I will update the posts accordingly.The Magical Power Button, What happensnext?8From bootloader to kernelAlthough this is a series of posts about the Linux kernel, we will not be starting directly fromthe kernel code - at least not, in this paragraph. As soon as you press the magical powerbutton on your laptop or desktop computer, it starts working. The motherboard sends asignal to the power supply device. After receiving the signal, the power supply provides theproper amount of electricity to the computer. Once the motherboard receives the power goodsignal, it tries to start the CPU. The CPU resets all leftover data in its registers and sets uppredefined values for each of them.The 80386 CPU and later define the following predefined data in CPU registers after thecomputer resets:IP0xfff0CS selector 0xf000CS base0xffff0000The processor starts working in real mode. Let's back up a little and try to understandmemory segmentation in this mode. Real mode is supported on all x86-compatibleprocessors, from the 8086 CPU all the way to the modern Intel 64-bit CPUs. Theprocessor has a 20-bit address bus, which means that it could work with amegabyteofaddress space. But it only has2^16 - 1or0xffff16-bit80860-0xFFFFFor1registers, which have a maximum address(64 kilobytes).Memory segmentation is used to make use of all the address space available. All memory isdivided into small, fixed-size segments ofmemory above64 KB65536bytes (64 KB). Since we cannot addresswith 16-bit registers, an alternate method was devised.An address consists of two parts: a segment selector, which has a base address, and anoffset from this base address. In real mode, the associated base address of a segmentselector isSegment Selector * 16. Thus, to get a physical address in memory, we need tomultiply the segment selector part by16and add the offset to it:PhysicalAddress = Segment Selector * 16 + OffsetFor example, ifCS:IPis0x2000:0x0010, then the corresponding physical address will be:>>> hex((0x2000 <>> hex((0xffff <>> 0xffff0000 + 0xfff0'0xfffffff0'We get0xfffffff0, which is 16 bytes below 4GB. This point is called the Reset vector. Thisis the memory location at which the CPU expects to find the first instruction to execute afterreset. It contains a jump (jmp) instruction that usually points to the BIOS entry point. Forexample, if we look in the coreboot source code (src/cpu/x86/16bit/reset16.inc), we willsee:.section ".reset", "ax", %progbits.code16.globl_start_start:.byte0xe9.int_start16bit - ( . + 2 )...Here we can see thejmp_start16bit - ( . + 2)address (0xe9, and its destination address at.We can also see that the0xfffffff0instruction opcode, which isresetsection is16bytes and that is compiled to start fromsrc/cpu/x86/16bit/reset16.ld):10From bootloader to kernelSECTIONS {/* Trigger an error if I have an unuseable start address */_bogus = ASSERT(_start16bit >= 0xffff0000, "_start16bit too low. Please report.");_ROMTOP = 0xfffffff0;. = _ROMTOP;.reset . : {*(.reset);. = 15;BYTE(0x00);}}Now the BIOS starts; after initializing and checking the hardware, the BIOS needs to find abootable device. A boot order is stored in the BIOS configuration, controlling which devicesthe BIOS attempts to boot from. When attempting to boot from a hard drive, the BIOS tries tofind a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector isstored in the first446bytes of the first sector, where each sector istwo bytes of the first sector are0x55and0xaa512bytes. The final, which designates to the BIOS that thisdevice is bootable.For example:;; Note: this example is written in Intel Assembly syntax;[BITS 16]boot:mov al, '!'mov ah, 0x0emov bh, 0x00mov bl, 0x07int 0x10jmp $times 510-($-) db 0db 0x55db 0xaaBuild and run this with:nasm -f bin boot.nasm && qemu-system-x86_64 boot11From bootloader to kernelThis will instruct QEMU to use thebootbinary that we just built as a disk image. Since thebinary generated by the assembly code above fulfills the requirements of the boot sector(the origin is set to0x7c00and we end with the magic sequence), QEMU will treat thebinary as the master boot record (MBR) of a disk image.You will see:In this example, we can see that the code will be executed instart at0x7c00and0x55real mode and willin memory. After starting, it calls the 0x10 interrupt, which just prints thesymbol; it fills the remaining0xaa16-bit510!bytes with zeros and finishes with the two magic bytes.You can see a binary dump of this using theobjdumputility:nasm -f bin boot.nasmobjdump -D -b binary -mi386 -Maddr16,data16,intel bootA real-world boot sector has code for continuing the boot process and a partition tableinstead of a bunch of 0's and an exclamation mark :) From this point onwards, the BIOShands over control to the bootloader.NOTE: As explained above, the CPU is in real mode; in real mode, calculating the physicaladdress in memory is done as follows:PhysicalAddress = Segment Selector * 16 + Offset12From bootloader to kerneljust as explained above. We have only 16-bit general purpose registers; the maximum valueof a 16-bit register is0xffff, so if we take the largest values, the result will be:>>> hex((0xffff * 16) + 0xffff)'0x10ffef'where0x10ffefis equal to1MB + 64KB - 16b. An 8086 processor (which was the firstprocessor with real mode), in contrast, has a 20-bit address line. Since2^20 = 1048576is1MB, this means that the actual available memory is 1MB.In general, real mode's memory map is as follows:0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table0x00000400 - 0x000004FF - BIOS Data Area0x00000500 - 0x00007BFF - Unused0x00007C00 - 0x00007DFF - Our Bootloader0x00007E00 - 0x0009FFFF - Unused0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory0x000B0000 - 0x000B7777 - Monochrome Video Memory0x000B8000 - 0x000BFFFF - Color Video Memory0x000C0000 - 0x000C7FFF - Video ROM BIOS0x000C8000 - 0x000EFFFF - BIOS Shadow Area0x000F0000 - 0x000FFFFF - System BIOSIn the beginning of this post, I wrote that the first instruction executed by the CPU is locatedat address0xFFFFFFF0, which is much larger than0xFFFFF(1MB). How can the CPUaccess this address in real mode? The answer is in the coreboot documentation:0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address spaceAt the start of execution, the BIOS is not in RAM, but in ROM.BootloaderThere are a number of bootloaders that can boot Linux, such as GRUB 2 and syslinux. TheLinux kernel has a Boot protocol which specifies the requirements for a bootloader toimplement Linux support. This example will describe GRUB 2.Continuing from before, now that theBIOShas chosen a boot device and transferredcontrol to the boot sector code, execution starts from boot.img. This code is very simple, dueto the limited amount of space available, and contains a pointer which is used to jump to thelocation of GRUB 2's core image. The core image begins with diskboot.img, which is usuallystored immediately after the first sector in the unused space before the first partition. The13From bootloader to kernelabove code loads the rest of the core image, which contains GRUB 2's kernel and drivers forhandling filesystems, into memory. After loading the rest of the core image, it executes thegrub_main function.Thegrub_mainfunction initializes the console, gets the base address for modules, sets theroot device, loads/parses the grub configuration file, loads modules, etc. At the end ofexecution, thegrub_mainfunction (from thefunction moves grub to normal mode. Thegrub-core/normal/main.cgrub_normal_executesource code file) completes the finalpreparations and shows a menu to select an operating system. When we select one of thegrub menu entries, thegrub_menu_execute_entryfunction runs, executing the grubbootcommand and booting the selected operating system.As we can read in the kernel boot protocol, the bootloader must read and fill some fields ofthe kernel setup header, which starts at the0x01f1offset from the kernel setup code. Youmay look at the boot linker script to confirm the value of this offset. The kernel headerarch/x86/boot/header.S starts from:.globl hdrhdr:setup_sects: .byte 0root_flags:.word ROOT_RDONLYsyssize:.long 0ram_size:.word 0vid_mode:.word SVGA_MODEroot_dev:.word 0boot_flag:.word 0xAA55The bootloader must fill this and the rest of the headers (which are only marked as beingtypewritein the Linux boot protocol, such as in this example) with values which it haseither received from the command line or calculated during boot. (We will not go over fulldescriptions and explanations for all fields of the kernel setup header now, but we shall doso when we discuss how the kernel uses them; you can find a description of all fields in theboot protocol.)As we can see in the kernel boot protocol, the memory will be mapped as follows afterloading the kernel:14From bootloader to kernel| Protected-mode kernel100000|+------------------------+| I/O memory hole0A0000|+------------------------+X+10000| Reserved for BIOS| Leave as much as possible unused~~| Command line| (Can also be below the X+10000 mark)+------------------------+| Stack/heapX+08000| For use by the kernel real-mode code.+------------------------+| Kernel setup| The kernel real-mode code.| Kernel boot sector| The kernel legacy boot sector.X +------------------------+| Boot loader001000+------------------------+| Reserved for MBR/BIOS000800|+------------------------+| BIOS use only000000|+------------------------+| Typically used by MBR000600| > 4;state.gs = state.fs = state.es = state.ds = state.ss = segment;state.cs = segment + 0x20;This means that segment registers will have the following values after kernel setup starts:gs = fs = es = ds = ss = 0x10000cs = 0x10200In my case, the kernel is loaded atAfter the jump tostart_of_setup0x10000address., the kernel needs to do the following:Make sure that all segment register values are equalSet up a correct stack, if neededSet up bssJump to the C code in main.cLet's look at the implementation.Aligning the Segment RegistersFirst of all, the kernel ensures that thedsandesaddress. Next, it clears the direction flag using themovw%ds, %axmovw%ax, %essegment registers point to the samecldinstruction:cldAs I wrote earlier,at0x10200grub2loads kernel setup code at address0x10000by default andcsbecause execution doesn't start from the start of file, but from the jump here:18From bootloader to kernel_start:.byte 0xeb.byte start_of_setup-1fwhich is at a0x10000512byte offset from 4d 5a. We also need to aligncsfromto0x10200, as well as all other segment registers. After that, we set up the stack:pushw%dspushw$6flretwwhich pushes the value ofexecutes theof label6Afterward,lretwdsto the stack, followed by the address of the 6 label andinstruction. When thelretwinstruction is called, it loads the addressinto the instruction pointer register and loadsdsandcscswith the value ofds.will have the same values.Stack SetupAlmost all of the setup code is in preparation for the C language environment in real mode.The next step is checking themovw%ss, %dxcmpw%ax, %dxmovw%sp, %dxje2fregister value and making a correct stack ifssssis wrong:This can lead to 3 different scenarios:sshas a valid value0x1000(as do all the other segment registers besidessis invalid and theCAN_USE_HEAPflag is set (see below)ssis invalid and theCAN_USE_HEAPflag is not set (see below)cs)Let's look at all three of these scenarios in turn:ss2:3:has a correct address (andw$~3, %dxjnz3fmovw0xfffc, %dxmovw%ax, %ssmovzwl%dx, %esp0x1000). In this case, we go to label 2:sti19From bootloader to kernelHere we set the alignment ofbootloader) to4(which contains the value ofdxspas given by thebytes and a check for whether or not it is zero. If it is zero, we put(4 byte aligned address before the maximum segment size of 64 KB) inwe continue to use the value ofwe put the value ofsets up a correctaxintossspdx0xfffc. If it is not zero,given by the bootloader (0xf7f4 in my case). After this,, which stores the correct segment address of0x1000and. We now have a correct stack:spIn the second scenario, (ssend of the setup code) into!=dxds). First, we put the value of _end (the address of theand check theloadflagsheader field using thetestbinstruction to see whether we can use the heap. loadflags is a bitmask header which isdefined as:#define LOADED_HIGH(1<<0)#define QUIET_FLAG(1<<5)#define KEEP_SEGMENTS(1<<6)#define CAN_USE_HEAP(1ds = ds();reg->es = ds();reg->fs = fs();reg->gs = gs();Let's look at the implementation of memset:GLOBAL(memset)pushw%dimovw%ax, %dimovzbl%dl, %eaximull0x01010101,%eaxpushw%cxshrw$2, %cxrep; stoslpopw%cxandw$3, %cxrep; stosbpopw%diretlENDPROC(memset)32First steps in the kernel setup codeAs you can read above, it uses the same calling conventions as themeans that the function gets its parameters from theThe implementation ofmemset. Next is thedilower 2 bytes of theeax,dxandregisters.cxis similar to that of memcpy. It saves the value of theregister on the stack and puts the value ofstructure, intoaxfunction, whichmemcpyax, which stores the address of theinstruction, which copies the value ofmovzblregister. The remaining 2 high bytes ofeaxdibiosregsdlto thewill be filled withzeros.The next instruction multiplieswitheax0x01010101. It needs to becausememsetwill copy4 bytes at the same time. For example, if we need to fill a structure whose size is 4 byteswith the valuewith0x70x01010101structure.After the, we will getmemsetThe rest of thewith memset,uses thememsetbiosregseaxwill contain the. So if we multiplyeaxand now we can copy these 4 bytes into the0x07070707rep; stosl0x00000007instruction to copyeaxfunction does almost the same thing asstructure is filled withmemset,bios_putcharintoes:dimemcpy..calls the 0x10 interruptwhich prints a character. Afterwards it checks if the serial port was initialized or not andwrites a character there with serial_putchar andinb/outbinstructions if it was set.Heap initializationAfter the stack and bss section have been prepared in header.S (see previous part), thekernel needs to initialize the heap with theFirst of allinit_heapchecks theinit_heapCAN_USE_HEAPfunction.flag from theloadflagsstructure in thekernel setup header and calculates the end of the stack if this flag was set:char *stack_end;if (boot_params.hdr.loadflags & CAN_USE_HEAP) {asm("leal %P1(%%esp),%0": "=r" (stack_end) : "i" (-STACK_SIZE));or in other wordsstack_end = esp - STACK_SIZEThen there is theheap_end.calculation:heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200);33First steps in the kernel setup codewhich meansheap_endheap_end_ptris greater thanor_endstack_end+(5120x200h. If it is then). The last check is whetherstack_endis assigned toheap_endtomake them equal.Now the heap is initialized and we can use it using theGET_HEAPmethod. We will see whatit is used for, how to use it and how it is implemented in the next posts.CPU validationThe next step as we can see is cpu validation through thevalidate_cpufunction fromarch/x86/boot/cpu.c.It calls thecheck_cpufunction and passes cpu level and required cpu level to it and checksthat the kernel launches on the right cpu level.check_cpu(&cpu_level, &req_level, &err_flags);if (cpu_level < req_level) {...return -1;}check_cpuchecks the CPU's flags, the presence of long mode in the case of x86_64(64-bit)CPU, checks the processor's vendor and makes preparations for certain vendors like turningoff SSE+SSE2 for AMD if they are missing, etc.Memory detectionThe next step is memory detection through thedetect_memoryfunction.detect_memorybasically provides a map of available RAM to the CPU. It uses different programminginterfaces for memory detection like0xe820,0xe801and0x88. We will see only theimplementation of the 0xE820 interface here.Let's look at the implementation of thedetect_memory_e820arch/x86/boot/memory.c source file. First of all, thethebiosregs0xe820function from thedetect_memory_e820function initializesstructure as we saw above and fills registers with special values for thecall:initregs(&ireg);ireg.ax= 0xe820;ireg.cx= sizeof buf;ireg.edx = SMAP;ireg.di= (size_t)&buf;34First steps in the kernel setup codeaxcontains the number of the function (0xe820 in our case)cxcontains the size of the buffer which will contain data about the memoryedxmust contain thees:diebxSMAPmagic numbermust contain the address of the buffer which will contain memory datahas to be zero.Next is a loop where data about the memory will be collected. It starts with a call to the0x15BIOS interrupt, which writes one line from the address allocation table. For getting thenext line we need to call this interrupt again (which we do in the loop). Before the next callebxmust contain the value returned previously:intcall(0x15, &ireg, &oreg);ireg.ebx = oreg.ebx;Ultimately, this function collects data from the address allocation table and writes this datainto thee820_entryarray:start of memory segmentsize of memory segmenttype of memory segment (whether the particular segment is usable or reserved)You can see the result of this in thedmesgoutput, something like:[0.000000] e820: BIOS-provided physical RAM map:[0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable[0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved[0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved[0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable[0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved[0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reservedNext, we may see a call to theimplemented only for theset_bios_modex86_64function. As we may see, this function ismode:static void set_bios_mode(void){#ifdef CONFIG_X86_64struct biosregs ireg;initregs(&ireg);ireg.ax = 0xec00;ireg.bx = 2;intcall(0x15, &ireg, NULL);#endif}35First steps in the kernel setup codeTheset_bios_modemode (ifbx == 2function executes the0x15BIOS interrupt to tell the BIOS that long) will be used.Keyboard initializationThe next step is the initialization of the keyboard with a call to theAt firstkeyboard_initinitializes registers using theinitregskeyboard_init()function.function. It then calls the 0x16interrupt to query the status of the keyboard.initregs(&ireg);ireg.ah = 0x02;/* Get keyboard status */intcall(0x16, &ireg, &oreg);boot_params.kbd_status = oreg.al;After this it calls 0x16 again to set the repeat rate and delay.ireg.ax = 0x0305;/* Set keyboard repeat rate */intcall(0x16, &ireg, NULL);QueryingThe next couple of steps are queries for different parameters. We will not dive into detailsabout these queries but we will get back to them in later parts. Let's take a short look atthese functions:The first step is getting Intel SpeedStep information by calling thechecks the CPU level and if it is correct, callsboot_params0x15query_istfunction. Itto get the info and saves the result to.Next, the query_apm_bios function gets Advanced Power Management information from theBIOS.APMPMquery_apm_biosinstallation. Aftercalls the0x15signature (it must bevalue of thecxNext, it calls0x150x15BIOS interruption too, but withfinishes executing, the0x504dquery_apm_bios), the carry flag (it must be 0 ifAPMah=0x53to checkfunctions check thesupported) and theregister (if it's 0x02, the protected mode interface is supported).again, but withax = 0x5304to disconnect theconnect the 32-bit protected mode interface. In the end, it fillsAPMinterface andboot_params.apm_bios_infowith values obtained from the BIOS.Note thatquery_apm_bioswill be executed only if theCONFIG_APMorCONFIG_APM_MODULEcompile time flag was set in the configuration file:36First steps in the kernel setup code#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)query_apm_bios();#endifThe last is thequery_eddBIOS. Let's look at howfunction, which queriesquery_eddEnhanced Disk Driveinformation from theis implemented.First of all, it reads the edd option from the kernel's command line and if it was set tothenoffjust returns.query_eddIf EDD is enabled,query_eddgoes over BIOS-supported hard disks and queries EDDinformation in the following loop:for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) {if (!get_edd_info(devno, &ei) && boot_params.eddbuf_entries < EDDMAXNR) {memcpy(edp, &ei, sizeof ei);edp++;boot_params.eddbuf_entries++;}.........}where0x80is the first hard drive and the value of theEDD_MBR_SIG_MAXcollects data into an array of edd_info structures.get_edd_infoby invoking theand if EDD is present,again calls theinterrupt withahas0x41interrupt, but withahas0x130x130x48andsimacro is 16. Itchecks that EDD is presentget_edd_infocontaining the address of thebuffer where EDD information will be stored.ConclusionThis is the end of the second part about the insides of the Linux kernel. In the next part, wewill see video mode setting and the rest of the preparations before the transition to protectedmode and directly transitioning into it.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me a PR to linux-insides.Links37First steps in the kernel setup codeProtected modeProtected modeLong modeNice explanation of CPU Modes with codeHow to Use Expand Down Segments on Intel 386 and Later CPUsearlyprintk documentationKernel ParametersSerial consoleIntel SpeedStepAPMEDD specificationTLDP documentation for Linux Boot Process (old)Previous Part38Video mode initialization and transition to protected modeKernel booting process. Part 3.Video mode initialization and transition toprotected modeThis is the third part of theright before the call to theKernel booting processset_videoseries. In the previous part, we stoppedroutine from main.c. In this part, we will look at:video mode initialization in the kernel setup code,the preparations made before switching into protected mode,the transition to protected modeNOTE If you don't know anything about protected mode, you can find some informationabout it in the previous part. Also, there are a couple of links which can help you.As I wrote above, we will start from theset_videofunction which is defined in thearch/x86/boot/video.c source code file. We can see that it starts by first getting the videomode from theboot_params.hdrstructure:u16 mode = boot_params.hdr.vid_mode;which we filled in thevid_modecopy_boot_paramsis an obligatory field which is filled by the bootloader. You can find informationabout it in the kernelOffsetfunction (you can read about it in the previous post).Protoboot protocolName:Meaning/Size01FA/2ALLvid_modeVideo mode controlAs we can read from the linux kernel boot protocol:vga= here is either an integer (in C notation, eitherdecimal, octal, or hexadecimal) or one of the strings"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"(meaning 0xFFFD).This value should be entered into thevid_mode field, as it is used by the kernel before the commandline is parsed.39Video mode initialization and transition to protected modeSo we can add thevgaoption to the grub (or another bootloader's) configuration file and itwill pass this option to the kernel command line. This option can have different values asmentioned in the description. For example, it can be an integer numberyou passtoaskvga0xFFFDorask. If, you will see a menu like this:which will ask to select a video mode. We will look at its implementation, but before divinginto the implementation we have to look at some other things.Kernel data typesEarlier we saw definitions of different data types likeetc. in the kernel setup code. Let'su16look at a couple of data types provided by the kernel:TypecharSize1short2int4long8u81u162u324u648If you read the source code of the kernel, you'll see these very often and so it will be good toremember them.Heap APIAfter we getcall to thevid_modeRESET_HEAPfromboot_params.hdrfunction.RESET_HEAPin theset_videofunction, we can see theis a macro which is defined in boot.h. It isdefined as:40Video mode initialization and transition to protected mode#define RESET_HEAP() ((void *)( HEAP = _end ))If you have read the second part, you will remember that we initialized the heap with theinit_heapdefined infunction. We have a couple of utility functions for managing the heap which areboot.h. They are:#define RESET_HEAP()As we saw just above, it resets the heap by setting theis just_endNext is theHEAPvariable to_end, whereextern char _end[];GET_HEAPmacro:#define GET_HEAP(type, n) \((type *)__get_heap(sizeof(type),__alignof__(type),(n)))for heap allocation. It calls the internal function__get_heapwith 3 parameters:the size of the datatype to be allocated for__alignof__(type)specifies how variables of this type are to be alignedspecifies how many items to allocatenThe implementation of__get_heapis:static inline char *__get_heap(size_t s, size_t a, size_t n){char *tmp;HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1));tmp = HEAP;HEAP += s*n;return tmp;}and we will further see its usage, something like:saved.data = GET_HEAP(u16, saved.x * saved.y);Let's try to understand howtothe_endamoveafterRESET_HEAP()__get_heapworks. We can see here that(which is equal) is assigned the address of the aligned memory according toparameter. After this we save the memory address fromHEAPHEAPto the end of the allocated block and returntmpHEAPto thetmpvariable,which is the start address ofallocated memory.41Video mode initialization and transition to protected modeAnd the last function is:static inline bool heap_free(size_t n){return (int)(heap_end - HEAP) >= (int)n;}which subtracts value of theHEAPpointer from theheap_end(we calculated it in theprevious part) and returns 1 if there is enough memory available forn.That's all. Now we have a simple API for heap and can setup video mode.Set up video modeNow we can move directly to video mode initialization. We stopped at thein thefunction. Next is the call toset_videoparameters in theboot_params.screen_infostore_mode_paramsRESET_HEAP()callwhich stores video modestructure which is defined ininclude/uapi/linux/screen_info.h.If we look at thestore_mode_paramsstore_cursor_positionfunction, we can see that it starts with a call to thefunction. As you can understand from the function name, it getsinformation about the cursor and stores it.First of all,AH = 0x3store_cursor_position, and calls the0x10initializes two variables which have typeAfterorig_xandorig_ystore_cursor_positionDLandfields of theis executed, thestore_mode_paramsDHregisters. Row and column will beboot_params.screen_infostore_video_modejust gets the current video mode and stores it inAfter this,withBIOS interruption. After the interruption is successfullyexecuted, it returns row and column in thestored in thebiosregsstructure.function will be called. Itboot_params.screen_info.orig_video_modechecks the current video mode and sets thevideo_segment..After the BIOS transfers control to the boot sector, the following addresses are for videomemory:0xB000:0x000032 KbMonochrome Text Video Memory0xB800:0x000032 KbColor Text Video MemorySo we set thevideo_segmentvariable toVGA in monochrome mode and to0xb0000xb800if the current video mode is MDA, HGC, orif the current video mode is in color mode. Aftersetting up the address of the video segment, the font size needs to be stored inboot_params.screen_info.orig_video_pointswith:42Video mode initialization and transition to protected modeset_fs(0);font_size = rdfs16(0x485);boot_params.screen_info.orig_video_points = font_size;First of all, we put 0 in thelikeset_fsFSregister with theset_fsfunction. We already saw functionsin the previous part. They are all defined in boot.h. Next, we read the valuewhich is located at addresssave the font size in0x485(this memory location is used to get the font size) andboot_params.screen_info.orig_video_points.x = rdfs16(0x44a);y = (adapter == ADAPTER_CGA) ? 25 : rdfs8(0x484)+1;Next, we get the amount of columns by addressstore them in0x44aboot_params.screen_info.orig_video_colsboot_params.screen_info.orig_video_linesand rows by address0x484andand. After this, execution ofstore_mode_paramsisfinished.Next we can see thesave_screenfunction which just saves the contents of the screen tothe heap. This function collects all the data which we got in the previous functions (like therows and columns, and stuff) and stores it in thesaved_screenstructure, which is definedas:static struct saved_screen {int x, y;int curx, cury;u16 *data;} saved;It then checks whether the heap has free space for it with:if (!heap_free(saved.x*saved.y*sizeof(u16)+512))return;and allocates space in the heap if it is enough and storesThe next call isprobe_cards(0)saved_screenin it.from arch/x86/boot/video-mode.c. It goes over allvideo_cards and collects the number of modes provided by the cards. Here is the interestingpart, we can see the loop:for (card = video_cards; card set_mode(structcard_infostructure. Every videomode defines this structure with values filled depending upon the video mode (for exampleforvgait is thestructure forvgavideo_vga.set_mode).function. See the above example of thevideo_vga.set_modeisvga_set_modecard_info, which checks the vga mode andcalls the respective function:static int vga_set_mode(struct mode_info *mode){vga_set_basic_mode();force_x = mode->x;force_y = mode->y;switch (mode->mode) {case VIDEO_80x25:break;case VIDEO_8POINT:vga_set_8font();break;case VIDEO_80x43:vga_set_80x43();break;case VIDEO_80x28:vga_set_14font();break;case VIDEO_80x30:vga_set_80x30();break;case VIDEO_80x34:vga_set_80x34();break;case VIDEO_80x60:vga_set_80x60();break;}return 0;}Every function which sets up video mode just calls thevalue in theAH0x10BIOS interrupt with a certainregister.45Video mode initialization and transition to protected modeAfter we have set the video mode, we pass it toNext,boot_params.hdr.vid_modeis called. This function simply stores the EDID (Extended Displayvesa_store_edidIdentification Data) information for kernel use. After thisLastly, ifdo_restore.store_mode_paramsis called again.is set, the screen is restored to an earlier state.Having done this, the video mode setup is complete and now we can switch to the protectedmode.Last preparation before transition intoprotected modeWe can see the last function call -go_to_protected_modeDo the last things and invoke protected mode- in main.c. As the comment says:, so let's see what these last things are andswitch into protected mode.Thego_to_protected_modefunction is defined in arch/x86/boot/pm.c. It contains somefunctions which make the last preparations before we can jump into protected mode, so let'slook at it and try to understand what it does and how it works.First is the call to therealmode_switch_hookfunction ingo_to_protected_mode. This functioninvokes the real mode switch hook if it is present and disables NMI. Hooks are used if thebootloader runs in a hostile environment. You can read more about hooks in the bootprotocol (see ADVANCED BOOT LOADER HOOKS).Therealmode_switchhook presents a pointer to the 16-bit real mode far subroutine whichdisables non-maskable interrupts. After therealmode_switchhook (it isn't present for me) ischecked, Non-Maskable Interrupts(NMI) is disabled:asm volatile("cli");outb(0x80, 0x70);/* Disable NMI */io_delay();At first, there is an inline assembly statement with ainterrupt flag (IFcliinstruction which clears the). After this, external interrupts are disabled. The next line disables NMI(non-maskable interrupt).An interrupt is a signal to the CPU which is emitted by hardware or software. After gettingsuch a signal, the CPU suspends the current instruction sequence, saves its state andtransfers control to the interrupt handler. After the interrupt handler has finished it's work, ittransfers control back to the interrupted instruction. Non-maskable interrupts (NMI) are46Video mode initialization and transition to protected modeinterrupts which are always processed, independently of permission. They cannot beignored and are typically used to signal for non-recoverable hardware errors. We will notdive into the details of interrupts now but we will be discussing them in the coming posts.Let's get back to the code. We can see in the second line that we are writing the byte(disabled bit) to0x70(the CMOS Address register). After that, a call to thefunction occurs.io_delay0x80io_delaycauses a small delay and looks like:static inline void io_delay(void){const u16 DELAY_PORT = 0x80;asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT));}To output any byte to the port0x80any value (the value fromin our case) to therealmode_switch_hookThe next function isALshould delay exactly 1 microsecond. So we can write0x80port. After this delay thefunction has finished execution and we can move to the next function.enable_a20, which enables the A20 line. This function is defined inarch/x86/boot/a20.c and it tries to enable the A20 gate with different methods. The first is thea20_test_shortfunction which checks if A20 is already enabled or not with thea20_testfunction:static int a20_test(int loops){int ok = 0;int saved, ctr;set_fs(0x0000);set_gs(0xffff);saved = ctr = rdfs32(A20_TEST_ADDR);while (loops--) {wrfs32(++ctr, A20_TEST_ADDR);io_delay();/* Serialize and make delay constant */ok = rdgs32(A20_TEST_ADDR+0x10) ^ ctr;if (ok)break;}wrfs32(saved, A20_TEST_ADDR);return ok;}47Video mode initialization and transition to protected modeFirst of all, we putin the0x0000read the value at the addressvariablessavedandctrFSregister andA20_TEST_ADDR(it isin the0xffff0x200) and put this value into the.Next, we write an updatedctrvalue intofs:A20_TEST_ADDRorfunction, then delay for 1ms, and then read the value from the. In a case whenA20_TEST_ADDR+0x10register. Next, weGSin other case if it's not zeroa20a20fs:0x200with thewrfs32register into the addressGSline is disabled, the address will be overlapped,line is already enabled the A20 line.If A20 is disabled, we try to enable it with a different method which you can find ina20.cFor example, it can be done with a call to the.If theenabled_a20functiondie0x15BIOS interrupt withAH=0x2041.function finished with a failure, print an error message and call the. You can remember it from the first source code file where we started -arch/x86/boot/header.S:die:hltjmpdie.sizedie, .-dieAfter the A20 gate is successfully enabled, thefunction is called:reset_coprocessoroutb(0, 0xf0);outb(0, 0xf1);This function clears the Math Coprocessor by writingwriting0toAfter this, the0xf10to0xf0and then resets it by.mask_all_interruptsfunction is called:outb(0xff, 0xa1);/* Mask all interrupts on the secondary PIC */outb(0xfb, 0x21);/* Mask all but cascade on the primary PIC */This masks all interrupts on the secondary PIC (Programmable Interrupt Controller) andprimary PIC except for IRQ2 on the primary PIC.And after all of these preparations, we can see the actual transition into protected mode.Set up the Interrupt Descriptor TableNow we set up the Interrupt Descriptor table (IDT) in thesetup_idtfunction:48Video mode initialization and transition to protected modestatic void setup_idt(void){static const struct gdt_ptr null_idt = {0, 0};asm volatile("lidtl %0" : : "m" (null_idt));}which sets up the Interrupt Descriptor Table (describes interrupt handlers and etc.). For now,the IDT is not installed (we will see it later), but now we just load the IDT with theinstruction.zero.contains the address and size of the IDT, but for now they are justnull_idtnull_idtis alidtlgdt_ptrstructure, it is defined as:struct gdt_ptr {u16 len;u32 ptr;} __attribute__((packed));where we can see the 16-bit length(len) of the IDT and the 32-bit pointer to it (More detailsabout the IDT and interruptions will be seen in the next posts).means that the size ofgdt_ptr__attribute__((packed))is the minimum required size. So the size of thewill be 6 bytes here or 48 bits. (Next we will load the pointer to thegdt_ptrgdt_ptrto theGDTRregister and you might remember from the previous post that it is 48-bits in size).Set up Global Descriptor TableNext is the setup of the Global Descriptor Table (GDT). We can see thesetup_gdtfunctionwhich sets up the GDT (you can read about it in the post Kernel booting process. Part 2.).There is a definition of theboot_gdtarray in this function, which contains the definition ofthe three segments:static const u64 boot_gdt[] __attribute__((aligned(16))) = {[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),};for code, data and TSS (Task State Segment). We will not use the task state segment fornow, it was added there to make Intel VT happy as we can see in the comment line (if you'reinterested you can find the commit which describes it - here). Let's look atall note that it has the__attribute__((aligned(16)))boot_gdt. First ofattribute. It means that this structurewill be aligned by 16 bytes.Let's look at a simple example:49Video mode initialization and transition to protected mode#include struct aligned {int a;}__attribute__((aligned(16)));struct nonaligned {int b;};int main(void){struct aligneda;struct nonaligned na;printf("Not aligned - %zu \n", sizeof(na));printf("Aligned - %zu \n", sizeof(a));return 0;}Technically a structure which contains onealignedintfield must be 4 bytes in size, but anstructure will need 16 bytes to store in memory:gcc test.c -o test && testNot aligned - 4Aligned - 16TheGDT_ENTRY_BOOT_CShas index - 2 here,GDT_ENTRY_BOOT_DSisGDT_ENTRY_BOOT_CS + 1and etc. It starts from 2, because the first is a mandatory null descriptor (index - 0) and thesecond is not used (index - 1).GDT_ENTRYis a macro which takes flags, base, limit and builds a GDT entry. For example,let's look at the code segment entry.GDT_ENTRYtakes the following values:base - 0limit - 0xfffffflags - 0xc09bWhat does this mean? The segment's base address is 0, and the limit (size of segment) is 0xffff(1 MB). Let's look at the flags. It is0xc09band it will be:1100 0000 1001 1011in binary. Let's try to understand what every bit means. We will go through all bits from left toright:50Video mode initialization and transition to protected mode1 - (G) granularity bit1 - (D) if 0 16-bit segment; 1 = 32-bit segment0 - (L) executed in 64-bit mode if 10 - (AVL) available for use by system software0000 - 4-bit length 19:16 bits in the descriptor1 - (P) segment presence in memory00 - (DPL) - privilege level, 0 is the highest privilege1 - (S) code or data segment, not a system segment101 - segment type execute/read/1 - accessed bitYou can read more about every bit in the previous post or in the Intel® 64 and IA-32Architectures Software Developer's Manuals 3A.After this we get the length of the GDT with:gdt.len = sizeof(boot_gdt)-1;We get the size ofboot_gdtand subtract 1 (the last valid address in the GDT).Next we get a pointer to the GDT with:gdt.ptr = (u32)&boot_gdt + (ds() << 4);Here we just get the address ofboot_gdtand add it to the address of the data segment left-shifted by 4 bits (remember we're in real mode now).Lastly we execute thelgdtlinstruction to load the GDT into the GDTR register:asm volatile("lgdtl %0" : : "m" (gdt));Actual transition into protected modeThis is the end of thego_to_protected_modefunction. We loaded the IDT and GDT, disabledinterrupts and now can switch the CPU into protected mode. The last step is calling theprotected_mode_jumpfunction with two parameters:protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4));which is defined in arch/x86/boot/pmjump.S.51Video mode initialization and transition to protected modeIt takes two parameters:address of the protected mode entry pointaddress ofboot_paramsLet's look insideprotected_mode_jump. The first parameter will be in thearch/x86/boot/pmjump.Sis in. As I wrote above, you can find it incode segment registercsboot_params(0x1000) inthe memory location labeled2bx(which isin thethe task state segment in themovw__BOOT_DS, %cxmovw$__BOOT_TSS, %diAs you can read abovewill be2 * 8 = 16Next, we set themovlPEcxanddiGDT_ENTRY_BOOT_CS,__BOOT_DSesiregister and the address of the. After this, we shiftbx << 4 + in_pm32after transitioned to 32-bit mode) and jump to labelorbregister and the second one.edxFirst of all, we put the address ofCSeax1bxby 4 bits and add it to, the physical address to jump. Next we put the data segment andregisters with:has index 2 and every GDT entry is 8 byte, sois 24 etc.(Protection Enable) bit in theCR0control register:%cr0, %edx$X86_CR0_PE, %dlmovl%edx, %cr0and make a long jump to protected mode:.byte2:0x66, 0xea.longin_pm32.word__BOOT_CSwhere:0x66is the operand-size prefix which allows us to mix 16-bit and 32-bit code0xea- is the jump opcodein_pm32is the segment offset__BOOT_CSis the code segment we want to jump to.After this we are finally in protected mode:.code32.section ".text32","ax"52Video mode initialization and transition to protected modeLet's look at the first steps taken in protected mode. First of all we set up the data segmentwith:movl%ecx, %dsmovl%ecx, %esmovl%ecx, %fsmovl%ecx, %gsmovl%ecx, %ssIf you paid attention, you can remember that we savedwe fill it with all segment registers besidescs(cs__BOOT_DSis alreadyin the__BOOT_CScxregister. Now).And setup a valid stack for debugging purposes:addl%ebx, %espThe last step before the jump into 32-bit entry point is to clear the general purpose registers:xorl%ecx, %ecxxorl%edx, %edxxorl%ebx, %ebxxorl%ebp, %ebpxorl%edi, %ediAnd jump to the 32-bit entry point in the end:jmpl*%eaxRemember thatparameter intoeaxcontains the address of the 32-bit entry (we passed it as the firstprotected_mode_jump).That's all. We're in protected mode and stop at its entry point. We will see what happensnext in the next part.ConclusionThis is the end of the third part about linux kernel insides. In the next part, we will look at thefirst steps we take in protected mode and transition into long mode.If you have any questions or suggestions write me a comment or ping me at twitter.53Video mode initialization and transition to protected modePlease note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes, please send me a PR with corrections atlinux-insides.LinksVGAVESA BIOS ExtensionsData structure alignmentNon-maskable interruptA20GCC designated initsGCC type attributesPrevious part54Transition to 64-bit modeKernel booting process. Part 4.Transition to 64-bit modeThis is the fourth part of theKernel booting processwhere we will see first steps inprotected mode, like checking that CPU supports long mode and SSE, paging, initializes thepage tables and at the end we will discuss the transition to long mode.NOTE: there will be much assembly code in this part, so if you are not familiar withthat, you might want to consult a book about itIn the previous part we stopped at the jump to the32-bitentry point inarch/x86/boot/pmjump.S:jmpl*%eaxYou will recall thateaxregister contains the address of the 32-bit entry point. We can readabout this in the linux kernel x86 boot protocol:When using bzImage, the protected-mode kernel was relocated to 0x100000Let's make sure that it is true by looking at the register values at the 32-bit entry point:eax0x100000ecx0x001048576edx0x00ebx0x00esp0x1ff5c0x1ff5cebp0x00x0esi0x1447083056edi0x00eip0x1000000x100000eflags0x46[ PF ZF ]cs0x1016ss0x1824ds0x1824es0x1824fs0x1824gs0x182455Transition to 64-bit modeWe can see here thatcsregister contains -previous part, this is the second index in thecontains0x1000000x10(as you may remember from theGlobal Descriptor Table),eipregisterand the base address of all segments including the code segment arezero.So we can get the physical address, it will bethe boot protocol. Now let's start with the0:0x10000032-bitor just0x100000, as specified byentry point.32-bit entry pointWe can find the definition of theentry point in the32-bitarch/x86/boot/compressed/head_64.S assembly source code file:__HEAD.code32ENTRY(startup_32)............ENDPROC(startup_32)First of all, why the directory is named+ header + kernel setup codethe main goal of thecompressed? Actuallybzimageis a gzippedvmlinux. We saw the kernel setup code in all of the previous parts. So,head_64.Sis to prepare for entering long mode, enter into it and thendecompress the kernel. We will see all of the steps up to kernel decompression in this part.You may find two files in thearch/x86/boot/compresseddirectory:head_32.Shead_64.Sbut we will consider onlybook is onlyx86_64the followingmakehead_64.Ssource code file because, as you may remember, thisrelated; Let's look at arch/x86/boot/compressed/Makefile. We can findtarget here:vmlinux-objs-y :=(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \$(obj)/string.o $(obj)/cmdline.o \$(obj)/piggy.o $(obj)/cpuflags.oTake a look on the$(obj)/head_$(BITS).o.This means that we will select which file to link based on whathead_32.oorhead_64.o. The$(BITS)$(BITS)is set to, eithervariable is defined elsewhere in arch/x86/Makefilebased on the kernel configuration:56Transition to 64-bit modeifeq ($(CONFIG_X86_32),y)BITS := 32......elseBITS := 64......endifNow we know where to start, so let's do it.Reload the segments if neededAs indicated above, we start in the arch/x86/boot/compressed/head_64.S assembly sourcecode file. First we see the definition of the special section attribute before thestartup_32definition:__HEAD.code32ENTRY(startup_32)The__HEADis macro which is defined in include/linux/init.h header file and expands to thedefinition of the following section:#define __HEADwith.head.text.sectionname andax".head.text","ax"flags. In our case, these flags show us that this section isexecutable or in other words contains code. We can find definition of this section in thearch/x86/boot/compressed/vmlinux.lds.S linker script:SECTIONS{. = 0;.head.text : {_head = . ;HEAD_TEXT_ehead = . ;}.........}57Transition to 64-bit modeIf you are not familiar with the syntax ofGNU LDinformation in the documentation. In short, thelinker scripting language, you can find more.symbol is a special variable of linker -location counter. The value assigned to it is an offset relative to the offset of the segment. Inour case, we assign zero to location counter. This means that our code is linked to run fromtheoffset in memory. Moreover, we can find this information in comments:0Be careful parts of head_64.S assume startup_32 is at address 0.Ok, now we know where we are, and now is the best time to look inside thestartup_32function.In the beginning of thetheDFstartup_32function, we can see thecldinstruction which clearsbit in the flags register. When direction flag is clear, all string operations like stos,scas and others will increment the index registersesioredi. We need to clear directionflag because later we will use strings operations for clearing space for page tables, etc.After we have cleared theDFbit, next step is the check of theKEEP_SEGMENTSkernel setup header field. If you remember we already sawloadflagsvery first part of this book. There we checkedNow we need to check theKEEP_SEGMENTSflag fromloadflagsin theflag to get ability to use heap.CAN_USE_HEAPflag. This flag is described in the linux bootprotocol documentation:Bit 6 (write): KEEP_SEGMENTSProtocol: 2.07+- If 0, reload the segment registers in the 32bit entry point.- If 1, do not reload the segment registers in the 32bit entry point.Assume that %cs %ds %ss %es are all set to flat segments witha base of 0 (or the equivalent for their environment).So, if theKEEP_SEGMENTSbit is not set in theloadflags, we need to setsegment registers to the index of data segment with base0ds,ssandes. That we do:testb $KEEP_SEGMENTS, BP_loadflags(%esi)jnz 1fclimovl$(__BOOT_DS), %eaxmovl%eax, %dsmovl%eax, %esmovl%eax, %ssRemember that theTable). If__BOOT_DSKEEP_SEGMENTSregisters with__BOOT_DSis0x18(index of data segment in the Global Descriptoris set, we jump to the nearest1flabel or update segmentif it is not set. It is pretty easy, but here is one interesting moment.58Transition to 64-bit modeIf you've read the previous part, you may remember that we already updated these segmentregisters right after we switched to protected mode in arch/x86/boot/pmjump.S. So why dowe need to care about values of segment registers again? The answer is easy. The Linuxkernel also has a 32-bit boot protocol and if a bootloader uses it to load the Linux kernel allcode before thestartup_32will be missed. In this case, thewill be the firststartup_32entry point of the Linux kernel right after the bootloader and there are no guarantees thatsegment registers will be in known state.After we have checked theflag and put the correct value to the segmentKEEP_SEGMENTSregisters, the next step is to calculate the difference between where we loaded and compiledto run. Remember that.head.textcontains following definition:setup.ld.S. = 0at the start of thesection. This means that the code in this section is compiled to run fromaddress. We can see this inobjdump0output:arch/x86/boot/compressed/vmlinux:file format elf64-x86-64Disassembly of section .head.text:0000000000000000 :The0:fccld1:f6 86 11 02 00 00 40testbobjdump$0x40,0x211(%rsi)util tells us that the address of thestartup_32is0but actually it's not so.Our current goal is to know where actually we are. It is pretty simple to do in long modebecause it supportriprelative addressing, but currently we are in protected mode. We willuse common pattern to know the address of thestartup_32. We need to define a label andmake a call to this label and pop the top of the stack to a register:call labellabel: pop %regAfter this, a%regregister will contain the address of a label. Let's look at the similar codewhich searches address of the1:startup_32in the Linux kernel:leal(BP_scratch+4)(%esi), %espcall1fpopl%ebpsubl$1b, %ebpAs you remember from the previous part, theesiregister contains the address of theboot_params structure which was filled before we moved to the protected mode. Theboot_paramsstructure contains a special fieldscratchwith offset0x1e4. These four bytes59Transition to 64-bit modefield will be temporary stack forscratchthefield +4callinstruction. We are getting the address of thebytes and putting it in theespregister. We add4bytes to the base offield because, as just described, it will be a temporary stack and the stackBP_scratchgrows from top to down inarchitecture. So our stack pointer will point to the top ofx86_64the stack. Next, we can see the pattern that I've described above. We make a call to thelabel and put the address of this label to theon the top of stack after theaddress of the1fcallebp1fregister because we have return addressinstruction will be executed. So, for now we have anlabel and now it is easy to get address of thestartup_32. We just needto subtract address of label from the address which we got from the stack:startup_32 (0x0)+-----------------------+||||||||||||||||1f (0x0 + 1f offset) +-----------------------+ %ebp - real physical address||||+-----------------------+Thestartup_32is linked to run at address0x0 + offset to 1f, approximately1fphysical address of thestartup_32of the protected mode kernel isin theebpbytes. Theebp1f1fhas the addressregister contains the realfrom theebpwe will get the real. The Linux kernel boot protocol describes that the base0x100000debugger and put breakpoint to the0x100021and this means thatlabel. So, if we subtractphysical address of thesee0x210x01f. We can verify this with gdb. Let's start theaddress, which is0x100021. If this is correct we willregister:60Transition to 64-bit mode$gdb(gdb)$ target remote :1234Remote debugging using :12340x0000fff0 in ?? ()(gdb)$br *0x100022Breakpoint 1 at 0x100022(gdb)$ cContinuing.Breakpoint 1, 0x00100022 in ?? ()(gdb)$i reax0x18ecx0x00x00x18edx0x00x0ebx0x00x0esp0x144a8ebp0x100021esi0x142c0edi0x0eip0x100022eflags0x46[ PF ZF ]cs0x100x10ss0x180x18ds0x180x18es0x180x18fs0x180x18gs0x180x180x144a80x1000210x142c00x00x100022If we execute the next instruction,subl$1b, %ebp, we will see:(gdb) nexti.........ebp0x1000000x100000.........Ok, that's true. The address of thethestartup_32startup_32is0x100000. After we know the address oflabel, we can prepare for the transition to long mode. Our next goal is tosetup the stack and verify that the CPU supports long mode and SSE.Stack setup and CPU verification61Transition to 64-bit modeWe could not setup the stack while we did not know the address of theWe can imagine the stack as an array and the stack pointer registerlabel.startup_32espmust point to theend of this array. Of course, we can define an array in our code, but we need to know itsactual address to configure the stack pointer in a correct way. Let's look at the code:Themovlboot_stack_end, %eaxaddl%ebp, %eaxmovl%eax, %espboot_stack_endlabel, defined in the same arch/x86/boot/compressed/head_64.Sassembly source code file and located in the .bss section:.bss.balign 4boot_heap:.fill BOOT_HEAP_SIZE, 1, 0boot_stack:.fill BOOT_STACK_SIZE, 1, 0boot_stack_end:First of all, we put the address ofregister contains the address ofboot_stack_endof theebpboot_stack_end. To get the real address ofstartup_32into theboot_stack_endeaxregister, so thewhere it was linked, which isboot_stack_endeax0x0 +, we need to add the real address. As you remember, we have found this address above and put it to theregister. In the end, the registereaxwill contain real address of theboot_stack_endand we just need to put to the stack pointer.After we have set up the stack, next step is CPU verification. As we are going to executetransition to thelong mode, we need to check that the CPU supportsWe will do it by the call of thecallverify_cpulong modeandSSE.function:verify_cputestl%eax, %eaxjnzno_longmodeThis function defined in the arch/x86/kernel/verify_cpu.S assembly file and just contains acouple of calls to the cpuid instruction. This instruction is used for getting information aboutthe processor. In our case, it checkssuccess or1on fail in theIf the value of theeaxCPU by the call of theeaxlong modeandsupport and returns0onregister.is not zero, we jump to thehltSSEno_longmodelabel which just stops theinstruction while no hardware interrupt will not happen:62Transition to 64-bit modeno_longmode:1:hltjmp1bIf the value of theeaxregister is zero, everything is ok and we are able to continue.Calculate relocation addressThe next step is calculating relocation address for decompression if needed. First, we needto know what it means for a kernel to berelocatable. We already know that the baseaddress of the 32-bit entry point of the Linux kernel is0x100000, but that is a 32-bit entrypoint. The default base address of the Linux kernel is determined by the value of theCONFIG_PHYSICAL_STARTkernel configuration option. Its default value is0x1000000or16MB. The main problem here is that if the Linux kernel crashes, a kernel developer must havearescue kernelfor kdump which is configured to load from a different address. The Linuxkernel provides special configuration option to solve this problem:CONFIG_RELOCATABLE. Aswe can read in the documentation of the Linux kernel:This builds a kernel image that retains relocation informationso it can be loaded someplace besides the default 1MB.Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the addressit has been loaded at and the compile time physical address(CONFIG_PHYSICAL_START) is used as the minimum location.In simple terms, this means that the Linux kernel with the same configuration can be bootedfrom different addresses. Technically, this is done by compiling the decompressor as positionindependent code. If we look at arch/x86/boot/compressed/Makefile, we will see that thedecompressor is indeed compiled with the-fPICflag:KBUILD_CFLAGS += -fno-strict-aliasing -fPICWhen we are using position-independent code an address is obtained by adding theaddress field of the command and the value of the program counter. We can load codewhich uses such addressing from any address. That's why we had to get the real physicaladdress ofstartup_32. Now let's get back to the Linux kernel code. Our current goal is tocalculate an address where we can relocate the kernel for decompression. Calculation ofthis address depends onCONFIG_RELOCATABLEkernel configuration option. Let's look at thecode:63Transition to 64-bit mode#ifdef CONFIG_RELOCATABLEmovl%ebp, %ebxmovlBP_kernel_alignment(%esi), %eaxdecl%eaxaddl%eax, %ebxnotl%eaxandl%eax, %ebxcmplLOAD_PHYSICAL_ADDR, %ebxjge1f#endifmovlLOAD_PHYSICAL_ADDR, %ebxRemember that the value of thelabel. If theCONFIG_RELOCATABLEebpregister is the physical address of thekernel configuration option is enabled during kernelconfiguration, we put this address in thecompare it with thestartup_32LOAD_PHYSICAL_ADDRebxregister, align it to a multiple ofvalue. TheLOAD_PHYSICAL_ADDR2MBandmacro is defined inthe arch/x86/include/asm/boot.h header file and it looks like this:#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \+ (CONFIG_PHYSICAL_ALIGN - 1)) \& ~(CONFIG_PHYSICAL_ALIGN - 1))As we can see it just expands to the alignedCONFIG_PHYSICAL_ALIGNvalue which representsthe physical address of where to load the kernel. After comparison of theLOAD_PHYSICAL_ADDRand value of theebxregister, we add the offset from thewhere to decompress the compressed kernel image. If theCONFIG_RELOCATABLEstartup_32option is notenabled during kernel configuration, we just put the default address where to load kernel andaddz_extract_offsetto it.After all of these calculations, we will haveloaded it andebxebpwhich contains the address where weset to the address of where kernel will be moved after decompression.But that is not the end. The compressed kernel image should be moved to the end of thedecompression buffer to simplify calculations where kernel will be located later. For this:1:movlBP_init_size(%esi), %eaxsubl_end, %eaxaddl%eax, %ebxwe put value from thehdr.init_size) to theboot_params.BP_init_sizeeaxregister. The(or kernel setup header value from theBP_init_sizecontains larger value betweencompressed and uncompressed vmlinux. Next we subtract address of the_endsymbol64Transition to 64-bit modefrom this value and add the result of subtraction toebxregister which will stores baseaddress for kernel decompression.Preparation before entering long modeWhen we have the base address where we will relocate the compressed kernel image, weneed to do one last step before we can transition to 64-bit mode. First, we need to updatethe Global Descriptor Table with 64-bit segments because an relocatable kernel may berunned at any address below 512G:addl%ebp, gdt+2(%ebp)lgdtgdt(%ebp)Here we adjust base address of the Global Descriptor table to the address where weactually loaded and load theTo understand the magic withDescriptor TableGlobal Descriptor Tablegdtwith theinstruction.lgdtoffsets we need to look at the definition of theGlobal. We can find its definition in the same source code file:.datagdt64:.wordgdt_end - gdt.long0.word0.quad0gdt:.wordgdt_end - gdt.longgdt.word0.quad0x00cf9a000000ffff/* __KERNEL32_CS */.quad0x00af9a000000ffff/* __KERNEL_CS */.quad0x00cf92000000ffff/* __KERNEL_DS */.quad0x0080890000000000/* TS descriptor */.quad0x0000000000000000/* TS continued */gdt_end:We can see that it is located in the32-bit.datasection and contains five descriptors: the first isdescriptor for kernel code segment,64-bitkernel segment, kernel data segmentand two task descriptors.We already loaded theGlobal Descriptor Tablein the previous part, and now we're doingalmost the same here, but descriptors withCS.L = 1mode. As we can see, the definition of thegdtandCS.D = 0for execution instarts from two bytes:gdt_end - gdt64bitwhich65Transition to 64-bit moderepresents the last byte in theaddress of thegdtgdttable or table limit. The next four bytes contains base.After we have loaded theGlobal Descriptor TablePAE mode by putting the value of theagain intocr4movlorlwithregister intocr4lgdteaxinstruction, we must enable, setting 5 bit in it and loading it:%cr4, %eax$X86_CR4_PAE, %eaxmovl%eax, %cr4Now we are almost finished with all preparations before we can move into 64-bit mode. Thelast step is to build page tables, but before that, here is some information about long mode.Long modeThe Long mode is the native mode for x86_64 processors. First, let's look at somedifferences betweenThe64-bitx86_64and thex86.mode provides features such as:New 8 general purpose registers fromr8tor15+ all general purpose registers are64-bit now;64-bit instruction pointer -RIP;New operating mode - Long mode;64-Bit Addresses and Operands;RIP Relative Addressing (we will see an example of it in the next parts).Long mode is an extension of legacy protected mode. It consists of two sub-modes:64-bit mode;compatibility mode.To switch into64-bitmode we need to do following things:Enable PAE;Build page tables and load the address of the top level page table into thecr3register;EnableEFER.LME;Enable paging.We already enabledPAEby setting thePAEbit in thecr4control register. Our next goalis to build the structure for paging. We will see this in next paragraph.66Transition to 64-bit modeEarly page table initializationSo, we already know that before we can move intotables, so, let's look at the building of early4G64-bitmode, we need to build pageboot page tables.NOTE: I will not describe the theory of virtual memory here. If you need to know moreabout it, see links at the end of this part.The Linux kernel usesOnePML4OnePDPoror4-levelpaging, and we generally build 6 page tables:Page Map Level 4table with one entry;Page Directory Pointertable with four entries;Four Page Directory tables with a total ofentries.2048Let's look at the implementation of this. First of all, we clear the buffer for the page tables inmemory. Every table is4096bytes, so we need clearlealpgtable(%ebx), %edixorl%eax, %eaxmovlrep$(BOOT_INIT_PGT_SIZE/4), %ecxstoslWe put the address ofpgtableplusebx(remember thatrelocate the kernel for decompression) in thetheecxTheediregister torep stoslregister by6144Theor4contains the address toregister, clear theinstruction will write the value of theand decrease the value of theBOOT_INIT_PGT_SIZE/4pgtableediebxeaxregister and set.be repeated while the value of the6144kilobyte buffer:24inecxtoedi, increase value of theregister by1. This operation willregister is greater than zero. That's why we putecxecxeax.is defined at the end of arch/x86/boot/compressed/head_64.S assembly fileand is:.section ".pgtable","a",@nobits.balign 4096pgtable:.fill BOOT_PGT_SIZE, 1, 0As we can see, it is located in theCONFIG_X86_VERBOSE_BOOTUP.pgtablesection and its size depends on thekernel configuration option:67Transition to 64-bit mode#ifdef CONFIG_X86_VERBOSE_BOOTUP#define BOOT_PGT_SIZE#(19*4096)else /* !CONFIG_X86_VERBOSE_BOOTUP */#define BOOT_PGT_SIZE#(17*4096)endif# else /* !CONFIG_RANDOMIZE_BASE */#define BOOT_PGT_SIZEBOOT_INIT_PGT_SIZE# endifAfter we have got buffer for thetable -PML4- with:lealpgtable + 0(%ebx), %edileal0x1007 (%edi), %eaxmovl%eax, 0(%edi)Here again, we put the address of theaddress of the0x10077startup_32in the. The7eaxto theregister. Therelative toebxregister. Next, we put this address with offset0x1007isPML44096bytes which is the size of theplusPML4entry. In our case, these flags arePDPentry to the.table with the samePage DirectoryPRESENT+RW+USEpgtable + 0x1000(%ebx), %edileal0x1007(%edi), %eaxmovl$4, %ecxmovl%eax, 0x00(%edi)addl$0x00001000, %eaxaddl$8, %edijnzentries in the%ecx1bWe put the base address of the page directory pointer which ispgtableregister. PutPage Directory Pointerflags:lealdecltheor in other words relative to. In the end, we just write first the address of the firstIn the next step we will build four1:edipgtablehere represents flags of thePRESENT+RW+USERPML4structure, we can start to build the top level pagepgtabletable in4ediin theecx4096or0x1000offset fromand the address of the first page directory pointer entry inregister, it will be a counter in the following loop and write theaddress of the first page directory pointer table entry to theediregister. After thiscontain the address of the first page directory pointer entry with flags0x7bytes, and write their addresses to2048eaxediwill. Next we justcalculate the address of following page directory pointer entries where each entry isbuilding of theeax8. The last step of building paging structure is thepage table entries with2-MBytepages:68Transition to 64-bit mode1:lealpgtable + 0x2000(%ebx), %edimovl$0x00000183, %eaxmovl$2048, %ecxmovl%eax, 0(%edi)addl$0x00200000, %eaxaddl$8, %edidecl%ecxjnz1bHere we do almost the same as in the previous example, all entries will be with flags$0x00000183-PRESENT + WRITE + MBZ. In the end, we will have2048pages with2-MBytepage or:>>> 2048 * 0x0020000042949672964Gpage table. We just finished to build our early page table structure which mapsgigabytes of memory and now we can put the address of the high-level page table incr34PML4-control register:lealpgtable(%ebx), %eaxmovl%eax, %cr3That's all. All preparation are finished and now we can see transition to the long mode.Transition to the 64-bit modeFirst of all we need to set themovlEFER.LMEflag in the MSR to0xC0000080:$MSR_EFER, %ecxrdmsrbtsl$_EFER_LME, %eaxwrmsrHere we put thein theecxMSR_EFERflag (which is defined in arch/x86/include/uapi/asm/msr-index.h)register and callrdmsrinstruction which reads the MSR register. Afterexecutes, we will have the resulting data incheck theEFER_LMEregister with thebit with thewrmsrbtsledx:eaxwhich depends on theinstruction and write data fromeaxecxto therdmsrvalue. WeMSRinstruction.In the next step, we push the address of the kernel segment code to the stack (we defined itin the GDT) and put the address of thestartup_64routine ineax.69Transition to 64-bit modepushl$__KERNEL_CSlealstartup_64(%ebp), %eaxAfter this we push this address to the stack and enable paging by settingin thecr0PGandPEbitsregister:pushl%eaxmovl$(X86_CR0_PG | X86_CR0_PE), %eaxmovl%eax, %cr0and execute:lretinstruction.Remember that we pushed the address of theprevious step, and after thelretstartup_64function to the stack in theinstruction, the CPU extracts the address of it and jumpsthere.After all of these steps we're finally in 64-bit mode:.code64.org 0x200ENTRY(startup_64)............That's all!ConclusionThis is the end of the fourth part linux kernel booting process. If you have questions orsuggestions, ping me in twitter 0xAX, drop me email or just create an issue.In the next part, we will see kernel decompression and much more.Please note that English is not my first language and I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Links70Transition to 64-bit modeProtected modeIntel® 64 and IA-32 Architectures Software Developer’s Manual 3AGNU linkerSSEPagingModel specific register.fill instructionPrevious partPaging on osdev.orgPaging Systemsx86 Paging Tutorial71Kernel decompressionKernel booting process. Part 5.Kernel decompressionThis is the fifth part of theKernel booting processseries. We saw transition to the 64-bitmode in the previous part and we will continue from this point in this part. We will see thelast steps before we jump to the kernel code as preparation for kernel decompression,relocation and directly kernel decompression. So... let's start to dive in the kernel codeagain.Preparation before kernel decompressionWe stopped right before the jump on the64-bitentry point -startup_64which is locatedin the arch/x86/boot/compressed/head_64.S source code file. We already saw the jump tothestartup_64pushllealin thestartup_32:__KERNEL_CSstartup_64(%ebp), %eax.........pushl%eax.........lretin the previous part. Since we loaded the newtransition in other mode (64-bitGlobal Descriptor Tableand there was CPUmode in our case), we can see the setup of the datasegments:.code64.org 0x200ENTRY(startup_64)xorl%eax, %eaxmovl%eax, %dsmovl%eax, %esmovl%eax, %ssmovl%eax, %fsmovl%eax, %gs72Kernel decompressionin the beginning of theas we joined into thestartup_64long mode. All segment registers besidescsregister now reseted.The next step is computation of difference between where the kernel was compiled andwhere it was loaded:#ifdef CONFIG_RELOCATABLEleaqstartup_32(%rip), %rbpmovlBP_kernel_alignment(%rsi), %eaxdecl%eaxaddq%rax, %rbpnotq%raxandq%rax, %rbpcmpqjgeLOAD_PHYSICAL_ADDR, %rbp1f#endifmovq$LOAD_PHYSICAL_ADDR, %rbpmovlBP_init_size(%rsi), %ebxsubl$_end, %ebxaddq%rbp, %rbx1:Therbprbxcontains the decompressed kernel start address and after this code executesregister will contain address to relocate the kernel code for decompression. Wealready saw code like this in thestartup_32( you can read about it in the previous part -Calculate relocation address), but we need to do this calculation again because thebootloader can use 64-bit boot protocol andstartup_32just will not be executed in thiscase.In the next step we can see setup of the stack pointer, resetting of the flags register andsetupGDTagain because of in a case of64-bitprotocol32-bitcode segment can beomitted by bootloader:leaqboot_stack_end(%rbx), %rspleaqgdt(%rip), %raxmovq%rax, gdt64+2(%rip)lgdtgdt64(%rip)pushq0popfqIf you look at the Linux kernel source code afterlgdt gdt64(%rip)instruction, you will seethat there is some additional code. This code builds trampoline to enable 5-level pagging ifneed. We will consider only 4-level paging in this books, so this code will be omitted.73Kernel decompressionAs you can see above, therbxregister contains the start address of the kerneldecompressor code and we just put this address withboot_stack_endoffset to therspregister which represents pointer to the top of the stack. After this step, the stack will becorrect. You can find definition of theboot_stack_endin the end ofarch/x86/boot/compressed/head_64.S assembly source code file:.bss.balign 4boot_heap:.fill BOOT_HEAP_SIZE, 1, 0boot_stack:.fill BOOT_STACK_SIZE, 1, 0boot_stack_end:It located in the end of the.bsssection, right before the.pgtable. If you will look intoarch/x86/boot/compressed/vmlinux.lds.S linker script, you will find Definition of theand.pgtable.bssthere.As we set the stack, now we can copy the compressed kernel to the address that we gotabove, when we calculated the relocation address of the decompressed kernel. Beforedetails, let's look at this assembly code:pushq%rsileaq(_bss-8)(%rip), %rsileaq(_bss-8)(%rbx), %rdimovq_bss, %rcxshrq$3, %rcxstdrepmovsqcldpopq%rsiFirst of all we pushrsito the stack. We need preserve the value ofregister now stores a pointer to theboot_paramsrsi, because thiswhich is real mode structure that containsbooting related data (you must remember this structure, we filled it in the start of kernelsetup). In the end of this code we'll restore the pointer to theThe next two_bss - 8leaqboot_paramsinstructions calculates effective addresses of theoffset and put it to thersiandrdiripintoandrsirbxagain.with. Why do we calculate these addresses?Actually the compressed kernel image is located between this copying code (fromstartup_32to the current code) and the decompression code. You can verify this by lookingat the linker script - arch/x86/boot/compressed/vmlinux.lds.S:74Kernel decompression. = 0;.head.text : {_head = . ;HEAD_TEXT_ehead = . ;}.rodata..compressed : {*(.rodata..compressed)}.text :{_text = .;/* Text */*(.text)*(.text.*)_etext = . ;}Note that.head.textsection containsstartup_32. You may remember it from the previouspart:__HEAD.code32ENTRY(startup_32).........The.textsection contains decompression code:.textrelocated:........./** Do the decompression, and jump to the new kernel..*/...And.rodata..compressedabsolute address of- 8contains the compressed kernel image. So_bss - 8, andrdivmlinux.lds.S_bssmovsqin the_bssrcxlinker script, it's located at the end of allsections with the setup/kernel code. Now we can start to copy data frombytes at the time, with thewill contain thewill contain the relocation relative address of. As we store these addresses in registers, we put the address ofregister. As you can see in thersirsitordi,8instruction.75Kernel decompressionNote that there is anthatrsiandrsiDFflag, which meanswill be decremented. In other words, we will copy the bytes backwards.rdiAt the end, we clear thestructure toinstruction before data copying: it sets thestdflag with theDFcldinstruction, and restoreboot_params.Now we have the address of the.textsection address after relocation, and we can jumpto it:leaqrelocated(%rbx), %raxjmp*%raxLast preparation before kernel decompressionIn the previous paragraph we saw that theThe first thing it does is clearing thexorl%eax, %eaxleaq_bss(%rip), %rdileaq_ebss(%rip), %rcxsubq%rdi, %rcxshrqtherelocatedlabel.section with:stosqWe need to initialize theeaxsection starts with the$3, %rcxrepclearbss.text.bss, put the address ofrep stosqsection, because we'll soon jump to C code. Here we just_bssand_ebssinrcx, and fill it with zeros withextract_kernelfunction:%rsimovq%rsi, %rdileaqboot_heap(%rip), %rsileaqinput_data(%rip), %rdxmovl$z_input_len, %ecxmovq%rbp, %r8movq$z_output_len, %r9callextract_kernelpopq%rsiAgain we setrdiinstruction.At the end, we can see the call to thepushqinrdito a pointer to thethe same time we setrsiboot_paramsstructure and preserve it on the stack. Into point to the area which should be used for kerneluncompression. The last step is preparation of theextract_kernelparameters and call of76Kernel decompressionthis function which will uncompres the kernel. Theextract_kernelfunction is defined in thearch/x86/boot/compressed/misc.c source code file and takes six arguments:- pointer to the boot_params structure which is filled by bootloader or duringrmodeearly kernel initialization;heap- pointer to theinput_datathewhich represents start address of the early boot heap;boot_heap- pointer to the start of the compressed kernel or in other words pointer toarch/x86/boot/compressed/vmlinux.bin.bz2input_lenoutput;- size of the compressed kernel;- start address of the future decompressed kernel;output_len- size of decompressed kernel;All arguments will be passed through the registers according to System V Application BinaryInterface. We've finished all preparation and can now look at the kernel decompression.Kernel decompressionAs we saw in previous paragraph, theextract_kernelfunction is defined in thearch/x86/boot/compressed/misc.c source code file and takes six arguments. This functionstarts with the video/console initialization that we already saw in the previous parts. We needto do this again because we don't know if we started in real mode or a bootloader was used,or whether the bootloader used the32or64-bitboot protocol.After the first initialization steps, we store pointers to the start of the free memory and to theend of it:free_mem_ptr= heap;free_mem_end_ptr = heap + BOOT_HEAP_SIZE;where theheapis the second parameter of thefunction which we got inextract_kernelthe arch/x86/boot/compressed/head_64.S:leaqboot_heap(%rip), %rsiAs you saw above, theboot_heapis defined as:boot_heap:.fill BOOT_HEAP_SIZE, 1, 0where thebzip2BOOT_HEAP_SIZEis macro which expands to0x10000(0x400000in a case ofkernel) and represents the size of the heap.77Kernel decompressionAfter heap pointers initialization, the next step is the call of thechoose_random_locationfunction from arch/x86/boot/compressed/kaslr.c source code file. As we can guess from thefunction name, it chooses the memory location where the kernel image will bedecompressed. It may look weird that we need to find or evenchooselocation where todecompress the compressed kernel image, but the Linux kernel supports kASLR whichallows decompression of the kernel into a random address, for security reasons.We will not consider randomization of the Linux kernel load address in this part, but will do itin the next part.Now let's back to misc.c. After getting the address for the kernel image, there need to besome checks to be sure that the retrieved random address is correctly aligned and addressis not wrong:if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))error("Destination physical address inappropriately aligned");if (virt_addr & (MIN_KERNEL_ALIGN - 1))error("Destination virtual address inappropriately aligned");if (heap > 0x3fffffffffffUL)error("Destination address too large");if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)error("Destination virtual address is beyond the kernel mapping area");if ((unsigned long)output != LOAD_PHYSICAL_ADDR)error("Destination address does not match LOAD_PHYSICAL_ADDR");if (virt_addr != LOAD_PHYSICAL_ADDR)error("Destination virtual address changed when not relocatable");After all these checks we will see the familiar message:Decompressing Linux...and call the__decompressfunction:__decompress(input_data, input_len, NULL, NULL, output, output_len, NULL, error);which will decompress the kernel. The implementation of the__decompressfunctiondepends on what decompression algorithm was chosen during kernel compilation:78Kernel decompression#ifdef CONFIG_KERNEL_GZIP#include "../../../../lib/decompress_inflate.c"#endif#ifdef CONFIG_KERNEL_BZIP2#include "../../../../lib/decompress_bunzip2.c"#endif#ifdef CONFIG_KERNEL_LZMA#include "../../../../lib/decompress_unlzma.c"#endif#ifdef CONFIG_KERNEL_XZ#include "../../../../lib/decompress_unxz.c"#endif#ifdef CONFIG_KERNEL_LZO#include "../../../../lib/decompress_unlzo.c"#endif#ifdef CONFIG_KERNEL_LZ4#include "../../../../lib/decompress_unlz4.c"#endifAfter kernel is decompressed, the last two functions arehandle_relocationsparse_elfand. The main point of these functions is to move the uncompressed kernelimage to the correct memory place. The fact is that the decompression will decompress inplace, and we still need to move kernel to the correct address. As we already know, thekernel image is an ELF executable, so the main goal of theparse_elffunction is to moveloadable segments to the correct address. We can see loadable segments in the output ofthereadelfprogram:79Kernel decompressionreadelf -l vmlinuxElf file type is EXEC (Executable file)Entry point 0x1000000There are 5 program headers, starting at offset 64Program Headers:TypeLOADOffsetVirtAddrFileSizMemSizPhysAddrFlagsAlign0x0000000000200000 0xffffffff81000000 0x00000000010000000x0000000000893000 0x0000000000893000LOADR E2000000x0000000000a93000 0xffffffff81893000 0x00000000018930000x000000000016d000 0x000000000016d000LOADRW2000000x0000000000c00000 0x0000000000000000 0x0000000001a000000x00000000000152d8 0x00000000000152d8LOADRW2000000x0000000000c16000 0xffffffff81a16000 0x0000000001a160000x0000000000138000 0x000000000029b000The goal of thegot from theparse_elfRWE200000function is to load these segments to thechoose_random_locationoutputaddress wefunction. This function starts with checking the ELFsignature:Elf64_Ehdr ehdr;Elf64_Phdr *phdrs, *phdr;memcpy(&ehdr, output, sizeof(ehdr));if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||ehdr.e_ident[EI_MAG1] != ELFMAG1 ||ehdr.e_ident[EI_MAG2] != ELFMAG2 ||ehdr.e_ident[EI_MAG3] != ELFMAG3) {error("Kernel is not a valid ELF file");return;}and if it's not valid, it prints an error message and halts. If we got a validthrough all program headers from the givenELFELFfile, we gofile and copy all loadable segments withcorrect 2 megabytes aligned address to the output buffer:80Kernel decompressionfor (i = 0; i p_type) {case PT_LOAD:#ifdef CONFIG_X86_64if ((phdr->p_align % 0x200000) != 0)error("Alignment of LOAD segment isn't multiple of 2MB");+#endif#ifdef CONFIG_RELOCATABLEdest = output;dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);#elsedest = (void *)(phdr->p_paddr);#endifmemmove(dest, output + phdr->p_offset, phdr->p_filesz);break;default:break;}}That's all.From this moment, all loadable segments are in the correct place.The next step after theparse_elffunction is the call of theImplementation of this function depends on thehandle_relocationsCONFIG_X86_NEED_RELOCSfunction.kernel configurationoption and if it is enabled, this function adjusts addresses in the kernel image, and is calledonly if theCONFIG_RANDOMIZE_BASEconfiguration option was enabled during kernelconfiguration. Implementation of thefunction subtracts value of thehandle_relocationsLOAD_PHYSICAL_ADDRfunction is easy enough. Thisfrom the value of the base load addressof the kernel and thus we obtain the difference between where the kernel was linked to loadand where it was actually loaded. After this we can perform kernel relocation as we knowactual address where the kernel was loaded, its address where it was linked to run andrelocation table which is in the end of the kernel image.After the kernel is relocated, we return back from theextract_kerneltoarch/x86/boot/compressed/head_64.S.The address of the kernel will be in thejmpraxregister and we jump to it:*%raxThat's all. Now we are in the kernel!81Kernel decompressionConclusionThis is the end of the fifth part about linux kernel booting process. We will not see postsabout kernel booting anymore (maybe updates to this and previous posts), but there will bemany posts about other kernel internals.Next chapter will describe more advanced details about linux kernel booting process, like aload address randomization and etc.If you have any questions or suggestions write me a comment or ping me in twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Linksaddress space layout randomizationinitrdlong modebzip2RdRand instructionTime Stamp CounterProgrammable Interval TimersPrevious part82Kernel load address randomizationKernel booting process. Part 6.IntroductionThis is the sixth part of theKernel booting processseries. In the previous part we haveseen the end of the kernel boot process. But we have skipped some important advancedparts.As you may remember the entry point of the Linux kernel is thethe main.c source code file started to execute atdepends on theCONFIG_PHYSICAL_STARTstart_kernelLOAD_PHYSICAL_ADDRfunction fromaddress. This addresskernel configuration option which is0x1000000bydefault:config PHYSICAL_STARThex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)default "0x1000000"---help--This gives the physical address where the kernel is loaded..........This value may be changed during kernel configuration, but also load address can beselected as a random value. For this purpose theCONFIG_RANDOMIZE_BASEkernelconfiguration option should be enabled during kernel configuration.In this case a physical address at which Linux kernel image will be decompressed andloaded will be randomized. This part considers the case when this option is enabled andload address of the kernel image will be randomized for security reasons.Initialization of page tablesBefore the kernel decompressor will start to find random memory range where the kernel willbe decompressed and loaded, the identity mapped page tables should be initialized. If abootloader used 16-bit or 32-bit boot protocol, we already have page tables. But in any case,we may need new pages by demand if the kernel decompressor selects memory rangeoutside of them. That's why we need to build new identity mapped page tables.83Kernel load address randomizationYes, building of identity mapped page tables is the one of the first step during randomizationof load address. But before we will consider it, let's try to remember where did we come fromto this point.In the previous part, we saw transition to long mode and jump to the kernel decompressorentry point -extract_kernelfunction. The randomization stuff starts here from the call ofthe:void choose_random_location(unsigned long input,unsigned long input_size,unsigned long *output,unsigned long output_size,unsigned long *virt_addr){}function. As you may see, this function takes following five parameters:input;input_sizeoutput;;output_iszevirt_addr;.Let's try to understand what these parameters are. The firstparameters of theextract_kernelinputparameter came fromfunction from the arch/x86/boot/compressed/misc.csource code file:asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,unsigned char *input_data,unsigned long input_len,unsigned char *output,unsigned long output_len){.........choose_random_location((unsigned long)input_data, input_len,(unsigned long *)&output,max(output_len, kernel_total_size),&virt_addr);.........}This parameter is passed from assembler code:84Kernel load address randomizationleaqinput_data(%rip), %rdxfrom the arch/x86/boot/compressed/head_64.S. Theinput_datais generated by the littlemkpiggy program. If you have compiled linux kernel source code under your hands, you mayfind the generated file by this program which should be placed in thelinux/arch/x86/boot/compressed/piggy.S. In my case this file looks:.section ".rodata..compressed","a",@progbits.globl z_input_lenz_input_len = 6988196.globl z_output_lenz_output_len = 29207032.globl input_data, input_data_endinput_data:.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"input_data_end:As you may see it contains four global symbols. The first twoz_output_lenthird is ourz_input_lenwhich are sizes of compressed and uncompressedandvmlinux.bin.gz. Theand as you may see it points to linux kernel image in raw binaryinput_dataformat (all debugging symbols, comments and relocation information are stripped). And thelastinput_data_endpoints to the end of the compressed linux image.So, our first parameter of thechoose_random_locationcompressed kernel image that is embedded into theThe second parameter of thechoose_random_locationfunction is the pointer to thepiggy.oobject file.function is thez_input_lenthat wehave seen just now.The third and fourth parameters of thechoose_random_locationfunction are address whereto place decompressed kernel image and the length of decompressed kernel imagerespectively. The address where to put decompressed kernel came fromarch/x86/boot/compressed/head_64.S and it is address of thestartup_32aligned to 2megabytes boundary. The size of the decompressed kernel came from the sameand it isz_output_lenpiggy.S.The last parameter of thechoose_random_locationfunction is the virtual address of thekernel load address. As we may see, by default it coincides with the default physical loadaddress:unsigned long virt_addr = LOAD_PHYSICAL_ADDR;which depends on kernel configuration:85Kernel load address randomization#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \+ (CONFIG_PHYSICAL_ALIGN - 1)) \& ~(CONFIG_PHYSICAL_ALIGN - 1))Now, as we considered parameters of thechoose_random_locationimplementation of it. This function starts from the checking offunction, let's look atnokaslroption in the kernelcommand line:if (cmdline_find_option_bool("nokaslr")) {warn("KASLR disabled: 'nokaslr' on cmdline.");return;}and if the options was given we exit from thechoose_random_locationfunction ad kernelload address will not be randomized. Related command line options can be found in thekernel documentation:kaslr/nokaslr [X86]Enable/disable kernel and module base offset ASLR(Address Space Layout Randomization) if built intothe kernel. When CONFIG_HIBERNATION is selected,kASLR is disabled by default. When kASLR is enabled,hibernation will be disabled.Let's assume that we didn't passCONFIG_RANDOMIZE_BASEnokaslrto the kernel command line and thekernel configuration option is enabled. In this case we addkASLRflag to kernel load flags:boot_params->hdr.loadflags |= KASLR_FLAG;and the next step is the call of the:initialize_identity_maps();function which is defined in the arch/x86/boot/compressed/kaslr_64.c source code file. Thisfunction starts from initialization ofmapping_infoan instance of thex86_mapping_infostructure:86Kernel load address randomizationmapping_info.alloc_pgt_page = alloc_pgt_page;mapping_info.context = &pgt_data;mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;mapping_info.kernpg_flag = _KERNPG_TABLE;Thex86_mapping_infostructure is defined in the arch/x86/include/asm/init.h header file andlooks:struct x86_mapping_info {void *(*alloc_pgt_page)(void *);void *context;unsigned long page_flag;unsigned long offset;bool direct_gbpages;unsigned long kernpg_flag;};This structure provides information about memory mappings. As you may remember fromthe previous part, we already setup'ed initial page tables from 0 up toneed to access memory aboveinitialize_identity_maps4G4G. For now we mayto load kernel at random position. So, thefunction executes initialization of a memory region for a possibleneeded new page table. First of all let's try to look at the definition of thex86_mapping_infostructure.Thealloc_pgt_pagetable entry. Theis a callback function that will be called to allocate space for a pagecontextfield is an instance of thealloc_pgt_datawhich will be used to track allocated page tables. Theare page flags. The first represents flags forPMDorpage_flagPUDstructure in our caseandkernpg_flagentries. The secondfield represents flags for kernel pages which can be overridden later. Thefield represents support for huge pages and the lastoffsetkernel virtual addresses and physical addresses up toThealloc_pgt_pagePMDfieldskernpg_flagdirect_gbpagesfield represents offset betweenlevel.callback just validates that there is space for a new page, allocatesnew page:entry = pages->pgt_buf + pages->pgt_buf_offset;pages->pgt_buf_offset += PAGE_SIZE;in the buffer from the:87Kernel load address randomizationstruct alloc_pgt_data {unsigned char *pgt_buf;unsigned long pgt_buf_size;unsigned long pgt_buf_offset;};structure and returns address of a new page. The last goal of thefunction is to initializephase, thepgdt_buf_sizeinitialze_identity_mapsandpgt_buf_offsetfunction setsinitialize_identity_maps. As we are only in initializationpgt_buf_offsetto zero:pgt_data.pgt_buf_offset = 0;and thepgt_data.pgt_buf_sizewill be set to77824or69632depends on which boot. If aprotocol will be used by bootloader (64-bit or 32-bit). The same is forpgt_data.pgt_bufbootloader loaded the kernel atwill point to the end ofstartup_32, thepgdt_data.pgdt_bufthe page table which already was initialzed in the arch/x86/boot/compressed/head_64.S:pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;where_pgtablepoints to the beginning of this page table _pgtable. In other way, if abootloader have used 64-bit boot protocol and loaded the kernel attables should be built by bootloader itself and_pgtablestartup_64, early pagewill be just overwrote:pgt_data.pgt_buf = _pgtableAs the buffer for new page tables is initialized, we may return back to thechoose_random_locationfunction.Avoid reserved memory rangesAfter the stuff related to identity page tables is initilized, we may start to choose randomlocation where to put decompressed kernel image. But as you may guess, we can't chooseany address. There are some reseved addresses in memory ranges. Such addressesoccupied by important things, like initrd, kernel command line and etc. Themem_avoid_init(input, input_size, *output);function will help us to do this. All non-safe memory regions will be collected in the:88Kernel load address randomizationstruct mem_vector {unsigned long long start;unsigned long long size;};static struct mem_vector mem_avoid[MEM_AVOID_MAX];array. WhereMEM_AVOID_MAXis frommem_avoid_indexenum which represents differenttypes of reserved memory regions:enum mem_avoid_index {MEM_AVOID_ZO_RANGE = 0,MEM_AVOID_INITRD,MEM_AVOID_CMDLINE,MEM_AVOID_BOOTPARAMS,MEM_AVOID_MEMMAP_BEGIN,MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,MEM_AVOID_MAX,};Both are defined in the arch/x86/boot/compressed/kaslr.c source code file.Let's look at the implementation of thefunction. The main goal of thismem_avoid_initfunction is to store information about reseved memory regions described by themem_avoid_indexenum in themem_avoidarray and create new pages for such regions inour new identity mapped buffer. Numerous parts fo themem_avoid_indexfunction are similar,but let's take a look at the one of them:mem_avoid[MEM_AVOID_ZO_RANGE].start = input;mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,mem_avoid[MEM_AVOID_ZO_RANGE].size);At the beginning of themem_avoid_initfunction tries to avoid memory region that is usedfor current kernel decompression. We fill an entry from theand size of such region and call themapped pages for this region. Theadd_identity_mapadd_identity_mapmem_avoidarray with the startfunction which should build identityfunction is defined in thearch/x86/boot/compressed/kaslr_64.c source code file and looks:89Kernel load address randomizationvoid add_identity_map(unsigned long start, unsigned long size){unsigned long end = start + size;start = round_down(start, PMD_SIZE);end = round_up(end, PMD_SIZE);if (start >= end)return;kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,start, end);}As you may see it aligns memory region to 2 megabytes boundary and checks given startand end addresses.In the end it just calls thekernel_ident_mapping_initarch/x86/mm/ident_map.c source code file and passfunction from themapping_infoinstance that wasinitilized above, address of the top level page table and addresses of memory region forwhich new identity mapping should be built.Thekernel_ident_mapping_initfunction sets default flags for new pages if they were notgiven:if (!info->kernpg_flag)info->kernpg_flag = _KERNPG_TABLE;and starts to build new 2-megabytes (because ofpage entries (PMDPGD -> P4D -> PUD -> PMDPSEbit in themapping_info.page_flagin a case of five-level page tables or)PGD -> PUD ->in a case of four-level page tables) related to the given addresses.for (; addr end)next = end;p4d = (p4d_t *)info->alloc_pgt_page(info->context);result = ident_p4d_init(info, p4d, addr, next);return result;}90Kernel load address randomizationFirst of all here we find next entry of theit is greater thanendnew page with ourtheident_p4d_initof the given memory region, we set it tox86_mapping_infofunction. Thelevel page directories (p4d->pudfor the given address and ifPage Global Directoryend. After this we allocater acallback that we already considered above and callident_p4d_init->pmdfunction will do the same, but for low-).That's all.New page entries related to reserved addresses are in our page tables. This is not the endof themem_avoid_initfunction, but other parts are similar. It just build pages for initrd,kernel command line and etc.Now we may return back tochoose_random_locationfunction.Physical address randomizationAfter the reserved memory regions were stored in themem_avoidarray and identity mappingpages were built for them, we select minimal available address to choose random memoryregion to decompress the kernel:min_addr = min(*output, 512UL << 20);As you may see it should be smaller than512megabytes. This512megabytes value wasselected just to avoid unknown things in lower memory.The next step is to select random physical and virtual addresses to load kernel. The first isphysical addresses:random_addr = find_random_phys_addr(min_addr, output_size);Thefind_random_phys_addrfunction is defined in the same source code file:static unsigned long find_random_phys_addr(unsigned long minimum,unsigned long image_size){minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);if (process_efi_entries(minimum, image_size))return slots_fetch_random();process_e820_entries(minimum, image_size);return slots_fetch_random();}91Kernel load address randomizationThe main goal ofprocess_efi_entriesfunction is to find all suitable memory ranges in fullaccessible memory to load kernel. If the kernel compiled and runned on the system withoutEFI support, we continue to search such memory regions in the e820 regions. All foundedmemory regions will be stored in thestruct slot_area {unsigned long addr;int num;};#define MAX_SLOT_AREA 100static struct slot_area slot_areas[MAX_SLOT_AREA];array. The kernel decompressor should select random index of this array and it will berandom place where kernel will be decompressed. This selection will be executed by theslots_fetch_randomfunction. The main goal of therandom memory range from theslot_areasslots_fetch_randomarray viafunction is to selectkaslr_get_random_longfunction:slot = kaslr_get_random_long("Physical") % slot_max;Thekaslr_get_random_longfunction is defined in the arch/x86/lib/kaslr.c source code fileand it just returns random number. Note that the random number will be get via differentways depends on kernel configuration and system opportunities (select random numberbase on time stamp counter, rdrand and so on).That's all from this point random memory range will be selected.Virtual address randomizationAfter random memory region was selected by the kernel decompressor, new identitymapped pages will be built for this region by demand:random_addr = find_random_phys_addr(min_addr, output_size);if (*output != random_addr) {add_identity_map(random_addr, output_size);*output = random_addr;}92Kernel load address randomizationFrom this timeoutputwill store the base address of a memory region where kernel will bedecompressed. But for this moment, as you may remember we randomized only physicaladdress. Virtual address should be randomized too in a case of x86_64 architecture:if (IS_ENABLED(CONFIG_X86_64))random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);*virt_addr = random_addr;As you may see in a case of nonx86_64architecture, randomzed virtual address willcoincide with randomized physical address. Thefind_random_virt_addrfunction calculatesamount of virtual memory ranges that may hold kernel image and calls thekaslr_get_random_longphysicalthat we already saw in a previous case when we tried to find randomaddress.From this moment we have both randomized base physical ((*virt_addr*output) and virtual) addresses for decompressed kernel.That's all.ConclusionThis is the end of the sixth and the last part about linux kernel booting process. We will notsee posts about kernel booting anymore (maybe updates to this and previous posts), butthere will be many posts about other kernel internals.Next chapter will be about kernel initialization and we will see the first steps in the Linuxkernel initialization code.If you have any questions or suggestions write me a comment or ping me in twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksAddress space layout randomizationLinux kernel boot protocollong modeinitrdEnumerated typefour-level page tables93Kernel load address randomizationfive-level page tablesEFIe820time stamp counterrdrandx86_64Previous part94InitializationKernel initialization processYou will find here a couple of posts which describe the full cycle of kernel initialization fromits first step after the kernel has been decompressed to the start of the first process run bythe kernel itself.Note That there will not be a description of the all kernel initialization steps. Here will be onlygeneric kernel part, without interrupts handling, ACPI, and many other parts. All parts which Ihave missed, will be described in other chapters.First steps after kernel decompression - describes first steps in the kernel.Early interrupt and exception handling - describes early interrupts initialization and earlypage fault handler.Last preparations before the kernel entry point - describes the last preparations beforethe call of thestart_kernel.Kernel entry point - describes first steps in the kernel generic code.Continue of architecture-specific initializations - describes architecture-specificinitialization.Architecture-specific initializations, again... - describes continue of the architecturespecific initialization process.The End of the architecture-specific initializations, almost... - describes the end of thesetup_archrelated stuff.Scheduler initialization - describes preparation before scheduler initialization andinitialization of it.RCU initialization - describes the initialization of the RCU.End of the initialization - the last part about linux kernel initialization.95First steps in the kernelKernel initialization. Part 1.First steps in the kernel codeThe previous post was a last part of the Linux kernel booting process chapter and now weare starting to dive into initialization process of the Linux kernel. After the image of the Linuxkernel is decompressed and placed in a correct place in memory, it starts to work. Allprevious parts describe the work of the Linux kernel setup code which does preparationbefore the first bytes of the Linux kernel code will be executed. From now we are in thekernel and all parts of this chapter will be devoted to the initialization process of the kernelbefore it will launch process with pidstart firstinit1. There are many things to do before the kernel willprocess. Hope we will see all of the preparations before kernel will start inthis big chapter. We will start from the kernel entry point, which is located in thearch/x86/kernel/head_64.S and will move further and further. We will see first preparationslike early page tables initialization, switch to a new descriptor in kernel space and manymany more, before we will see thestart_kernelfunction from the init/main.c will be called.In the last part of the previous chapter we stopped at the jmp instruction from thearch/x86/boot/compressed/head_64.S assembly source code file:jmp*%raxAt this moment theraxregister contains address of the Linux kernel entry point which thatwas obtained as a result of the call of thedecompress_kernelfunction from thearch/x86/boot/compressed/misc.c source code file. So, our last instruction in the kernelsetup code is a jump on the kernel entry point. We already know where is defined the entrypoint of the linux kernel, so we are able to start to learn what does the Linux kernel doesafter the start.First steps in the kernelOkay, we got the address of the decompressed kernel image from thefunction intoraxdecompress_kernelregister and just jumped there. As we already know the entry point of thedecompressed kernel image starts in the arch/x86/kernel/head_64.S assembly source codefile and at the beginning of it, we can see following definitions:96First steps in the kernel.text__HEAD.code64.globl startup_64startup_64:.........We can see definition of thestartup_64routine that is defined in thewhich is just a macro which expands to the definition of executable#define __HEAD.section__HEADsection,.head.textsection:".head.text","ax"We can see definition of this section in the arch/x86/kernel/vmlinux.lds.S linker script:.text : AT(ADDR(.text) - LOAD_OFFSET) {_text = .;.........} :text = 0x9090Besides the definition of the.textsection, we can understand default virtual and physicaladdresses from the linker script. Note that address of the_textis location counter which isdefined as:. = __START_KERNEL;for the x86_64. The definition of the__START_KERNELmacro is located in thearch/x86/include/asm/page_types.h header file and represented by the sum of the basevirtual address of the kernel mapping and physical start:#define __START_KERNEL(__START_KERNEL_map + __PHYSICAL_START)#define __PHYSICAL_STARTALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)Or in other words:Base physical address of the Linux kernel Base virtual address of the Linux kernel -0x1000000;0xffffffff81000000.97First steps in the kernelNow we know default physical and virtual addresses of thestartup_64routine, but to knowactual addresses we must to calculate it with the following code:leaq_text(%rip), %rbpsubq$_text - __START_KERNEL_map, %rbpYes, it defined as, but it may be different, for example if kASLR is enabled. So0x1000000our current goal is to calculate delta betweenHere we just put the__START_KERNEL_map0x1000000address to therip-relativerbpand where we actually loaded.register and then subtractfrom it. We know that compiled virtual address of the0xffffffff81000000and the physical address of it ismacro expands to the0xffffffff800000000x1000000. The_text$_text -is__START_KERNEL_mapaddress, so at the second line of the assemblycode, we will get following expression:rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)So, after the calculation, therbpwill contain0which represents difference betweenaddresses where we actually loaded and where the code was compiled. In our casezeromeans that the Linux kernel was loaded by default address and the kASLR was disabled.After we got the address of thestartup_64, we need to do a check that this address iscorrectly aligned. We will do it with the following code:testljnz$~PMD_PAGE_MASK, %ebpbad_addressHere we just compare low part of thePMD_PAGE_MASK. ThePMD_PAGE_MASKrbpregister with the complemented value of theindicates the mask forPage middle directory(readPaging about it) and defined as:#define PMD_PAGE_MASKwherePMD_PAGE_SIZE(~(PMD_PAGE_SIZE-1))macro defined as:#define PMD_PAGE_SIZE#define PMD_SHIFT(_AC(1, UL) < level3_kernel_pgt[0]level3_kernel_pgt[510] -> level2_kernel_pgt[0]level3_kernel_pgt[511] -> level2_fixmap_pgt[0]level2_kernel_pgt[0]-> 512 MB kernel mappinglevel2_fixmap_pgt[507] -> level1_fixmap_pgtNote that we didn't fixup base address of theearly_level4_pgtand some of other pagetable directories, because we will see this during of building/filling of structures for thesepage tables. As we corrected base addresses of the page tables, we can start to build it.Identity mapping setupNow we can see the set up of identity mapping of early page tables. In Identity MappedPaging, virtual addresses are mapped to physical addresses that have the same value,1. Let's look at it in detail. First of all we get the_early_level4_pgtand put they intordiandrip-relativerbxaddress of the_text1 :andregisters:101First steps in the kernelleaq_text(%rip), %rdileaqearly_level4_pgt(%rip), %rbxAfter this we store address of thedirectory entry which storesPGDIR_SHIFTwhere_textin theand get the index of the page globalraxaddress, by shifting_text_textaddress on the:movq%rdi, %raxshrq$PGDIR_SHIFT, %raxPGDIR_SHIFTis39.PGDIR_SHFTindicates the mask for page global directory bits ina virtual address. There are macro for all types of page directories:#define PGDIR_SHIFT39#define PUD_SHIFT30#define PMD_SHIFT21After this we put the address of the first entry of therdxregister with theearly_level4_pgtThe_KERNPG_TABLEwith the 2early_dynamic_pgts(4096 + _KERNPG_TABLE)(%rbx), %rdxmovq%rdx, 0(%rbx,%rax,8)movq%rdx, 8(%rbx,%rax,8)register contains address of thea page global directory occupied by theearly_level4_pgt_text. Theentries:early_level4_pgt_textand%rax * 8here is index ofaddress. So here we fill two entries of thewith address of two entries of theearly_dynamic_pgtspage table to theaccess rights (see above) and fill theleaqrbxearly_dynamic_pgtsearly_dynamic_pgtswhich is related tois array of arrays:extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];which will store temporary page tables for early kernel which we will not move to the.init_level4_pgtAfter this we add4096(size of theaddress of the first entry of theaddress of the_text) to theearly_level4_pgtearly_dynamic_pgtsrax) to the) and putrdxrdi. Now we shift address of the(it now contains the(it now contains physical_textotPUD_SHIFTto getindex of an entry from page upper directory which contains this address and clears high bitsto get onlypudrelated part:102First steps in the kerneladdq$4096, %rdxmovq%rdi, %raxshrq$PUD_SHIFT, %raxandl$(PTRS_PER_PUD-1), %eaxAs we have index of a page upper directory we write two addresses of the second entry oftheearly_dynamic_pgtsarray to the first entry of this temporary page directory:movq%rdx, 4096(%rbx,%rax,8)incl%eaxandl$(PTRS_PER_PUD-1), %eaxmovq%rdx, 4096(%rbx,%rax,8)In the next step we do the same operation for last page table directory, but filling not twoentries, but all entries to cover full size of the kernel.After our early page table directories filled, we put physical address of theto theraxregister and jump to labelmovq1early_level4_pgt:$(early_level4_pgt - __START_KERNEL_map), %raxjmp 1fThat's all for now. Our early paging is prepared and we just need to finish last preparationbefore we will jump into C code and kernel entry point later.Last preparation before jump at the kernelentry pointAfter that we jump to the labelthe content of thephys_base1we enablePAE(see above) to the,PGErax(Paging Global Extension) and putregister and fillcr3register with it:1:movl$(X86_CR4_PAE | X86_CR4_PGE), %ecxmovq%rcx, %cr4addqphys_base(%rip), %raxmovq%rax, %cr3In the next step we check that CPU supports NX bit with:103First steps in the kernelmovl$0x80000001, %eaxcpuidmovlWe put%edx,%edi0x80000001value to theand executeeaxinstruction for getting thecpuidextended processor info and feature bits. The result will be in theto theediedxregister which we put.Now we put0xc0000080orMSR_EFERto theecxand callrdmsrinstruction for the readingmodel specific register.movl$MSR_EFER, %ecxrdmsrThe result will be in theedx:eax. General view of theis following:EFER6332-------------------------------------------------------------------------------|||Reserved MBZ|||-------------------------------------------------------------------------------311615141312111098 710-------------------------------------------------------------------------------|| T || Reserved MBZ| C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|||| E ||||||||||||||||||--------------------------------------------------------------------------------We will not see all fields in details here, but we will learn about this and otherspecial part about it. As we readEFERto thewhich iswithbtslSCEtheset (System Call Extensionsbit we enableediNXandSYSRET, we checkbit) we just writeEFER_SCE_EFER_SCEin aor zero bitinstruction and set it to one. By the settinginstructions. In the next step we check 20th bit in, remember that this register stores result of thebtslcpuid(see above). If20bit isto the model specific register.$_EFER_SCE, %eaxbtl1:SYSCALLedx:eaxMSRs$20,%edijnc1fbtsl$_EFER_NX, %eaxbtsq$_PAGE_BIT_NX,early_pmd_flags(%rip)wrmsr104First steps in the kernelIf the NX bit is supported we enable_EFER_NXAfter the NX bit is set, we set some bits in theand write it too, with thecr0wrmsrinstruction.control register, namely:X86_CR0_PE- system is in protected mode;X86_CR0_MP- controls interaction of WAIT/FWAIT instructions with TS flag in CR0;X86_CR0_ET- on the 386, it allowed to specify whether the external math coprocessorwas an 80287 or 80387;X86_CR0_NE- enable internal x87 floating point error reporting when set, else enablesPC style x87 error detection;X86_CR0_WP- when set, the CPU can't write to read-only pages when privilege level is0;X86_CR0_AM- alignment check enabled if AM set, AC flag (in EFLAGS register) set, andprivilege level is 3;X86_CR0_PG- enable paging.by the execution following assembly code:#define CR0_STATE(X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \X86_CR0_PG)movl$CR0_STATE, %eaxmovq%rax, %cr0We already know that to run any code, and even more C code from assembly, we need tosetup a stack. As always, we are doing it by the setting of stack pointer to a correct place inmemory and resetting flags register after this:movq initial_stack(%rip), %rsppushq$0popfqThe most interesting thing here is theinitial_stack. This symbol is defined in the sourcecode file and looks like:GLOBAL(initial_stack).quadTheGLOBALinit_thread_union+THREAD_SIZE-8is already familiar to us from. It defined in the arch/x86/include/asm/linkage.hheader file expands to theglobalsymbol definition:105First steps in the kernel#define GLOBAL(name)\.globl name;\name:TheTHREAD_SIZEmacro is defined in the arch/x86/include/asm/page_64_types.h header fileand depends on value of theKASAN_STACK_ORDER#define THREAD_SIZE_ORDER#define THREAD_SIZE(2 + KASAN_STACK_ORDER)(PAGE_SIZE << THREAD_SIZE_ORDER)We consider when the kasan is disabled and thewill expands toTHREAD_SIZEisthreadmacro:16PAGE_SIZEis4096bytes. So thekilobytes and represents size of the stack of a thread. Why? You may already know that each process may have parent processes and childprocesses. Actually, a parent process and child process differ in stack. A new kernel stack isallocated for a new process. In the Linux kernel this stack is represented by the union withthethread_infostructure.And as we can see theis represented by theinit_thread_unionthread_unionunion.Earlier this union looked like:union thread_union {struct thread_info thread_info;unsigned long stack[THREAD_SIZE/sizeof(long)];};but from the Linux kernel4.9-rc1release,thread_infostructure which represents a thread. So, for nowwas moved to thethread_uniontask_structlooks like:union thread_union {#ifndef CONFIG_THREAD_INFO_IN_TASKstruct thread_info thread_info;#endifunsigned long stack[THREAD_SIZE/sizeof(long)];};where theCONFIG_THREAD_INFO_IN_TASKarchitecture. So, as we consider onlythread_uniontask_structThekernel configuration option is enabled forx86_64will contain only stack andx86_64architecture in this book, an instance ofthread_infostructure will be placed in the.init_thread_unionlooks like:106First steps in the kernelunion thread_union init_thread_union __init_task_data = {#ifndef CONFIG_THREAD_INFO_IN_TASKINIT_THREAD_INFO(init_task)#endif};which represents just thread stack. Now we may understand this expression:GLOBAL(initial_stack).quadthatinit_thread_union+THREAD_SIZE-8initial_stackTHREAD_SIZEsymbol points to the start of thethread_union.stackarray +which is 16 killobytes and - 8 bytes. Here we need to subtract8bytes at thetop of stack. This is necessary to guarantee illegal access of the next page memory.After the early boot stack is set, to update the Global Descriptor Table with thelgdtinstruction:lgdtearly_gdt_descr(%rip)where theearly_gdt_descris defined as:early_gdt_descr:.wordGDT_ENTRIES*8-1early_gdt_descr_base:.quadINIT_PER_CPU_VAR(gdt_page)We need to reloadGlobal Descriptor Tablebecause now kernel works in the lowuserspace addresses, but soon kernel will work in its own space. Now let's look at thedefinition ofearly_gdt_descr. Global Descriptor Table contains32entries:#define GDT_ENTRIES 32for kernel code, data, thread local storage segments and etc... it's simple. Now let's look atthe definition of theFirst ofgdt_pageearly_gdt_descr_base.defined as:struct gdt_page {struct desc_struct gdt[GDT_ENTRIES];} __attribute__((aligned(PAGE_SIZE)));107First steps in the kernelin the arch/x86/include/asm/desc.h. It contains one fielddesc_structgdtwhich is array of thestructure which is defined as:struct desc_struct {union {struct {unsigned int a;unsigned int b;};struct {u16 limit0;u16 base0;unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;};};} __attribute__((packed));and presents familiar to usaligned toPAGE_SIZEdescriptor. Also we can note thatGDTwhich islet's try to understand what is4096bytes. It means thatINIT_PER_CPU_VAR.gdtstructurewill occupy one page. NowINIT_PER_CPU_VARdefined in the arch/x86/include/asm/percpu.h and just concatsgdt_pageis a macro whichinit_per_cpu__with thegiven parameter:#define INIT_PER_CPU_VAR(var) init_per_cpu__##varAfter theINIT_PER_CPU_VARmacro will be expanded, we will haveinit_per_cpu__gdt_page.We can see in the linker script:#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_loadINIT_PER_CPU(gdt_page);As we gotinit_per_cpu__gdt_pageinINIT_PER_CPU_VARlinker script will be expanded we will get offset from theandINIT_PER_CPU__per_cpu_loadmacro from. After thiscalculations, we will have correct base address of the new GDT.Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from itsname. When we createHere we creatingper-CPUgdt_pagevariable, each CPU will have its own copy of this variable.per-CPU variable. There are many advantages for variables ofthis type, like there are no locks, because each CPU works with its own copy of variable andetc... So every core on multiprocessor will have its ownGDTtable and every entry in thetable will represent a memory segment which can be accessed from the thread which ran onthe core. You can read in details aboutper-CPUvariables in the Theory/per-cpu post.108First steps in the kernelAs we loaded new Global Descriptor Table, we reload segments as we did it every time:xorl %eax,%eaxmovl %eax,%dsmovl %eax,%ssmovl %eax,%esmovl %eax,%fsmovl %eax,%gsAfter all of these steps we set upregister that it post to thegswhich representsirqstackspecial stack where interrupts will be handled on:movl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxwrmsrwhereis:MSR_GS_BASE#define MSR_GS_BASEWe need to putMSR_GS_BASE(which are point to theand0xc0000101to theinitial_gsecx) withregister and load data from thewrmsrinstruction. We don't usesegment registers for addressing in the 64-bit mode, butssbe used.fsandgsfsandeax,csandhave a hidden part (as we saw it in the real mode forgsfsedx,dsregisters cancs) and thispart contains descriptor which mapped to Model Specific Registers. So we can see above0xc0000101is ags.baseMSR address. When a system call or interrupt occurred, there isno kernel stack at the entry point, so the value of theMSR_GS_BASEwill store address of theinterrupt stack.In the next step we put the address of the real mode bootparam structure to the(remembermovqrsirdiholds pointer to this structure from the start) and jump to the C code with:initial_code(%rip), %raxpushq$__KERNEL_CSpushq%rax# set correct cs# target address in negative spacelretqHere we put the address of the__KERNEL_CSlretqinitial_codeand the address of theto theinitial_coderaxand push fake address,to the stack. After this we can seeinstruction which means that after it return address will be extracted from stack (nowthere is address of theinitial_code) and jump there.initial_codeis defined in the samesource code file and looks:109First steps in the kernel.balign8GLOBAL(initial_code).quadx86_64_start_kernel.........As we can seeinitial_codecontains address of thex86_64_start_kernel, which isdefined in the arch/x86/kerne/head64.c and looks like this:asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {.........}It has one argument is amode data to therdireal_mode_data(remember that we passed address of the realregister previously).This is first C code in the kernel!Next to start_kernelWe need to see last preparations before we can see "kernel entry point" - start_kernelfunction from the init/main.c.First of all we can see some checks in thex86_64_start_kernelfunction:BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map 2*PUD_SIZE);BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);There are checks for different things like virtual addresses of modules space is not fewerthan base address of the kernel text -__STAT_KERNEL_mapnot less than image of the kernel and etc...BUILD_BUG_ON, that kernel text with modules isis a macro which looks as:#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))110First steps in the kernelLet's try to understand how this trick works. Let's take for example first condition:MODULES_VADDR < __START_KERNEL_mapmeans if.!!conditionsis true, we will getMODULES_VADDR < __START_KERNEL_mapzero if not. Afterwe will get or2*!!(condition)is the same thator201condition != 0in the. So itor!!(condition). In the end of calculations we canget two different behaviors:We will have compilation error, because try to get size of the char array with negativeindex (as can be in our case, because__START_KERNEL_mapMODULES_VADDRcan't be less thanwill be in our case);No compilation errors.That's all. So interesting C trick for getting compile error which depends on some constants.In the next step we can see call of theof thecr4cr4_init_shadowfunction which stores shadow copyper cpu. Context switches can change bits in thefor each CPU. And after this we can see call of thecr4so we need to storereset_early_page_tablescr4function wherewe resets all page global directory entries and write new pointer to the PGT incr3:for (i = 0; i < PTRS_PER_PGD-1; i++)early_level4_pgt[i].pgd = 0;next_early_pgt = 0;write_cr3(__pa_nodebug(early_level4_pgt));Soon we will build new page tables. Here we can see that we go through all Page GlobalDirectory Entries (next_early_pgtaddress of thePTRS_PER_PGDis512) in the loop and make it zero. After this we setto zero (we will see details about it in the next post) and write physicalearly_level4_pgtto thecr3.__pa_nodebugis a macro which will beexpanded to:((unsigned long)(x) - __START_KERNEL_map + phys_base)After this we clearsetup of the early_bssIDTfrom the__bss_stopto__bss_startand the next step will behandlers, but it's big concept so we will see it in the next part.ConclusionThis is the end of the first part about linux kernel initialization.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.111First steps in the kernelIn the next part we will see initialization of the early interruption handlers, kernel spacememory mapping and a lot more.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.LinksModel Specific RegisterPagingPrevious part - kernel load address randomizationNXASLR112Early interrupts handlerKernel initialization. Part 2.Early interrupt and exception handlingIn the previous part we stopped before setting of early interrupt handlers. At this moment weare in the decompressed Linux kernel, we have basic paging structure for early boot and ourcurrent goal is to finish early preparation before the main kernel code will start to work.We already started to do this preparation in the previous first part of this chapter. Wecontinue in this part and will know more about interrupt and exception handling.Remember that we stopped before following loop:for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)set_intr_gate(i, early_idt_handler_array[i]);from the arch/x86/kernel/head64.c source code file. But before we started to sort out thiscode, we need to know about interrupts and handlers.Some theoryAn interrupt is an event caused by software or hardware to the CPU. For example a userhave pressed a key on keyboard. On interrupt, CPU stops the current task and transfercontrol to the special routine which is called - interrupt handler. An interrupt handler handlesand interrupt and transfer control back to the previously stopped task. We can split interruptson three types:Software interrupts - when a software signals CPU that it needs kernel attention. Theseinterrupts are generally used for system calls;Hardware interrupts - when a hardware event happens, for example button is pressedon a keyboard;Exceptions - interrupts generated by CPU, when the CPU detects error, for exampledivision by zero or accessing a memory page which is not in RAM.Every interrupt and exception is assigned a unique number which called Vector number32can be any number from0to255vector number.. There is common practice to use firstvector numbers for exceptions, and vector numbers fromuser-defined interrupts. We can see it in the code above -32to255are used forNUM_EXCEPTION_VECTORS, whichdefined as:113Early interrupts handler#define NUM_EXCEPTION_VECTORS 32CPU uses vector number as an index in theInterrupt Descriptor Table(we will seedescription of it soon). CPU catch interrupts from the APIC or through it's pins. Followingtable shows0-31exceptions:--------------------------------------------------------------------------------------------|Vector|Mnemonic|Description|Type |Error Code|Source|--------------------------------------------------------------------------------------------|0| #DE|Divide Error|Fault|NO|DIV and IDIV||-------------------------------------------------------------------------------------------|1| #DB|Reserved|F/T|NO|||-------------------------------------------------------------------------------------------|2| ---|NMI|INT|NO|external NMI||-------------------------------------------------------------------------------------------|3| #BP|Breakpoint|Trap |NO|INT 3||-------------------------------------------------------------------------------------------|4| #OF|Overflow|Trap |NO|INTOinstruction||-------------------------------------------------------------------------------------------|5| #BR|Bound Range Exceeded|Fault|NO|BOUND instruction||-------------------------------------------------------------------------------------------|6| #UD|Invalid Opcode|Fault|NO|UD2 instruction||-------------------------------------------------------------------------------------------|7| #NM|Device Not Available|Fault|NO|Floating point or [F]WAIT||-------------------------------------------------------------------------------------------|8| #DF|Double Fault|Abort|YES|An instruction which can generate NMI ||-------------------------------------------------------------------------------------------|9| ---|Reserved|Fault|NO|||-------------------------------------------------------------------------------------114Early interrupts handler-------|10| #TS|Invalid TSS|Fault|YES|Task switch or TSS access||-------------------------------------------------------------------------------------------|11| #NP|Segment Not Present |Fault|NO|Accessing segment register||-------------------------------------------------------------------------------------------|12| #SS|Stack-Segment Fault |Fault|YES|Stack operations||-------------------------------------------------------------------------------------------|13| #GP|General Protection|Fault|YES|Memory reference||-------------------------------------------------------------------------------------------|14| #PF|Page fault|Fault|YES|Memory reference||-------------------------------------------------------------------------------------------|15| ---|Reserved||NO|||-------------------------------------------------------------------------------------------|16| #MF|x87 FPU fp error|Fault|NO|Floating point or [F]Wait||-------------------------------------------------------------------------------------------|17| #AC|Alignment Check|Fault|YES|Data reference||-------------------------------------------------------------------------------------------|18| #MC|Machine Check|Abort|NO|||-------------------------------------------------------------------------------------------|19| #XM|SIMD fp exception|Fault|NO|SSE[2,3] instructions||-------------------------------------------------------------------------------------------|20| #VE|Virtualization exc. |Fault|NO|EPT violations||-------------------------------------------------------------------------------------------|21-31 | ---|Reserved|INT|NO|External interrupts|---------------------------------------------------------------------------------------------115Early interrupts handlerTo react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is anarray of 8-byte descriptors like Global Descriptor Table, but IDT entries are calledgates.CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT isan array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of theentry in the IDT. We remember from the previous part that CPU uses specialto locate Global Descriptor Table, so CPU uses special registerDescriptor Table andlidtIDTRGDTRregisterfor Interruptinstruction for loading base address of the table into thisregister.64-bit mode IDT entry has following structure:12796-------------------------------------------------------------------------------|||Reserved|||-------------------------------------------------------------------------------9564-------------------------------------------------------------------------------|||Offset 63..32|||-------------------------------------------------------------------------------6348 47464442393432-------------------------------------------------------------------------------|||Offset 31..16||P||D||P| 0 |Type |0 0 0 | 0 | 0 | IST |||L||||||||||||-------------------------------------------------------------------------------3116 150-------------------------------------------------------------------------------|||Segment Selector|||Offset 15..0|||--------------------------------------------------------------------------------Where:OffsetDPLP- is offset to entry point of an interrupt handler;- Descriptor Privilege Level;- Segment Present flag;Segment selectorIST- a code segment selector in GDT or LDT- provides ability to switch to a new stack for interrupts handling.And the lastTypefield describes type of theIDTentry. There are three different kinds ofhandlers for interrupts:116Early interrupts handlerTask descriptorInterrupt descriptorTrap descriptorInterrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler.Only one difference between these types is how CPU handleswas accessed through interrupt gate, CPU clear theIFIFflag. If interrupt handlerflag to prevent other interruptswhile current interrupt handler executes. After that current interrupt handler executes, CPUsets theIFflag again withinstruction.iretOther bits in the interrupt gate reserved and must be 0. Now let's look how CPU handlesinterrupts:CPU save flags register,CS, and instruction pointer on the stack.If interrupt causes an error code (like#PFfor example), CPU saves an error on thestack after instruction pointer;After interrupt handler executed,iretinstruction used to return from it.Now let's back to code.Fill and load IDTWe stopped at the following point:for (i = 0; i 0xFF);\_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0,\__KERNEL_CS);\_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\0, 0, __KERNEL_CS);\} while (0)First of all it checks with that passed interrupt number is not greater thanmacro. We need to do this check because we can have onlymake a call of the_set_gate256255withBUG_ONinterrupts. After this, itfunction which writes address of an interrupt gate to theIDT:static inline void _set_gate(int gate, unsigned type, void *addr,unsigned dpl, unsigned ist, unsigned seg){gate_desc s;pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);write_idt_entry(idt_table, gate, &s);write_trace_idt_entry(gate, &s);}At the start ofgate_desc_set_gatefunction we can see call of thepack_gatefunction which fillsstructure with the given values:118Early interrupts handlerstatic inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,unsigned dpl, unsigned ist, unsigned seg){gate->offset_low= PTR_LOW(func);gate->segment= __KERNEL_CS;gate->ist= ist;gate->p= 1;gate->dpl= dpl;gate->zero0= 0;gate->zero1= 0;gate->type= type;gate->offset_middle= PTR_MIDDLE(func);gate->offset_high= PTR_HIGH(func);}As I mentioned above, we fill gate descriptor in this function. We fill three parts of theaddress of the interrupt handler with the address which we got in the main loop (address ofthe interrupt handler entry point). We are using three following macros to split address onthree parts:#define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF)#define PTR_MIDDLE(x) (((unsigned long long)(x) >> 16) & 0xFFFF)#define PTR_HIGH(x) ((unsigned long long)(x) >> 32)With the firstPTR_MIDDLEPTR_LOWmacro we get the firstwe get the secondwe get the last42setandGAT_INTERRUPTIDTPTR_HIGHmacrobytes of the address. Next we setup the segment selector for interruptDescriptor Privilege Level__KERNEL_CS. In the next step we fillInterrupt(highest privilege level) with zeros. And wetype in the end.Now we have filled IDT entry and we can callcopies filledbytes of the address, with the secondbytes of the address and with the thirdhandler, it will be our kernel code segment Stack Table2entry to theIDTnative_write_idt_entryfunction which just:static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate){memcpy(&idt[entry], gate, sizeof(*gate));}After that main loop will finished, we will have filledstructures and we can loadidt_tableInterrupt Descriptor tablearray ofgate_descwith the call of the:load_idt((const struct desc_ptr *)&idt_descr);119Early interrupts handlerWhereidt_descris:struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table };andload_idtjust executesinstruction:lidtasm volatile("lidt %0"::"m" (*dtr));You can note that there are calls of thefunctions. These functions fillsIDTdifference. These functions useidt_table_trace_*functions in the_set_gategates in the same manner thattrace_idt_tabletheand other_set_gatebut with oneInterrupt Descriptor Tableinstead offor tracepoints (we will cover this theme in the another part).Okay, now we have filled and loadedInterrupt Descriptor Table, we know how the CPUacts during an interrupt. So now time to deal with interrupts handlers.Early interrupts handlersAs you can read above, we filledIDTwith the address of theearly_idt_handler_array. Wecan find it in the arch/x86/kernel/head_64.S assembly file:.globl early_idt_handler_arrayearly_idt_handlers:i = 0.rept NUM_EXCEPTION_VECTORS.if (EXCEPTION_ERRCODE_MASK >> i) & 1pushq $0.endifpushq$ijmp early_idt_handler_commoni = i + 1.fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc.endrWe can see here, interrupt handlers generation for the first32exceptions. We check here,if exception has an error code then we do nothing, if exception does not return error code,we push zero to the stack. We do it for that would stack was uniform. After that we pushexception number on the stack and jump on theearly_idt_handler_arraywhich is genericinterrupt handler for now. As we may see above, every nine bytes of theearly_idt_handler_arraynumberarray consists from optional push of an error code, push ofand jump instruction. We can see it in the output of theobjdumpvectorutil:120Early interrupts handler$objdump -D vmlinux.........ffffffff81fe5000 :ffffffff81fe5000:6a 00pushq$0x0ffffffff81fe5002:6a 00pushq$0x0ffffffff81fe5004:e9 17 01 00 00jmpqffffffff81fe5120 .........As i wrote above, CPU pushes flag register,early_idt_handlerCSandRIPon the stack. So beforewill be executed, stack will contain following data:|--------------------|| %rflags|| %cs|| %rip|| rsp --> error code ||--------------------|Now let's look on theearly_idt_handler_commonimplementation. It locates in the samearch/x86/kernel/head_64.S assembly file and first of all we can see check for NMI. We don'tneed to handle it, so just ignore it in theearly_idt_handler_common:cmpl$2,(%rsp)je .Lis_nmiwhereis_nmi:is_nmi:addq $16,%rspINTERRUPT_RETURNdrops an error code and vector number from the stack and calljust expands to thewe checkiretqINTERRUPT_RETURNwhich isinstruction. As we checked the vector number and it is notearly_recursion_flagto prevent recursion in theearly_idt_handler_commonNMI,and if121Early interrupts handlerit's correct we save general registers on the stack:pushq %raxpushq %rcxpushq %rdxpushq %rsipushq %rdipushq %r8pushq %r9pushq %r10pushq %r11We need to do it to prevent wrong values of registers when we return from the interrupthandler. After this we check segment selector in the stack:cmpl$__KERNEL_CS,96(%rsp)jne 11fwhich must be equal to the kernel code segment and if it is not we jump on labelprintsPANIC11whichmessage and makes stack dump.After the code segment was checked, we check the vector number, and if it isFault, we put value from thecr2to therdiregister and call#PFearly_make_pgtableor Page(wellsee it soon):cmpl $14,72(%rsp)jnz 10fGET_CR2_INTO(%rdi)call early_make_pgtableandl %eax,%eaxjz 20fIf vector number is not#PF, we restore general purpose registers from the stack:popq %r11popq %r10popq %r9popq %r8popq %rdipopq %rsipopq %rdxpopq %rcxpopq %raxand exit from the handler withiret.122Early interrupts handlerIt is the end of the first interrupt handler. Note that it is very early interrupt handler, so ithandles only Page Fault now. We will see handlers for the other interrupts, but now let's lookon the page fault handler.Page fault handlingIn the previous paragraph we saw first early interrupt handler which checks interrupt numberfor page fault and callshave4G#PFearly_make_pgtablefor building new page tables if it is. We need tohandler in this step because there are plans to add ability to load kernel aboveand make access toboot_paramsYou can find implementation of thestructure above the 4G.early_make_pgtabletakes one parameter - address from thecr2in the arch/x86/kernel/head64.c andregister, which caused Page Fault. Let's lookon it:int __init early_make_pgtable(unsigned long address){unsigned long physaddr = address - __PAGE_OFFSET;unsigned long i;pgdval_t pgd, *pgd_p;pudval_t pud, *pud_p;pmdval_t pmd, *pmd_p;.........}It starts from the definition of some variables which have*val_ttypes. All of these typesare just:typedef unsigned longpgdval_t;Also we will operate with the*_t(not val) types, for examplepgd_tand etc... All of thesetypes defined in the arch/x86/include/asm/pgtable_types.h and represent structures like this:typedef struct { pgdval_t pgd; } pgd_t;For example,extern pgd_t early_level4_pgt[PTRS_PER_PGD];123Early interrupts handlerHereearly_level4_pgtarray ofpgd_tpresents early top-level page table directory which consists of antypes andpgdpoints to low-level page entries.After we made the check that we have no invalid address, we're getting the address of thePage Global Directory entry which contains#PFaddress and put it's value to thepgdvariable:pgd_p = &early_level4_pgt[pgd_index(address)].pgd;pgd = *pgd_p;In the next step we checkpgd, if it contains correct page global directory entry we putphysical address of the page global directory entry and put it to thepud_pwith:pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);wherePTE_PFN_MASKis a macro:#define PTE_PFN_MASK((pteval_t)PHYSICAL_PAGE_MASK)which expands to:(~(PAGE_SIZE-1)) & ((1 <= EARLY_DYNAMIC_PAGE_TABLES) {reset_early_page_tables();goto again;}pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];for (i = 0; i < PTRS_PER_PUD; i++)pud_p[i] = 0;*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;After this we fix up address of the page upper directory with:pud_p += pud_index(address);pud = *pud_p;In the next step we do the same actions as we did before, but with the page middle directory.In the end we fix address of the page middle directory which contains maps kernel text+datavirtual addresses:pmd = (physaddr & PMD_MASK) + early_pmd_flags;pmd_p[pmd_index(address)] = pmd;After page fault handler finished it's work and as result ourearly_level4_pgtcontainsentries which point to the valid addresses.ConclusionThis is the end of the second part about linux kernel insides. If you have questions orsuggestions, ping me in twitter 0xAX, drop me email or just create issue. In the next part wewill see all steps before kernel entry point -start_kernelfunction.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.LinksGNU assembly .reptAPICNMIPage tableInterrupt handler125Early interrupts handlerPage Fault,Previous part126Last preparations before the kernel entry pointKernel initialization. Part 3.Last preparations before the kernel entry pointThis is the third part of the Linux kernel initialization process series. In the previous part wesaw early interrupt and exception handling and will continue to dive into the linux kernelinitialization process in the current part. Our next point is 'kernel entry point' -start_kernelfunction from the init/main.c source code file. Yes, technically it is not kernel's entry point butthe start of the generic kernel code which does not depend on certain architecture. Butbefore we call thestart_kernelfunction, we must do some preparations. So let's continue.boot_params againIn the previous part we stopped at setting Interrupt Descriptor Table and loading it in theIDTRregister. At the next step after this we can see a call of thecopy_bootdatafunction:copy_bootdata(__va(real_mode_data));This function takes one argument - virtual address of thewe passed the address of theboot_paramsreal_mode_data. Remember thatstructure fromarch/x86/include/uapi/asm/bootparam.h to thex86_64_start_kernelfunction as firstargument in arch/x86/kernel/head_64.S:/* rsi is pointer to real mode structure with interesting info.pass it to C */movq%rsi, %rdiNow let's look at__vamacro. This macro defined in init/main.c:#define __va(x)wherePAGE_OFFSET((void *)((unsigned long)(x)+PAGE_OFFSET))is__PAGE_OFFSETwhich is0xffff880000000000and the base virtualaddress of the direct mapping of all physical memory. So we're getting virtual address of theboot_paramsstructure and pass it to thereal_mod_datato theboot_paramscopy_bootdatafunction, where we copywhich is declared in the arch/x86/kernel/setup.h127Last preparations before the kernel entry pointextern struct boot_params boot_params;Let's look at thecopy_boot_dataimplementation:static void __init copy_bootdata(char *real_mode_data){char * command_line;unsigned long cmd_line_ptr;memcpy(&boot_params, real_mode_data, sizeof boot_params);sanitize_boot_params(&boot_params);cmd_line_ptr = get_cmd_line_ptr();if (cmd_line_ptr) {command_line = __va(cmd_line_ptr);memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);}}First of all, note that this function is declared with__initprefix. It means that this functionwill be used only during the initialization and used memory will be freed.We can see declaration of two variables for the kernel command line and copyingreal_mode_datato theboot_paramsboot_paramsthememcpyfunction. The next call of thefunction which fills some fields of thesanitize_boot_paramsext_ramdisk_imagewith theboot_paramsstructure likeand etc... if bootloaders which fail to initialize unknown fields into zero. After this we're getting address of the command line with the call ofget_cmd_line_ptrfunction:unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32;return cmd_line_ptr;which gets the 64-bit address of the command line from the kernel boot header and returnsit. In the last step we checkboot_command_linecmd_line_ptr, getting its virtual address and copy it to thewhich is just an array of bytes:extern char __initdata boot_command_line[];After this we will have copied kernel command line andstep we can see call of theload_ucode_bspboot_paramsstructure. In the nextfunction which loads processor microcode, butwe will not see it here.128Last preparations before the kernel entry pointAfter microcode was loaded we can see the check of theearly_printkbecausefunction which printsearly_printkKernel Aliveconsole_logleveland thestring. But you'll never see this outputis not initialized yet. It is a minor bug in the kernel and i sent thepatch - commit and you will see it in the mainline soon. So you can skip this code.Move on init pagesIn the next step, as we have copiedboot_paramsstructure, we need to move from the earlypage tables to the page tables for initialization process. We already set early page tables forswitchover, you can read about it in the previous part and dropped all it in thereset_early_page_tablesfunction (you can read about it in the previous part too) and keptonly kernel high mapping. After this we call:clear_page(init_level4_pgt);function and passinit_level4_pgtwhich also defined in the arch/x86/kernel/head_64.Sand looks:NEXT_PAGE(init_level4_pgt).quadlevel3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE.orginit_level4_pgt + L4_PAGE_OFFSET*8, 0.quadlevel3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE.orginit_level4_pgt + L4_START_KERNEL*8, 0.quadlevel3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLEwhich maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss.clear_pagefunction defined in the arch/x86/lib/clear_page_64.S let's look on this function:129Last preparations before the kernel entry pointENTRY(clear_page)CFI_STARTPROCxorl %eax,%eaxmovl$4096/64,%ecx.p2align 4.Lloop:decl%ecx#define PUT(x) movq %rax,x*8(%rdi)movq %rax,(%rdi)PUT(1)PUT(2)PUT(3)PUT(4)PUT(5)PUT(6)PUT(7)leaq 64(%rdi),%rdijnz.LloopnopretCFI_ENDPROC.Lclear_page_end:ENDPROC(clear_page)As you can understand from the function name it clears or fills with zeros page tables. Firstof all note that this function starts with theCFI_STARTPROCandwhich areCFI_ENDPROCexpands to GNU assembly directives:#define CFI_STARTPROC.cfi_startproc#define CFI_ENDPROC.cfi_endprocand used for debugging. AftertheecxCFI_STARTPROCmacro we zero outeaxregister and put 64 to(it will be a counter). Next we can see loop which starts with theit starts from theecxdecrement. After it we put zero from thewhich contains the base address of theseven times but every time moverdiinit_level4_pgtrax.Lloopregister to theoffset on 8. After this we will have first 64 bytes of thefilled with zeros. In the next step we put the address of theinit_level4_pgtwith 64-bytes offset to theAs we haveinit_level4_pgtrdinow and do the same procedureinit_level4_pgtreaches zero. In the end we will havelabel andrdiagain and repeat all operations untilinit_level4_pgtecxfilled with zeros.filled with zeros, we set the lastinit_level4_pgtentry tokernel high mapping with the:init_level4_pgt[511] = early_level4_pgt[511];130Last preparations before the kernel entry pointRemember that we dropped allearly_level4_pgtentries in thereset_early_page_tablefunction and kept only kernel high mapping there.The last step in thex86_64_start_kernelfunction is the call of the:x86_64_start_reservations(real_mode_data);function with thereal_mode_dataas argument. Thedefined in the same source code file as thex86_64_start_reservationsx86_64_start_kernelfunctionfunction and looks:void __init x86_64_start_reservations(char *real_mode_data){if (!boot_params.hdr.version)copy_bootdata(__va(real_mode_data));reserve_ebda_region();start_kernel();}You can see that it is the last function before we are in the kernel entry point -start_kernelfunction. Let's look what it does and how it works.Last step before kernel entry pointFirst of all we can see in theboot_params.hdr.versionx86_64_start_reservationsfunction the check for:if (!boot_params.hdr.version)copy_bootdata(__va(real_mode_data));and if it is zero we callreal_mode_datacopy_bootdatafunction again with the virtual address of the(read about its implementation).In the next step we can see the call of thereserve_ebda_regionfunction which defined inthe arch/x86/kernel/head.c. This function reserves memory block for theEBDAor ExtendedBIOS Data Area. The Extended BIOS Data Area located in the top of conventional memoryand contains data about ports, disk parameters and etc...Let's look on thereserve_ebda_regionfunction. It starts from the checking isparavirtualization enabled or not:131Last preparations before the kernel entry pointif (paravirt_enabled())return;we exit from thereserve_ebda_regionfunction if paravirtualization is enabled because if itenabled the extended bios data area is absent. In the next step we need to get the end ofthe low memory:lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);lowmem <<= 10;We're getting the virtual address of the BIOS low memory in kilobytes and convert it to byteswith shifting it on 10 (multiply on 1024 in other words). After this we need to get the addressof the extended BIOS data are with the:ebda_addr = get_bios_ebda();whereget_bios_ebdafunction defined in the arch/x86/include/asm/bios_ebda.h and lookslike:static inline unsigned int get_bios_ebda(void){unsigned int address = *(unsigned short *)phys_to_virt(0x40E);address <<= 4;return address;}Let's try to understand how it works. Here we can see that we converting physical address0x40Eto the virtual, where0x0040:0x000eis the segment which contains base address ofthe extended BIOS data area. Don't worry that we are usingphys_to_virtfunction forconverting a physical address to virtual address. You can note that previously we have used__vamacro for the same point, butphys_to_virtis the same:static inline void *phys_to_virt(phys_addr_t address){return __va(address);}only with one difference: we pass argument with theCONFIG_PHYS_ADDR_T_64BITphys_addr_twhich depends on:132Last preparations before the kernel entry point#ifdef CONFIG_PHYS_ADDR_T_64BITtypedef u64 phys_addr_t;#elsetypedef u32 phys_addr_t;#endifThis configuration option is enabled byCONFIG_PHYS_ADDR_T_64BIT. After that we got virtualaddress of the segment which stores the base address of the extended BIOS data area, weshift it on 4 and return. After thisebda_addrvariables contains the base address of theextended BIOS data area.In the next step we check that address of the extended BIOS data area and low memory isnot less thanINSANE_CUTOFFmacroif (ebda_addr < INSANE_CUTOFF)ebda_addr = LOWMEM_CAP;if (lowmem regions[0].size == 0) {WARN_ON(type->cnt != 1 || type->total_size);type->regions[0].base = base;type->regions[0].size = size;type->regions[0].flags = flags;memblock_set_region_node(&type->regions[0], nid);type->total_size = size;return 0;}After we filled our region we can see the call of thememblock_set_region_nodefunction withtwo parameters:address of the filled memory region;NUMA node id.where our regions represented by thememblock_regionstructure:struct memblock_region {phys_addr_t base;phys_addr_t size;unsigned long flags;#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAPint nid;#endif};NUMA node id depends onMAX_NUMNODESmacro which is defined in theinclude/linux/numa.h:#define MAX_NUMNODESwhereNODES_SHIFT(1 stack)141Kernel entry pointFrom the Linux kernelv4.9-rc1and stack pointer resides inkernel. This depends onenabled by default forrelease,task_structthread_infostructure which represents a thread in the LinuxCONFIG_THREAD_INFO_IN_TASKx86_64structure may contains only flagskernel configuration option which is. You can be sure in this if you will look in the init/main.cconfiguration build file:config THREAD_INFO_IN_TASKboolhelpSelect this to move thread_info off the stack into task_struct.Tomake this work, an arch will need to remove all thread_info fieldsexcept flags and fix any runtime bugs.One subtle change that will be needed is to use try_get_task_stack()and put_task_stack() in save_thread_stack_tsk() and get_wchan().and arch/x86/Kconfig:config X86def_bool y.........select THREAD_INFO_IN_TASK.........So, in this way we may just get end of a thread stack from the giventask_structstructure:#ifdef CONFIG_THREAD_INFO_IN_TASKstatic inline unsigned long *end_of_stack(const struct task_struct *task){return task->stack;}#endifAs we got the end of the init process stack, we writeSTACK_END_MAGICthere. Aftercanaryisset, we can check it like this:if (*end_of_stack(task) != STACK_END_MAGIC) {//// handle stack overflow here//}142Kernel entry pointThe next function after theset_task_stack_end_magicfunction has an empty body forx86_64issmp_setup_processor_id. This:void __init __weak smp_setup_processor_id(void){}as it not implemented for all architectures, but some such as s390 and arm64.The next function instart_kernelfunction is almost the same asisdebug_objects_early_init. Implementation of this, but fills hashes for object debugging. As Ilockdep_initwrote above, we will not see the explanation of this and other functions which are fordebugging purposes in this chapter.After thedebug_object_early_initboot_init_stack_canarythe-fstack-protectorfunction we can see the call of thefunction which fillswith the canary value fortask_struct->canarygcc feature. This function depends on theconfiguration option and if this option is disabled,CONFIG_CC_STACKPROTECTORboot_init_stack_canarydoes nothing,otherwise it generates random numbers based on random pool and the TSC:get_random_bytes(&canary, sizeof(canary));tsc = __native_read_tsc();canary += tsc + (tsc 8) & 255, val & 255);}157Continue architecture-specific boot-time initializationswheredev_tstrangeold_is a kernel data type to present major/minor number pair. But what's theprefix? For historical reasons, there are two ways of managing the major andminor numbers of a device. In the first way major and minor numbers occupied 2 bytes. Youcan see it in the previous code: 8 bit for major number and 8 bit for minor number. But thereis a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bitinteger was replaced by 32-bit integer where 12 bits reserved for major number and 20 bitsfor minor. You can see this in theimplementation:new_decode_devstatic inline dev_t new_decode_dev(u32 dev){unsigned major = (dev & 0xfff00) >> 8;unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);return MKDEV(major, minor);}After calculation we will get20 bits forminor0xfffor 12 bits formajor. So in the end of execution of theminor numbers for the root device inROOT_DEVif it is0xffffffffold_decode_devand0xffffforwe will get major and.Memory map setupThe next point is the setup of the memory map with the call of thesetup_memory_mapfunction. But before this we setup different parameters as information about a screen(current row and column, video page and etc... (you can read about it in the Video modeinitialization and transition to protected mode)), Extended display identification data, videomode, bootloader_type and etc...:screen_info = boot_params.screen_info;edid_info = boot_params.edid_info;saved_video_mode = boot_params.hdr.vid_mode;bootloader_type = boot_params.hdr.type_of_loader;if ((bootloader_type >> 4) == 0xe) {bootloader_type &= 0xf;bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;}bootloader_version= bootloader_type & 0xf;bootloader_version |= boot_params.hdr.ext_loader_ver << 4;All of these parameters we got during boot time and stored in theboot_paramsstructure.After this we need to setup the end of the I/O memory. As you know one of the mainpurposes of the kernel is resource management. And one of the resource is memory. As wealready know there are two ways to communicate with devices are I/O ports and devicememory. All information about registered resources are available through:158Continue architecture-specific boot-time initializations/proc/ioports - provides a list of currently registered port regions used for input or outputcommunication with a device;/proc/iomem - provides current map of the system's memory for each physical device.At the moment we are interested in/proc/iomem:cat /proc/iomem00000000-00000fff : reserved00001000-0009d7ff : System RAM0009d800-0009ffff : reserved000a0000-000bffff : PCI Bus 0000:00000c0000-000cffff : Video ROM000d0000-000d3fff : PCI Bus 0000:00000d4000-000d7fff : PCI Bus 0000:00000d8000-000dbfff : PCI Bus 0000:00000dc000-000dffff : PCI Bus 0000:00000e0000-000fffff : reserved000e0000-000e3fff : PCI Bus 0000:00000e4000-000e7fff : PCI Bus 0000:00000f0000-000fffff : System ROMAs you can see range of addresses are shown in hexadecimal notation with its owner. Linuxkernel provides API for managing any resources in a general way. Global resources (forexample PICs or I/O ports) can be divided into subsets - relating to any hardware bus slot.The main structureresource:struct resource {resource_size_t start;resource_size_t end;const char *name;unsigned long flags;struct resource *parent, *sibling, *child;};presents abstraction for a tree-like subset of system resources. This structure providesrange of addresses fromx86_64startto) which a resource covers,/proc/iomemoutput) andflagsendname(resource_size_tisphys_addr_toru64forof a resource (you see these names in theof a resource (All resources flags defined in theinclude/linux/ioport.h). The last are three pointers to theresourcestructure. These pointersenable a tree-like structure:159Continue architecture-specific boot-time initializations+-------------++-------------+||||parent||------|||sibling|+-------------+||+-------------+||+-------------+|||child|||+-------------+Every subset of resources has root range resources. Foriomemit isiomem_resourcewhichdefined as:struct resource iomem_resource = {.name= "PCI mem",.start= 0,.end= -1,.flags= IORESOURCE_MEM,};EXPORT_SYMBOL(iomem_resource);TODO EXPORT_SYMBOLiomem_resourcedefines root addresses range for io memory withIORESOURCE_MEM(address of the0x00000200iomemPCI memname and) as flags. As i wrote above our current point is setup the end. We will do it with:iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;Here we shift1onboot_cpu_data.x86_phys_bitswhich we filled during execution of thename of thex86_phys_bits.boot_cpu_dataearly_cpu_initiscpuinfo_x86structure. As you can understand from thefield, it presents maximum bits amount of the maximum physicaladdress in the system. Note also thatiomem_resourcemacro. This macro exports the given symbol (is passed to theiomem_resourceEXPORT_SYMBOLin our case) for dynamiclinking or in other words it makes a symbol accessible to dynamically loaded modules.After we set the end address of the rootiomemresource address range, as I wrote abovethe next step will be setup of the memory map. It will be produced with the call of thememory_mapsetup_function:160Continue architecture-specific boot-time initializationsvoid __init setup_memory_map(void){char *who;who = x86_init.resources.memory_setup();memcpy(&e820_saved, &e820, sizeof(struct e820map));printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n");e820_print_map(who);}First of all we call look here the call of thex86_init_opsx86_init.resources.memory_setup.x86_initis astructure which presents platform specific setup functions as resourcesinitialization, pci initialization and etc... initialization of thex86_initis in thearch/x86/kernel/x86_init.c. I will not give here the full description because it is very long, butonly one part which interests us for now:struct x86_init_ops x86_init __initdata = {.resources = {.probe_roms= probe_roms,.reserve_resources= reserve_standard_io_resources,.memory_setup= default_machine_specific_memory_setup,},.........}As we can see herememry_setupfield isdefault_machine_specific_memory_setupwhere weget the number of the e820 entries which we collected in the boot time, sanitize the BIOSe820 map and fille820mapstructure with the memory regions. As all regions are collected,print of all regions with printk. You can find this print if you executedmesgcommand andyou can see something like this:161Continue architecture-specific boot-time initializations[0.000000] e820: BIOS-provided physical RAM map:[0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable[0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved[0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved[0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable[0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS[0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable[0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved[0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable[0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved[0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable[0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS[0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved[0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable.........Copying of the BIOS Enhanced Disk DeviceinformationThe next two steps is parsing of thecopying BIOS EDD to the safe place.as we can read from theField name:Type:Offset/size:Protocol:x86setup_datasetup_datawithparse_setup_datafunction andis a field from the kernel boot header andboot protocol:setup_datawrite (special)0x250/82.09+The 64-bit physical pointer to NULL terminated single linked list ofstruct setup_data. This is used to define a more extensible bootparameters passing mechanism.It used for storing setup information for different types as device tree blob, EFI setup dataand etc... In the second step we copy BIOS EDD information from thestructure that we collected in the arch/x86/boot/edd.c to theeddboot_paramsstructure:162Continue architecture-specific boot-time initializationsstatic inline void __init copy_edd(void){memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,sizeof(edd.mbr_signature));memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;edd.edd_info_nr = boot_params.eddbuf_entries;}Memory descriptor initializationThe next step is initialization of the memory descriptor of the init process. As you alreadycan know every process has its own address space. This address space presented withspecial data structure which calledmemory descriptorcode memory descriptor presented withmm_struct. Directly in the linux kernel sourcestructure.mm_structcontains manydifferent fields related with the process address space as start/end address of the kernelcode/data, start/end of the brk, number of memory areas, list of memory areas and etc...This structure defined in the include/linux/mm_types.h. As every process has its ownmemory descriptor,our firstthe initinittask_structstructure contains it in themmandactive_mmfield. Andprocess has it too. You can remember that we saw the part of initialization oftask_structwith#define INIT_TASK(tsk)INIT_TASKmacro in the previous part:\{..........mm = NULL,.active_mm\= &init_mm, \...}mmpoints to the process address space andactive_mmpoints to the active address spaceif process has no address space such as kernel threads (more about it you can read in thedocumentation). Now we fill memory descriptor of the initial process:init_mm.start_code = (unsigned long) _text;init_mm.end_code = (unsigned long) _etext;init_mm.end_data = (unsigned long) _edata;init_mm.brk = _brk_end;with the kernel's text, data and brk.init_mmis the memory descriptor of the initial processand defined as:163Continue architecture-specific boot-time initializationsstruct mm_struct init_mm = {.mm_rb= RB_ROOT,.pgd= swapper_pg_dir,.mm_users= ATOMIC_INIT(2),.mm_count= ATOMIC_INIT(1),.mmap_sem= __RWSEM_INITIALIZER(init_mm.mmap_sem),.page_table_lock =.mmlist__SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),= LIST_HEAD_INIT(init_mm.mmlist),INIT_MM_CONTEXT(init_mm)};wheremm_rbis a red-black tree of the virtual memory areas,global directory,mmap_semmm_usersis address space users,mm_countpgdis a pointer to the pageis primary usage counter andis memory area semaphore. After we setup memory descriptor of the initialprocess, next step is initialization of the Intel Memory Protection Extensions withmpx_mm_init. The next step is initialization of the code/data/bss resources with:code_resource.start = __pa_symbol(_text);code_resource.end = __pa_symbol(_etext)-1;data_resource.start = __pa_symbol(_etext);data_resource.end = __pa_symbol(_edata)-1;bss_resource.start = __pa_symbol(__bss_start);bss_resource.end = __pa_symbol(__bss_stop)-1;We already know a little aboutresourcestructure (read above). Here we fills code/data/bssresources with their physical addresses. You can see it in the/proc/iomem:00100000-be825fff : System RAM01000000-015bb392 : Kernel code015bb393-01930c3f : Kernel data01a11000-01ac3fff : Kernel bssAll of these structures are defined in the arch/x86/kernel/setup.c and look like typicalresource initialization:static struct resource code_resource = {.name.start.end.flags= "Kernel code",= 0,= 0,= IORESOURCE_BUSY | IORESOURCE_MEM};164Continue architecture-specific boot-time initializationsThe last step which we will cover in this part will beNXconfiguration.NX-bitor noexecute bit is 63-bit in the page directory entry which controls the ability to execute codefrom all physical pages mapped by the table entry. This bit can only be used/set when theno-executepage-protection mechanism is enabled by the settingx86_configure_nxfunction we check that CPU has support ofdisabled. After the check we fill__supported_pte_maskEFER.NXENX-bitto 1. In theand it does notdepend on it:void x86_configure_nx(void){if (cpu_has_nx && !disable_nx)__supported_pte_mask |= _PAGE_NX;else__supported_pte_mask &= ~_PAGE_NX;}ConclusionIt is the end of the fifth part about linux kernel initialization process. In this part we continuedto dive in thesetup_archfunction which makes initialization of architecture-specific stuff. Itwas long part, but we have not finished with it. As i already wrote, thesetup_archis bigfunction, and I am really not sure that we will cover all of it even in the next part. There weresome new interesting concepts in this part likeFix-mappedaddresses, ioremap and etc...Don't worry if they are unclear for you. There is a special part about these concepts - Linuxkernel memory management Part 2.. In the next part we will continue with the initialization ofthe architecture-specific stuff and will see parsing of the early kernel parameters, early dumpof the pci devices,Desktop Management Interfacescanning and many many more.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Linksmm vs active_mme820Supervisor mode access preventionKernel stacksTSSIDTMemory mapped I/O165Continue architecture-specific boot-time initializationsCFI directivesPDF. dwarf4 specificationCall stackPrevious part166Architecture-specific initializations, again...Kernel initialization. Part 6.Architecture-specific initialization, again...In the previous part we saw architecture-specific (the arch/x86/kernel/setup.c and finished on_PAGE_NXx86_64in our case) initialization stuff fromfunction which sets thex86_configure_nxflag depends on support of NX bit. As I wrote beforesetup_archfunction andare very big, so in this and in the next part we will continue to learn aboutstart_kernelarchitecture-specific initialization process. The next function afterparse_early_paramx86_configure_nxis. This function is defined in the init/main.c and as you can understandfrom its name, this function parses kernel command line and setups different servicesdepends on the given parameters (all kernel command line parameters you can find are inthe Documentation/kernel-parameters.txt). You may remember how we setupearlyprintkin the earliest part. On the early stage we looked for kernel parameters and their value withthecmdline_find_optionfunction and__cmdline_find_option,__cmdline_find_option_boolhelpers from the arch/x86/boot/cmdline.c. There we're in the generic kernel part which doesnot depend on architecture and here we use another approach. If you are reading linuxkernel source code, you already note calls like this:early_param("gbpages", parse_direct_gbpages_on);early_parammacro takes two parameters:command line parameter name;function which will be called if given parameter is passed.and defined as:#define early_param(str, fn) \__setup_param(str, fn, fn, 1)in the include/linux/init.h. As you can see__setup_paramearly_parammacro just makes call of themacro:167Architecture-specific initializations, again...#define __setup_param(str, unique_id, fn, early)\static const char __setup_str_##unique_id[] __initconst \__aligned(1) = str; \static struct obs_kernel_param __setup_##unique_id\__used __section(.init.setup)\__attribute__((aligned((sizeof(long)))))\= { __setup_str_##unique_id, fn, early }This macro defines__setup_str_*_idvariable (where*depends on given function name)and assigns it to the given command line parameter name. In the next line we can seedefinition of the__setup_*obs_kernel_paramvariable which type isobs_kernel_paramand its initialization.structure defined as:struct obs_kernel_param {const char *str;int (*setup_func)(char *);int early;};and contains three fields:name of the kernel parameter;function which setups something depend on parameter;field determines is parameter early (1) or not (0).Note thatall__set_param__setup_str_*macro defines withwill be placed in the__section(.init.setup).init.setupsection, moreover, as we can see inthe include/asm-generic/vmlinux.lds.h, they will be placed between__setup_endattribute. It means that__setup_startand:#define INIT_SETUP(initsetup_align). = ALIGN(initsetup_align);\\VMLINUX_SYMBOL(__setup_start) = .; \*(.init.setup)\VMLINUX_SYMBOL(__setup_end) = .;Now we know how parameters are defined, let's back to theparse_early_paramimplementation:168Architecture-specific initializations, again...void __init parse_early_param(void){static int done __initdata;static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;if (done)return;/* All fall through to do_early_param. */strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);parse_early_options(tmp_cmdline);done = 1;}Theparse_early_paramline. After this we copydefined and call thecalls thefunction from the same source codeparse_args__setup_startto__setup_enddo_early_paramif a parameter is early. After this all services which are depend on early. The nextprints information about thex86_configure_nxit after thenoexecfunction. This functionparse_early_param. As I wrote in the beginning of this part, we already setx86_configure_nxfile., and calls the function from thecommand line parameters were setup and the next call after thex86_report_nxmain.cfunction from the kernel/params.c whereparses given command line and callsobs_kernel_paramcheck thatto the temporary command line which we justboot_command_lineparse_early_optionsparse_early_optionsgoes from thedonealready called and the second is temporary storage for kernel commandparse_early_paramparse_argsfunction defines two static variables. Firstx86_report_nxNXparse_early_paramwith thefunction from the arch/x86/mm/setup_nx.c just. Note that we call, but after the call of theNX-bitisx86_report_nxparse_early_parambecause the kernel supportnot right after the. The answer is simple: we callnoexecparameter:[X86]On X86-32 available only on PAE configured kernels.noexec=on: enable non-executable mappings (default)noexec=off: disable non-executable mappingsWe can see it in the booting time:After this we can see call of the:memblock_x86_reserve_range_setup_data();169Architecture-specific initializations, again...function. This function is defined in the same arch/x86/kernel/setup.c source code file andremaps memory for theaboutsetup_datasetup_dataand reserved memory block for theyou can read in the previous part and aboutioremapsetup_dataand(morememblockyoucan read in the Linux kernel memory management).In the next step we can see following conditional statement:if (acpi_mps_check()) {#ifdef CONFIG_X86_LOCAL_APICdisable_apic = 1;#endifsetup_clear_cpu_cap(X86_FEATURE_APIC);}The firstacpi_mps_checkCONFIG_X86_LOCAL_APICfunction from the arch/x86/kernel/acpi/boot.c depends onandCONFIG_x86_MPPARSEconfiguration options:int __init acpi_mps_check(void){#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE)/* mptable code is not built-in*/if (acpi_disabled || acpi_noirq) {printk(KERN_WARNING "MPS support code is not built-in.\n""Using acpi=off or acpi=noirq or pci=noacpi ""may have problem\n");return 1;}#endifreturn 0;}It checks the built-inset andMPSor MultiProcessor Specification table. IfCONFIG_x86_MPPAARSEis not set,of the command line options:acpi_mps_checkreturns1acpi=offacpi_mps_check,acpi=noirqorCONFIG_X86_LOCAL_APICprints warning message if the onepci=noacpipassed to the kernel. Ifit means that we disable local APIC and clearbit in the of the current CPU with theissetup_clear_cpu_capX86_FEATURE_APICmacro. (more about CPU maskyou can read in the CPU masks).Early PCI dumpIn the next step we make a dump of the PCI devices with the following code:170Architecture-specific initializations, again...#ifdef CONFIG_PCIif (pci_early_dump_regs)early_dump_pci_devices();#endifpci_early_dump_regsvariable defined in the arch/x86/pci/common.c and its value dependson the kernel command line parameter:pci=earlydump. We can find definition of thisparameter in the drivers/pci/pci.c:early_param("pci", pci_setup);function gets the string after thepci_setuppcibios_setupwhich defined as__weakpci=and analyzes it. This function callsin the drivers/pci/pci.c and every architecturedefines the same function which overrides__weakanalog. For examplex86_64architecture-dependent version is in the arch/x86/pci/common.c:char *__init pcibios_setup(char *str) {.........} else if (!strcmp(str, "earlydump")) {pci_early_dump_regs = 1;return NULL;}.........}So, ifCONFIG_PCIoption is set and we passedpci=earlydumpoption to the kernelcommand line, next function which will be called -early_dump_pci_devicesarch/x86/pci/early.c. This function checkspci parameter with:noearlyfrom theif (!early_pci_allowed())return;and returns if it was passed. Each PCI domain can host up to256buses and each bushosts up to 32 devices. So, we goes in a loop:171Architecture-specific initializations, again...for (bus = 0; bus < 256; bus++) {for (slot = 0; slot < 32; slot++) {for (func = 0; func < 8; func++) {.........}}}and read thepciconfig with theread_pci_configThat's all. We will not go deep in theDrivers/PCIpcifunction.details, but will see more details in the specialpart.Finish with memory parsingAfter theearly_dump_pci_devices, there are a couple of function related with availablememory and e820 which we collected in the First steps in the kernel setup part:/* update the e820_saved too */e820_reserve_setup_data();finish_e820_parsing();.........e820_add_kernel_range();trim_bios_range(void);max_pfn = e820_end_of_ram_pfn();early_reserve_e820_mpc_new();Let's look on it. As you can see the first function isdoes almost the same asit also callswhich issanitizesmemblock_x86_reserve_range_setup_datae820_update_rangeE820_RESERVED_KERNe820mape820_reserve_setup_datawith thewhich adds new regions to thein our case. The next function issanitize_e820_map. This functionwhich we saw above, bute820mapwith the given typefinish_e820_parsingwhichfunction. Besides this two functions we cansee a couple of functions related to the e820. You can see it in the listing above.e820_add_kernel_rangefunction takes the physical address of the kernel start and end:u64 start = __pa_symbol(_text);u64 size = __pa_symbol(_end) - start;172Architecture-specific initializations, again...checks that.text.dataand.bssmarked aswarning message if not. The next functione820MapasE820_RESERVEDE820RAMtrm_bios_rangein theand sanitizes it again with the call of thefunction. Every memory page has a unique number -and prints theupdate first 4096 bytes inAfter this we get the last page frame number with the call of thee820_end_of_ram_pfne820mapsanitize_e820_map.e820_end_of_ram_pfnPage frame numberfunction returns the maximum with the call of theande820_end_pfn:unsigned long __init e820_end_of_ram_pfn(void){return e820_end_pfn(MAX_ARCH_PFN);}where(e820_end_pfnMAX_ARCH_PFNe820istakes maximum page frame number on the certain architecture0x400000000slots and check thatfore820x86_64). In theentry hase820_end_pfnE820_RAMorwe go through the allE820_PRAMtype because wecalculate page frame numbers only for these types, gets the base address and end addressof the page frame number for the currente820entry and makes some checks for theseaddresses:for (i = 0; i type != E820_RAM && ei->type != E820_PRAM)continue;start_pfn = ei->addr >> PAGE_SHIFT;end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT;if (start_pfn >= limit_pfn)continue;if (end_pfn > limit_pfn) {last_pfn = limit_pfn;break;}if (end_pfn > last_pfn)last_pfn = end_pfn;}173Architecture-specific initializations, again...if (last_pfn > max_arch_pfn)last_pfn = max_arch_pfn;printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n",last_pfn, max_arch_pfn);return last_pfn;After this we check thatlast_pfnwhich we got in the loop is not greater that maximumpage frame number for the certain architecture (in our case), print information aboutx86_64last page frame number and return it. We can see thelast_pfnin thedmesgoutput:...[0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000...After this, as we have calculated the biggest page frame number, we calculatewhich is the biggest page frame number in theinstalled more than 4 gigabytes of RAM,e820_end_of_low_ram_pfnlow memorymax_low_pfnmax_low_pfn4gigabytes. Ifwill be result of thefunction which does the samegigabytes limit, in other wayor below firstmax_low_pfne820_end_of_ram_pfnwill be the same asmax_pfnbut with 4:if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))max_low_pfn = e820_end_of_low_ram_pfn();elsemax_low_pfn = max_pfn;high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;Next we calculatehigh_memory(defines the upper bound on direct map memory) with__vamacro which returns a virtual address by the given physical memory.DMI scanningThe next step after manipulations with different memory regions ande820slots is collectinginformation about computer. We will get all information with the Desktop ManagementInterface and following functions:dmi_scan_machine();dmi_memdev_walk();174Architecture-specific initializations, again...First isdefined in the drivers/firmware/dmi_scan.c. This function goesdmi_scan_machinethrough the System Management BIOS structures and extracts information. There are twoways specified to gain access to theSMBIOStable: get the pointer to thethe EFI's configuration table and scanning the physical memory betweenaddresses. Let's look on the second approach.0x10000memory between0xf0000the:early_ioremapand0x10000with theSMBIOS0xF0000dmi_scan_machinedmi_early_remaptable fromandfunction remapswhich just expands tovoid __init dmi_scan_machine(void){char __iomem *p, *q;char buf[32];.........p = dmi_early_remap(0xF0000, 0x10000);if (p == NULL)goto error;and iterates over allDMIheader address and find search_SM_string:memset(buf, 0, 16);for (q = p; q specificationis 1 or 4(it must be1by specification) in the loop:4while (length > 0) {if ((*bp == SMP_MAGIC_IDENT) &&(mpf->length == 1) &&!mpf_checksum((unsigned char *)bp, 16) &&((mpf->specification == 1)|| (mpf->specification == 4))) {mem = virt_to_phys(mpf);memblock_reserve(mem, sizeof(*mpf));if (mpf->physptr)smp_reserve_memory(mpf);}}reserves given memory block if search is successful withmemblock_reserveand reservesphysical address of the multiprocessor configuration table. You can find documentationabout this in the - MultiProcessor Specification. You can read More details in the special partabout.SMPAdditional early memory initialization routinesIn the next step of thesetup_archwe can see the call of theearly_alloc_pgt_buffunctionwhich allocates the page table buffer for early stage. The page table buffer will be placed inthebrkvoidarea. Let's look on its implementation:__init early_alloc_pgt_buf(void){unsigned long tables = INIT_PGT_BUF_SIZE;phys_addr_t base;base = __pa(extend_brk(tables, PAGE_SIZE));pgt_buf_start = base >> PAGE_SHIFT;pgt_buf_end = pgt_buf_start;pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);}First of all it get the size of the page table buffer, it will bePAGE_SIZE)extend_brkINIT_PGT_BUF_SIZEwhich is(6 *in the current linux kernel 4.0. As we got the size of the page table buffer, we callfunction with two parameters: size and align. As you can understand from its178Architecture-specific initializations, again...name, this function extends thebrkbrkarea. As we can see in the linux kernel linker scriptis in memory right after the BSS:. = ALIGN(PAGE_SIZE);.brk : AT(ADDR(.brk) - LOAD_OFFSET) {__brk_base = .;. += 64 * 1024;/* 64k alignment slop space */*(.brk_reservation)/* areas brk users have reserved */__brk_limit = .;}Or we can find it withreadelfutil:After that we got physical address of the newbrkwith the__pamacro, we calculate thebase address and the end of the page table buffer. In the next step as we got page tablebuffer, we reserve memory block for the brk area with thereserve_brkfunction:static void __init reserve_brk(void){if (_brk_end > _brk_start)memblock_reserve(__pa_symbol(_brk_start),_brk_end - _brk_start);_brk_start = 0;}Note that in the end of thereserve_brk, we setbrk_startto zero, because after this wewill not allocate it anymore. The next step after reserving memory block for theneed to unmap out-of-range memory areas in the kernel mapping with thefunction. Remember that kernel mapping islevel2_kernel_pgtclean_high_mapmaps the kernel_text__START_KERNEL_map,dataandbssandbrk, wecleanup_highmap_end - _textor. In the start of thewe define these parameters:unsigned long vaddr = __START_KERNEL_map;unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;pmd_t *pmd = level2_kernel_pgt;pmd_t *last_pmd = pmd + PTRS_PER_PMD;179Architecture-specific initializations, again...Now, as we defined start and end of the kernel mapping, we go in the loop through the allkernel page middle directory entries and clean entries which are not betweenend_textand:for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {if (pmd_none(*pmd))continue;if (vaddr end)set_pmd(pmd, __pmd(0));}After this we set the limit for thefunction (read more about2), it will bee820memblockorISA_END_ADDRESSwith the call of thememblockallocation with thememblock_set_current_limityou can in the Linux kernel memory management Part0x100000and fill thememblock_x86_fillmemblockinformation according tofunction. You can see the result of this functionin the kernel initialization time:MEMBLOCK configuration:memory size = 0x1fff7ec00 reserved size = 0x1e30000memory.cnt= 0x3memory[0x0][0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0memory[0x1][0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0memory[0x2]reserved.cnt[0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0= 0x3reserved[0x0][0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0reserved[0x1][0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0reserved[0x2][0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0The rest functions after theadditional slots in the0x20050000,0x0,allocates-reserve_real_mode- trims certain memory regions started from, etc. these regions must be excluded because Sandy Bridge hasproblems with these regions,memblockearly_reserve_e820_mpc_newto 1 megabyte for the trampoline to the real mode (fortrim_platform_memory_ranges0x20110000are:for MultiProcessor Specification table,e820mapreserves low memory fromrebooting, etc.),memblock_x86_fillinit_mem_mappingtrim_low_memory_rangereserves the first 4 kilobyte page infunction reconstructs direct memory mapping and setups thedirect mapping of the physical memory atPAGE_OFFSET,early_trap_pf_inithandler (we will look on it in the chapter about interrupts) andsetupssetup_real_mode#PFfunctionsetups trampoline to the real mode code.That's all. You can note that this part will not cover all functions which are in the(likeearly_gart_iommu_checksetup_archsetup_arch, mtrr initialization, etc.). As I already wrote many times,is big, and linux kernel is big. That's why I can't cover every line in the linuxkernel. I don't think that we missed something important, but you can say something like:180Architecture-specific initializations, again...each line of code is important. Yes, it's true, but I missed them anyway, because I think thatit is not realistic to cover full linux kernel. Anyway we will often return to the idea that wehave already seen, and if something is unfamiliar, we will cover this theme.ConclusionIt is the end of the sixth part about linux kernel initialization process. In this part wecontinued to dive in thefinished with it. Yes,setup_archsetup_archfunction again and it was long part, but we are notis big, hope that next part will be the last part about thisfunction.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksMultiProcessor SpecificationNX bitDocumentation/kernel-parameters.txtAPICCPU masksLinux kernel memory managementPCIe820System Management BIOSSystem Management BIOSEFISMPMultiProcessor SpecificationBSSSMBIOS specificationPrevious part181End of the architecture-specific initializations, almost...Kernel initialization. Part 7.The End of the architecture-specificinitialization, almost...This is the seventh part of the Linux Kernel initialization process which covers insides of thesetup_archparts, thefunction from the arch/x86/kernel/setup.c. As you can know from the previoussetup_archfunction does some architecture-specific (in our case it is x86_64)initialization stuff like reserving memory for kernel code/data/bss, early scanning of theDesktop Management Interface, early dump of the PCI device and many many more. If youhave read the previous part, you can remember that we've finished it at thesetup_real_modefunction. In the next step, as we set limit of the memblock to the allmapped pages, we can see the call of thesetup_log_buffunction from thekernel/printk/printk.c.Thesetup_log_buffunction setups kernel cyclic buffer and its length depends on theCONFIG_LOG_BUF_SHIFTconfiguration option. As we can read from the documentation of theCONFIG_LOG_BUF_SHIFTit can be between12and21. In the insides, buffer defined asarray of chars:#define __LOG_BUF_LEN (1 <= (mapped_size>>1))panic("initrd too large to handle, ""disabling initrd (%lld needed, %lld available)\n",ramdisk_size, mapped_size>>1);You can see here that we callit, wheremax_pfn_mappednot remember what ismemblock_mem_sizefunction and pass themax_pfn_mappedcontains the highest direct mapped page frame number. If you dopage frame number, explanation is simple: First12bits of the virtualaddress represent offset in the physical page or page frame. If we right-shift outthe virtual address, we'll discard offset part and will getmemblock_mem_sizetowe go through the all memblockPage Frame Numbermemcalculates size of the mapped pages and return it to thebits of12. In the(not reserved) regions andmapped_sizevariable (see codeabove). As we got amount of the direct mapped memory, we check that size of theis not greater than mapped pages. If it is greater we just callpanicinitrdwhich halts the systemand prints famous Kernel panic message. In the next step we print information about theinitrdsize. We can see the result of this in thedmesgoutput:[0.000000] RAMDISK: [mem 0x36d20000-0x37687fff]and relocatestart of theinitrdto the direct mapping area with therelocate_initrdmemblock_find_in_rangerelocate_initrdfunction. In thefunction we try to find a free area with thefunction:183End of the architecture-specific initializations, almost...relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE);if (!relocated_ramdisk)panic("Cannot find place for new RAMDISK of size %lld\n",ramdisk_size);Thememblock_find_in_rangefromof the0function tries to find a free area in a given range, in our caseto the maximum mapped physical address and size must equal to the aligned sizeinitrd. If we didn't find a area with the given size, we callpanicagain. If all isgood, we start to relocated RAM disk to the down of the directly mapped memory in the nextstep.In the end of thereserve_initrdfunction, we free memblock memory which occupied bythe ramdisk with the call of the:memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);After we relocatedinitrdramdisk image, the next function isvsmp_initarch/x86/kernel/vsmp_64.c. This function initializes support of theScaleMP vSMPalready wrote in the previous parts, this chapter will not cover non-relatedinitialization parts (for example as the current orACPIfrom the. As Ix86_64, etc.). So we will skip implementationof this for now and will back to it in the part which cover techniques of parallel computing.The next function isfrom the arch/x86/kernel/io_delay.c. This function allowsio_delay_initto override default I/O delay0x80port. We already saw I/O delay in the Last preparationbefore transition into protected mode, now let's look on theio_delay_initimplementation:void __init io_delay_init(void){if (!io_delay_override)dmi_check_system(io_delay_0xed_port_dmi_table);}This function checkio_delay_overrideio_delay_overrideis set. We can setvariable and overrides I/O delay port ifio_delay_overridevariably by passingio_delayoption to the kernel command line. As we can read from the Documentation/kernelparameters.txt,io_delayoption is:184End of the architecture-specific initializations, almost...io_delay=[X86] I/O delay method0x80Standard port 0x80 based delay0xedAlternate port 0xed based delay (needed on some systems)udelaySimple two microseconds delaynoneNo delayWe can seeio_delaycommand line parameter setup with theearly_parammacro in thearch/x86/kernel/io_delay.cearly_param("io_delay", io_delay_param);More aboutearly_paramwhich setupsyou can read in the previous part. So theio_delay_overrideio_delay_paramfunctionvariable will be called in the do_early_param function.function gets the argument of theparameter and setsio_delay_paramio_delay_typeio_delaykernel command linedepends on it:static int __init io_delay_param(char *s){if (!s)return -EINVAL;if (!strcmp(s, "0x80"))io_delay_type = CONFIG_IO_DELAY_TYPE_0X80;else if (!strcmp(s, "0xed"))io_delay_type = CONFIG_IO_DELAY_TYPE_0XED;else if (!strcmp(s, "udelay"))io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY;else if (!strcmp(s, "none"))io_delay_type = CONFIG_IO_DELAY_TYPE_NONE;elsereturn -EINVAL;io_delay_override = 1;return 0;}The next functions areacpi_boot_table_initafter the, but as I wrote above we will not cover ACPI related stuff in thisio_delay_initLinux Kernel initialization process,early_acpi_boot_initandinitmem_initchapter.Allocate area for DMA185End of the architecture-specific initializations, almost...In the next step we need to allocate area for the Direct memory access with thedma_contiguous_reserveDMAfunction which is defined in the drivers/base/dma-contiguous.c.is a special mode when devices communicate with memory without CPU. Note that wepass one parameter -max_pfn_mapped <= 0) {mmu_cr4_features = __read_cr4();if (trampoline_cr4_features)*trampoline_cr4_features = mmu_cr4_features;}The next function which you can see ismap_vsyscalfrom the arch/x86/kernel/vsyscall_64.c.This function maps memory space for vsyscalls and depends onCONFIG_X86_VSYSCALL_EMULATIONkernel configuration option. Actuallysegment which provides fast access to the certain system calls likevsyscallgetcpuis a special, etc. Let's lookon implementation of this function:188End of the architecture-specific initializations, almost...void __init map_vsyscall(void){extern char __vsyscall_page;unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);if (vsyscall_mode != NONE)__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,vsyscall_mode == NATIVE? PAGE_KERNEL_VSYSCALL: PAGE_KERNEL_VVAR);BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=(unsigned long)VSYSCALL_ADDR);}In the beginning of theextern variablemap_vsyscall__vsyscall_pagewe can see definition of two variables. The first is. As a extern variable, it defined somewhere in other sourcecode file. Actually we can see definition of thearch/x86/kernel/vsyscall_emu_64.S. Theof thevsyscallsasgettimeofday__vsyscall_page__vsyscall_pagein thesymbol points to the aligned calls, etc.:.globl __vsyscall_page.balign PAGE_SIZE, 0xcc.type __vsyscall_page, @object__vsyscall_page:mov __NR_gettimeofday, %raxsyscallret.balign 1024, 0xccmov__NR_time, %raxsyscallret.........The second variable is__vsyscall_pagenot equal tophysaddr_vsyscallwhich just stores physical address of thesymbol. In the next step we check theNONE, it isEMULATEvsyscall_modevariable, and if it isby default:static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;And after this check we can see the call of thenative_set_fixmap__set_fixmapfunction which callswith the same parameters:189End of the architecture-specific initializations, almost...void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags){__native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags));}void __native_set_fixmap(enum fixed_addresses idx, pte_t pte){unsigned long address = __fix_to_virt(idx);if (idx >= __end_of_fixed_addresses) {BUG();return;}set_pte_vaddr(address, pte);fixmaps_set++;}Here we can see thatnative_set_fixmapmakes value ofphysical address (physical address of theinternal function given__native_set_fixmapindex (fixed_addressesPage Table Entry__vsyscall_pagefrom the givensymbol in our case) and calls. Internal function gets the virtual address of theVSYSCALL_PAGEin our case) and checks that given index isnot greater than end of the fix-mapped addresses. After this we set page table entry with thecall of thethe end of themap_vsyscallfirst index in theorfunction and increase count of the fix-mapped addresses. And inset_pte_vaddrwe check that virtual address of thefixed_addressesffffffffff600000with the) is not greater thanVSYSCALL_PAGEVSYSCALL_ADDRwhich is(which is-10UL cpumask_allocation;#endifcpumask_clear(mm->cpu_vm_mask_var);}As you can see in the init/main.c, we pass memory descriptor of the init process to themm_init_cpumaskTLB switchand depends oncpumaskCONFIG_CPUMASK_OFFSTACKconfiguration option we clear.In the next step we can see the call of the following function:setup_command_line(command_line);This function takes pointer to the kernel command line allocates a couple of buffers to storecommand line. We need a couple of buffers, because one buffer used for future referenceand accessing to command line and one for parameter parsing. We will allocate space forthe following buffers:saved_command_line- will contain boot command line;initcall_command_linedo_initcall_level- will contain boot command line. will be used in the;static_command_line- will contain command line for parameters parsing.We will allocate space with thememblock_virt_alloc_try_nidmemblock_virt_allocfunction. This function callswhich allocates boot memory block withslab is not available or useskzalloc_nodemanagement chapter). Thememblock_virt_allocmemblock_reserveif(more about it will be in the linux memoryusesBOOTMEM_LOW_LIMIT(physical192End of the architecture-specific initializations, almost...address of the(PAGE_OFFSET + 0x1000000)the current value of thevalue) andBOOTMEM_ALLOC_ACCESSIBLE(equal to) as minimum address of the memory regionmemblock.current_limitand maximum address of the memory region.Let's look on the implementation of thesetup_command_line:static void __init setup_command_line(char *command_line){saved_command_line =memblock_virt_alloc(strlen(boot_command_line) + 1, 0);initcall_command_line =memblock_virt_alloc(strlen(boot_command_line) + 1, 0);static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0);strcpy(saved_command_line, boot_command_line);strcpy(static_command_line, command_line);}Here we can see that we allocate space for the three buffers which will contain kernelcommand line for the different purposes (read above). And as we allocated space, we storeboot_command_linefrom thein thesetup_arch) to theThe next function after thesettingnr_cpu_idssaved_command_lineandstatic_command_linesetup_command_linecommand_line(kernel command line.is thesetup_nr_cpu_ids(number of CPUs) according to the last bit in the. This functioncpu_possible_mask(more about it you can read in the chapter describes cpumasks concept). Let's look on itsimplementation:void __init setup_nr_cpu_ids(void){nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;}Herenr_cpu_idsrepresents number of CPUs,NR_CPUSrepresents the maximum numberof CPUs which we can set in configuration time:193End of the architecture-specific initializations, almost...Actually we need to call this function, becauseNR_CPUScan be greater than actual amountof the CPUs in the your computer. Here we can see that we callfind_last_bitfunction andpass two parameters to it:cpu_possible_maskbits;maximum number of CPUS.In thesetup_archwe can find the call of thecalculates and writes to thefind_last_bitprefill_possible_mapcpu_possible_maskfunction whichactual number of the CPUs. We call thefunction which takes the address and maximum size to search and returnsbit number of the first set bit. We passedthe CPUs. First of all thefind_last_bitcpu_possible_maskfunction splits givenbits and maximum number ofunsigned longaddress to thewords:words = size / BITS_PER_LONG;whereBITS_PER_LONGis64on thex86_64. As we got amount of words in the given size ofthe search data, we need to check is given size does not contain partial words with thefollowing check:194End of the architecture-specific initializations, almost...if (size & (BITS_PER_LONG-1)) {tmp = (addr[words] & (~0UL >> (BITS_PER_LONG- (size & (BITS_PER_LONG-1)))));if (tmp)goto found;}if it contains partial word, we mask the last word and check it. If the last word is not zero, itmeans that current word contains at least one set bit. We go to thefoundlabel:found:return words * BITS_PER_LONG + __fls(tmp);Here you can seebsr__flsfunction which returns last set bit in a given word with help of theinstruction:static inline unsigned long __fls(unsigned long word){asm("bsr %1,%0": "=r" (word): "rm" (word));return word;}Thebsrinstruction which scans the given operand for first bit set. If the last word is notpartial we going through the all words in the given address and trying to find first set bit:while (words) {tmp = addr[--words];if (tmp) {found:return words * BITS_PER_LONG + __fls(tmp);}}Here we put the last word to thetmpvariable and check thattmpcontains at least one setbit. If a set bit found, we return the number of this bit. If no one words do not contains set bitwe just return given size:return size;After thisnr_cpu_idswill contain the correct amount of the available CPUs.That's all.195End of the architecture-specific initializations, almost...ConclusionIt is the end of the seventh part about the linux kernel initialization process. In this part,finally we have finished with thesetup_archfunction and returned to thestart_kernelfunction. In the next part we will continue to learn generic kernel code from thestart_kerneland will continue our way to the firstinitprocess.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksDesktop Management Interfacex86_64initrdKernel panicDocumentation/kernel-parameters.txtACPIDirect memory accessNUMAControl registervsyscallsSMPjiffyPrevious part196Scheduler initializationKernel initialization. Part 8.Scheduler initializationThis is the eighth part of the Linux kernel initialization process chapter and we stopped onthesetup_nr_cpu_idsfunction in the previous part.The main point of this part is scheduler initialization. But before we will start to learninitialization process of the scheduler, we need to do some stuff. The next step in theinit/main.c is thepercpusetup_per_cpu_areasfunction. This function setups memory areas for thevariables, more about it you can read in the special part about the Per-CPUvariables. Afterpercpuareas is up and running, the next step is thesmp_prepare_boot_cpufunction.This function does some preparations for symmetric multiprocessing. Since this function isarchitecture specific, it is located in the arch/x86/include/asm/smp.h Linux kernel header file.Let's look at the definition of this function:static inline void smp_prepare_boot_cpu(void){smp_ops.smp_prepare_boot_cpu();}We may see here that it just calls thesmp_prepare_boot_cpucallback of thesmp_opsstructure. If we look at the definition of instance of this structure from thearch/x86/kernel/smp.c source code file, we will see that theto the call of thenative_smp_prepare_boot_cpusmp_prepare_boot_cpuexpandsfunction:struct smp_ops smp_ops = {.........smp_prepare_boot_cpu = native_smp_prepare_boot_cpu,.........}EXPORT_SYMBOL_GPL(smp_ops);Thenative_smp_prepare_boot_cpufunction looks:197Scheduler initializationvoid __init native_smp_prepare_boot_cpu(void){int me = smp_processor_id();switch_to_new_gdt(me);cpumask_set_cpu(me, cpu_callout_mask);per_cpu(cpu_state, me) = CPU_ONLINE;}and executes following things: first of all it gets theBootstrap processor and itsididof the current CPU (which isis zero for this moment) with thefunction. I will not explain how thesmp_processor_idworks, because we already saw it inthe Kernel entry point part. After we've got processorDescriptor Table for the given CPU with thesmp_processor_ididnumber we reload Globalswitch_to_new_gdtfunction:void switch_to_new_gdt(int cpu){struct desc_ptr gdt_descr;gdt_descr.address = (long)get_cpu_gdt_table(cpu);gdt_descr.size = GDT_SIZE - 1;load_gdt(&gdt_descr);load_percpu_segment(cpu);}Thegdt_descrdefinition of avariable represents pointer to thedesc_ptris256descriptor here (we already sawstructure in the Early interrupt and exception handling part). We getthe address and the size of theGDT_SIZEGDTGDTdescriptor for theCPUwith the givenid. Theor:#define GDT_SIZE (GDT_ENTRIES * 8)and the address of the descriptor we will get with theget_cpu_gdt_table:static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu){return per_cpu(gdt_page, cpu).gdt;}Theget_cpu_gdt_tableusesper_cpumacro for getting value of avariable for the given CPU number (bootstrap processor withYou may ask the following question: so, if we can accessidgdt_pagepercpu- 0 in our case).gdt_pagepercpu variable, whereit was defined? Actually we already saw it in this book. If you have read the first part of thischapter, you can remember that we saw definition of thegdt_pagein the198Scheduler initializationarch/x86/kernel/head_64.S:early_gdt_descr:.wordGDT_ENTRIES*8-1early_gdt_descr_base:.quadINIT_PER_CPU_VAR(gdt_page)and if we will look on the linker file we can see that it locates after the__per_cpu_loadsymbol:#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_loadINIT_PER_CPU(gdt_page);and filledgdt_pagein the arch/x86/kernel/cpu/common.c:DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {#ifdef CONFIG_X86_64[GDT_ENTRY_KERNEL32_CS]= GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),[GDT_ENTRY_KERNEL_CS]= GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),[GDT_ENTRY_KERNEL_DS]= GDT_ENTRY_INIT(0xc093, 0, 0xfffff),[GDT_ENTRY_DEFAULT_USER32_CS]= GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),[GDT_ENTRY_DEFAULT_USER_DS]= GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),[GDT_ENTRY_DEFAULT_USER_CS]= GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),.........more aboutpercpuvariables you can read in the Per-CPU variables part. As we gotaddress and size of theexecutelgdtGDTdescriptor we reloadinstruct and loadpercpu_segmentGDTwith theload_gdtwhich justwith the following function:void load_percpu_segment(int cpu) {loadsegment(gs, 0);wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));load_stack_canary_segment();}The base address of theso we are usingpercpuloadsegmentarea must containmacro and passgsgsregister (or, we fillthe settingcpu_callout_maskcpu_stateregister forx86),. In the next step we writes the baseaddress if the IRQ stack and setup stack canary (this is only forGDTfsx86_32). After we load newbitmap with the current cpu and set cpu state as online withpercpu variable for the current processor -CPU_ONLINE:199Scheduler initializationcpumask_set_cpu(me, cpu_callout_mask);per_cpu(cpu_state, me) = CPU_ONLINE;So, what iscpu_callout_maskwhich is booted the first onknown asbitmap... As we initialized bootstrap processor (processorx86secondary processors) the other processors in a multiprocessor system are. Linux kernel uses following two bitmasks:cpu_callout_maskcpu_callin_maskAfter bootstrap processor initialized, it updates thecpu_callout_maskto indicate whichsecondary processor can be initialized next. All other or secondary processors can do someinitialization stuff before and check thecpu_callout_maskOnly after the bootstrap processor filled theon the boostrap processor bit.cpu_callout_maskwith this secondaryprocessor, it will continue the rest of its initialization. After that the certain processor finish itsinitialization process, the processor sets bit in theprocessor finds the bit in thecpu_callin_maskcpu_callin_mask. Once the bootstrapfor the current secondary processor, thisprocessor repeats the same procedure for initialization of one of the remaining secondaryprocessors. In a short words it works as i described, but we will see more details in thechapter aboutSMP.That's all. We did allSMPboot preparation.Build zonelistsIn the next step we can see the call of thebuild_all_zonelistsfunction. This function setsup the order of zones that allocations are preferred from. What are zones and what's orderwe will understand soon. For the start let's see how linux kernel considers physical memory.Physical memory is split into banks which are called support forNUMAnodes. If you has no hardware, you will see only one node:$cat /sys/devices/system/node/node0/numastatnuma_hit 72452442numa_miss 0numa_foreign 0interleave_hit 12925local_node 72452442other_node 0Everynodeis presented by thestruct pglist_datain the linux kernel. Each node isdivided into a number of special blocks which are called by thezone structzones. Every zone is presentedin the linux kernel and has one of the type:200Scheduler initializationZONE_DMA- 0-16M;ZONE_DMA32- used for 32 bit devices that can only do DMA areas below 4G;- all RAM from the 4GB on theZONE_NORMALx86_64;ZONE_HIGHMEM- absent on theZONE_MOVABLE- zone which contains movable pages.which are presented by thex86_64zone_type;enum. We can get information about zones with the:$ cat /proc/zoneinfoNode 0, zoneDMApages free3975min3low3......Node 0, zoneDMA32pages free694163min875low1093......Node 0, zoneNormalpages free2529995min3146low3932......As I wrote above all nodes are described with thepglist_dataorpg_data_tmemory. This structure is defined in the include/linux/mmzone.h. Thefunction from the mm/page_alloc.c constructs an orderedDMA,DMA32,NORMALwhen a selectedNUMAzone,HIGH_MEMORYornode,MOVABLEzoneliststructure inbuild_all_zonelists(of different zones) which specifies the zones/nodes to visitcannot satisfy the allocation request. That's all. More aboutand multiprocessor systems will be in the special part.The rest of the stuff before schedulerinitializationBefore we will start to dive into linux kernel scheduler initialization process we must do acouple of things. The first thing is thepage_alloc_initfunction from the mm/page_alloc.c.This function looks pretty easy:201Scheduler initializationvoid __init page_alloc_init(void){int ret;ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD,"mm/page_alloc:dead", NULL,page_alloc_cpu_dead);WARN_ON(ret ptl, thekmem_cache_initinitializes kernel cache, thechunks with those allocated by slub, thekernel cache, thevmalloc_init- initializespgtable_initvmallockernelmem_init-percpu_init_late- initializes the. Please, NOTE that we willnot dive into details about all of these functions and concepts, but we will see all of they it inthe Linux kernel memory manager chapter.That's all. Now we can look on thescheduler.Scheduler initializationAnd now we come to the main purpose of this part - initialization of the task scheduler. I wantto say again as I already did it many times, you will not see the full explanation of thescheduler here, there will be special separate chapter about this. Here will be described firstinitial scheduler mechanisms which are initialized first of all. So let's start.Our current point is thesched_initfunction from the kernel/sched/core.c kernel sourcecode file and as we can understand from the function's name, it initializes scheduler. Let'sstart to dive into this function and try to understand how the scheduler is initialized. At thestart of thesched_initfunction we can see the following call:sched_clock_init();Thesched_clock_initsched_clock_initis pretty easy function and as we may see it just setsvariable:void sched_clock_init(void){sched_clock_running = 1;}that will be used later. At the next step is initialization of the array ofwaitqueues:for (i = 0; i < WAIT_TABLE_SIZE; i++)init_waitqueue_head(bit_wait_table + i);wherebit_wait_tableis defined as:204Scheduler initialization#define WAIT_TABLE_BITS 8#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;Thebit_wait_tableis array of wait queues that will be used for wait/wake up of processesdepends on the value of a designated bit. The next step after initialization ofarray is calculating size of memory to allocate for theroot_task_groupwaitqueues. As we may see thissize depends on two following kernel configuration options:#ifdef CONFIG_FAIR_GROUP_SCHEDalloc_size += 2 * nr_cpu_ids * sizeof(void **);#endif#ifdef CONFIG_RT_GROUP_SCHEDalloc_size += 2 * nr_cpu_ids * sizeof(void **);#endifCONFIG_FAIR_GROUP_SCHEDCONFIG_RT_GROUP_SCHED;.Both of these options provide two different planning models. As we can read from thedocumentation, the current scheduler -CFSorCompletely Fair Scheduleruse a simpleconcept. It models process scheduling as if the system has an ideal multitasking processorwhere each process would receive1/nprocessor time, wherenis the number of therunnable processes. The scheduler uses the special set of rules. These rules determinewhen and how to select a new process to run and they are calledTheCompletely Fair Schedulertimescheduling policies:SCHED_NORMALSCHED_BATCHSCHED_IDLEThesupports followingnormalscheduling policyor in other words.non-real-;;.SCHED_NORMALis used for the most normal applications, the amount of cpu eachprocess consumes is mostly determined by the nice value, the100% non-interactive tasks and theSCHED_IDLESCHED_BATCHused for theruns tasks only when the processor has notask to run besides this task.Thereal-timeandSCHED_RRpolicies are also supported for the time-critical applications:SCHED_FIFO. If you've read something about the Linux kernel scheduler, you can knowthat it is modular. It means that it supports different algorithms to schedule different types of205Scheduler initializationprocesses. Usually this modularity is calledscheduler classes. These modules encapsulatescheduling policy details and are handled by the scheduler core without knowing too muchabout them.Now let's get back to the our code and look on the two configuration options:CONFIG_FAIR_GROUP_SCHEDandCONFIG_RT_GROUP_SCHED. The least unit which scheduleroperates is an individual task or thread. But a process is not only one type of entities ofwhich the scheduler may operate. Both of these options provides support for groupscheduling. The first one option provides support for group scheduling withschedulerpolicies and the second withreal-timecompletely fairpolicies respectively.In simple words, group scheduling is a feature that allows us to schedule a set of tasks as ifa single task. For example, if you create a group with two tasks on the group, then this groupis just like one normal task, from the kernel perspective. After a group is scheduled, thescheduler will pick a task from this group and it will be scheduled inside the group. So, suchmechanism allows us to build hierarchies and manage their resources. Although a minimalunit of scheduling is a process, the Linux kernel scheduler does not usestructure under the hood. There is specialsched_entitytask_structstructure that is used by the Linuxkernel scheduler as scheduling unit.So, the current goal is to calculate a space to allocate for asched_entity(ies)of the roottask group and we do it two times with:#ifdef CONFIG_FAIR_GROUP_SCHEDalloc_size += 2 * nr_cpu_ids * sizeof(void **);#endif#ifdef CONFIG_RT_GROUP_SCHEDalloc_size += 2 * nr_cpu_ids * sizeof(void **);#endifThe first is for case when scheduling of task groups is enabled withscheduler and the second is for the same purpose by in a case ofcompletely fairreal-timescheduler. Sohere we calculate size which is equal to size of a pointer multiplied on amount of CPUs inthe system and multiplied to2. We need to multiply this on2as we will need to allocate aspace for two things:scheduler entity structure;runqueue.After we have calculated size, we allocate a space with thepointers ofsched_entityandrunququeskzallocfunction and setthere:206Scheduler initializationptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);#ifdef CONFIG_FAIR_GROUP_SCHEDroot_task_group.se = (struct sched_entity **)ptr;ptr += nr_cpu_ids * sizeof(void **);root_task_group.cfs_rq = (struct cfs_rq **)ptr;ptr += nr_cpu_ids * sizeof(void **);#endif#ifdef CONFIG_RT_GROUP_SCHEDroot_task_group.rt_se = (struct sched_rt_entity **)ptr;ptr += nr_cpu_ids * sizeof(void **);root_task_group.rt_rq = (struct rt_rq **)ptr;ptr += nr_cpu_ids * sizeof(void **);#endifAs I already mentioned, the Linux group scheduling mechanism allows to specify ahierarchy. The root of such hierarchies is theroot_runqueuetask_groupThis structure contains many fields, but we are interested inrt_rqse,task group structure.rt_se,cfs_rqandfor this moment:The first two are instances ofsched_entitystructure. It is defined in theinclude/linux/sched.h kernel header filed and used by the scheduler as a unit of scheduling.struct task_group {......struct sched_entity **se;struct cfs_rq **cfs_rq;......}Thecfs_rqandrt_rqpresentrun queues.Arun queuethat is used by the Linux kernel scheduler to storeactiveis a specialper-cpustructurethreads or in other words set ofthreads which potentially will be picked up by the scheduler to run.The space is allocated and the next step is to initialize aanddeadlinebandwidthof CPU forreal-timetasks:init_rt_bandwidth(&def_rt_bandwidth,global_rt_period(), global_rt_runtime());init_dl_bandwidth(&def_dl_bandwidth,global_rt_period(), global_rt_runtime());207Scheduler initializationAll groups have to be able to rely on the amount of CPU time. The two following structures:anddef_rt_bandwidthtimeandrepresent default values of bandwidths fordef_dl_bandwidthreal-tasks. We will not look at definition of these structures as it is not sodeadlineimportant for now, but we are interested in two following values:sched_rt_period_us;sched_rt_runtime_us.The first represents a period and the second represents quantum that is allocated fortimetasks duringsched_rt_period_usreal-. You may see global values of these parameters inthe:$cat /proc/sys/kernel/sched_rt_period_us1000000$ cat /proc/sys/kernel/sched_rt_runtime_us950000The values related to a group can be configured in/cpu.rt_runtime_usand thedef_rt_bandwidthbe retuned by the/cpu.rt_period_usand. Due no one filesystem is not mounted yet, thedef_dl_bandwidthglobal_rt_periodThat's all with the bandwiths ofandwill be initialzed with default values which willglobal_rt_runtimereal-timeanddeadlinedepends on enable of SMP, we make initialization of thefunctions.tasks and in the next step,root domain:#ifdef CONFIG_SMPinit_defrootdomain();#endifThe real-time scheduler requires global resources to make scheduling decision. Butunfortunately scalability bottlenecks appear as the number of CPUs increase. The conceptofroot domainswas introduced for improving scalability and avoid such bottlenecks.Instead of bypassing over allwhere/from to push/pull arun queuesreal-time, the scheduler gets information about a CPUtask from theroot_domainstructure. This structure isdefined in the kernel/sched/sched.h kernel header file and just keeps track of CPUs that canbe used to push or pull a process.Afterroot domaintasks of theinitialization, we make initialization of theroot task groupbandwidthfor thereal-timeas we did the same above:208Scheduler initialization#ifdef CONFIG_RT_GROUP_SCHEDinit_rt_bandwidth(&root_task_group.rt_bandwidth,global_rt_period(), global_rt_runtime());#endifwith the same default values.In the next step, depends on theallocateslabcache forCONFIG_CGROUP_SCHEDtask_group(s)kernel configuration option weand initialize thesiblingsthe root task group. As we can read from the documentation, theandchildrenlists ofCONFIG_CGROUP_SCHEDis:This option allows you to create arbitrary task groups using the "cgroup" pseudofilesystem and control the cpu bandwidth allocated to each such task group.As we finished with the lists initialization, we can see the call of theautogroup_initfunction:#ifdef CONFIG_CGROUP_SCHEDlist_add(&root_task_group.list, &task_groups);INIT_LIST_HEAD(&root_task_group.children);INIT_LIST_HEAD(&root_task_group.siblings);autogroup_init(&init_task);#endifwhich initializes automatic process group scheduling. Theautogroupfeature is aboutautomatic creation and population of a new task group during creation of a new session viasetsid call.After this we are going through the allCPUs are stored in theand initialize arunqueuepossiblecpu_possible_maskfor eachCPUs (you can remember thatpossiblebitmap that can ever be available in the system)possiblecpu:for_each_possible_cpu(i) {struct rq *rq;.........Therqstructure in the Linux kernel is defined in the kernel/sched/sched.h. As I alreadymentioned this above, arun queueis a fundamental data structure in a scheduling process.The scheduler uses it to determine who will be runned next. As you may see, this structurehas many different fields and we will not cover all of them here, but we will look on themwhen they will be directly used.209Scheduler initializationAfter initialization ofper-cpurun queues with default values, we need to setuploadof the first task in the system:weightset_load_weight(&init_task);First of all let's try to understand what is itdefinition of thesched_entityload weightof a process. If you will look at thestructure, you will see that it starts from theloadfield:struct sched_entity {struct load_weightload;.........}represented by theload_weightstructure which just contains two fields that representactual load weight of a scheduler entity and its invariant value:struct load_weight {unsigned longu32weight;inv_weight;};You already may know that each process in the system hasallows to get more time to run. Aload weightpriority. The higher priorityof a process is a relation between priority ofthis process and timeslices of this process. Each process has three following fields related topriority:struct task_struct {.........intprio;intstatic_prio;intnormal_prio;.........}The first one isdynamic prioritywhich can't be changed during lifetime of a process basedon its static priority and interactivity of the process. Themost likely well-known to younice valuestatic_priocontains initial priority. This value does not changed by the kernel if a210Scheduler initializationuser will not change it. The last one isstatic_priobased on the value of thetoo, but also it depends on the scheduling policy of a process.So the main goal of theinitnormal_priorityset_load_weightfunction is to initialzeload_weightfields for thetask:static void set_load_weight(struct task_struct *p){int prio = p->static_prio - MAX_RT_PRIO;struct load_weight *load = &p->se.load;if (idle_policy(p->policy)) {load->weight = scale_load(WEIGHT_IDLEPRIO);load->inv_weight = WMULT_IDLEPRIO;return;}load->weight = scale_load(sched_prio_to_weight[prio]);load->inv_weight = sched_prio_to_wmult[prio];}As you may see we calculate initialinitsettask and use it as index ofweightandinv_weightpriofrom the initial value of thesched_prio_to_weightandsched_prio_to_wmultvalues. These two arrays contain apriority value. In a case of when a process isidlestatic_prioload weightof thearrays todepends onprocess, we set minimal load weight.For this moment we came to the end of initialization process of the Linux kernel scheduler.The last steps are: to make current process (it will be the firstinitprocess)idlethat willbe runned when a cpu has no other process to run. Calculating next time period of the nextcalculation of CPU load and initialization of thefairclass:__init void init_sched_fair_class(void){#ifdef CONFIG_SMPopen_softirq(SCHED_SOFTIRQ, run_rebalance_domains);#endif}Here we register a soft irq that will call theSCHED_SOFTIRQwill be triggered, therun_rebalance_domainsrun_rebalancehandler. After thewill be called to rebalance a run queueon the current CPU.The last two steps of thesettingsched_initscheeduler_runningfunction is to initialization of scheduler statistics andvariable:211Scheduler initializationscheduler_running = 1;That's all. Linux kernel scheduler is initialized. Of course, we have skipped many differentdetails and explanations here, because we need to know and understand how differentconcepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel ,but we took a short look on the scheduler initialization process. We will look all other detailsin the separate part which will be fully dedicated to the scheduler.ConclusionIt is the end of the eighth part about the linux kernel initialization process. In this part, welooked on the initialization process of the scheduler and we will continue in the next part todive in the linux kernel initialization process and will see initialization of the RCU and manyother initialization stuff in the next part.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksCPU maskshigh-resolution kernel timerspinlockRun queueLinux kernel memory managerslubvirtual file systemLinux kernel hotplug documentationIRQGlobal Descriptor TablePer-CPU variablesSMPRCUCFS Scheduler documentationReal-Time group schedulingPrevious part212Scheduler initialization213RCU initializationKernel initialization. Part 9.RCU initializationThis is ninth part of the Linux Kernel initialization process and in the previous part westopped at the scheduler initialization. In this part we will continue to dive to the linux kernelinitialization process and the main purpose of this part will be to learn about initialization ofthe RCU. We can see that the next step in the init/main.c after thethepreempt_disablesched_initis the call of. There are two macros:preempt_disablepreempt_enablefor preemption disabling and enabling. First of all let's try to understand what ispreemptinthe context of an operating system kernel. In simple words, preemption is ability of theoperating system kernel to preempt current task to run task with higher priority. Here weneed to disable preemption because we will have only onetime and we don't need to stop it before we callcpu_idleinitprocess for the early bootfunction. Themacro is defined in the include/linux/preempt.h and depends on thepreempt_disableCONFIG_PREEMPT_COUNTkernel configuration option. This macro is implemented as:#define preempt_disable() \do { \preempt_count_inc(); \barrier(); \} while (0)and ifCONFIG_PREEMPT_COUNTis not set just:#define preempt_disable()barrier()Let's look on it. First of all we can see one difference between these macro implementations.Thepreempt_disablepreempt_count_incandwithCONFIG_PREEMPT_COUNT. There is specialpreempt_disablepercpuset contains the call of thevariable which stores the number of held lockscalls:DECLARE_PER_CPU(int, __preempt_count);214RCU initializationIn the first implementation of thepreempt_disableThere is API for returning value of thewe calledwe increment this__preempt_count, it is the__preempt_count.function. Aspreempt_count, first of all we increment preemption counter with thepreempt_disablemacro which expands to the:preempt_count_inc#define preempt_count_inc() preempt_count_add(1)#define preempt_count_add(val)wherepreempt_count_addvariable (percpu__preempt_count_add(val)calls the__preempt_countraw_cpu_add_4macro which adds) in our case (more aboutread in the part about Per-CPU variables). Ok, we increasedstep we can see the call of thebarrierprecpu1variables you can__preempt_countmacro in the both macros. Theinserts an optimization barrier. In the processors withx86_64to the givenand the nextbarriermacroarchitecture independentmemory access operations can be performed in any order. That's why we need theopportunity to point compiler and processor on compliance of order. This mechanism ismemory barrier. Let's consider a simple example:preempt_disable();foo();preempt_enable();Compiler can rearrange it as:preempt_disable();preempt_enable();foo();In this case non-preemptible functionthepreempt_disablepreempt_count_incandfoopreempt_enablecan be preempted. As we putbarriermacro inmacros, it prevents the compiler from swappingwith other statements. More about barriers you can read here and here.In the next step we can see following statement:if (WARN(!irqs_disabled(),"Interrupts were enabled *very* early, fixing it\n"))local_irq_disable();which check IRQs state, and disabling (withcliinstruction forx86_64) if they areenabled.That's all. Preemption is disabled and we can go ahead.215RCU initializationInitialization of the integer ID managementIn the next step we can see the call of thelib/idr.c. TheintegerIDsidr_init_cachefunction which defined in thelibrary is used in a various places in the linux kernel to manage assigningidrto objects and looking up objects by id.Let's look on the implementation of theidr_init_cachefunction:void __init idr_init_cache(void){idr_layer_cache = kmem_cache_create("idr_layer_cache",sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);}Here we can see the call of thekmem_cache_create. We already called thein the init/main.c. This function create generalized caches again using thekmem_cache_initkmem_cache_alloc(more about caches we will see in the Linux kernel memory management chapter). In ourcase, as we are usingkmem_cache_createkmem_cache_createkmem_cache_twhich will be used by the slab allocator andcreates it. As you can see we pass five parameters to the:name of the cache;size of the object to store in cache;offset of the first object in the page;flags;constructor for the objects.and it will createkmem_cachefor the integer IDs. IntegerIDsis commonly used pattern tomap set of integer IDs to the set of pointers. We can see usage of the integer IDs in the i2cdrivers subsystem. For example drivers/i2c/i2c-core.c which represents the core of thesubsystem definesIDfor thei2cadapter with theDEFINE_IDRi2cmacro:static DEFINE_IDR(i2c_adapter_idr);and then uses it for the declaration of thei2cadapter:216RCU initializationstatic int __i2c_add_numbered_adapter(struct i2c_adapter *adap){intid;.........id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);.........}andid2_adapter_idrpresents dynamically calculated bus number.More about integer ID management you can read here.RCU initializationThe next step is RCU initialization with thercu_initfunction and it's implementationdepends on two kernel configuration options:CONFIG_TINY_RCUCONFIG_TREE_RCUIn the first casercu_initwill be in the kernel/rcu/tiny.c and in the second case it will bedefined in the kernel/rcu/tree.c. We will see the implementation of theall about theRCUtree rcu, but first ofin general.or read-copy update is a scalable high-performance synchronization mechanismRCUimplemented in the Linux kernel. On the early stage the linux kernel provided support andenvironment for the concurrently running applications, but all execution was serialized in thekernel using a single global lock. In our days linux kernel has no single global lock, butprovides different mechanisms including lock-free data structures, percpu data structuresand other. One of these mechanisms is - theread-copy updatedesigned for rarely-modified data structures. The idea of the. TheRCURCUtechnique isis simple. For example wehave a rarely-modified data structure. If somebody wants to change this data structure, wemake a copy of this data structure and make all changes in the copy. In the same time allother users of the data structure use old version of it. Next, we need to choose safe momentwhen original version of the data structure will have no users and update it with the modifiedcopy.Of course this description of theRCURCUis very simplified. To understand some details about, first of all we need to learn some terminology. Data readers in theRCUexecuted inthe critical section. Every time when data reader get to the critical section, it calls the217RCU initializationrcu_read_lock, andrcu_read_unlockon exit from the critical section. If the thread is not inthe critical section, it will be in state which called every thread is in thequiescent statecalled -quiescent stategrace period. The moment when. If a thread wants to removean element from the data structure, this occurs in two steps. First step isremoval-atomically removes element from the data structure, but does not release the physicalmemory. After this thread-writer announces and waits until it is finished. From this moment,the removed element is available to the thread-readers. After thegrace periodfinished, thesecond step of the element removal will be started, it just removes the element from thephysical memory.There a couple of implementations of theimplementation calledand. OldRCUcalled classic, the newRCU. As you may already understand, thetreekernel configuration option enables treeCONFIG_TINY_RCURCUCONFIG_SMP=nRCU. Another is thetinyCONFIG_TREE_RCURCU which depends on. We will see more details about theRCUin general inthe separate chapter about synchronization primitives, but now let's look on thercu_initimplementation from the kernel/rcu/tree.c:void __init rcu_init(void){int cpu;rcu_bootup_announce();rcu_init_geometry();rcu_init_one(&rcu_bh_state, &rcu_bh_data);rcu_init_one(&rcu_sched_state, &rcu_sched_data);__rcu_init_preempt();open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);/** We don't need protection against CPU-hotplug here because* this is called early in boot, before either interrupts* or the scheduler are operational.*/cpu_notifier(rcu_cpu_notify, 0);pm_notifier(rcu_pm_notify, 0);for_each_online_cpu(cpu)rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);rcu_early_boot_tests();}In the beginning of thercu_bootup_announcercu_init. Thefunction we definercu_bootup_announcecpuvariable and callfunction is pretty simple:218RCU initializationstatic void __init rcu_bootup_announce(void){pr_info("Hierarchical RCU implementation.\n");rcu_bootup_announce_oddness();}It just prints information about thercu_bootup_announce_oddnessabout the currentlikewith thewhich usespr_infopr_infofunction andtoo, for printing different informationconfiguration which depends on different kernel configuration optionsRCUCONFIG_RCU_TRACERCU,CONFIG_PROVE_RCUwe can see the call of the,CONFIG_RCU_FANOUT_EXACTrcu_init_geometry, etc. In the next step,function. This function is defined in the samesource code file and computes the node tree geometry depends on the amount of CPUs.ActuallyRCUprovides scalability with extremely low internal RCU lock contention. What if adata structure will be read from the different CPUs?RCUAPI provides thercu_statestructure which presents RCU global state including node hierarchy. Hierarchy is presentedby the:struct rcu_node node[NUM_RCU_NODES];array of structures. As we can read in the comment of above definition:The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the secondlevel in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third levelin ->node[m+1] and following (->node[m+1] referenced by ->level[2]).The number of levels isdetermined by the number of CPUs and by CONFIG_RCU_FANOUT.Small systems will have a "hierarchy" consisting of a single rcu_node.Thercu_nodestructure is defined in the kernel/rcu/tree.h and contains information aboutcurrent grace period, is grace period completed or not, CPUs or groups that need to switchin order for current grace period to proceed, etc. Everycouple of CPUs. Thesercu_statercu_nodercu_nodecontains a lock for astructures are embedded into a linear array in thestructure and represented as a tree with the root as the first element and coversall CPUs. As you can see the number of the rcu nodes determined by theNUM_RCU_NODESwhich depends on number of available CPUs:#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)219RCU initializationwhere levels values depend on theCONFIG_RCU_FANOUT_LEAFexample for the simplest case, onercu_nodeconfiguration option. Forwill cover two CPU on machine with the eightCPUs:+-----------------------------------------------------------------+|rcu_state||+----------------------+|||root||||rcu_node|||+----------------------+|||||+----v-----++--v-------+|||||||| rcu_node || rcu_node |||||||||||+------------------+|||||||||||||+----v-----++-------v--++-v--------++-v--------+|||||||||| rcu_node || rcu_node || rcu_node || rcu_node ||||||||||+----------++----------+|||||||||||||||||||||||||||+----------------+||+----------+||+----------+|+---------|-----------------|-------------|---------------|-------+||||+---------v-----------------v-------------v---------------v--------+|||CPU1|||CPU3||CPU2||||CPU5|CPU4|||||CPU7||CPU6||CPU8|||+------------------------------------------------------------------+So, in thercu_nodenextfqsrcu_init_geometryfunction we just need to calculate the total number ofstructures. We start to do it with the calculation of thewhich isforce-quiescent-statejiffiestill to the first and(read above about it):d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;if (jiffies_till_first_fqs == ULONG_MAX)jiffies_till_first_fqs = d;if (jiffies_till_next_fqs == ULONG_MAX)jiffies_till_next_fqs = d;where:220RCU initialization#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))#define RCU_JIFFIES_FQS_DIV256As we calculated these jiffies, we check that previous definedjiffies_till_next_fqsjiffies_till_first_fqsandvariables are equal to the ULONG_MAX (their default values) andset they equal to the calculated value. As we did not touch these variables before, they areequal to theULONG_MAX:static ulong jiffies_till_first_fqs = ULONG_MAX;static ulong jiffies_till_next_fqs = ULONG_MAX;In the next step of thehas the same value asthercu_init_geometry, we check thatCONFIG_RCU_FANOUT_LEAFrcu_fanout_leafdidn't change (itin compile-time) and equal to the value ofconfiguration option, we just return:CONFIG_RCU_FANOUT_LEAFif (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&nr_cpu_ids == NR_CPUS)return;After this we need to compute the number of nodes that anrcu_nodetree can handle withthe given number of levels:rcu_capacity[0] = 1;rcu_capacity[1] = rcu_fanout_leaf;for (i = 2; i <= MAX_RCU_LVLS; i++)rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;And in the last step we calculate the number of rcu_nodes at each level of the tree in theloop.As we calculated geometry of thercu_nodetree, we need to go back to thefunction and next step we need to initialize tworcu_statercu_initstructures with thercu_init_onefunction:rcu_init_one(&rcu_bh_state, &rcu_bh_data);rcu_init_one(&rcu_sched_state, &rcu_sched_data);Thercu_init_oneGlobalRCUfunction takes two arguments:state;Per-CPU data forRCU.221RCU initializationBoth variables defined in the kernel/rcu/tree.h with itspercpudata:extern struct rcu_state rcu_bh_state;DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);About this states you can read here. As I wrote above we need to initializestructures andrcu_init_onewe can see the call of thefunction will help us with it. After the__rcu_init_preemptrcu_statercu_statewhich depends on theinitialization,CONFIG_PREEMPT_RCUkernel configuration option. It does the same as previous functions - initialization of thercu_preempt_stateAfter this, in thestructure with thercu_initrcu_init_onefunction which hasrcu_statetype., we can see the call of the:open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);function. This function registers a handler of thesoftirqpending interrupt. Pending interrupt orsupposes that part of actions can be delayed for later execution when the systemis less loaded. Pending interrupts is represented by the following structure:struct softirq_action{void(*action)(struct softirq_action *);};which is defined in the include/linux/interrupt.h and contains only one field - handler of aninterrupt. You can check aboutsoftirqsin the your system with the:222RCU initializationcat /proc/softirqsCPU6CPU0CPU1CPU2CPU3CPU4CPU520010213777910811013957310764710740811497211270401133422113293930764513615253559687792016374420000006602916113024210235075950917057535675323826276510302368260219255812906806282979690156839069385CPU7HI:00TIMER:9653NET_TX:00NET_RX:292303BLOCK:282855BLOCK_IOPOLL:00TASKLET:67080SCHED:927969914HRTIMER:248246RCU:3304The998665663473open_softirqfunction takes two parameters:index of the interrupt;interrupt handler.and adds interrupt handler to the array of the pending interrupts:void open_softirq(int nr, void (*action)(struct softirq_action *)){softirq_vec[nr].action = action;}In our case the interrupt handler is kernel/rcu/tree.c and does thesoftirqinterrupt for theRCURCUrcu_process_callbackswhich is defined in thecore processing for the current CPU. After we registered, we can see the following code:cpu_notifier(rcu_cpu_notify, 0);pm_notifier(rcu_pm_notify, 0);for_each_online_cpu(cpu)rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);223RCU initializationHere we can see registration of thecpunotifier which needs in systems which supportsCPU hotplug and we will not dive into details about this theme. The last function in thercu_initis thercu_early_boot_tests:void rcu_early_boot_tests(void){pr_info("Running RCU self tests\n");if (rcu_self_test)early_boot_test_call_rcu();if (rcu_self_test_bh)early_boot_test_call_rcu_bh();if (rcu_self_test_sched)early_boot_test_call_rcu_sched();}which runs self tests for theRCU.That's all. We saw initialization process of theabout theRCURCUsubsystem. As I wrote above, morewill be in the separate chapter about synchronization primitives.Rest of the initialization processOk, we already passed the main theme of this part which isRCUinitialization, but it is notthe end of the linux kernel initialization process. In the last paragraph of this theme we willsee a couple of functions which work in the initialization time, but we will not dive into deepdetails around this function for different reasons. Some reasons not to dive into details arefollowing:They are not very important for the generic kernel initialization process and depend onthe different kernel configuration;They have the character of debugging and not important for now;We will see many of this stuff in the separate parts/chapters.After we initializedtrace_initRCU, the next step which you can see in the init/main.c is the -function. As you can understand from its name, this function initialize tracingsubsystem. You can read more about linux kernel trace system - here.After thetrace_init, we can see the call of theradix_tree_init. If you are familiar withthe different data structures, you can understand from the name of this function that itinitializes kernel implementation of the Radix tree. This function is defined in the lib/radixtree.c and you can read more about it in the part about Radix tree.224RCU initializationIn the next step we can see the functions which are related to theinterrupts handlingsubsystem, they are:early_irq_initinit_IRQsoftirq_initWe will see explanation about this functions and their implementation in the special partabout interrupts and exceptions handling. After this many different functions (likeinit_timers,hrtimers_init,time_init, etc.) which are related to different timing andtimers stuff. We will see more about these function in the chapter about timers.The next couple of functions are related with the perf events -perf_event-initbe separate chapter about perf), initialization of thewith thethis we enableirqprofiling(there willprofile_init. Afterwith the call of the:local_irq_enable();which expands to theof thestiinstruction and making post initialization of the SLAB with the callkmem_cache_init_latefunction (As I wrote above we will know about theSLABin theLinux memory management chapter).After the post initialization of theSLAB, next point is initialization of the console with thefunction from the drivers/tty/tty_io.c.console_initAfter the console initialization, we can see thelockdep_infofunction which printsinformation about the Lock dependency validator. After this, we can see the initialization ofthe dynamic allocation of thememory leak detector initialization with thesetup_per_cpu_pagesetthesched_clock_initfunction for the initialanon_vma_initdebug_objects_mem_initkmemleak_init,, setup of the NUMA policy with thefor the scheduler with thepidmap_initwith thedebug objectsPID,pidmappercpu, kernelpageset setup with thenuma_policy_init, setting timeinitialization with the call of thenamespace, cache creation with thefor the private virtual memory areas and early initialization of the ACPI withacpi_early_init.This is the end of the ninth part of the linux kernel initialization process and here we sawinitialization of the RCU. In the last paragraph of this part (processRest of the initialization) we will go through many functions but did not dive into details about theirimplementations. Do not worry if you do not know anything about these stuff or you knowand do not understand anything about this. As I already wrote many times, we will seedetails of implementations in other parts or other chapters.225RCU initializationConclusionIt is the end of the ninth part about the linux kernel initialization process. In this part, welooked on the initialization process of theRCUsubsystem. In the next part we will continueto dive into linux kernel initialization process and I hope that we will finish with thestart_kernelfunction and will go to therest_initfunction from the same init/main.csource code file and will see the start of the first process.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Linkslock-free data structureskmemleakACPIIRQsRCURCU documentationinteger ID managementDocumentation/memory-barriers.txtRuntime locking correctness validatorPer-CPU variablesLinux kernel memory managementslabi2cPrevious part226End of initializationKernel initialization. Part 10.End of the linux kernel initializationprocessThis is tenth part of the chapter about linux kernel initialization process and in the previouspart we saw the initialization of the RCU and stopped on the call of theacpi_early_initfunction. This part will be the last part of the Kernel initialization process chapter, so let'sfinish it.After the call of theacpi_early_initfunction from the init/main.c, we can see the followingcode:#ifdef CONFIG_X86_ESPFIX64init_espfix_bsp();#endifHere we can see the call of theCONFIG_X86_ESPFIX64init_espfix_bspfunction which depends on thekernel configuration option. As we can understand from the functionname, it does something with the stack. This function is defined in thearch/x86/kernel/espfix_64.c and prevents leaking ofreturning to 16-bit stack. First of all we installpage directory in theinit_espfix_bsespfix31:16bits of theespregister duringpage upper directory into the kernel:pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);WhereESPFIX_BASE_ADDR#define PGDIR_SHIFTis:39#define ESPFIX_PGD_ENTRY _AC(-2, UL)#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY <= PAGE_SIZE.........void thread_info_cache_init(void){thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,THREAD_SIZE, 0, NULL);BUG_ON(thread_info_cache == NULL);}.........#endifAs we already know theTHREAD_SIZEisPAGE_SIZEis(_AC(1,UL) << PAGE_SHIFT)(PAGE_SIZE rlim[RLIMIT_NPROC].rlim_max = max_threads/2;init_task.signal->rlim[RLIMIT_SIGPENDING] =init_task.signal->rlim[RLIMIT_NPROC];As we know thesignalinit_taskis an instance of thetask_structstructure, so it containsfield which represents signal handler. It has following typestruct signal_structOn the first two lines we can see setting of the current and maximum limit of thelimits.resource. Every process has an associated set of resource limits. These limits specify amountof resources which current process can use. Hererlimis resource control limit andpresented by the:struct rlimit {__kernel_ulong_trlim_cur;__kernel_ulong_trlim_max;};structure from the include/uapi/linux/resource.h. In our case the resource is theRLIMIT_NPROCwhich is the maximum number of processes that user can own and- the maximum number of pending signals. We can see it in the:RLIMIT_SIGPENDINGcat /proc/self/limitsLimitSoft LimitHard LimitUnitsMax processes6381563815processesMax pending signals6381563815signals..................Initialization of the cachesThe next function after thefork_initis theproc_caches_initfunction allocates caches for the memory descriptors (orbeginning of thethe call of theproc_caches_initkmem_cache_createsighand_cachepsignal_cachepfiles_cachepfrom the kernel/fork.c. Thismm_structstructure). At thewe can see allocation of the different SLAB caches with:- manage information about installed signal handlers;- manage information about process signal descriptor;- manage information about opened files;230End of initialization- manage filesystem information.fs_cachepAfter this we allocateSLABcache for thestructures:mm_structmm_cachep = kmem_cache_create("mm_struct",sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);After this we allocateSLABcache for the importantvm_area_structwhich used by thekernel to manage virtual memory space:vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);Note, that we useKMEM_CACHEmacro here instead of thedefined in the include/linux/slab.h and just expands to thekmem_cache_create. This macro iskmem_cache_createcall:#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\sizeof(struct __struct), __alignof__(struct __struct),\(__flags), NULL)TheKMEM_CACHEoperator. Thehas one difference fromKMEM_CACHEkmem_cache_createmmap_initareaSLABandmacro alignskmem_cache_createSLAB. Take a look on__alignof__to the size of the given structure, butuses given value to align space. After this we can see the call of thensproxy_cache_initfunctions. The first function initializes virtual memoryand the second function initializesThe next function after theproc_caches_initfor namespaces.SLABisbuffer_initfs/buffer.c source code file and allocate cache for the. This function is defined in thebuffer_head. Thebuffer_headis aspecial structure which defined in the include/linux/buffer_head.h and used for managingbuffers. In the start of thebuffer_headfunction we allocate cache for thebuffer_initstructures with the call of thekmem_cache_createstructfunction as we did in theprevious functions. And calculate the maximum size of the buffers in memory with:nrpages = (nr_free_buffer_pages() * 10) / 100;max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));which will be equal to theThe next function after the10%of theZONE_NORMALbuffer_initis -(all RAM from the 4GB on thevfs_caches_initx86_64. This function allocatescaches and hashtable for different VFS caches. We already saw the).SLABvfs_caches_init_earlyfunction in the eighth part of the linux kernel initialization process which initialized caches fordcache(or directory-cache) and inode cache. Theearly initialization of thedcacheandinodevfs_caches_initfunction makes post-caches, private data cache, hash tables for the231End of initializationmount points, etc. More details about VFS will be described in the separate part. After thiswe can seesignals_initallocates a cache for thefunction. This function is defined in the kernel/signal.c andsigqueuesignals. The next function isstructures which represents queue of the real timepage_writeback_init. This function initializes the ratio for thedirty pages. Every low-level page entry contains thedirtybit which indicates whether apage has been written to after been loaded into memory.Creation of the root for the procfsAfter all of this preparations we need to create the root for the proc filesystem. We will do itwith the call of theproc_root_initfunction from the fs/proc/root.c. At the start of thefunction we allocate the cache for the inodes and register a new filesystemproc_root_initin the system with the:err = register_filesystem(&proc_fs_type);if (err)return;As I wrote above we will not dive into details about VFS and different filesystems in thischapter, but will see it in the chapter about thein our system, we call theallocatesinodeaccessing theproc_self_initnumber for the/procself(VFS. After we've registered a new filesystemfunction from the fs/proc/self.c and this function/proc/selfdirectory refers to the processfilesystem). The next step after theproc_setup_thread_selfwhich setups the/proc/thread-selfinformation about current thread. After this we createisproc_self_initdirectory which contains/proc/self/mountssymlink which willcontains mount points with the call of theproc_symlink("mounts", NULL, "self/mounts");and a couple of directories depends on the different configuration options:232End of initialization#ifdef CONFIG_SYSVIPCproc_mkdir("sysvipc", NULL);#endifproc_mkdir("fs", NULL);proc_mkdir("driver", NULL);proc_mkdir("fs/nfsd", NULL);#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)proc_mkdir("openprom", NULL);#endifproc_mkdir("bus", NULL);.........if (!proc_mkdir("tty", NULL))return;proc_mkdir("tty/ldisc", NULL);.........In the end of the/proc/sysproc_root_initwe call theproc_sys_initfunction which createsdirectory and initializes the Sysctl.It is the end ofstart_kernelstart_kernelfunction. I did not describe all functions which are called in the. I skipped them, because they are not important for the generic kernelinitialization stuff and depend on only different kernel configurations. They aretaskstats_init_earlywhich exports per-task statistic to the user-space,initializes per-task delay accounting,security stuff,check_bugskey_initandsecurity_init- fix some architecture-dependent bugs,executes initialization of the ftrace,cgroup_initdelayacct_init-initialize differentftrace_initfunctionmakes initialization of the rest of the cgroupsubsystem,etc. Many of these parts and subsystems will be described in the other chapters.That's all. Finally we have passed through the long-longstart_kernelfunction. But it is notthe end of the linux kernel initialization process. We haven't run the first process yet. In theend of thestart_kernelwe can see the last call of the -rest_initfunction. Let's goahead.First steps after the start_kernelTherest_initfunction is defined in the same source code file asand this file is init/main.c. In the beginning of therest_initstart_kernelfunction,we can see call of the twofollowing functions:233End of initializationrcu_scheduler_starting();smpboot_thread_init();The firstmakes RCU scheduler active and the secondrcu_scheduler_startingregisters thesmpboot_thread_initsmpboot_thread_notifierCPU notifier (more about it youcan read in the CPU hotplug documentation. After this we can see the following calls:kernel_thread(kernel_init, NULL, CLONE_FS);pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);Here thefunction (defined in the kernel/fork.c) creates new kernel thread.Askernel_threadwe can see thekernel_threadfunction takes three arguments:Function which will be executed in a new thread;Parameter for thefunction;kernel_initFlags.We will not dive into details aboutkernel_threadimplementation (we will see it in thechapter which describe scheduler, just need to say thatkernel_threadwe only need to know that we create new kernel thread withinvokes clone). Nowkernel_threadfunction, parentand child of the thread will use shared information about filesystem and it will start toexecutefunction. A kernel thread differs from a user thread that it runs inkernel_initkernel mode. So with these twothePID = 1forinitkernel_threadprocess andprocess. Let's look on thekthreaddPID = 2calls we create two new kernel threads withforkthreadd. We already know what isinit. It is a special kernel thread which manages and helpsdifferent parts of the kernel to create another kernel thread. We can see it in the output ofthepsutil: ps -ef | grep kthreaddroot2Let's postpone00 Jan11 ?kernel_initand00:00:00 [kthreadd]kthreaddfor now and go ahead in therest_init. In thenext step after we have created two new kernel threads we can see the following code:rcu_read_lock();kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);rcu_read_unlock();The firstrcu_read_lockand thercu_read_unlockfunction marks the beginning of an RCU read-side critical sectionmarks the end of an RCU read-side critical section. We call thesefunctions because we need to protect thefind_task_by_pid_ns. Thefind_task_by_pid_ns234End of initializationreturns pointer to thethetask_structfortask_structPID = 2the next step we callby the given pid. So, here we are getting the pointer to(we got it afterkthreaddcreation with thekernel_thread). Infunctioncompletecomplete(&kthreadd_done);and pass address of thekthreadd_done. Thekthreadd_donedefined asstatic __initdata DECLARE_COMPLETION(kthreadd_done);whereDECLARE_COMPLETIONmacro defined as:#define DECLARE_COMPLETION(work) \struct completion work = COMPLETION_INITIALIZER(work)and expands to the definition of thestructure. This structure is defined in thecompletioninclude/linux/completion.h and presentscompletionsconcept. Completions is a codesynchronization mechanism which provides race-free solution for the threads that must waitfor some process to have reached a point or a specific state. Using completions consists ofthree parts: The first is definition of theDECLARE_COMPLETIONcomplete. The second is call of thestructure and we did it with thewait_for_completion. After the call of thisfunction, a thread which called it will not continue to execute and will wait while other threaddid not callcompletekthreadd_donefunction. Note that we callin the beginning of thewait_for_completionkernel_init_freeablewith the:wait_for_completion(&kthreadd_done);And the last step is to callkernel_init_freeableAfter thekthreaddcompletefunction as we saw it above. After this thefunction will not be executed whilekthreaddthread will not be set.was set, we can see three following functions in therest_init:init_idle_bootup_task(current);schedule_preempt_disabled();cpu_startup_entry(CPUHP_ONLINE);The firstinit_idle_bootup_taskclass for the current process (function from the kernel/sched/core.c sets the Schedulingidleclass in our case):235End of initializationvoid init_idle_bootup_task(struct task_struct *idle){idle->sched_class = &idle_sched_class;}whereidleclass is a low task priority and tasks can be run only when the processordoesn't have anything to run besides this tasks. The second functionschedule_preempt_disableddisables preempt intasks. And the third functionidleis defined in the kernel/sched/idle.c and callscpu_startup_entrykernel/sched/idle.c. Thefunction works as process withcpu_idle_loopin the background. Main purpose of thecpu_idle_loopfrom thecpu_idle_loopand worksPID = 0is to consume the idle CPU cycles.When there is no process to run, this process starts to work. We have one process withidlescheduling class (we just set theinit_idle_bootup_taskfunction), so thecurrentidletask to theidlewith the call of thethread does not do useful work but justchecks if there is an active task to switch to:static void cpu_idle_loop(void){.........while (1) {while (!need_resched()) {.........}...}More about it will be in the chapter about scheduler. So for this moment thecalls therest_initbecomeidlekernel_initfunction which spawns aninit(kernel_initprocess itself. Now is time to look on thefunction starts from the call of thekernel_init_freeablestart_kernelfunction) process andkernel_initkernel_init_freeable. Execution of thefunction. Thefunction first of all waits for the completion of thekthreaddsetup. Ialready wrote about it above:wait_for_completion(&kthreadd_done);After this we setgfp_allowed_maskto__GFP_BITS_MASKwhich means that system is alreadyrunning, set allowed cpus/mems to all CPUs and NUMA nodes with thefunction, allowthecadorinitprocess to run on any CPU with theCtrl-Alt-Deleteset_mems_allowedset_cpus_allowed_ptr, set pid for, do preparation for booting of the other CPUs with the call of236End of initializationthesmp_prepare_cpusthesmp_init, call early initcalls with thedo_pre_smp_initcallsand initialize lockup_detector with the call of theinitialize scheduler with thesched_init_smpdo_basic_setupwithSMPandlockup_detector_init.After this we can see the call of the following functions the, initializedo_basic_setup. Before we will callfunction, our kernel already initialized for this moment. As commentsays:Now we can finally start doing some real work..Thedo_basic_setupwill reinitialize cpuset to the active CPUs, initialize thekhelper- whichis a kernel thread which used for making calls out to userspace from within the kernel,initialize tmpfs, initializedriversmake post-early call of thetwice file descriptors fromsubsystem, enable the user-mode helperinitcalls0to. We can see opening of theafter the2do_basic_setupworkqueuedev/consoleandand dup:if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) 0xFF);You can find this check within the Linux kernel source code related to interrupt setup (eg.Theset_intr_gate,void set_system_intr_gate32 vector numbers from0to31in arch/x86/include/asm/desc.h). The firstare reserved by the processor and used for theprocessing of architecture-defined exceptions and interrupts. You can find the table with thedescription of these vector numbers in the second part of the Linux kernel initializationprocess - Early interrupt and exception handling. Vector numbers from32to255aredesignated as user-defined interrupts and are not reserved by the processor. Theseinterrupts are generally assigned to external I/O devices to enable those devices to sendinterrupts to the processor.Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2major classes:243IntroductionExternal or hardware generated interruptsSoftware-generated interruptsThe first - external interrupts are received through thewhich are connected to theLocal APICLocal APICor pins on the processor. The second - software-generated interrupts arecaused by an exceptional condition in the processor itself (sometimes using specialarchitecture-specific instructions). A common example for an exceptional condition isdivision by zero. Another example is exiting a program with thesyscallinstruction.As mentioned earlier, an interrupt can occur at any time for a reason which the code andCPU have no control over. On the other hand, exceptions aresynchronouswith programexecution and can be classified into 3 categories:FaultsTrapsAbortsAfaultis an exception reported before the execution of a "faulty" instruction (which canthen be corrected). If corrected, it allows the interrupted program to be resume.Next atrapis an exception which is reported immediately following the execution of theinstruction. Traps also allow the interrupted program to be continued just as atrapfaultdoes.Finally anabortis an exception that does not always report the exact instruction whichcaused the exception and does not allow the interrupted program to be resumed.Also we already know from the previous part that interrupts can be classified asandnon-maskablemaskable. Maskable interrupts are interrupts which can be blocked with the twofollowing instructions forx86_64-stiandcli. We can find them in the Linux kernelsource code:static inline void native_irq_disable(void){asm volatile("cli": : :"memory");}andstatic inline void native_irq_enable(void){asm volatile("sti": : :"memory");}244IntroductionThese two instructions modify theinstruction sets theIFIFflag and theflag bit within the interrupt register. Theclistiinstruction clears this flag. Non-maskableinterrupts are always reported. Usually any failure in the hardware is mapped to such nonmaskable interrupts.If multiple exceptions or interrupts occur at the same time, the processor handles them inorder of their predefined priorities. We can determine the priorities from the highest to thelowest in the following table:+----------------------------------------------------------------+||Priority|||| Description|||+--------------+-------------------------------------------------+||1|| Hardware Reset and Machine Checks|| - RESET|| - Machine Check|+--------------+-------------------------------------------------+||2|| Trap on Task Switch|| - T flag in TSS is set|||+--------------+-------------------------------------------------+|| External Hardware Interventions||| - FLUSH|| - STOPCLK||| - SMI||| - INIT||3+--------------+-------------------------------------------------+||4|| Traps on the Previous Instruction|| - Breakpoints|| - Debug Trap Exceptions|+--------------+-------------------------------------------------+|5| Nonmaskable Interrupts|+--------------+-------------------------------------------------+|6| Maskable Hardware Interrupts|+--------------+-------------------------------------------------+|7| Code Breakpoint Fault|+--------------+-------------------------------------------------+|8| Faults from Fetching Next Instruction||| Code-Segment Limit Violation||| Code Page Fault|+--------------+-------------------------------------------------+|| Faults from Decoding the Next Instruction||| Instruction length > 15 bytes||| Invalid Opcode||9| Coprocessor Not Available||||+--------------+-------------------------------------------------+|| Faults on Executing an Instruction||10| Overflow||| Bound error|245Introduction|| Invalid TSS||| Segment Not Present||| Stack fault||| General Protection||| Data Page Fault||| Alignment Check||| x87 FPU Floating-point exception||| SIMD floating-point exception||| Virtualization exception|+--------------+-------------------------------------------------+Now that we know a little about the various types of interrupts and exceptions, it is time tomove on to a more practical part. We start with the description of theTable. As mentioned earlier, thehandlers. TheIDTIDTInterrupt Descriptorstores entry points of the interrupts and exceptionsis similar in structure to theGlobal Descriptor Tablewhich we saw inthe second part of the Kernel booting process. But of course it has some differences.Instead ofdescriptors, theIDTentries are calledgates. It can contain one of thefollowing gates:Interrupt gatesTask gatesTrap gates.in thethex86x86_64architecture. Only long mode interrupt gates and trap gates can be referenced in. Like theof 8-byte gates onGlobal Descriptor Tablex86, theand an array of 16-byte gates onthe second part of the Kernel booting process, thatNULLdescriptor as its first element. Unlike theDescriptor TableInterrupt Descriptor tablex86_64is an array. We can remember fromGlobal Descriptor TableGlobal Descriptor Table, themust containInterruptmay contain a gate; it is not mandatory. For example, you may rememberthat we have loaded the Interrupt Descriptor table with theNULLgates only in the earlierpart while transitioning into protected mode:/** Set up the IDT*/static void setup_idt(void){static const struct gdt_ptr null_idt = {0, 0};asm volatile("lidtl %0" : : "m" (null_idt));}from the arch/x86/boot/pm.c. TheInterrupt Descriptor tablecan be located anywhere inthe linear address space and the base address of it must be aligned on an 8-byte boundaryonx86or 16-byte boundary onx86_64. The base address of theIDTis stored in the246Introductionspecial register theIDTRIDTR. There are two instructions onx86-compatible processors to modifyregister:LIDTSIDTThe first instructionoperand into theof theIDTRLIDTIDTRis used to load the base-address of the. The second instructioninto the specified operand. TheSIDTIDTRIDTi.e., the specifiedis used to read and store the contentsregister is 48-bits on thex86andcontains the following information:+-----------------------------------+----------------------+|||Base address of the IDT|||Limit of the IDT|||+-----------------------------------+----------------------+4716 15Looking at the implementation oftheIDTRregister with thelidtsetup_idt0, we have prepared ainstruction. Note thatnull_idtnull_idthasand loaded it togdt_ptrtype whichis defined as:struct gdt_ptr {u16 len;u32 ptr;} __attribute__((packed));Here we can see the definition of the structure with the two fields of 2-bytes and 4-byteseach (a total of 48-bits) as we can see in the diagram. Now let's look at thestructure. Thein thex86_64IDTIDTentriesentries structure is an array of the 16-byte entries which are called gates. They have the following structure:247Introduction12796+-------------------------------------------------------------------------------+|||Reserved|||+-------------------------------------------------------------------------------9564+-------------------------------------------------------------------------------+|||Offset 63..32|||+-------------------------------------------------------------------------------+6348 47464442393432+-------------------------------------------------------------------------------+|||Offset 31..16||P||D||P| 0 |Type |0 0 0 | 0 | 0 | IST ||||L|||||||||||-------------------------------------------------------------------------------+3116 150+-------------------------------------------------------------------------------+|||Segment Selector|||Offset 15..0|||+-------------------------------------------------------------------------------+To form an index into the IDT, the processor scales the exception or interrupt vector bysixteen. The processor handles the occurrence of exceptions and interrupts just like ithandles calls of a procedure when it sees thenumber orvector numberIDTinstruction. A processor uses a uniqueof the interrupt or the exception as the index to find the necessaryInterrupt Descriptor TableAs we can see,callentry. Now let's take a closer look at anIDTentry.entry on the diagram consists of the following fields:bits - offset from the segment selector which is used by the processor as the0-15base address of the entry point of the interrupt handler;16-31bits - base address of the segment select which contains the entry point of theinterrupt handler;IST- a new special mechanism in theDPL- Descriptor Privilege Level;Px86_64, will see it later;- Segment Present flag;48-63bits - second part of the handler base address;64-95bits - third part of the base address of the handler;96-127bits - and the last bits are reserved by the CPU.And the lastTypefield describes the type of theIDTentry. There are three different kindsof handlers for interrupts:248IntroductionInterrupt gateTrap gateTask gateTheISTorInterrupt Stack Tableis a new mechanism in thex86_64alternative to the legacy stack-switch mechanism. Previously thex86. It is used as anarchitecture provideda mechanism to automatically switch stack frames in response to an interrupt. Themodified version of thex86ISTis aStack switching mode. This mechanism unconditionallyswitches stacks when it is enabled and can be enabled for any interrupt in theIDTentryrelated with the certain interrupt (we will soon see it). From this we can understand thatISTis not necessary for all interrupts. Some interrupts can continue to use the legacy stackswitching mode. TheSegment orTheTSSTSSISTmechanism provides up to sevenISTpointers in the Task Statewhich is the special structure which contains information about a process.is used for stack switching during the execution of an interrupt or exceptionhandler in the Linux kernel. Each pointer is referenced by an interrupt gate from theTheInterrupt Descriptor Tablerepresented by the array of thegate_descIDT.structures:extern gate_desc idt_table[];wheregate_descis:#ifdef CONFIG_X86_64.........typedef struct gate_struct64 gate_desc;.........#endifandgate_struct64defined as:struct gate_struct64 {u16 offset_low;u16 segment;unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;u16 offset_middle;u32 offset_high;u32 zero1;} __attribute__((packed));249IntroductionEach active thread has a large stack in the Linux kernel for thestack size is defined asTHREAD_SIZEx86_64architecture. Theand is equal to:#define PAGE_SHIFT12#define PAGE_SIZE(_AC(1,UL) << PAGE_SHIFT).........#define THREAD_SIZE_ORDER#define THREAD_SIZEThePAGE_SIZEisKASAN_STACK_ORDER4096(2 + KASAN_STACK_ORDER)(PAGE_SIZE << THREAD_SIZE_ORDER)-bytes and the. As we can see, theTHREAD_SIZE_ORDERKASAN_STACKdepends on thedepends on theCONFIG_KASANkernelconfiguration parameter and is defined as:#ifdef CONFIG_KASAN#define KASAN_STACK_ORDER 1#else#define KASAN_STACK_ORDER 0#endifKASanis a runtime memory debugger. Thus, theCONFIG_KASANis disabled or32768THREAD_SIZEwill be16384bytes ifif this kernel configuration option is enabled. Thesestacks contain useful data as long as a thread is alive or in a zombie state. While the threadis in user-space, the kernel stack is empty except for thethread_infostructure (detailsabout this structure are available in the fourth part of the Linux kernel initialization process)at the bottom of the stack. The active or zombie threads aren't the only threads with theirown stack. There also exist specialized stacks that are associated with each available CPU.These stacks are active when the kernel is executing on that CPU. When the user-space isexecuting on the CPU, these stacks do not contain any useful information. Each CPU has afew special per-cpu stacks as well. The first is theinterrupt stackused for the externalhardware interrupts. Its size is determined as follows:#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)or16384bytes. The per-cpu interrupt stack represented by thethe Linux kernel forx86_64irq_stack_unionunion in:250Introductionunion irq_stack_union {char irq_stack[IRQ_STACK_SIZE];struct {char gs_base[40];unsigned long stack_canary;};};The firstfield is a 16 kilobytes array. Also you can see thatirq_stackirq_stack_unioncontains a structure with the two fields:gs_basex86_64- The, thegsgsregister always points to the bottom of theirqstackunion. On theregister is shared by per-cpu area and stack canary (more aboutper-cpuvariables you can read in the special part). All per-cpu symbols are zero based andthegspoints to the base of the per-cpu area. You already know that segmentedmemory model is abolished in the long mode, but we can set the base address for thetwo segment registers -fsandgswith the Model specific registers and theseregisters can be still be used as address registers. If you remember the first part of theLinux kernel initialization process, you can remember that we have set themovl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxgsregister:wrmsrwhereinitial_gspoints to the:irq_stack_unionGLOBAL(initial_gs).quadINIT_PER_CPU_VAR(irq_stack_union)stack_canary- Stack canary for the interrupt stack is athe stack hasn't been overwritten. Note thatgs_basestack protectoris a 40 bytes array.that stack canary will be on the fixed offset from the base of thebeThe40for thex86_64irq_stack_unionSystem.mapand20for thex86is the first datum in thegsto verify thatGCCrequiresand its value must.percpuarea, we can see it in the:251Introduction0000000000000000 D __per_cpu_start0000000000000000 D irq_stack_union0000000000004000 d exception_stacks0000000000009000 D gdt_page.........We can see its definition in the code:DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;Now, it's time to look at the initialization of theirq_stack_unionirq_stack_union. Besides thedefinition, we can see the definition of the following per-cpu variables inthe arch/x86/include/asm/processor.h:DECLARE_PER_CPU(char *, irq_stack_ptr);DECLARE_PER_CPU(unsigned int, irq_count);The first is theirq_stack_ptr. From the variable's name, it is obvious that this is a pointer tothe top of the stack. The second -irq_countinterrupt stack or not. Initialization of thesetup_per_cpu_areasis used to check if a CPU is already on anirq_stack_ptris located in thefunction in arch/x86/kernel/setup_percpu.c:void __init setup_per_cpu_areas(void){......#ifdef CONFIG_X86_64for_each_possible_cpu(cpu) {.........per_cpu(irq_stack_ptr, cpu) =per_cpu(irq_stack_union.irq_stack, cpu) +IRQ_STACK_SIZE - 64;.........#endif......}252IntroductionHere we go over all the CPUs one-by-one and setupequal to the top of the interrupt stack minus64. Whyirq_stack_ptr64. This turns out to be?TODOarch/x86/kernel/cpu/common.c source code file is following:void load_percpu_segment(int cpu){.........loadsegment(gs, 0);wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));}and as we already know thegsregister points to the bottom of the interrupt stack.movl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxwrmsrGLOBAL(initial_gs).quadINIT_PER_CPU_VAR(irq_stack_union)Here we can see thewrmsrinstruction which loads the data fromspecific register pointed by theMSR_GS_BASEregister.ecxedx:eaxinto the Modelregister. In our case the model specific register iswhich contains the base address of the memory segment pointed by thepoints to the address of theedx:eaxirq_stack_unioninitial_gsgswhich is the base address of our.We already know thatx86_64has a feature calledInterrupt Stack TableorISTand thisfeature provides the ability to switch to a new stack for events non-maskable interrupt,double fault etc. There can be up to sevenISTentries per-cpu. Some of them are:DOUBLEFAULT_STACKNMI_STACKDEBUG_STACKMCE_STACKor#define DOUBLEFAULT_STACK 1#define NMI_STACK 2#define DEBUG_STACK 3#define MCE_STACK 4253IntroductionAll interrupt-gate descriptors which switch to a new stack with thetheset_intr_gate_istISTare initialized withfunction. For example:set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);.........set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);where&nmiand&double_faultare addresses of the entries to the given interrupthandlers:asmlinkage void nmi(void);asmlinkage void double_fault(void);defined in the arch/x86/kernel/entry_64.Sidtentry double_fault do_double_fault has_error_code=1 paranoid=2.........ENTRY(nmi).........END(nmi)When an interrupt or an exception occurs, the newssripselector’srplfield is set to the newcplssselector is forced to. The oldss,rspNULL, register flags,and thecs,are pushed onto the new stack. In 64-bit mode, the size of interrupt stack-framepushes is fixed at 8-bytes, so we will get the following stack:+---------------+|||SS| 40|RSP| 32|RFLAGS| 24|CS| 16|RIP| 8|Error code| 0||+---------------+254IntroductionIf theISTfield in the interrupt gate is not0, we read theISTpointer intorsp. If theinterrupt vector number has an error code associated with it, we then push the error codeonto the stack. If the interrupt vector number has no error code, we go ahead and push thedummy error code on to the stack. We need to do this to ensure stack consistency. Next, weload the segment-selector field from the gate descriptor into the CS register and must verifythat the target code-segment is a 64-bit mode code segment by the checking bitLbit in theGlobal Descriptor Tabledescriptor intorip21i.e. the. Finally we load the offset field from the gatewhich will be the entry-point of the interrupt handler. After this theinterrupt handler begins to execute and when the interrupt handler finishes its execution, itmust return control to the interrupted process with theinstruction unconditionally pops the stack pointer (interrupted process and does not depend on theiretss:rspcplinstruction. Theiret) to restore the stack of thechange.That's all.ConclusionIt is the end of the first part ofInterrupts and Interrupt Handlingin the Linux kernel. Wecovered some theory and the first steps of initialization of stuffs related to interrupts andexceptions. In the next part we will continue to dive into the more practical aspects ofinterrupts and interrupt handling.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me a PR to linux-insides.LinksPICAdvanced Programmable Interrupt Controllerprotected modelong modekernel stacksTask State Segmentsegmented memory modelModel specific registersStack canaryPrevious chapter255Introduction256Start to dive into interruptsInterrupts and Interrupt Handling. Part 2.Start to dive into interrupt and exceptionshandling in the Linux kernelWe saw some theory about interrupts and exception handling in the previous part and as Ialready wrote in that part, we will start to dive into interrupts and exceptions in the Linuxkernel source code in this part. As you already can note, the previous part mostly describedtheoretical aspects and in this part we will start to dive directly into the Linux kernel sourcecode. We will start to do it as we did it in other chapters, from the very early places. We willnot see the Linux kernel source code from the earliest code lines as we saw it for example inthe Linux kernel booting process chapter, but we will start from the earliest code which isrelated to the interrupts and exceptions. In this part we will try to go through the all interruptsand exceptions related stuff which we can find in the Linux kernel source code.If you've read the previous parts, you can remember that the earliest place in the Linuxkernelx86_64architecture-specific source code which is related to the interrupt is located inthe arch/x86/boot/pm.c source code file and represents the first setup of the InterruptDescriptor Table. It occurs right before the transition into the protected mode in thego_to_protected_modefunction by the call of thesetup_idt:void go_to_protected_mode(void){...setup_idt();...}Thesetup_idtfunction is defined in the same source code file as thego_to_protected_modefunction and just loads the address of theNULLinterrupts descriptortable:static void setup_idt(void){static const struct gdt_ptr null_idt = {0, 0};asm volatile("lidtl %0" : : "m" (null_idt));}257Start to dive into interruptswheregdt_ptraddress of therepresents a special 48-bitGlobal Descriptor TableGDTRregister which must contain the base:struct gdt_ptr {u16 len;u32 ptr;} __attribute__((packed));Of course in our case thewe setgdt_ptrInterrupt Descriptor Tabledoes not represent the. You will not find anGDTRidt_ptrregister, butIDTRsincestructure, because if it hadbeen in the Linux kernel source code, it would have been the same asgdt_ptrbut withdifferent name. So, as you can understand there is no sense to have two similar structureswhich differ only by name. You can note here, that we do not fill theInterrupt Descriptorwith entries, because it is too early to handle any interrupts or exceptions at this point.TableThat's why we just fill theIDTwithNULL.After the setup of the Interrupt descriptor table, Global Descriptor Table and other stuff wejump into protected mode in the - arch/x86/boot/pmjump.S. You can read more about it inthe part which describes the transition to protected mode.We already know from the earliest parts that entry to protected mode is located in theboot_params.hdr.code32_startmode andboot_paramsto theand you can see that we pass the entry of the protectedprotected_mode_jumpin the end of the arch/x86/boot/pm.c:protected_mode_jump(boot_params.hdr.code32_start,(u32)&boot_params + (ds() << 4));Theprotected_mode_jumpparameters in theaxis defined in the arch/x86/boot/pmjump.S and gets these twoanddxregisters using one of the 8086 calling conventions:GLOBAL(protected_mode_jump)..........byte2:.long.word0x66, 0xea# ljmpl opcodein_pm32# offset__BOOT_CS# segment.........ENDPROC(protected_mode_jump)wherein_pm32contains a jump to the 32-bit entry point:258Start to dive into interruptsGLOBAL(in_pm32)......jmpl*%eax // %eax contains address of the startup_32......ENDPROC(in_pm32)As you can remember the 32-bit entry point is in the arch/x86/boot/compressed/head_64.Sassembly file, although it containsarch/x86/boot/compressed_64in its name. We can see the two similar files in thedirectory:arch/x86/boot/compressed/head_32.S.arch/x86/boot/compressed/head_64.S;But the 32-bit mode entry point is the second file in our case. The first file is not evencompiled forx86_64. Let's look at the arch/x86/boot/compressed/Makefile:vmlinux-objs-y := $(obj)/vmlinux.lds$(obj)/head_$(BITS).o$(obj)/misc.o \......We can see here thathead_*depends on the$(BITS)variable which depends on thearchitecture. You can find it in the arch/x86/Makefile:ifeq ($(CONFIG_X86_32),y)...BITS := 32elseBITS := 64...endifNow as we jumped on thestartup_32from the arch/x86/boot/compressed/head_64.S wewill not find anything related to the interrupt handling here. Thestartup_32contains codethat makes preparations before the transition into long mode and directly jumps in to it. Theentry is located inlong modestartup_64decompression that occurs in theand it makes preparations before the kerneldecompress_kernelfrom thearch/x86/boot/compressed/misc.c. After the kernel is decompressed, we jump on thestartup_64from the arch/x86/kernel/head_64.S. In thestartup_64we start to buildidentity-mapped pages. After we have built identity-mapped pages, checked the NX bit,setup theExtended Feature Enable RegisterDescriptor Tablewith thelgdt(see in links), and updated the earlyinstruction, we need to setupgsGlobalregister with the followingcode:259Start to dive into interruptsmovl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxwrmsrWe already saw this code in the previous part. First of all pay attention on the lastinstruction. This instruction writes data from theregister specified by theecxedx:eaxregister. We can see thatwrmsrregisters to the model specificcontainsecxwhich$MSR_GS_BASEis declared in the arch/x86/include/uapi/asm/msr-index.h and looks like:#define MSR_GS_BASE0xc0000101From this we can understand that. Since registersregistercs,MSR_GS_BASEds,es, anddefines the number of thessare not used in the 64-bit mode, theirfields are ignored. But we can access memory overspecific register provides aback doorfsandregisters. The modelgsto the hidden parts of these segment registers andallows to use 64-bit base address for segment register addressed by thetheMSR_GS_BASEon theis the hidden part and this part is mapped on theinitial_gsmodel specificfsGS.baseandgs. Sofield. Let's look:GLOBAL(initial_gs).quadWe passtheINIT_PER_CPU_VAR(irq_stack_union)irq_stack_unioninit_per_cpu__symbol to theINIT_PER_CPU_VARmacro which just concatenatesprefix with the given symbol. In our case we will get theinit_per_cpu__irq_stack_unionsymbol. Let's look at the linker script. There we can seefollowing definition:#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_loadINIT_PER_CPU(irq_stack_union);It tells us that the address of the__per_cpu_loadinit_per_cpu__irq_stack_union. Now we need to understand where__per_cpu_loadare what they mean. The firstarch/x86/include/asm/processor.h with thecall theinit_per_cpu_varwill beirq_stack_union +init_per_cpu__irq_stack_unionirq_stack_unionDECLARE_INIT_PER_CPUandis defined in themacro which expands tomacro:260Start to dive into interruptsDECLARE_INIT_PER_CPU(irq_stack_union);#define DECLARE_INIT_PER_CPU(var) \extern typeof(per_cpu_var(var)) init_per_cpu_var(var)#define init_per_cpu_var(var)init_per_cpu__##varIf we expand all macros we will get the sameexpanding theINIT_PER_CPUvariable. Let's look at theirq_stack_unioninit_per_cpu__irq_stack_unionmacro, but you can note that it is not just a symbol, but atypeof(per_cpu_var(var))and theas we got afterper_cpu_varexpression. Ourvarismacro is defined in thearch/x86/include/asm/percpu.h:#define PER_CPU_VAR(var)%__percpu_seg:varwhere:#ifdef CONFIG_X86_64#define __percpu_seg gsendifSo, we are accessinggs:irq_stack_unionand getting its type which isirq_uniondefined the first variable and know its address, now let's look at the secondsymbol. There are a couple of__per_cpu_loadper-cpu. Ok, we__per_cpu_loadvariables which are located after this symbol. Theis defined in the include/asm-generic/sections.h:extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];and presented base address of theaddress of theirq_stack_unioninit_per_cpu__irq_stack_union,per-cpuvariables from the data area. So, we know the__per_cpu_loadand we know thatmust be placed right after__per_cpu_load. And we can seeit in the System.map:.........ffffffff819ed000 D __init_beginffffffff819ed000 D __per_cpu_loadffffffff819ed000 A init_per_cpu__irq_stack_union.........261Start to dive into interruptsNow we know aboutinitial_gs, so let's look at the code:movl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxwrmsrHere we specified a model specific register withinitial_gsto theedx:eaxpair and execute theregister with the base address of theMSR_GS_BASEwrmsr, put the 64-bit address of theinstruction for filling theinit_per_cpu__irq_stack_uniongswhich will be at thebottom of the interrupt stack. After this we will jump to the C code on thex86_64_start_kernelfrom the arch/x86/kernel/head64.c. In thex86_64_start_kernelfunction we do the last preparations before we jump into the generic and architectureindependent kernel code and one of these preparations is filling the earlyDescriptor Tablewith the interrupts handlers entries orInterruptearly_idt_handlers. You canremember it, if you have read the part about the Early interrupt and exception handling andcan remember following code:for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)set_intr_gate(i, early_idt_handlers[i]);load_idt((const struct desc_ptr *)&idt_descr);but I wrote3.18Early interrupt and exception handlingpart when Linux kernel version was -. For this day actual version of the Linux kernel is4.1.0-rc6+andAndy Lutomirskisent the patch and soon it will be in the mainline kernel that changes behaviour for theearly_idt_handlers. NOTE While I wrote this part the patch already turned in the Linuxkernel source code. Let's look on it. Now the same part looks like:for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)set_intr_gate(i, early_idt_handler_array[i]);load_idt((const struct desc_ptr *)&idt_descr);AS you can see it has only one difference in the name of the array of the interrupts handlersentry points. Now it isearly_idt_handler_arry:extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];whereNUM_EXCEPTION_VECTORSandEARLY_IDT_HANDLER_SIZEare defined as:262Start to dive into interrupts#define NUM_EXCEPTION_VECTORS 32#define EARLY_IDT_HANDLER_SIZE 9So, theearly_idt_handler_arrayis an array of the interrupts handlers entry points andcontains one entry point on every nine bytes. You can remember that previousearly_idt_handlerswas defined in the arch/x86/kernel/head_64.S. Theearly_idt_handler_arrayis defined in the same source code file too:ENTRY(early_idt_handler_array).........ENDPROC(early_idt_handler_common)It fillstheearly_idt_handler_arryearly_make_pgtablewith the.rept NUM_EXCEPTION_VECTORSand contains entry ofinterrupt handler (more about its implementation you can read inthe part about Early interrupt and exception handling). For now we come to the end of thex86_64architecture-specific code and the next part is the generic kernel code. Of courseyou already can know that we will return to the architecture-specific code in thefunction and other places, but this is the end of thex86_64setup_archearly code.Setting stack canary for the interrupt stackThe next stop after the arch/x86/kernel/head_64.S is the biggeststart_kernelfunctionfrom the init/main.c. If you've read the previous chapter about the Linux kernel initializationprocess, you must remember it. This function does all initialization stuff before kernel willlaunch firstinitprocess with the pid -1and exceptions handling is the call of the. The first thing that is related to the interruptsboot_init_stack_canaryfunction.This function sets the canary value to protect interrupt stack overflow. We already saw a littlesome details about implementation of theboot_init_stack_canaryin the previous part andnow let's take a closer look on it. You can find implementation of this function in thearch/x86/include/asm/stackprotector.h and its depends on theCONFIG_CC_STACKPROTECTORkernel configuration option. If this option is not set this function will not do anything:263Start to dive into interrupts#ifdef CONFIG_CC_STACKPROTECTOR.........#elsestatic inline void boot_init_stack_canary(void){}#endifIf theCONFIG_CC_STACKPROTECTORboot_init_stack_canarykernel configuration option is set, thefunction starts from the check statirq_stack_unionper-cpu interrupt stack has offset equal to forty bytes from thestack_canarythat representsvalue:#ifdef CONFIG_X86_64BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);#endifAs we can read in the previous part theirq_stack_unionrepresented by the followingunion:union irq_stack_union {char irq_stack[IRQ_STACK_SIZE];struct {char gs_base[40];unsigned long stack_canary;};};which defined in the arch/x86/include/asm/processor.h. We know that union in the Cprogramming language is a data structure which stores only one field in a memory. We cansee here that structure has first field bottom of theirq_stackgs_basewhich is 40 bytes size and represents. So, after this our check with theBUILD_BUG_ONmacro should endsuccessfully. (you can read the first part about Linux kernel initialization process if you'reinteresting about theBUILD_BUG_ONAfter this we calculate newcanarymacro).value based on the random number and Time StampCounter:get_random_bytes(&canary, sizeof(canary));tsc = __native_read_tsc();canary += tsc + (tsc < 0xFF);_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);}First of all we can see the check thatgreater than0xffnwhich is vector number of the interrupt is notor 255. We need to check it because we remember from the previouspart that vector number of an interrupt must be betweencan see the call of the_set_gate0and255. In the next step wefunction that sets a given interrupt gate to theIDTtable:268Start to dive into interruptsstatic inline void _set_gate(int gate, unsigned type, void *addr,unsigned dpl, unsigned ist, unsigned seg){gate_desc s;pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);write_idt_entry(idt_table, gate, &s);write_trace_idt_entry(gate, &s);}Here we start from thegate_descpack_gatefunction which takes cleanIDTentry represented by thestructure and fills it with the base address and limit, Interrupt Stack Table,Privilege level, type of an interrupt which can be one of the following values:GATE_INTERRUPTGATE_TRAPGATE_CALLGATE_TASKand set the present bit for the givenIDTentry:static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,unsigned dpl, unsigned ist, unsigned seg){gate->offset_low= PTR_LOW(func);gate->segment= __KERNEL_CS;gate->ist= ist;gate->p= 1;gate->dpl= dpl;gate->zero0= 0;gate->zero1= 0;gate->type= type;gate->offset_middle= PTR_MIDDLE(func);gate->offset_high= PTR_HIGH(func);}After this we write just filled interrupt gate to thewhich expands to theidt_tablenative_write_idt_entryIDTwith thewrite_idt_entrymacroand just copy the interrupt gate to thetable by the given index:269Start to dive into interrupts#define write_idt_entry(dt, entry, g)native_write_idt_entry(dt, entry, g)static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate){memcpy(&idt[entry], gate, sizeof(*gate));}whereis just array ofidt_tablegate_desc:extern gate_desc idt_table[];That's all. The secondset_intr_gate_istset_system_intr_gate_istfunction has only one difference from the:static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist){BUG_ON((unsigned)n > 0xFF);_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);}Do you see it? Look on the fourth parameter of theset_intr_gateit wasWe also know that00x0early_trap_init0x3. We know that this parameter representDPLis the highest privilege level andset_system_intr_gate_istto the. It is_set_gate,set_intr_gate_ist,3. In theor privilege level.is the lowest.Now we know howset_intr_gateare work and we can returnfunction. Let's look on it again:set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);We set twoIDTentries for the#DBinterrupt andint3. These functions takes the sameset of parameters:vector number of an interrupt;address of an interrupt handler;interrupt stack table index.That's all. More about interrupts and handlers you will know in the next parts.Conclusion270Start to dive into interruptsIt is the end of the second part about interrupts and interrupt handling in the Linux kernel.We saw the some theory in the previous part and started to dive into interrupts andexceptions handling in the current part. We have started from the earliest parts in the Linuxkernel source code which are related to the interrupts. In the next part we will continue todive into this interesting theme and will know more about interrupt handling process.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksIDTProtected modeList of x86 calling conventions8086Long modeNXExtended Feature Enable RegisterModel-specific registerProcess identifierlockdepirqflags tracingIFStack canaryUnion typethiscpu* operationsvector numberInterrupt Stack TablePrivilege levelPrevious part271Interrupt handlersInterrupts and Interrupt Handling. Part 3.Exception HandlingThis is the third part of the chapter about an interrupts and an exceptions handling in theLinux kernel and in the previous part we stopped at thesetup_archfunction from thearch/x86/kernel/setup.c source code file.We already know that this function executes initialization of architecture-specific stuff. In ourcase thesetup_archsetup_archfunction does x86_64 architecture related initializations. Theis big function, and in the previous part we stopped on the setting of the twoexceptions handlers for the two following exceptions:#DB- debug exception, transfers control from the interrupted process to the debughandler;#BP- breakpoint exception, caused by theThese exceptions allow thex86_64int 3instruction.architecture to have early exception processing for thepurpose of debugging via the kgdb.As you can remember we set these exceptions handlers in theearly_trap_initfunction:void __init early_trap_init(void){set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);load_idt(&idt_descr);}from the arch/x86/kernel/traps.c. We already saw implementation of theandset_system_intr_gate_istset_intr_gate_istfunctions in the previous part and now we will look on theimplementation of these two exceptions handlers.Debug and Breakpoint exceptionsOk, we setup exception handlers in theearly_trap_initfunction for the#DBand#BPexceptions and now time is to consider their implementations. But before we will do this, firstof all let's look on details of these exceptions.272Interrupt handlersThe first exceptions -or#DBdebugexception occurs when a debug event occurs. Forexample - attempt to change the contents of a debug register. Debug registers are specialregisters that were presented inx86processors starting from the Intel 80386 processorand as you can understand from name of this CPU extension, main purpose of theseregisters is debugging.These registers allow to set breakpoints on the code and read or write data to trace it.Debug registers may be accessed only in the privileged mode and an attempt to read orwrite the debug registers when executing at any other privilege level causes a generalprotection fault exception. That's why we have usedexception, but not theset_system_intr_gate_istThe verctor number of the#DBexceptions isset_intr_gate_istfor the#DB.1(we pass it asX86_TRAP_DB) and as wemay read in specification, this exception has no error code:+-----------------------------------------------------+|Vector|Mnemonic|Description|Type |Error Code|+-----------------------------------------------------+|1| #DB|Reserved|F/T|NO|+-----------------------------------------------------+The second exception is#BPthe int 3 instruction. Unlike theorbreakpointDBexception occurs when processor executesexception, the#BPexception may occur in userspace.We can add it anywhere in our code, for example let's look on the simple program:// breakpoint.c#include int main() {int i;while (i 0x0000000000400585 :83 45 fc 01addDWORD PTR [rbp-0x4],0x1(gdb) cContinuing.i equal to: 1Program received signal SIGTRAP, Trace/breakpoint trap.0x0000000000400585 in main ()=> 0x0000000000400585 :83 45 fc 01addDWORD PTR [rbp-0x4],0x1(gdb) cContinuing.i equal to: 2Program received signal SIGTRAP, Trace/breakpoint trap.0x0000000000400585 in main ()=> 0x0000000000400585 :83 45 fc 01addDWORD PTR [rbp-0x4],0x1.........From this moment we know a little about these two exceptions and we can move on toconsideration of their handlers.Preparation before an exception handlerAs you may note before, theset_intr_gate_istandset_system_intr_gate_istfunctionstakes an addresses of exceptions handlers in theirs second parameter. In or case our twoexception handlers will be:debugint3;.You will not find these functions in the C code. all of that could be found in the kernel's*.c/*.hfiles only definition of these functions which are located in thearch/x86/include/asm/traps.h kernel header file:asmlinkage void debug(void);274Interrupt handlersandasmlinkage void int3(void);You may noteasmlinkagedirective in definitions of these functions. The directive is thespecial specificator of the gcc. Actually for aCfunctions which are called from assembly,we need in explicit declaration of the function calling convention. In our case, if functionmade withasmlinkagedescriptor, thengccwill compile the function to retrieve parametersfrom stack.So, both handlers are defined in the arch/x86/entry/entry_64.S assembly source code filewith theidtentrymacro:idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACKandidtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACKEach exception handler may be consists from two parts. The first part is generic part and it isthe same for all exception handlers. An exception handler should to save general purposeregisters on the stack, switch to kernel stack if an exception came from userspace andtransfer control to the second part of an exception handler. The second part of an exceptionhandler does certain work depends on certain exception. For example page fault exceptionhandler should find virtual page for given address, invalid opcode exception handler shouldsendsignal and etc.SIGILLAs we just saw, an exception handler starts from definition of theidtentrymacro from thearch/x86/kernel/entry_64.S assembly source code file, so let's look at implementation of thismacro. As we may see, thesymidtentrymacro takes five arguments:- defines global symbol with the.globl namewhich will be an an entry ofexception handler;do_sym- symbol name which represents a secondary entry of an exception handler;has_error_code- information about existence of an error code of exception.The last two parameters are optional:paranoid- shows us how we need to check current mode (will see explanation indetails later);shift_istDefinition of the- shows us is an exception running at.idtentryInterrupt Stack Table.macro looks:275Interrupt handlers.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1ENTRY(\sym).........END(\sym).endmBefore we will consider internals of theidtentrymacro, we should to know state of stackwhen an exception occurs. As we may read in the Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual 3A, the state of stack when an exception occurs is following:+------------++40 | %SS|+32 | %RSP|+24 | %RFLAGS|+16 | %CS|+8 | %RIP|0 | ERROR CODE | <-- %RSP+------------+Now we may start to consider implementation of theidtmacro. Both#DBandBPexception handlers are defined as:idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACKidtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACKIf we will look at these definitions, we may know that compiler will generate two routines withdebuganddo_int3names and both of these exception handlers will callint3do_debugandsecondary handlers after some preparation. The third parameter defines existenceof error code and as we may see both our exception do not have them. As we may see onthe diagram above, processor pushes error code on stack if an exception provides it. In ourcase, thedebugandint3exception do not have error codes. This may bring somedifficulties because stack will look differently for exceptions which provides error code andfor exceptions which not. That's why implementation of theidtentrymacro starts fromputting a fake error code to the stack if an exception does not provide it:.ifeq \has_error_codepushq$-1.endifBut it is not only fake error-code. Moreover the-1also represents invalid system callnumber, so that the system call restart logic will not be triggered.276Interrupt handlersThe last two parameters of theidtentrymacroan exception handler runned at stack fromshift_istandInterrupt Stack Tableallow to know doparanoidor not. You already mayknow that each kernel thread in the system has own stack. In addition to these stacks, thereare some specialized stacks associated with each processor in the system. One of thesestacks is - exception stack. The x86_64 architecture provides special feature which is called-Interrupt Stack Table. This feature allows to switch to a new stack for designated eventssuch as an atomic exceptions likedouble faultallows us to know do we need to switch onThe second parameter -paranoidISTand etc. So theshift_istparameterstack for an exception handler or not.defines the method which helps us to know did we comefrom userspace or not to an exception handler. The easiest way to determine this is to viaCPLorinCurrent Privilege LevelCSsegment register. If it is equal to3, we came fromuserspace, if zero we came from kernel space:testl $3,CS(%rsp)jnz userspace.........// we are from the kernel spaceBut unfortunately this method does not give a 100% guarantee. As described in the kerneldocumentation:if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, which mighthave triggered right after a normal entry wrote CS to the stack but before we executedSWAPGS, then the only safe way to check for GS is the slower method: the RDMSR.In other words for exampleNMIcould happen inside the critical section of a swapgsinstruction. In this way we should check value of theMSR_GS_BASEmodel specific registerwhich stores pointer to the start of per-cpu area. So to check did we come from userspace ornot, we should to check value of theMSR_GS_BASEmodel specific register and if it is negativewe came from kernel space, in other way we came from userspace:movl$MSR_GS_BASE,%ecxrdmsrtestl %edx,%edxjs 1fIn first two lines of code we read value of theedx:eaxMSR_GS_BASEpair. We can't set negative value to thegsmodel specific register intofrom userspace. But from other sidewe know that direct mapping of the physical memory starts from thevirtual address. In this way,MSR_GS_BASE0xffff880000000000will contain an address from0xffff880000000000277Interrupt handlersto0xffffc7ffffffffffvalue in the%edxwhy kernel space. After therdmsrregister will be gsinstruction will be executed, the smallest possible0xffff8800which points to start ofwhich isper-cpu-30720in unsigned 4 bytes. That'sarea will contain negative value.After we pushed fake error code on the stack, we should allocate space for general purposeregisters with:ALLOC_PT_GPREGS_ON_STACKmacro which is defined in the arch/x86/entry/calling.h header file. This macro just allocates15*8 bytes space on the stack to preserve general purpose registers:.macro ALLOC_PT_GPREGS_ON_STACK addskip=0addq$-(15*8+\addskip), %rsp.endmSo the stack will look like this after execution of theALLOC_PT_GPREGS_ON_STACK:+------------++160 | %SS|+152 | %RSP|+144 | %RFLAGS|+136 | %CS|+128 | %RIP|+120 | ERROR CODE ||------------|+112 ||+104 ||+96 ||+88 ||+80 ||+72 ||+64 ||+56 ||+48 ||+40 ||+32 ||+24 ||+16 ||+8 ||+0 || <- %RSP+------------+After we allocated space for general purpose registers, we do some checks to understanddid an exception come from userspace or not and if yes, we should move back to aninterrupted process stack or stay on exception stack:278Interrupt handlers.if \paranoid.if \paranoid == 1testbjnz$3, CS(%rsp)1f.endifcallparanoid_entry.elsecallerror_entry.endifLet's consider all of these there cases in course.An exception occured in userspaceIn the first let's consider a case when an exception hasexceptions. In this case we check selector fromint31flabel if we came from userspace or theparanoid=1CSlike ourdebugandsegment register and jump atparanoid_entrywill be called in other way.Let's consider first case when we came from userspace to an exception handler. Asdescribed above we should jump atcall1label. The1label starts from the call of theerror_entryroutine which saves all general purpose registers in the previously allocated area on thestack:SAVE_C_REGS 8SAVE_EXTRA_REGS 8These both macros are defined in the arch/x86/entry/calling.h header file and just movevalues of general purpose registers to a certain place at the stack, for example:.macro SAVE_EXTRA_REGS offset=0movq %r15, 0*8+\offset(%rsp)movq %r14, 1*8+\offset(%rsp)movq %r13, 2*8+\offset(%rsp)movq %r12, 3*8+\offset(%rsp)movq %rbp, 4*8+\offset(%rsp)movq %rbx, 5*8+\offset(%rsp).endmAfter execution ofSAVE_C_REGSandSAVE_EXTRA_REGSthe stack will look:279Interrupt handlers+------------++160 | %SS|+152 | %RSP|+144 | %RFLAGS|+136 | %CS|+128 | %RIP|+120 | ERROR CODE ||------------|+112 | %RDI|+104 | %RSI|+96 | %RDX|+88 | %RCX|+80 | %RAX|+72 | %R8|+64 | %R9|+56 | %R10|+48 | %R11|+40 | %RBX|+32 | %RBP|+24 | %R12|+16 | %R13|+8 | %R14|+0 | %R15| thread.sp0 - 1)As we came from userspace, this means that exception handler will run in real processcontext. After we got stack pointer from themovqsync_regswe switch stack:%rax, %rspThe last two steps before an exception handler will call secondary handler are:1. Passing pointer toregisters to themovqpt_regs%rdistructure which contains preserved general purposeregister:%rsp, %rdias it will be passed as first parameter of secondary exception handler.1. Pass error code to thehandler and set it to-1%rsiregister as it will be second argument of an exceptionon the stack for the same purpose as we did it before - toprevent restart of a system call:.if \has_error_codemovqORIG_RAX(%rsp), %rsimovq$-1, ORIG_RAX(%rsp).elsexorl%esi, %esi.endifAdditionally you may see that we zeroed the%esiregister above in a case if an exceptiondoes not provide error code.In the end we just call secondary exception handler:281Interrupt handlerscall\do_symwhich:dotraplinkage void do_debug(struct pt_regs *regs, long error_code);will be fordebugexception and:dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);will be forint 3exception. In this part we will not see implementations of secondaryhandlers, because of they are very specific, but will see some of them in one of next parts.We just considered first case when an exception occurred in userspace. Let's consider lasttwo.An exception with paranoid > 0 occurred inkernelspaceIn this case an exception was occurred in kernelspace andfor this exception. This value ofparanoid=1paranoididtentrymacro is defined withmeans that we should use slowerway that we saw in the beginning of this part to check do we really came from kernelspaceor not. Theparanoid_entryrouting allows us to know this:ENTRY(paranoid_entry)cldSAVE_C_REGS 8SAVE_EXTRA_REGS 8movl$1, %ebxmovl$MSR_GS_BASE, %ecxrdmsrtestljs%edx, %edx1fSWAPGSxorl1:%ebx, %ebxretEND(paranoid_entry)As you may see, this function represents the same that we covered before. We use second(slow) method to get information about previous state of an interrupted task. As we checkedthis and executedSWAPGSin a case if we came from userspace, we should to do the same282Interrupt handlersthat we did before: We need to put pointer to a structure which holds general purposeregisters to the%rdi(which will be first parameter of a secondary handler) and put errorcode if an exception provides it to the%rsi(which will be second parameter of a secondaryhandler):movq%rsp, %rdi.if \has_error_codemovqORIG_RAX(%rsp), %rsimovq$-1, ORIG_RAX(%rsp).elsexorl%esi, %esi.endifThe last step before a secondary handler of an exception will be called is cleanup of newISTstack fram:.if \shift_ist != -1subqEXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist).endifYou may remember that we passed theshift_istHere we check its value and if its not equal toStack Tablebyshift_ist-1as argument of theidtentry, we get pointer to a stack frommacro.Interruptindex and setup it.In the end of this second way we just call secondary exception handler as we did it before:call\do_symThe last method is similar to previous both, but an exception occured withparanoid=0andwe may use fast method determination of where we are from.Exit from an exception handlerAfter secondary handler will finish its works, we will return to thenext step will be jump to thejmperror_exitidtentrymacro and the:error_exitroutine. Theerror_exitfunction defined in the same arch/x86/entry/entry_64.S assemblysource code file and the main goal of this function is to know where we are from (fromuserspace or kernelspace) and executeSWPAGSdepends on this. Restore registers to283Interrupt handlersprevious state and executeiretinstruction to transfer control to an interrupted task.That's all.ConclusionIt is the end of the third part about interrupts and interrupt handling in the Linux kernel. Wesaw the initialization of the Interrupt descriptor table in the previous part with the#BP#DBandgates and started to dive into preparation before control will be transferred to anexception handler and implementation of some interrupt handlers in this part. In the next partwe will continue to dive into this theme and will go next by thesetup_archfunction and willtry to understand interrupts handling related stuff.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksDebug registersIntel 80385INT 3gccTSSGNU assembly .error directivedwarf2CFI directivesIRQsystem callswapgsSIGTRAPPer-CPU variableskgdbACPIPrevious part284Initialization of non-early interrupt gatesInterrupts and Interrupt Handling. Part 4.Initialization of non-early interrupt gatesThis is fourth part about an interrupts and exceptions handling in the Linux kernel and in theprevious part we saw first early#DBandexceptions handlers from the#BParch/x86/kernel/traps.c. We stopped on the right after thecalled in thesetup_archearly_trap_initfunction thatfunction which defined in the arch/x86/kernel/setup.c. In this partwe will continue to dive into an interrupts and exceptions handling in the Linux kernel forx86_64and continue to do it from the place where we left off in the last part. First thingwhich is related to the interrupts and exceptions handling is the setup of thefault handler with theearly_trap_pf_init#PFor pagefunction. Let's start from it.Early page fault handlerTheearly_trap_pf_initset_intr_gatefunction defined in the arch/x86/kernel/traps.c. It usesmacro that fills Interrupt Descriptor Table with the given entry:void __init early_trap_pf_init(void){#ifdef CONFIG_X86_64set_intr_gate(X86_TRAP_PF, page_fault);#endif}This macro defined in the arch/x86/include/asm/desc.h. We already saw macros like this inthe previous part -set_system_intr_gategiven vector number is not greater than_set_gatefunction asand255set_intr_gate_ist. This macro checks that(maximum vector number) and callsset_system_intr_gateandset_intr_gate_ist#define set_intr_gate(n, addr)\do {\BUG_ON((unsigned)n > 0xFF);\_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0,\__KERNEL_CS);did it:\_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\0, 0, __KERNEL_CS);\} while (0)285Initialization of non-early interrupt gatesTheset_intr_gatemacro takes two parameters:vector number of a interrupt;address of an interrupt handler;In our case they are:X86_TRAP_PFpage_faultTheX86_TRAP_PF-14;- the interrupt handler entry point.is the element of enum which defined in thearch/x86/include/asm/traprs.h:enum {............X86_TRAP_PF,/* 14, Page Fault */.........}When theof theearly_trap_pf_init_set_gatewill be called, thewhich will fill thethe implementation of theIDTpage_faultset_intr_gatewill be expanded to the callwith the handler for the page fault. Now let's look onhandler. Thepage_faulthandler defined in thearch/x86/kernel/entry_64.S assembly source code file as all exceptions handlers. Let's lookon it:trace_idtentry page_fault do_page_fault has_error_code=1We saw in the previous part howtheidtentry#DBandmacro, but here we can seesource code file and depends on the#BPhandlers defined. They were defined withtrace_idtentryCONFIG_TRACING. This macro defined in the samekernel configuration option:#ifdef CONFIG_TRACING.macro trace_idtentry sym do_sym has_error_code:reqidtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_codeidtentry \sym \do_sym has_error_code=\has_error_code.endm#else.macro trace_idtentry sym do_sym has_error_code:reqidtentry \sym \do_sym has_error_code=\has_error_code.endm#endif286Initialization of non-early interrupt gatesWe will not dive into exceptions Tracing now. IfCONFIG_TRACINGmacro just expands to the normaltrace_idtentryimplementation of theidtentryidtentryis not set, we can see that. We already sawmacro in the previous part, so let's start from theexception handler.page_faultAs we can see in theidtentrydefinition, the handler of thepage_faultisdo_page_faultfunction which defined in the arch/x86/mm/fault.c and as all exceptions handlers it takes twoarguments:regs-pt_regsstructure that holds state of an interrupted process;- error code of the page fault exception.error_codeLet's look inside this function. First of all we read content of the cr2 control register:dotraplinkage void notracedo_page_fault(struct pt_regs *regs, unsigned long error_code){unsigned long address = read_cr2();.........}This register contains a linear address which causeda call of theand. In the next step we makefunction from the include/linux/context_tracking.h. Theexception_enterexception_enterpage faultexception_exitare functions from context tracking subsystem in theLinux kernel used by the RCU to remove its dependency on the timer tick while a processorruns in userspace. Almost in the every exception handler we will see similar code:enum ctx_state prev_state;prev_state = exception_enter();...... // exception handler here...exception_exit(prev_state);Theexception_enterfunction checks thatcontext_tracking_is_enabledthis_cpu_read(more aboutAfter this it callscontext trackingis enabled with theand if it is in enabled state, we get previous context with thethis_cpu_*operations you can read in the Documentation).context_tracking_user_exitfunction which informs the context tracking thatthe processor is exiting userspace mode and entering the kernel:287Initialization of non-early interrupt gatesstatic inline enum ctx_state exception_enter(void){enum ctx_state prev_ctx;if (!context_tracking_is_enabled())return 0;prev_ctx = this_cpu_read(context_tracking.state);context_tracking_user_exit();return prev_ctx;}The state can be one of the:enum ctx_state {IN_KERNEL = 0,IN_USER,} state;And in the end we return previous context. Between theexception_exitexception_enterandwe call actual page fault handler:__do_page_fault(regs, error_code, address);The__do_page_faultis defined in the same source code file asarch/x86/mm/fault.c. In the beginning of thekmemcheck checker. Thekmemcheck__do_page_faultdo_page_fault-we check state of thedetects warns about some uses of uninitializedmemory. We need to check it because page fault can be caused by kmemcheck:if (kmemcheck_active(regs))kmemcheck_hide(regs);prefetchw(&mm->mmap_sem);After this we can see the call of theprefetchwwhich executes instruction with the samename which fetches X86_FEATURE_3DNOW to get exclusive cache line. The main purposeof prefetching is to hide the latency of a memory access. In the next step we check that wegot page fault not in the kernel space with the following condition:if (unlikely(fault_in_kernel_space(address))) {.........}288Initialization of non-early interrupt gateswherefault_in_kernel_spaceis:static int fault_in_kernel_space(unsigned long address){return address >= TASK_SIZE_MAX;}Themacro expands to the:TASK_SIZE_MAX#define TASK_SIZE_MAXor0x00007ffffffff000((1UL x86_tss.ist[v] =(unsigned long)estacks;if (v == DEBUG_STACK-1)per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;}}As we have filledTask State Segmentswith theInterrupt Stack Tableswe can setTSSdescriptor for the current processor and load it with the:294Initialization of non-early interrupt gatesset_tss_desc(cpu, t);load_TR_desc();wheretheset_tss_descmacro from the arch/x86/include/asm/desc.h writes given descriptor toof the given processor:Global Descriptor Table#define set_tss_desc(cpu, addr) __set_tss_desc(cpu, GDT_ENTRY_TSS, addr)static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr){struct desc_struct *d = get_cpu_gdt_table(cpu);tss_desc tss;set_tssldt_descriptor(&tss, (unsigned long)addr, DESC_TSS,IO_BITMAP_OFFSET + IO_BITMAP_BYTES +sizeof(unsigned long) - 1);write_gdt_entry(d, entry, &tss, DESC_TSS);}andmacro expands to theload_TR_descltr#define load_TR_desc()orLoad Task Registerinstruction:native_load_tr_desc()static inline void native_load_tr_desc(void){asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));}In the end of thetrap_initfunction we can see the following code:set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);.........#ifdef CONFIG_X86_64memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);set_nmi_gate(X86_TRAP_DB, &debug);set_nmi_gate(X86_TRAP_BP, &int3);#endifHere we copyoridt_tableDebug exceptionandto the#BRornmi_dit_tableand setup exception handlers for theBreakpoint exception#DB. You can remember that we alreadyset these interrupt gates in the previous part, so why do we need to setup it again? Wesetup it again because when we initialized it before in theTask State Segmentearly_trap_initfunction, thewas not ready yet, but now it is ready after the call of thecpu_initfunction.295Initialization of non-early interrupt gatesThat's all. Soon we will consider all handlers of these interrupts/exceptions.ConclusionIt is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. Wesaw the initialization of the Task State Segment in this part and initialization of the differentinterrupt handlers asDivide Error,Page Faultexception and etc. You can note that wesaw just initialization stuff, and will dive into details about handlers for these exceptions. Inthe next part we will start to do it.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Linkspage faultInterrupt Descriptor TableTracingcr2RCUthiscpu* operationskmemcheckprefetchw3DNowCPU cachesVFSLinux kernel memory managementFix-Mapped Addresses and ioremapExtended Industry Standard ArchitectureINT isntructionINTOBOUNDopcodecontrol registerx87 FPUMCE exceptionSIMDcpumasks and bitmaps296Initialization of non-early interrupt gatesNXTask State SegmentPrevious part297Implementation of some exception handlersInterrupts and Interrupt Handling. Part 5.Implementation of exception handlersThis is the fifth part about an interrupts and exceptions handling in the Linux kernel and inthe previous part we stopped on the setting of interrupt gates to the Interrupt descriptorTable. We did it in thetrap_initfunction from the arch/x86/kernel/traps.c source code file.We saw only setting of these interrupt gates in the previous part and in the current part wewill see implementation of the exception handlers for these gates. The preparation before anexception handler will be executed is in the arch/x86/entry/entry_64.S assembly file andoccurs in the idtentry macro that defines exceptions entry points:idtentry divide_errordo_divide_errorhas_error_cdo_overflowhas_error_code=0idtentry overflowode=0idtentry invalid_opdo_invalid_ophas_error_cdo_boundshas_error_code=0idtentry boundsode=0idtentry device_not_availabledo_device_not_availablehas_error_code=0idtentry coprocessor_segment_overrundo_coprocessor_segment_overrun has_error_code=0idtentry invalid_TSSdo_invalid_TSShas_error_coddo_segment_not_presenthas_error_code=1idtentry segment_not_presente=1idtentry spurious_interrupt_bugdo_spurious_interrupt_bughas_error_code=0idtentry coprocessor_errordo_coprocessor_errorhas_error_code=0idtentry alignment_checkdo_alignment_checkhas_error_code=1idtentry simd_coprocessor_errordo_simd_coprocessor_errorhas_error_code=0The(idtentrymacro does following preparation before an actual exception handlerdo_divide_errorfor thedivide_errorcontrol. In another words theidtentry,do_overflowfor theoverflowand etc.) will getmacro allocates place for the registers (pt_regsstructure) on the stack, pushes dummy error code for the stack consistency if an298Implementation of some exception handlersinterrupt/exception has no error code, checks the segment selector in thecssegmentregister and switches depends on the previous state(userspace or kernelspace). After all ofthese preparations it makes a call of an actual interrupt/exception handler:.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1ENTRY(\sym).........call\do_sym.........END(\sym).endmAfter an exception handler will finish its work, theidtentrymacro restores stack andgeneral purpose registers of an interrupted task and executes iret instruction:ENTRY(paranoid_exit).........RESTORE_EXTRA_REGSRESTORE_C_REGSREMOVE_PT_GPREGS_FROM_STACK 8INTERRUPT_RETURNEND(paranoid_exit)whereINTERRUPT_RETURNis:#define INTERRUPT_RETURNjmp native_iret...ENTRY(native_iret).global native_irq_return_iretnative_irq_return_iret:iretqMore about theidtentrymacro you can read in the third part of thehttps://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html chapter. Ok,now we saw the preparation before an exception handler will be executed and now time tolook on the handlers. First of all let's look on the following handlers:divide_erroroverflowinvalid_op299Implementation of some exception handlerscoprocessor_segment_overruninvalid_TSSsegment_not_presentstack_segmentalignment_checkAll these handlers defined in the arch/x86/kernel/traps.c source code file with theDO_ERRORmacro:DO_ERROR(X86_TRAP_DE,SIGFPE,DO_ERROR(X86_TRAP_OF,SIGSEGV, "overflow",overflow)DO_ERROR(X86_TRAP_UD,SIGILL,invalid_op)DO_ERROR(X86_TRAP_OLD_MF, SIGFPE,"divide error",divide_error)"invalid opcode","coprocessor segment overrun", coprocessor_segment_overrun)DO_ERROR(X86_TRAP_TS,SIGSEGV, "invalid TSS",DO_ERROR(X86_TRAP_NP,SIGBUS,"segment not present",segment_not_present)DO_ERROR(X86_TRAP_SS,SIGBUS,"stack segment",stack_segment)DO_ERROR(X86_TRAP_AC,SIGBUS,"alignment check",alignment_check)As we can see theDO_ERRORinvalid_TSS)macro takes 4 parameters:Vector number of an interrupt;Signal number which will be sent to the interrupted process;String which describes an exception;Exception handler entry point.This macro defined in the same source code file and expands to the function with thedo_handlername:#define DO_ERROR(trapnr, signr, str, name)\dotraplinkage void do_##name(struct pt_regs *regs, long error_code)\{\do_error_trap(regs, error_code, str, trapnr, signr);\}Note on the##tokens. This is special feature - GCC macro Concatenation whichconcatenates two given strings. For example, firstDO_ERRORin our example will expands tothe:dotraplinkage void do_divide_error(struct pt_regs *regs, long error_code)\{...}300Implementation of some exception handlersWe can see that all functions which are generated by theDO_ERRORmacro just make a callof thedo_error_trapfunction from the arch/x86/kernel/traps.c. Let's look on implementationof thedo_error_trapfunction.Trap handlersThedo_error_trapfunction starts and ends from the two following functions:enum ctx_state prev_state = exception_enter();.........exception_exit(prev_state);from the include/linux/context_tracking.h. The context tracking in the Linux kernel subsystemwhich provide kernel boundaries probes to keep track of the transitions between levelcontexts with two basic initial contexts:userorkernel. Theexception_enterchecks that context tracking is enabled. After this if it is enabled, theprevious context and compares it with thewe callcontext_tracking_exitCONTEXT_KERNELfunctionexception_enter. If the previous context isreadsuser,function from the kernel/context_tracking.c which inform thecontext tracking subsystem that a processor is exiting user mode and entering the kernelmode:if (!context_tracking_is_enabled())return 0;prev_ctx = this_cpu_read(context_tracking.state);if (prev_ctx != CONTEXT_KERNEL)context_tracking_exit(prev_ctx);return prev_ctx;If previous context is nonuser, we just return it. Thepre_ctxhasenum ctx_statetypewhich defined in the include/linux/context_tracking_state.h and looks as:enum ctx_state {CONTEXT_KERNEL = 0,CONTEXT_USER,CONTEXT_GUEST,} state;301Implementation of some exception handlersThe second function isdefined in the same include/linux/context_tracking.hexception_exitfile and checks that context tracking is enabled and call theif the previous context wasusercontert_tracking_enterfunction:static inline void exception_exit(enum ctx_state prev_ctx){if (context_tracking_is_enabled()) {if (prev_ctx != CONTEXT_KERNEL)context_tracking_enter(prev_ctx);}}Thefunction informs the context tracking subsystem that acontext_tracking_enterprocessor is going to enter to the user mode from the kernel mode. We can see the followingcode between theexception_enterandexception_exit:if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=NOTIFY_STOP) {conditional_sti(regs);do_trap(trapnr, signr, str, regs, error_code,fill_trap_info(regs, signr, trapnr, &info));}First of all it calls thenotify_diefunction which defined in the kernel/notifier.c. To getnotified for kernel panic, kernel oops, Non-Maskable Interrupt or other events the callerneeds to insert itself in thenotify_diechain and thenotify_diefunction does it. TheLinux kernel has special mechanism that allows kernel to ask when something happens andthis mechanism calledfor theUSBnotifiersornotifier chains. This mechanism used for examplehotplug events (look on the drivers/usb/core/notify.c), for the memory hotplug(look on the include/linux/memory.h, thehotplug_memory_notifiermacro and etc...), systemreboots and etc. A notifier chain is thus a simple, singly-linked list. When a Linux kernelsubsystem wants to be notified of specific events, it fills out a specialstructure and passes it to thethe call of thedie_argsnotifier_chain_registernotifier_call_chainnotifier_blockfunction. An event can be sent withfunction. First of all thenotify_diefunction fillsstructure with the trap number, trap string, registers and other values:struct die_args args = {.regs= regs,.str= str,.err= err,.trapnr = trap,.signr= sig,}302Implementation of some exception handlersand returns the result of theatomic_notifier_call_chainfunction with thedie_chain:static ATOMIC_NOTIFIER_HEAD(die_chain);return atomic_notifier_call_chain(&die_chain, val, &args);which just expands to thenotifier_blockatomic_notifier_headstructure that contains lock and:struct atomic_notifier_head {spinlock_t lock;struct notifier_block __rcu *head;};Theatomic_notifier_call_chainfunction calls each function in a notifier chain in turn andreturns the value of the last notifier function called. If thedoes not returnNOTIFY_STOPwe executenotify_diein thedo_error_trapfunction from theconditional_stiarch/x86/kernel/traps.c that checks the value of the interrupt flag and enables interruptdepends on it:static inline void conditional_sti(struct pt_regs *regs){if (regs->flags & X86_EFLAGS_IF)local_irq_enable();}more aboutlocal_irq_enablenext and last call in thefunction defined themacro you can read in the second part of this chapter. Thedo_error_traptskis thevariable which hasinterrupted process. After the definition of thedo_trap_no_signaldo_trapfunction. First of all thetask_structtskdo_traptype and represents the current, we can see the call of thefunction:struct task_struct *tsk = current;if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code))return;Thedo_trap_no_signalfunction makes two checks:Did we come from the Virtual 8086 mode;Did we come from the kernelspace.303Implementation of some exception handlersif (v8086_mode(regs)) {...}if (!user_mode(regs)) {...}return -1;We will not consider first case because the long mode does not support the Virtual 8086mode. In the second case we invokefault anddiefixup_exceptionfunction which will try to recover aif we can't:if (!fixup_exception(regs)) {tsk->thread.error_code = error_code;tsk->thread.trap_nr = trapnr;die(str, regs, error_code);}Thediefunction defined in the arch/x86/kernel/dumpstack.c source code file, prints usefulinformation about stack, registers, kernel modules and caused kernel oops. If we came fromthe userspace thedo_trapdo_trap_no_signalfunction will returnfunction will continue. If we passed through thedid not exit from thedo_trap-1and the execution of thedo_trap_no_signalafter this, it means that previous context was -function anduser. Mostexceptions caused by the processor are interpreted by Linux as error conditions, forexample division by zero, invalid opcode and etc. When an exception occurs the Linuxkernel sends a signal to the interrupted process that caused the exception to notify it of anincorrect condition. So, in thenumber (SIGFPEfunction we need to send a signal with the givendo_trapfor the divide error,SIGILLfor the overflow exception and etc...). First ofall we save error code and vector number in the current interrupts process with the fillingthread.error_codeandthread_trap_nr:tsk->thread.error_code = error_code;tsk->thread.trap_nr = trapnr;After this we make a check do we need to print information about unhandled signals for theinterrupted process. We check thatunhandled_signalshow_unhandled_signalsvariable is set, thatfunction from the kernel/signal.c will return unhandled signal(s) and printkrate limit:304Implementation of some exception handlers#ifdef CONFIG_X86_64if (show_unhandled_signals && unhandled_signal(tsk, signr) &&printk_ratelimit()) {pr_info("%s[%d] trap %s ip:%lx sp:%lx error:%lx",tsk->comm, tsk->pid, str,regs->ip, regs->sp, error_code);print_vma_addr(" in ", regs->ip);pr_cont("\n");}#endifAnd send a given signal to interrupted process:force_sig_info(signr, info ?: SEND_SIG_PRIV, tsk);This is the end of thedo_trap. We just saw generic implementation for eight differentexceptions which are defined with theDO_ERRORmacro. Now let's look on another exceptionhandlers.Double faultThe next exception is#DForDouble fault. This exception occurs when the processordetected a second exception while calling an exception handler for a prior exception. We setthe trap gate for this exception in the previous part:set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);Note that this exception runs on theindex -1DOUBLEFAULT_STACKInterrupt Stack Table which has:#define DOUBLEFAULT_STACK 1Thedouble_faultis handler for this exception and defined in the arch/x86/kernel/traps.c.Thedouble_faulthandler starts from the definition of two variables: string that describesexception and interrupted process, as other exception handlers:static const char str[] = "double fault";struct task_struct *tsk = current;305Implementation of some exception handlersThe handler of the double fault exception split on two parts. The first part is the check whichchecks that a fault is anon-ISTrestores only the bottom16solves this problem. So if themake it look likefault on thebits when returning to anon-ISTstack. Actually theespfix6416iretbit segment. Theinstructionespfixfeaturefault on the espfix64 stack we modify the stack toGeneral Protection Fault:struct pt_regs *normal_regs = task_pt_regs(current);memmove(&normal_regs->ip, (void *)regs->sp, 5*8);ormal_regs->orig_ax = 0;regs->ip = (unsigned long)general_protection;regs->sp = (unsigned long)&normal_regs->orig_ax;return;In the second case we do almost the same that we did in the previous exception handlers.The first is the call of theist_enterfunction that discards previous context,userin ourcase:ist_enter(regs);And after this we fill the interrupted process with the vector number of theDouble faultexception and error code as we did it in the previous handlers:tsk->thread.error_code = error_code;tsk->thread.trap_nr = X86_TRAP_DF;Next we print useful information about the double fault (PID number, registers content):#ifdef CONFIG_DOUBLEFAULTdf_debug(regs, error_code);#endifAnd die:for (;;)die(str, regs, error_code);That's all.Device not available exception handler306Implementation of some exception handlersThe next exception is the#NMorDevice not available. TheDevice not availableexception can occur depending on these things:The processor executed an x87 FPU floating-point instruction while the EM flag incontrol registercr0was set;The processor executed aregisterwaitorinstruction while thefwaitMPandTSflags ofwere set;cr0The processor executed an x87 FPU, MMX or SSE instruction while thecontrol registerThe handler of thecr0was set and theDevice not availableEMTSflag inflag is clear.exception is thedo_device_not_availablefunctionand it defined in the arch/x86/kernel/traps.c source code file too. It starts and ends from thegetting of the previous context, as other traps which we saw in the beginning of this part:enum ctx_state prev_state;prev_state = exception_enter();.........exception_exit(prev_state);In the next step we check thatFPUis not eager:BUG_ON(use_eager_fpu());When we switch into a task or interrupt we may avoid loading theuse it, we catchDevice not Available exceptionduring task switching, thetheEMFPUx87state. If a task willexception. If we loading theis eager. In the next step we checkflag which can show us isFPUcr0FPUstatecontrol register onfloating point unit present (flag clear) or not (flagset):#ifdef CONFIG_MATH_EMULATIONif (read_cr0() & X86_CR0_EM) {struct math_emu_info info = { };conditional_sti(regs);info.regs = regs;math_emulate(&info);exception_exit(prev_state);return;}#endif307Implementation of some exception handlersIf thefill thex87floating point unit not presented, we enable interrupts with theconditional_sti(defined in the arch/x86/include/asm/math_emu.h) structure with themath_emu_inforegisters of an interrupt task and callmath_emulatefunction from the arch/x86/math-emu/fpu_entry.c. As you can understand from function's name, it emulates(more about the,we will know in the special chapter). In other way, ifx87is clear which means thatunit is presented, we call thex87 FPUfrom the arch/x86/kernel/fpu/core.c which copies thethe live hardware registers. After thisFPUX87 FPUX86_CR0_EMfpu__restoreregisters from theFPUunitflagfunctionfpustatetoinstructions can be used:fpu__restore(¤t->thread.fpu);General protection fault exception handlerThe next exception is the#GPorGeneral protection fault. This exception occurs whenthe processor detected one of a class of protection violations calledviolationsgeneral-protection. It can be:Exceeding the segment limit when accessing theLoading thess,ds,es,fsorgscs,ds,es,fsorgssegments;register with a segment selector for a systemsegment.;Violating any of the privilege rules;and other...The exception handler for this exception is thearch/x86/kernel/traps.c. Thedo_general_protectiondo_general_protectionfrom thefunction starts and ends as otherexception handlers from the getting of the previous context:prev_state = exception_enter();...exception_exit(prev_state);After this we enable interrupts if they were disabled and check that we came from the Virtual8086 mode:conditional_sti(regs);if (v8086_mode(regs)) {local_irq_enable();handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);goto exit;}308Implementation of some exception handlersAs long mode does not support this mode, we will not consider exception handling for thiscase. In the next step check that previous mode was kernel mode and try to fix the trap. Ifwe can't fix the current general protection fault exception we fill the interrupted process withthe vector number and error code of the exception and add it to thenotify_diechain:if (!user_mode(regs)) {if (fixup_exception(regs))goto exit;tsk->thread.error_code = error_code;tsk->thread.trap_nr = X86_TRAP_GP;if (notify_die(DIE_GPF, "general protection fault", regs, error_code,X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)die("general protection fault", regs, error_code);goto exit;}If we can fix exception we go to theexitlabel which exits from exception state:exit:exception_exit(prev_state);If we came from user mode we sendmode as we did it in thedo_trapSIGSEGVsignal to the interrupted process from userfunction:if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&printk_ratelimit()) {pr_info("%s[%d] general protection ip:%lx sp:%lx error:%lx",tsk->comm, task_pid_nr(tsk),regs->ip, regs->sp, error_code);print_vma_addr(" in ", regs->ip);pr_cont("\n");}force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);That's all.ConclusionIt is the end of the fifth part of the Interrupts and Interrupt Handling chapter and we sawimplementation of some interrupt handlers in this part. In the next part we will continue todive into interrupt and exception handlers and will see handler for the Non-Maskable309Implementation of some exception handlersInterrupts, handling of the math coprocessor and SIMD coprocessor exceptions and manymany more.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksInterrupt descriptor Tableiret instructionGCC macro Concatenationkernel panickernel oopsNon-Maskable Interrupthotpluginterrupt flaglong modesignalprintkcoprocessorSIMDInterrupt Stack TablePIDx87 FPUcontrol registerMMXPrevious part310Handling Non-Maskable interruptsInterrupts and Interrupt Handling. Part 6.Non-maskable interrupt handlerIt is sixth part of the Interrupts and Interrupt Handling in the Linux kernel chapter and in theprevious part we saw implementation of some exception handlers for the General ProtectionFault exception, divide exception, invalid opcode exceptions and etc. As I wrote in theprevious part we will see implementations of the rest exceptions in this part. We will seeimplementation of the following handlers:Non-Maskable interrupt;BOUND Range Exceeded Exception;Coprocessor exception;SIMD coprocessor exception.in this part. So, let's start.Non-Maskable interrupt handlingA Non-Maskable interrupt is a hardware interrupt that cannot be ignored by standardmasking techniques. In a general way, a non-maskable interrupt can be generated in eitherof two ways:External hardware asserts the non-maskable interrupt pin on the CPU.The processor receives a message on the system bus or the APIC serial bus with adelivery modeNMI.When the processor receives aimmediately by calling theNMINMIfrom one of these sources, the processor handles ithandler pointed to by interrupt vector which has number2(see table in the first part). We already filled the Interrupt Descriptor Table with the vectornumber, address of thenmiinterrupt handler andNMI_STACKInterrupt Stack Table entry:set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);in thetrap_initfunction which defined in the arch/x86/kernel/traps.c source code file. Inthe previous parts we saw that entry points of the all interrupt handlers are defined with the:311Handling Non-Maskable interrupts.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1ENTRY(\sym).........END(\sym).endmmacro from the arch/x86/entry/entry_64.S assembly source code file. But the handler of theNon-Maskableinterrupts is not defined with this macro. It has own entry point:ENTRY(nmi).........END(nmi)in the same arch/x86/entry/entry_64.S assembly file. Lets dive into it and will try tounderstand howNon-Maskableinterrupt handler works. Thenmihandlers starts from thecall of the:PARAVIRT_ADJUST_EXCEPTION_FRAMEmacro but we will not dive into details about it in this part, because this macro related to theParavirtualization stuff which we will see in another chapter. After this save the content of therdxregister on the stack:pushq%rdxAnd allocated check thatcswas not the kernel segment when an non-maskable interruptoccurs:cmpljneThe__KERNEL_CS, 16(%rsp)first_nmi__KERNEL_CSmacro defined in the arch/x86/include/asm/segment.h and representedsecond descriptor in the Global Descriptor Table:#define GDT_ENTRY_KERNEL_CS#define __KERNEL_CS2(GDT_ENTRY_KERNEL_CS*8)312Handling Non-Maskable interruptsmore aboutchapter. Ifyou can read in the second part of the Linux kernel booting processGDTis not kernel segment, it means that it is not nestedcsNMIand we jump on thelabel. Let's consider this case. First of all we put address of the current stackfirst_nmipointer to theand pushesrdx1to the stack in thefirst_nmilabel:first_nmi:movq(%rsp), %rdxpushq$1Why do we push1on the stack? As the comment says:We allow breakpoints in NMIsOn the x86_64, like other architectures, the CPU will not execute anotherNMIis completed. ANMIexception which are useCPU will leaveNMIhandler triggers either a page fault or breakpoint or anotherinstruction too. If this happens while iniretcontext and a newthose exceptions will re-enableproblem theNMINMIsNMImay come in. TheNMIused to return fromNMIsto preempt the runningcomes in before the first NMI handler is complete, the newNMI will write all over the preemptedNMIcontext, thehandler will not return to the state that it was, when the exceptionhandler. If anothernextiretNMIand we will get nested non-maskable interrupts. Thetriggered, but instead it will return to a state that will allow newNMIuntil the firstinterrupt finished with the iret instruction like other interrupts andNMIexceptions do it. If theNMI.NMIsstack. We can have nestedis using the top of the stack of the previousNMINMIswhere the. It means that we cannotexecute it because a nested non-maskable interrupt will corrupt stack of a previous nonmaskable interrupt. That's why we have allocated space on the stack for temporary variable.We will check this variable that it was set when a previousnot nestedthat aNMI. We pushnon-maskable1NMIis executing and clear if it ishere to the previously allocated space on the stack to denoteinterrupt executed currently. Remember that when andNMIor anotherexception occurs we have the following stack frame:+------------------------+|SS||RSP||RFLAGS||CS||RIP|+------------------------+and also an error code if an exception has it. So, after all of these manipulations our stackframe will look like this:313Handling Non-Maskable interrupts+------------------------+|SS||RSP||RFLAGS||CS||RIP||RDX||1|+------------------------+In the next step we allocate yet anothersubq40bytes on the stack:$(5*8), %rspand pushes the copy of the original stack frame after the allocated space:.rept 5pushq11*8(%rsp).endrwith the .rept assembly directive. We need in the copy of the original stack frame. Generallywe need in two copies of the interrupt stack. First isframe andcopiedframe is used to fixup thecopiedNMI40bytes (copiedsavedsavedstackstackstack frame). This stackstack frame that a nested NMI may change. The second -stack frame modified by any nestedtriggered a secondinterrupts stack:stack frame. Now we pushes original stack frame to theframe which locates after the just allocatedcopiedcopiedNMIsto let the firstand we should repeat the firstNMINMIknow that wehandler. Ok, we have madefirst copy of the original stack frame, now time to make second copy:addq$(10*8), %rsp.rept 5pushq-6*8(%rsp).endrsubq$(5*8), %rspAfter all of these manipulations our stack frame will be like this:314Handling Non-Maskable interrupts+-------------------------+| original SS|| original Return RSP|| original RFLAGS|| original CS|| original RIP|+-------------------------+| temp storage for rdx|+-------------------------+| NMI executing variable|+-------------------------+| copied SS|| copied Return RSP|| copied RFLAGS|| copied CS|| copied RIP|+-------------------------+| Saved SS|| Saved Return RSP|| Saved RFLAGS|| Saved CS|| Saved RIP|+-------------------------+After this we push dummy error code on the stack as we did it already in the previousexception handlers and allocate space for the general purpose registers on the stack:pushq$-1ALLOC_PT_GPREGS_ON_STACKWe already saw implementation of theALLOC_PT_GREGS_ON_STACKmacro in the third part ofthe interrupts chapter. This macro defined in the arch/x86/entry/calling.h and yet anotherallocates120bytes on stack for the general purpose registers, from therdito ther15:.macro ALLOC_PT_GPREGS_ON_STACK addskip=0addq$-(15*8+\addskip), %rsp.endmAfter space allocation for the general registers we can see call of thecallparanoid_entry:paranoid_entryWe can remember from the previous parts this label. It pushes general purpose registers onthe stack, readsMSR_GS_BASEtheis negative, we came from the kernel mode and just return from theMSR_GS_BASEModel Specific register and checks its value. If the value of315Handling Non-Maskable interrupts, in other way it means that we came from the usermode and need toparanoid_entryexecuteswapgsinstruction which will change usergswith the kernelgs:ENTRY(paranoid_entry)cldSAVE_C_REGS 8SAVE_EXTRA_REGS 8movl$1, %ebxmovl$MSR_GS_BASE, %ecxrdmsrtestl%edx, %edxjs1fSWAPGSxorl1:%ebx, %ebxretEND(paranoid_entry)Note that after theinstruction we zeroed theswapgscontent of this register and if we executedbecause theNMIhandler can causethanswapgsother way. In the next step we store value of theebxcr2register. Next time we will checkebxmust containcontrol register to the0andr12in1register,and corrupt the value of this controlpage faultregister:movq%cr2, %r12Now time to call actualerror code to thersimovq%rsp, %rdimovq$-1, %rsicalldo_nmiWe will back to thedo_nmiNMIhandler. We push the address of theand call thedo_nmido_nmido_nmiregister (remember it must containlabel. The01cr2performed and if we got it we restore. After this we test content of theif we have usedSWAPGS_UNSAFE_STACKSWAPGS_UNSAFE_STACKnmi_restore,handler will be finished we check thedo_nmi, in other way we jump on the labeldidn't use it) and executerdilittle later in this part, but now let's look what occurs after thewill finish its execution. After thecr2to thehandler:register, because we can got page fault duringoriginalpt_regsswapgsif it containsmacro just expands to the1instruction andor jump to theswapgs1ebxif wenmi_restoreinstruction. In thelabel we restore general purpose registers, clear allocated space on the stackfor this registers, clear our temporary variable and exit from the interrupt handler with theINTERRUPT_RETURNmacro:316Handling Non-Maskable interruptsmovq%cr2, %rcxcmpq%rcx, %r12je1fmovq%r12, %cr21:testljnz%ebx, %ebxnmi_restorenmi_swapgs:SWAPGS_UNSAFE_STACKnmi_restore:RESTORE_EXTRA_REGSRESTORE_C_REGS/* Pop the extra iret frame at once */REMOVE_PT_GPREGS_FROM_STACK 6*8/* Clear the NMI executing stack variable */movq$0, 5*8(%rsp)INTERRUPT_RETURNwhereiretINTERRUPT_RETURNis defined in the arch/x86/include/irqflags.h and just expands to theinstruction. That's all.Now let's consider case when anotherNMIinterrupt occurred when previousNMIinterruptdidn't finish its execution. You can remember from the beginning of this part that we've madea check that we came from userspace and jump on thecmpljnein this case:first_nmi$__KERNEL_CS, 16(%rsp)first_nmiNote that in this case it is firstNMIevery time, because if the firstNMIcatched page fault,breakpoint or another exception it will be executed in the kernel mode. If we didn't comefrom userspace, first of all we test our temporary variable:cmplje$1, -8(%rsp)nested_nmiand if it is set to1we jump to theIn the case of nestedNMIsnested_nmilabel. If it is notwe check that we are above theignore it, in other way we check that we above thannested_nmi_out1, we test therepeat_nmiend_repeat_nmiISTstack.. In this case weand jump on thelabel.Now let's look on theexception handler. This function defined in thedo_nmiarch/x86/kernel/nmi.c source code file and takes two parameters:address of thept_regs;error code.317Handling Non-Maskable interruptsas all exception handlers. Thedo_nmifunction and ends with the call of thenmi_nesting_preprocess. Theupdate_debug_stackper-cpu variable to1and call thefunction from the arch/x86/kernel/cpu/common.c. This functiondebug_stack_set_zeroTablenmi_nesting_postprocessnmi_nesting_preprocessfunction checks that we likely do not work with the debug stack andif we on the debug stack set theincreases thestarts from the call of thedebug_stack_use_ctrper-cpu variable and loads newInterrupt Descriptor:static inline void nmi_nesting_preprocess(struct pt_regs *regs){if (unlikely(is_debug_stack(regs->sp))) {debug_stack_set_zero();this_cpu_write(update_debug_stack, 1);}}Thenmi_nesting_postprocesswhich we set in theloads originfunction checks thenmi_nesting_preprocessInterrupt Descriptor Tablefunction, we can see the call of thelockdep_recursionper-cpu variableand resets debug stack or in another words it. After the call of thenmi_enterin thedo_nminmi_nesting_preprocess. Thenmi_enterincreasesfield of the interrupted process, update preempt counter and informs theRCU subsystem aboutnmi_enterupdate_debug_stackNMI. There is also, but vice-versa. After thestructure and call thedefault_do_nminmi_exitnmi_enterfunction that does the same stuff aswe increase__nmi_countfunction. First of all in thein thedefault_do_nmiirq_statwe checkthe address of the previous nmi and update address of the last nmi to the actual:if (regs->ip == __this_cpu_read(last_nmi_rip))b2b = true;else__this_cpu_write(swallow_nmi, false);__this_cpu_write(last_nmi_rip, regs->ip);After this first of all we need to handle CPU-specificNMIs:handled = nmi_handle(NMI_LOCAL, regs, b2b);__this_cpu_add(nmi_stats.normal, handled);And then non-specificNMIsdepends on its reason:318Handling Non-Maskable interruptsreason = x86_platform.get_nmi_reason();if (reason & NMI_REASON_MASK) {if (reason & NMI_REASON_SERR)pci_serr_error(reason, regs);else if (reason & NMI_REASON_IOCHK)io_check_error(reason, regs);__this_cpu_add(nmi_stats.external, 1);return;}That's all.Range Exceeded ExceptionThe next exception is theBOUNDrange exceeded exception. TheBOUNDinstructiondetermines if the first operand (array index) is within the bounds of an array specified thesecond operand (bounds operand). If the index is not within bounds, aexceeded exception ordo_bounds#BRis occurred. The handler of the#BRexception_exitexception_enterrangeexception is thefunction that defined in the arch/x86/kernel/traps.c. Thestarts with the call of theBOUNDdo_boundshandlerfunction and ends with the call of the:prev_state = exception_enter();if (notify_die(DIE_TRAP, "bounds", regs, error_code,X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)goto exit;.........exception_exit(prev_state);return;After we have got the state of the previous context, we add the exception to thechain and if it will returnand theNOTIFY_STOPcontext trackingwe return from the exception. More about notify chainsfunctions you can read in the previous part. In the next step weenable interrupts if they were disabled with theflag and call thenotify_dielocal_irq_enablecontidional_stifunction that checksIFdepends on its value:319Handling Non-Maskable interruptsconditional_sti(regs);if (!user_mode(regs))die("bounds", regs, error_code);and check that if we didn't came from user mode we sendSIGSEGVsignal with thediefunction. After this we check is MPX enabled or not, and if this feature is disabled we jumpon theexit_traplabel:if (!cpu_feature_enabled(X86_FEATURE_MPX)) {goto exit_trap;}where we execute do_trap function (more about it you can find in the previous part):Cexit_trap:do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);exception_exit(prev_state);IfMPXfeature is enabled we check theand if it is zero, it means that theMPXBNDSTATUSwith theget_xsave_field_ptrfunctionwas not responsible for this exception:bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);if (!bndcsr)goto exit_trap;After all of this, there is still only one way whenMPXis responsible for this exception. Wewill not dive into the details about Intel Memory Protection Extensions in this part, but willsee it in another chapter.Coprocessor exception and SIMD exceptionThe next two exceptions are x87 FPU Floating-Point Error exception orFloating-Point Exception or#XF. The first exception occurs when the#MFand SIMDhasx87 FPUdetected floating point error. For example divide by zero, numeric overflow and etc. Thesecond exception occurs when the processor has detected SSE/SSE2/SSE3point exception. It can be the same as for thearedo_coprocessor_errorandx87 FPUSIMDfloating-. The handlers for these exceptionsdo_simd_coprocessor_errorare defined in thearch/x86/kernel/traps.c and very similar on each other. They both make a call of themath_errorfunction from the same source code file but pass different vector number. Thedo_coprocessor_errorpassesX86_TRAP_MFvector number to themath_error:320Handling Non-Maskable interruptsdotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code){enum ctx_state prev_state;prev_state = exception_enter();math_error(regs, error_code, X86_TRAP_MF);exception_exit(prev_state);}anddo_simd_coprocessor_errorpassesX86_TRAP_XFto themath_errorfunction:dotraplinkage voiddo_simd_coprocessor_error(struct pt_regs *regs, long error_code){enum ctx_state prev_state;prev_state = exception_enter();math_error(regs, error_code, X86_TRAP_XF);exception_exit(prev_state);}First of all themath_errorfunction defines current interrupted task, address of its fpu, stringwhich describes an exception, add it to thehandler if it will returnNOTIFY_STOPnotify_diechain and return from the exception:struct task_struct *task = current;struct fpu *fpu = &task->thread.fpu;siginfo_t info;char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :"simd exception";if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP)return;After this we check that we are from the kernel mode and if yes we will try to fix an exceptionwith thefixup_exceptionfunction. If we cannot we fill the task with the exception's errorcode and vector number and die:if (!user_mode(regs)) {if (!fixup_exception(regs)) {task->thread.error_code = error_code;task->thread.trap_nr = trapnr;die(str, regs, error_code);}return;}321Handling Non-Maskable interruptsIf we came from the user mode, we save thenumber of an exception andsiginfo_tfpustate, fill the task structure with the vectorwith the number of signal,errno, the addresswhere exception occurred and signal code:fpu__save(fpu);task->thread.trap_nr= trapnr;task->thread.error_code = error_code;info.si_signo= SIGFPE;info.si_errno= 0;info.si_addr= (void __user *)uprobe_get_trap_addr(regs);info.si_code = fpu__exception_code(fpu, trapnr);After this we check the signal code and if it is non-zero we return:if (!info.si_code)return;Or send theSIGFPEsignal in the end:force_sig_info(SIGFPE, &info, task);That's all.ConclusionIt is the end of the sixth part of the Interrupts and Interrupt Handling chapter and we sawimplementation of some exception handlers in this part, likenon-maskableand x87 FPU floating point exception. Finally we have finsihed with theinterrupt, SIMDtrap_initfunctionin this part and will go ahead in the next part. The next our point is the external interruptsand theearly_irq_initfunction from the init/main.c.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksGeneral Protection Faultopcode322Handling Non-Maskable interruptsNon-MaskableBOUND instructionCPU socketInterrupt Descriptor TableInterrupt Stack TableParavirtualization.reptSIMDCoprocessorx86_64iretpage faultbreakpointGlobal Descriptor Tablestack frameModel Specific regiserpercpuRCUMPXx87 FPUPrevious part323Dive into external hardware interruptsInterrupts and Interrupt Handling. Part 7.Introduction to external interruptsThis is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel chapterand in the previous part we have finished with the exceptions which are generated by theprocessor. In this part we will continue to dive to the interrupt handling and will start with theexternal hardware interrupt handling. As you can remember, in the previous part we havefinished with thecall of thetrap_initearly_irq_initfunction from the arch/x86/kernel/trap.c and the next step is thefunction from the init/main.c.Interrupts are signal that are sent across IRQ orInterrupt Request Lineby a hardware orsoftware. External hardware interrupts allow devices like keyboard, mouse and etc, toindicate that it needs attention of the processor. Once the processor receives theRequestInterrupt, it will temporary stop execution of the running program and invoke special routinewhich depends on an interrupt. We already know that this routine is called interrupt handler(or how we will call itISRInterrupt Handler RoutineorInterrupt Service Routinefrom this part). TheISRorcan be found in Interrupt Vector table that is located at fixedaddress in the memory. After the interrupt is handled processor resumes the interruptedprocess. At the boot/initialization time, the Linux kernel identifies all devices in the machine,and appropriate interrupt handlers are loaded into the interrupt table. As we saw in theprevious parts, most exceptions are handled simply by the sending a Unix signal to theinterrupted process. That's why kernel is can handle an exception quickly. Unfortunately wecan not use this approach for the external hardware interrupts, because often they arriveafter (and sometimes long after) the process to which they are related has been suspended.So it would make no sense to send a Unix signal to the current process. External interrupthandling depends on the type of an interrupt:I/Ointerrupts;Timer interrupts;Interprocessor interrupts.I will try to describe all types of interrupts in this book.Generally, a handler of anI/Ointerrupt must be flexible enough to service several devicesat the same time. For example in the PCI bus architecture several devices may share thesameIRQline. In the simplest way the Linux kernel must do following thing when anI/Ointerrupt occurred:Save the value of anIRQand the register's contents on the kernel stack;324Dive into external hardware interruptsSend an acknowledgment to the hardware controller which is servicing theExecute the interrupt service routine (next we will call itISRIRQline;) which is associated withthe device;Restore registers and return from an interrupt;Ok, we know a little theory and now let's start with theimplementation of theearly_irq_initmake early initialization of theearly_irq_initfunction. Thefunction is in the kernel/irq/irqdesc.c. This functionirq_descstructure. Theirq_descstructure is the foundationof interrupt management code in the Linux kernel. An array of this structure, which has thesame name -, keeps track of every interrupt request source in the Linux kernel.irq_descThis structure defined in the include/linux/irqdesc.h and as you can note it depends on thekernel configuration option. This kernel configuration option enablesCONFIG_SPARSE_IRQsupport for sparse irqs. Theirq_descstructure contains many different files:- per irq and chip data passed down to chip functions;irq_common_datastatus_use_accessorsthe values from the- contains status of the interrupt source which is combination ofenumfrom the include/linux/irq.h and different macros which aredefined in the same source code file;kstat_irqs- irq stats per-cpu;handle_irq- highlevel irq-events handler;- identifies the interrupt service routines to be invoked when the IRQ occurs;actionirq_countdepth-0- counter of interrupt occurrences on the IRQ line;if the IRQ line is enabled and a positive value if it has been disabled atleast once;last_unhandled- aging timer for unhandled count;irqs_unhandled- count of the unhandled interrupts;lock- a spin lock used to serialize the accesses to thepending_maskownerIRQdescriptor;- pending rebalanced interrupts;- an owner of interrupt descriptor. Interrupt descriptors can be allocated frommodules. This field is need to proved refcount on the module which provides theinterrupts;and etc.Of course it is not all fields of theirq_descstructure, because it is too long to describe eachfield of this structure, but we will see it all soon. Now let's start to dive into theimplementation of theearly_irq_initfunction.Early external interrupts initialization325Dive into external hardware interruptsNow, let's look on the implementation of theimplementation of thefunction. Note thatearly_irq_initfunction depends on theearly_irq_initconfiguration option. Now we consider implementation of thetheCONFIG_SPARSE_IRQirq_descearly_irq_initkernelfunction whenkernel configuration option is not set. This function starts from thedeclaration of the following variables:and theCONFIG_SPARSE_IRQirqdescriptors counter, loop counter, memory nodedescriptor:int __init early_irq_init(void){int count, i, node = first_online_node;struct irq_desc *desc;.........}Thenodeis an online NUMA node which depends on thedepends on theCONFIG_NODES_SHIFT#define MAX_NUMNODESMAX_NUMNODESvalue whichkernel configuration parameter:(1 < 1#define first_online_nodefirst_node(node_states[N_ONLINE])#else#define first_online_nodeThenode_states0is the enum which defined in the include/linux/nodemask.h and representthe set of the states of a node. In our case we are searching an online node and it will beifMAX_NUMNODESis one or zero. If thenode_states[N_ONLINE]of the__first_nodewill return1MAX_NUMNODESand thefunction which will return0is greater than one, thefirst_nodeminimalmacro will be expands to the callor the first online node:326Dive into external hardware interrupts#define first_node(src) __first_node(&(src))static inline int __first_node(const nodemask_t *srcp){return min_t(int, MAX_NUMNODES, find_first_bit(srcp->bits, MAX_NUMNODES));}More about this will be in the another chapter about theNUMA. The next step after thedeclaration of these local variables is the call of the:init_irq_default_affinity();function. Theinit_irq_default_affinitydepends on theCONFIG_SMP(in our case it is thefunction defined in the same source code file andkernel configuration option allocates a given cpumask structureirq_default_affinity):#if defined(CONFIG_SMP)cpumask_var_t irq_default_affinity;static void __init init_irq_default_affinity(void){alloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);cpumask_setall(irq_default_affinity);}#elsestatic void __init init_irq_default_affinity(void){}#endifWe know that when a hardware, such as disk controller or keyboard, needs attention fromthe processor, it throws an interrupt. The interrupt tells to the processor that something hashappened and that the processor should interrupt current process and handle an incomingevent. In order to prevent multiple devices from sending the same interrupts, the IRQ systemwas established where each device in a computer system is assigned its own special IRQ sothat its interrupts are unique. Linux kernel can assign certainThis is known asSMP IRQ affinityIRQsto specific processors., and it allows you control how your system will respondto various hardware events (that's why it has certain implementation only if thekernel configuration option is set). After we allocatedcan seeprintkirq_default_affinityCONFIG_SMPcpumask, weoutput:printk(KERN_INFO "NR_IRQS:%d\n", NR_IRQS);327Dive into external hardware interruptswhich printsNR_IRQS:~$dmesg | grep NR_IRQS[The0.000000] NR_IRQS:4352NR_IRQSis the maximum number of theirqdescriptors or in another words maximumnumber of interrupts. Its value depends on the state of theconfiguration option. If thePIC chip, theNR_IRQSCONFIG_X86_IO_APICCONFIG_X86_IO_APICkernelis not set and the Linux kernel uses an oldis:#define NR_IRQS_LEGACY16#ifdef CONFIG_X86_IO_APIC.........#else# define NR_IRQSNR_IRQS_LEGACY#endifIn other way, when theCONFIG_X86_IO_APICkernel configuration option is set, theNR_IRQSdepends on the amount of the processors and amount of the interrupt vectors:#define CPU_VECTOR_LIMIT(64 * NR_CPUS)#define NR_VECTORS256#define IO_APIC_VECTOR_LIMIT( 32 * MAX_IO_APICS )#define MAX_IO_APICS128# define NR_IRQS\(CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ?(NR_VECTORS + CPU_VECTOR_LIMIT):\\(NR_VECTORS + IO_APIC_VECTOR_LIMIT)).........We remember from the previous parts, that the amount of processors we can set duringLinux kernel configuration process with theCONFIG_NR_CPUSconfiguration option:328Dive into external hardware interruptsIn the first case (CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMITthe second case (my case theis512CPU_VECTOR_LIMIT lock),}};Theirq_descis array of theirqdescriptors. It has three already initialized fields:329Dive into external hardware interruptshandle_irq- as I already wrote above, this field is the highlevel irq-event handler. Inour case it initialized with thefunction that defined in thehandle_bad_irqkernel/irq/handle.c source code file and handles spurious and unhandled irqs;depth-0if the IRQ line is enabled and a positive value if it has been disabled atleast once;lock- A spin lock used to serialize the accesses to theAs we calculated count of the interrupts and initialized ourIRQdescriptor.irq_descarray, we start to filldescriptors in the loop:for (i = 0; i irq_data.irq = irq;desc->irq_data.chip = &no_irq_chip;desc->irq_data.chip_data = NULL;desc->irq_data.handler_data = NULL;desc->irq_data.msi_desc = NULL;.........Theirq_data.chipirq_set_irq_typestructure provides generalAPIlike theirq_set_chip,and etc, for the irq controller drivers. You can find it in the kernel/irq/chip.csource code file.After this we set the status of the accessor for the given descriptor and set disabled state ofthe interrupts:.........irq_settings_clr_and_set(desc, ~0, _IRQ_DEFAULT_INIT_FLAGS);irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED);.........In the next step we set the high level interrupt handlers to thehandle_bad_irqwhichhandles spurious and unhandled irqs (as the hardware stuff is not initialized yet, we set thishandler), setirq_desc.descto1which means that anIRQis disabled, reset count of theunhandled interrupts and interrupts in general:331Dive into external hardware interrupts.........desc->handle_irq = handle_bad_irq;desc->depth = 1;desc->irq_count = 0;desc->irqs_unhandled = 0;desc->name = NULL;desc->owner = owner;.........After this we go through the all possible processor with the for_each_possible_cpu helperand set thekstat_irqsto zero for the given interrupt descriptor:for_each_possible_cpu(cpu)*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;and call thedesc_smp_initfunction from the kernel/irq/irqdesc.c that initializesof the given interrupt descriptor, sets defaultSMPaffinity and clears thethe given interrupt descriptor depends on the value of theNUMApending_masknodeofCONFIG_GENERIC_PENDING_IRQkernel configuration option:static void desc_smp_init(struct irq_desc *desc, int node){desc->irq_data.node = node;cpumask_copy(desc->irq_data.affinity, irq_default_affinity);#ifdef CONFIG_GENERIC_PENDING_IRQcpumask_clear(desc->pending_mask);#endif}In the end of theearly_irq_initarch_early_irq_initfunction we return the return value of thefunction:return arch_early_irq_init();This function defined in the kernel/apic/vector.c and contains only one call of thearch_early_ioapic_initthefunction from the kernel/apic/io_apic.c. As we can understand fromarch_early_ioapic_initfunction's name, this function makes early initialization of theI/O APIC. First of all it make a check of the number of the legacy interrupts with the call ofthenr_legacy_irqsfunction. If we have no legacy interrupts with the Intel 8259programmable interrupt controller we setio_apic_irqsto the0xffffffffffffffff:332Dive into external hardware interruptsif (!nr_legacy_irqs())io_apic_irqs = ~0UL;After this we are going through the allthe call of theI/O APICsalloc_ioapic_saved_registersand allocate space for the registers with:for_each_ioapic(i)alloc_ioapic_saved_registers(i);And in the end of theirqs (fromIRQ0toarch_early_ioapic_initIRQ15function we are going through the all legacy) in the loop and allocate space for theconfiguration of an irq on the givenNUMAirq_cfgwhich representsnode:for (i = 0; i vector = IRQ0_VECTOR + i;cpumask_setall(cfg->domain);}That's all.Sparse IRQsWe already saw in the beginning of this part that implementation of thefunction depends on theimplementation of theCONFIG_SPARSE_IRQearly_irq_initearly_irq_initkernel configuration option. Previously we sawfunction when theCONFIG_SPARSE_IRQconfigurationoption is not set, now let's look on the its implementation when this option is set.Implementation of this function very similar, but little differ. We can see the same definition ofvariables and call of theinit_irq_default_affinityin the beginning of theearly_irq_initfunction:333Dive into external hardware interrupts#ifdef CONFIG_SPARSE_IRQint __init early_irq_init(void){int i, initcnt, node = first_online_node;struct irq_desc *desc;init_irq_default_affinity();.........}#else.........But after this we can see the following call:initcnt = arch_probe_nr_irqs();Thearch_probe_nr_irqsfunction defined in the arch/x86/kernel/apic/vector.c and calculatescount of the pre-allocated irqs and updatenr_irqswith its number. But stop. Why there arepre-allocated irqs? There is alternative form of interrupts called - Message SignaledInterrupts available in the PCI. Instead of assigning a fixed number of the interrupt request,the device is allowed to record a message at a particular address of RAM, in fact, the displayon the Local APIC.andMSI-XMSIpermits a device to allocatepermits a device to allocate up tobe pre-allocated. More aboutarch_probe_nr_irqs2048,2,4,8,16or32interruptsinterrupts. Now we know that irqs canwill be in a next part, but now let's look on theMSIfunction. We can see the check which assign amount of the interruptvectors for the each processor in the system to thenr1which represents number ofMSInr_irqsif it is greater and calculate theinterrupts:int nr_irqs = NR_IRQS;if (nr_irqs > (NR_VECTORS * nr_cpu_ids))nr_irqs = NR_VECTORS * nr_cpu_ids;nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids;Take a look on theoffset where itsgsi_topIRQgsi_topvariable. Eachstarts. It is calledrepresents it. We get theGSIAPICis identified with its ownbase orIDand with theGlobal System Interruptbase. So theGlobal System Interruptbase from the MultiProcessorConfiguration Table table (you can remember that we have parsed this table in the sixth partof the Linux Kernel initialization process chapter).334Dive into external hardware interruptsAfter this we update thenrdepends on the value of thegsi_top:#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)if (gsi_top <= NR_IRQS_LEGACY)nr +=8 * nr_cpu_ids;elsenr += gsi_top * 16;#endifUpdate thenr_irqsif it less thannrand return the number of the legacy irqs:if (nr IRQ_BITMAP_BITS))nr_irqs = IRQ_BITMAP_BITS;if (WARN_ON(initcnt > IRQ_BITMAP_BITS))initcnt = IRQ_BITMAP_BITS;whereIRQ_BITMAP_BITSNR_IRQS + 8196is equal to theNR_IRQSif theCONFIG_SPARSE_IRQis not set andin other way. In the next step we are going over all interrupt descriptorswhich need to be allocated in the loop and allocate space for the descriptor and insert to theirq_desc_treeradix tree:335Dive into external hardware interruptsfor (i = 0; i < initcnt; i++) {desc = alloc_desc(i, node, NULL);set_bit(i, allocated_irqs);irq_insert_desc(i, desc);}In the end of theearly_irq_initarch_early_irq_initCONFIG_SPARSE_IRQfunction we return the value of the call of thefunction as we did it already in the previous variant when theoption was not set:return arch_early_irq_init();That's all.ConclusionIt is the end of the seventh part of the Interrupts and Interrupt Handling chapter and westarted to dive into external hardware interrupts in this part. We saw early initialization of theirq_descstructure which represents description of an external interrupt and containsinformation about it like list of irq actions, information about interrupt handler, interrupt'sowner, count of the unhandled interrupt and etc. In the next part we will continue to researchexternal interrupts.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksIRQnumaEnum typecpumaskpercpuspinlockcritical sectionLock validatorMSII/O APIC336Dive into external hardware interruptsLocal APICIntel 8259PICMultiProcessor Configuration Tableradix treedmesg337Initialization of external hardware interrupts structuresInterrupts and Interrupt Handling. Part 8.Non-early initialization of the IRQsThis is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel chapter andin the previous part we started to dive into the external hardware interrupts. We looked onthe implementation of theearly_irq_initfile and saw the initialization of theirq_descfunction from the kernel/irq/irqdesc.c source codeirq_descstructure in this function. Remind thatstructure (defined in the include/linux/irqdesc.h is the foundation of interruptmanagement code in the Linux kernel and represents an interrupt descriptor. In this part wewill continue to dive into the initialization stuff which is related to the external hardwareinterrupts.Right after the call of thetheinit_IRQearly_irq_initfunction in the init/main.c we can see the call offunction. This function is architecture-specific and defined in thearch/x86/kernel/irqinit.c. Theinit_IRQfunction makes initialization of thevector_irqpercpu variable that defined in the same arch/x86/kernel/irqinit.c source code file:...DEFINE_PER_CPU(vector_irq_t, vector_irq) = {[0 ... NR_VECTORS - 1] = -1,};...and representspercpuarray of the interrupt vector numbers. Thevector_irq_tdefined inthe arch/x86/include/asm/hw_irq.h and expands to the:typedef int vector_irq_t[NR_VECTORS];whereNR_VECTORSis count of the vector number and as you can remember from the firstpart of this chapter it is256for the x86_64:#define NR_VECTORSSo, in the start of thenumber of thelegacy256init_IRQfunction we fill thevector_irqpercpu array with the vectorinterrupts:338Initialization of external hardware interrupts structuresvoid __init init_IRQ(void){int i;for (i = 0; i nr_legacy_irqs;}339Initialization of external hardware interrupts structuresThis structure defined in the same header file and represents non-modern programmableinterrupts controller:struct legacy_pic {int nr_legacy_irqs;struct irq_chip *chip;void (*mask)(unsigned int irq);void (*unmask)(unsigned int irq);void (*mask_all)(void);void (*restore_mask)(void);void (*init)(int auto_eoi);int (*irq_pending)(unsigned int irq);void (*make_irq)(unsigned int irq);};Actual default maximum number of the legacy interrupts represented by theNR_IRQ_LEGACYmacro from the arch/x86/include/asm/irq_vectors.h:#define NR_IRQS_LEGACY16In the loop we are accessing theIRQ0_VECTOR + ivecto_irqper-cpu array with theper_cpuindex and write the legacy vector number there. Themacro by theIRQ0_VECTORdefined in the arch/x86/include/asm/irq_vectors.h header file and expands to the#define FIRST_EXTERNAL_VECTOR0x20#define IRQ0_VECTOR((FIRST_EXTERNAL_VECTOR + 16) & ~15)Why is0x30:here? You can remember from the first part of this chapter that first 32 vector0x30numbers from0to31are reserved by the processor and used for the processing ofarchitecture-defined exceptions and interrupts. Vector numbers fromreserved for the ISA. So, it means that we fill theequal to themacro32In the end of theto theIRQ0_VECTOR + 16init_IRQvector_irq(before the0x300x30from theto0x3fIRQ0_VECTORarewhich is).function we can see the call of the following function:x86_init.irqs.intr_init();from the arch/x86/kernel/x86_init.c source code file. If you have read chapter about theLinux kernel initialization process, you can remember thex86_initstructure. This structurecontains a couple of files which are points to the function related to the platform setup(x86_64in our case), for exampleresources- related with the memory resources,340Initialization of external hardware interrupts structuresmpparse- related with the parsing of the MultiProcessor Configuration Table table and etc.).As we can see thealso contains thex86_initirqsfield which contains three followingfields:struct x86_init_ops x86_init __initdata{..........irqs = {.pre_vector_init= init_ISA_irqs,.intr_init= native_init_IRQ,.trap_init= x86_init_noop,},.........}Now, we are interesting in thenative_init_IRQnative_init_IRQfunction contains the. As we can note, the name of thenative_prefix which means that this function isarchitecture-specific. It defined in the arch/x86/kernel/irqinit.c and executes generalinitialization of the Local APIC and initialization of the ISA irqs. Let's look on theimplementation of thethere. Thenative_init_IRQnative_init_IRQfunction and will try to understand what occursfunction starts from the execution of the following function:x86_init.irqs.pre_vector_init();As we can see above, thepre_vector_initpoints to theinit_ISA_irqsfunction thatdefined in the same source code file and as we can understand from the function's name, itmakes initialization of theISAthe definition of thevariable which has achiprelated interrupts. Theinit_ISA_irqsirq_chipfunction starts fromtype:void __init init_ISA_irqs(void){struct irq_chip *chip = legacy_pic->chip;.........Theirq_chipstructure defined in the include/linux/irq.h header file and representshardware interrupt chip descriptor. It contains:name- name of a device. Used in the/proc/interrupts:341Initialization of external hardware interrupts structures$ cat /proc/interruptsCPU0CPU1CPU2CPU3CPU4CPU5CPU61600000000000000CPU70:0IO-APIC1:2-edge20IO-APIC0IO-APIC8:timer001-edge1i8042008-edgertc0look on the last column;- mask an interrupt source;(*irq_mask)(struct irq_data *data)(*irq_ack)(struct irq_data *data)- start of a new interrupt;(*irq_startup)(struct irq_data *data)- start up the interrupt;(*irq_shutdown)(struct irq_data *data)- shutdown the interruptand etc.fields. Note that theto chip functions. It containsirqstructure represents set of the per irq chip data passed downirq_data- interrupt number,maskhwirq- precomputed bitmask for accessing the chip registers,- hardware interrupt number, local to the interrupt domainchip low level interrupt hardware access and etc.After this depends on theoption call theCONFIG_X86_64init_bsp_APICandCONFIG_X86_LOCAL_APICkernel configurationfunction from the arch/x86/kernel/apic/apic.c:#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)init_bsp_APIC();#endifThis function makes initialization of the APIC ofbootstrap processor(or processor whichstarts first). It starts from the check that we found SMP config (read more about it in the sixthpart of the Linux kernel initialization process chapter) and the processor hasAPIC:if (smp_found_config || !cpu_has_apic)return;In other way we return from this function. In the next step we call thefunction from the same source code file that shutdowns the localin the chapter about theAPICAdvanced Programmable Interrupt Controllerthe first processor by the settingunsigned int valueto theclear_local_APIC(more about it will be) and enableAPIC_SPIV_APIC_ENABLEDAPICof:342Initialization of external hardware interrupts structuresvalue = apic_read(APIC_SPIV);value &= ~APIC_VECTOR_MASK;value |= APIC_SPIV_APIC_ENABLED;and writing it with the help of theapic_writefunction:apic_write(APIC_SPIV, value);After we have enabledAPICfor the bootstrap processor, we return to thefunction and in the next step we initialize legacyinit_ISA_irqsProgrammable Interrupt Controllerand setthe legacy chip and handler for the each legacy irq:legacy_pic->init(0);for (i = 0; i vector)first_system_vector = vector;} else {BUG();}We already saw thefirst_system_vectorset_bitmacro, now let's look on the. The firsttest_bittest_bitand themacro defined in thearch/x86/include/asm/bitops.h and looks like this:#define test_bit(nr, addr)\(__builtin_constant_p((nr))\? constant_test_bit((nr), (addr))\: variable_test_bit((nr), (addr)))We can see the ternary operator here make a test with the gcc built-in function__builtin_constant_ptests that given vector number (you're feeling misunderstanding of thenr) is known at compile time. If__builtin_constant_p, we can make simple test:344Initialization of external hardware interrupts structures#include #define PREDEFINED_VAL 1int main() {int i = 5;printf("__builtin_constant_p(i) is %d\n", __builtin_constant_p(i));printf("__builtin_constant_p(PREDEFINED_VAL) is %d\n", __builtin_constant_p(PREDEFINED_VAL));printf("__builtin_constant_p(100) is %d\n", __builtin_constant_p(100));return 0;}and look on the result:$gcc test.c -o test$ ./test__builtin_constant_p(i) is 0__builtin_constant_p(PREDEFINED_VAL) is 1__builtin_constant_p(100) is 1Now I think it must be clear for you. Let's get back to the__builtin_constant_pwill return non-zero, we calltest_bitmacro. If theconstant_test_bitfunction:static inline int constant_test_bit(int nr, const void *addr){const u32 *p = (const u32 *)addr;return ((1UL <> 5])) != 0;}and thevariable_test_bitin other way:static inline int variable_test_bit(int nr, const void *addr){u8 v;const u32 *p = (const u32 *)addr;asm("btl %2,%1; setc %0" : "=qm" (v) : "m" (*p), "Ir" (nr));return v;}What's the difference between two these functions and why do we need in two differentfunctions for the same purpose? As you already can guess main purpose is optimization. Ifwe will write simple example with these functions:345Initialization of external hardware interrupts structures#define CONST 25int main() {int nr = 24;variable_test_bit(nr, (int*)0x10000000);constant_test_bit(CONST, (int*)0x10000000)return 0;}and will look on the assembly output of our example we will see following assembly code:pushq%rbpmovq%rsp, %rbpmovl$268435456, %esimovl$25, %edicallconstant_test_bitfor thepushqconstant_test_bit, and:%rbpmovq%rsp, %rbpsubq$16, %rspmovl$24, -4(%rbp)movl-4(%rbp), %eaxmovl$268435456, %esimovl%eax, %edicallvariable_test_bitfor thevariable_test_bit. These two code listings starts with the same part, first of all wesave base of the current stack frame in the%rbpexamples is different. In the first example we putsecond parameter register and call0x10000000) to theconstant_test_bitesiandregister. But after this code for both$268435456$25$268435456(our first parameter) to the. We put function parameters to theregisters because as we are learning Linux kernel for theSystem V AMD64 ABI(here thex86_64esiandis ourediediarchitecture we usecalling convention. All is pretty simple. When we are using predefinedconstant, the compiler can just substitute its value. Now let's look on the second part. As youcan see here, the compiler can not substitute value from thenrvariable. In this casecompiler must calculate its offset on the program's stack frame. We subtractrspregister to allocate stack for the local variables data and put thevariable) to therbpwith offset-4$2416from the(value of thenr. Our stack frame will be like this:346Initialization of external hardware interrupts structures<- stack grows%[rbp]|+----------+ +---------+ +---------+ +--------+||nr|| || | return| ||-||-||-|| || | address | ||argc||+----------+ +---------+ +---------+ +--------+|%[rsp]After this we put this value to theeax, soeaxregister now contains value of thethe end we do the same that in the first example, we put the$268435456of the(value ofvariable_test_bitfunction) and the value of theregister (the second parameter of theThe next step after thegates from thevariable_test_bit function. In(the first parameternr) to theedi).function will finish its work is the setting interruptapic_intr_initFIRST_EXTERNAL_VECTOReaxnror0x20to the0x256:i = FIRST_EXTERNAL_VECTOR;#ifndef CONFIG_X86_LOCAL_APIC#define first_system_vector NR_VECTORS#endiffor_each_clear_bit_from(i, used_vectors, first_system_vector) {set_intr_gate(i, irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR));}But as we are using thefor_each_clear_bit_fromgates. After this we use the samehelper, we set only non-initialized interruptfor_each_clear_bit_frominterrupt gates in the interrupt table with thehelper to fill the non-filledspurious_interrupt:#ifdef CONFIG_X86_LOCAL_APICfor_each_clear_bit_from(i, used_vectors, NR_VECTORS)set_intr_gate(i, spurious_interrupt);#endifWhere thespurious_interruptinterrupt. Here thefunction represent interrupt handler for theused_vectorsis theinterrupt gates. We already filled first32unsigned longspuriousthat contains already initializedinterrupt vectors in thetrap_initfunction fromthe arch/x86/kernel/setup.c source code file:347Initialization of external hardware interrupts structuresfor (i = 0; i state != TASK_RUNNING)wake_up_process(tsk);}356Softirq, Tasklets and WorkqueuesEachksoftirqdkernel thread runs thedeferred interrupts and calls thereads the__softirq_pendingrun_ksoftirqdfunction that checks existence offunction depends on result. This function__do_softirqsoftirq bit mask of the local processor and executes thedeferrable functions corresponding to every bit set. During execution of a deferred function,new pendingsoftirqsmight occur. The main problem here that execution of the userspacecode can be delayed for a long time while the__do_softirqfunction will handle deferredinterrupts. For this purpose, it has the limit of the time when it must be finished:unsigned long end = jiffies + MAX_SOFTIRQ_TIME;.........restart:while ((softirq_bit = ffs(pending))) {...h->action(h);...}.........pending = local_softirq_pending();if (pending) {if (time_before(jiffies, end) && !need_resched() &&--max_restart)goto restart;}...Checks of the existence of the deferred interrupts performed periodically and there are somepoints where this check occurs. The main point where this situation occurs is the call of thedo_IRQfunction that defined in the arch/x86/kernel/irq.c and provides main possibilities foractual interrupt processing in the Linux kernel. When this function will finish to handle aninterrupt, it calls theexiting_irqexpands to the call of thefunction from the arch/x86/include/asm/apic.h thatirq_exitcurrent context and calls thefunction. Theinvoke_softirqirq_exitchecks deferred interrupts,function:if (!in_interrupt() && local_softirq_pending())invoke_softirq();that executes the__do_softirqtoo. So what do we have in summary. Eachthrough the following stages: Registration of aActivation of asoftirqsoftirqwith theby marking it as deferred with thesoftirqopen_softirqraise_softirqgoesfunction.function. After357Softirq, Tasklets and Workqueuesthis, all markedsoftirqswill be r in the next time the Linux kernel schedules a round ofexecutions of deferrable functions. And execution of the deferred functions that have thesame type.As I already wrote, thesoftirqsare statically allocated and it is a problem for a kernelmodule that can be loaded. The second concept that built on top oftaskletssoftirq-- thesoftirq, you noticesolves this problem.TaskletsIf you read the source code of the Linux kernel that is related to thethat it is used very rarely. The preferable way to implement deferrable functions aretasklets. As I already wrote above theand generally on top of twoTASKLET_SOFTIRQHI_SOFTIRQsoftirqstaskletsare built on top of thesoftirqconcept:;.In short words,taskletsunlike, tasklets that have the same type cannot be run on multiple processors at asoftirqsaresoftirqsthat can be allocated and initialized at runtime andtime. Ok, now we know a little bit about thesoftirqs, of course previous text does notcover all aspects about this, but now we can directly look on the code and to know moreabout thesoftirqsstep by step on practice and to know aboutto the implementation of thesoftirq_inittasklets. Let's return backfunction that we talked about in the beginning ofthis part. This function is defined in the kernel/softirq.c source code file, let's look on itsimplementation:void __init softirq_init(void){int cpu;for_each_possible_cpu(cpu) {per_cpu(tasklet_vec, cpu).tail =&per_cpu(tasklet_vec, cpu).head;per_cpu(tasklet_hi_vec, cpu).tail =&per_cpu(tasklet_hi_vec, cpu).head;}open_softirq(TASKLET_SOFTIRQ, tasklet_action);open_softirq(HI_SOFTIRQ, tasklet_hi_action);}358Softirq, Tasklets and WorkqueuesWe can see definition of the integercpuvariable at the beginning of thefunction. Next we will use it as parameter for thefor_each_possible_cputhrough the all possible processors in the system. If thesoftirq_initmacro that goespossible processoris the newterminology for you, you can read more about it the CPU masks chapter. In short words,possible cpusis the set of processors that can be plugged in anytime during the life of thatsystem boot. Allpossible processorsstored in thecpu_possible_bitsbitmap, you can findits definition in the kernel/cpu.c:static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;.........const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);Ok, we defined the integerfor_each_possible_cpucpuvariable and go through the all possible processors with themacro and makes initialization of the two following per-cpuvariables:tasklet_vec;tasklet_hi_vecThese twoper-cpu;variables defined in the same source code file as thefunction and represent twotasklet_headsoftirq_initstructures:static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec);static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);Wheretasklet_headstructure represents a list ofTaskletsand contains two fields, headand tail:struct tasklet_head {struct tasklet_struct *head;struct tasklet_struct **tail;};Thetasklet_structTasklettaskletstructure is defined in the include/linux/interrupt.h and represents the. Previously we did not see this word in this book. Let's try to understand what theis. Actually, the tasklet is one of mechanisms to handle deferred interrupt. Let'slook on the implementation of thetasklet_structstructure:359Softirq, Tasklets and Workqueuesstruct tasklet_struct{struct tasklet_struct *next;unsigned long state;atomic_t count;void (*func)(unsigned long);unsigned long data;};As we can see this structure contains five fields, they are:Next tasklet in the scheduling queue;State of the tasklet;Represent current state of the tasklet, active or not;Main callback of the tasklet;Parameter of the callback.In our case, we set only for initialize only two arrays of tasklets in thethetasklet_vecand thethetasklet_vecandtasklet_hi_vecsoftirq_initfunction:. Tasklets and high-priority tasklets are stored inarrays, respectively. So, we have initialized thesetasklet_hi_vecarrays and now we can see two calls of theopen_softirqfunction that is defined in thekernel/softirq.c source code file:open_softirq(TASKLET_SOFTIRQ, tasklet_action);open_softirq(HI_SOFTIRQ, tasklet_hi_action);at the end of thesoftirq_initthe initialization ofsoftirq, in our case they are:function. The main purpose of the. Let's look on the implementation of thetasklet_actionassociated with theHI_SOFTIRQassociated with theTASKLET_SOFTIRQAPI for the manipulating oftasklet_structopen_softirqand thetasklet_hi_actionsoftirq is namedtaskletsis named. First of all it is theopen_softirqfunction.or thesoftirqfunctionandsoftirqfunctiontasklet_hi_actiontasklet_actionfunction is. The Linux kernel providestasklet_init, function and parameter for it and initializes the givenfunction that takestasklet_structwiththe given data:360Softirq, Tasklets and Workqueuesvoid tasklet_init(struct tasklet_struct *t,void (*func)(unsigned long), unsigned long data){t->next = NULL;t->state = 0;atomic_set(&t->count, 0);t->func = func;t->data = data;}There are additional methods to initialize a tasklet statically with the two following macros:DECLARE_TASKLET(name, func, data);DECLARE_TASKLET_DISABLED(name, func, data);The Linux kernel provides three following functions to mark a tasklet as ready to run:void tasklet_schedule(struct tasklet_struct *t);void tasklet_hi_schedule(struct tasklet_struct *t);void tasklet_hi_schedule_first(struct tasklet_struct *t);The first function schedules a tasklet with the normal priority, the second with the highpriority and the third out of turn. Implementation of the all of these three functions is similar,so we will consider only the first --tasklet_schedule. Let's look on its implementation:static inline void tasklet_schedule(struct tasklet_struct *t){if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))__tasklet_schedule(t);}void __tasklet_schedule(struct tasklet_struct *t){unsigned long flags;local_irq_save(flags);t->next = NULL;*__this_cpu_read(tasklet_vec.tail) = t;__this_cpu_write(tasklet_vec.tail, &(t->next));raise_softirq_irqoff(TASKLET_SOFTIRQ);local_irq_restore(flags);}As we can see it checks and sets the state of the given tasklet to theand executes the__tasklet_schedulevery similar to theraise_softirqwith the given tasklet. TheTASKLET_STATE_SCHED__tasklet_schedulefunction that we saw above. It saves thelooksinterrupt flag361Softirq, Tasklets and Workqueuesand disables interrupts at the beginning. After this, it updatestasklet and calls theraise_softirq_irqofffunction that we saw above. When the Linuxkernel scheduler will decide to run deferred functions, thetasklet_actioncalled for deferred functions which are associated with thetasklet_hi_actionwith the newtasklet_vecfunction will befor deferred functions which are associated with theHI_SOFTIRQfunctions are very similar and there is only one difference between them -usesandtasklet_vectasklet_hi_actionLet's look on the implementation of theusestasklet_hi_vectasklet_actionandTASKLET_SOFTIRQ. Thesetasklet_action.function:static void tasklet_action(struct softirq_action *a){local_irq_disable();list = __this_cpu_read(tasklet_vec.head);__this_cpu_write(tasklet_vec.head, NULL);__this_cpu_write(tasklet_vec.tail, this_cpu_ptr(&tasklet_vec.head));local_irq_enable();while (list) {if (tasklet_trylock(t)) {t->func(t->data);tasklet_unlock(t);}.........}}In the beginning of thetasklet_actionprocessor with the help of thefunction, we disable interrupts for the locallocal_irq_disablemacro (you can read about this macro inthe second part of this chapter). In the next step, we take a head of the list that containstasklets with normal priority and set this per-cpu list toNULLbecause all tasklets must beexecuted in a generally way. After this we enable interrupts for the local processor and gothrough the list of tasklets in the loop. In every iteration of the loop we call thetasklet_trylockfunction for the given tasklet that updates state of the given tasklet onTASKLET_STATE_RUN:static inline int tasklet_trylock(struct tasklet_struct *t){return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);}If this operation was successful we execute tasklet's action (it was set in thetasklet_initand call thestate.tasklet_unlockfunction that clears tasklet'sTASKLET_STATE_RUN)362Softirq, Tasklets and WorkqueuesIn general, that's all abouttaskletsconcept. Of course this does not cover fulltasklets,but I think that it is a good point from where you can continue to learn this concept.Thetaskletsare widely used concept in the Linux kernel, but as I wrote in the beginning ofthis part there is third mechanism for deferred functions --workqueue. In the next paragraphwe will see what it is.WorkqueuesTheworkqueueis another concept for handling deferred functions. It is similar totaskletswith some differences. Workqueue functions run in the context of a kernel process, buttaskletfunctions run in the software interrupt context. This means thatfunctions must not be atomic astaskletworkqueuefunctions. Tasklets always run on the processorfrom which they were originally submitted. Workqueues work in the same way, but only bydefault. Theworkqueueconcept represented by the:struct worker_pool {spinlock_tlock;intcpu;intnode;intid;unsigned intflags;struct list_headworklist;intnr_workers;.........structure that is defined in the kernel/workqueue.c source code file in the Linux kernel. I willnot write the source code of this structure here, because it has quite a lot of fields, but wewill consider some of those fields.In its most basic form, the work queue subsystem is an interface for creating kernel threadsto handle work that is queued from elsewhere. All of these kernel threads are called -worker threads. The work queue are maintained by thework_structthat defined in theinclude/linux/workqueue.h. Let's look on this structure:363Softirq, Tasklets and Workqueuesstruct work_struct {atomic_long_t data;struct list_head entry;work_func_t func;#ifdef CONFIG_LOCKDEPstruct lockdep_map lockdep_map;#endif};Here are two things that we are interested:theworkqueueand thedatafunc-- the function that will be scheduled by- parameter of this function. The Linux kernel provides specialper-cpu threads that are calledkworker:systemd-cgls -k | grep kworker├─5 [kworker/0:0H]├─15 [kworker/1:0H]├─20 [kworker/2:0H]├─25 [kworker/3:0H]├─30 [kworker/4:0H].........This process can be used to schedule the deferred functions of the workqueues (asksoftirqdworkqueueforsoftirqs). Besides this we can create new separate worker thread for a. The Linux kernel provides following macros for the creation of workqueue:#define DECLARE_WORK(n, f) \struct work_struct n = __WORK_INITIALIZER(n, f)for static creation. It takes two parameters: name of the workqueue and the workqueuefunction. For creation of workqueue in runtime, we can use the:#define INIT_WORK(_work, _func)\__INIT_WORK((_work), (_func), 0)#define __INIT_WORK(_work, _func, _onstack)do {\\__init_work((_work), _onstack);\(_work)->data = (atomic_long_t) WORK_DATA_INIT();\INIT_LIST_HEAD(&(_work)->entry);\(_work)->func = (_func);\} while (0)364Softirq, Tasklets and Workqueuesmacro that takeswork_structstructure that has to be created and the function to bescheduled in this workqueue. After aneed to put it to theworkqueuequeue_delayed_workfunctions:workwas created with the one of these macros, we. We can do it with the help of theor thequeue_workstatic inline bool queue_work(struct workqueue_struct *wq,struct work_struct *work){return queue_work_on(WORK_CPU_UNBOUND, wq, work);}Thefunction just calls thequeue_workqueue_work_onprocessor. Note that in our case we pass thefunction. It is a part of theenumfunction that queues work on specificWORK_CPU_UNBOUNDto thequeue_work_onthat is defined in the include/linux/workqueue.h andrepresents workqueue which are not bound to any specific processor. Thefunction tests and set the__queue_workWORK_STRUCT_PENDING_BITfunction with theworkqueuebit of the givenqueue_work_onand executes theworkfor the given processor and givenwork:bool queue_work_on(int cpu, struct workqueue_struct *wq,struct work_struct *work){bool ret = false;...if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {__queue_work(cpu, wq, work);ret = true;}...return ret;}Thefunction gets the__queue_workActually, allare not placed in theworksrepresented by theworkqueue_structcreate aworker_pool. Yes, theworkqueue, but to thepwqsfield which is list of, it stands out for each processor theassociated withwork poolthe. So in theworker_pool__queue_workraw_smp_processor_id.that isworker_pools. When we. Each, which is allocated on the same processorworkqueueinteracts withfunction we set the cpu to the current processor with(you can find information about this macro in the fourth part of theLinux kernel initialization process chapter), getting theworkqueue_structworkqueuework poolpool_workqueueand corresponds to the type of priority queue. Through themworker_poolnotstructure in the Linux kernel. As you can see above, thestructure has theworkqueuepool_workqueuework pooland insert the givenworkpool_workqueueto the givenworkqueuefor the given:365Softirq, Tasklets and Workqueuesstatic void __queue_work(int cpu, struct workqueue_struct *wq,struct work_struct *work){.........if (req_cpu == WORK_CPU_UNBOUND)cpu = raw_smp_processor_id();if (!(wq->flags & WQ_UNBOUND))pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);elsepwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));.........insert_work(pwq, work, worklist, work_flags);As we can createworksandalready wrote, allworksare executed by the kernel thread. When this kernel thread isscheduled, it starts to executeexecutes a loop inside theworkqueueworks, we need to know when they are executed. As Ifrom the givenworker_threadworkqueue. Each worker threadfunction. This thread makes many different thingsand part of these things are similar to what we saw before in this part. As it starts executing,it removes allwork_structorworksfrom itsworkqueue.That's all.ConclusionIt is the end of the ninth part of the Interrupts and Interrupt Handling chapter and wecontinued to dive into external hardware interrupts in this part. In the previous part we sawinitialization of thethesoftirq,IRQstaskletand mainandirq_descworkqueueThe next part will be last part of thestructure. In this part we saw three concepts:that are used for the deferred functions.Interrupts and Interrupt Handlingchapter and we willlook on the real hardware driver and will try to learn how it works with the interruptssubsystem.If you have any questions or suggestions, write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Links366Softirq, Tasklets and WorkqueuesinitcallIFeflagsCPU masksper-cpuWorkqueuePrevious part367Last partInterrupts and Interrupt Handling. Part 10.Last partThis is the tenth part of the chapter about interrupts and interrupt handling in the Linuxkernel and in the previous part we saw a little about deferred interrupts and related conceptslikesoftirq,taskletandworkqeue. In this part we will continue to dive into this themeand now it's time to look at real hardware driver.Let's consider serial driver of the StrongARM** SA-110/21285 Evaluation Board board forexample and will look how this driver requests an IRQ line, what happens when an interruptis triggered and etc. The source code of this driver is placed in the drivers/tty/serial/21285.csource code file. Ok, we have source code, let's start.Initialization of a kernel moduleWe will start to consider this driver as we usually did it with all new concepts that we saw inthis book. We will start to consider it from the intialization. As you already may know, theLinux kernel provides two macros for initialization and finalization of a driver or a kernelmodule:module_init;module_exit.And we can find usage of these macros in our driver source code:module_init(serial21285_init);module_exit(serial21285_exit);The most part of device drivers can be compiled as a loadable kernel module or in anotherway they can be statically linked into the Linux kernel. In the first case initialization of adevice driver will be produced via themodule_initandmodule_exitmacros that aredefined in the include/linux/init.h:368Last part#define module_init(initfn)\static inline initcall_t __inittest(void)\{ return initfn; }\int init_module(void) __attribute__((alias(#initfn)));#define module_exit(exitfn)\static inline exitcall_t __exittest(void)\{ return exitfn; }\void cleanup_module(void) __attribute__((alias(#exitfn)));and will be called by the initcall functions:early_initcallpure_initcallcore_initcallpostcore_initcallarch_initcallsubsys_initcallfs_initcallrootfs_initcalldevice_initcalllate_initcallthat are called in thedo_initcallsfrom the init/main.c. Otherwise, if a device driver isstatically linked into the Linux kernel, implementation of these macros will be following:#define module_init(x)__initcall(x);#define module_exit(x)__exitcall(x);In this way implementation of module loading placed in the kernel/module.c source code fileand initialization occurs in thedo_init_modulefunction. We will not dive into details aboutloadable modules in this chapter, but will see it in the special chapter that will describe Linuxkernel modules. Ok, themodule_initmacro takes one parameter - theserial21285_initinour case. As we can understand from function's name, this function does stuff related to thedriver initialization. Let's look at it:369Last partstatic int __init serial21285_init(void){int ret;printk(KERN_INFO "Serial: 21285 driver\n");serial21285_setup_ports();ret = uart_register_driver(&serial21285_reg);if (ret == 0)uart_add_one_port(&serial21285_reg, &serial21285_port);return ret;}As we can see, first of all it prints information about the driver to the kernel buffer and thecall of theserial21285_setup_portsserial21285_portfunction. This function setups the base uart clock of thedevice:unsigned int mem_fclk_21285 = 50000000;static void serial21285_setup_ports(void){serial21285_port.uartclk = mem_fclk_21285 / 4;}Here theserial21285is the structure that describesuartdriver:static struct uart_driver serial21285_reg = {.owner.driver_name.dev_name= THIS_MODULE,= "ttyFB",= "ttyFB",.major= SERIAL_21285_MAJOR,.minor= SERIAL_21285_MINOR,.nr.cons= 1,= SERIAL_21285_CONSOLE,};If the driver registered successfully we attach the driver-defined portstructure with theuart_add_one_portcode file and return from theserial21285_portfunction from the drivers/tty/serial/serial_core.c sourceserial21285_initfunction:if (ret == 0)uart_add_one_port(&serial21285_reg, &serial21285_port);return ret;370Last partThat's all. Our driver is initialized. When anuart_openuartport will be opened with the call of thefunction from the drivers/tty/serial/serial_core.c, it will call thefunction to start up the serial port. This function will call thetheuart_opsstructure. Eachuartstartupuart_startupfunction that is part ofdriver has the definition of this structure, in our case itis:static struct uart_ops serial21285_ops = {....startup= serial21285_startup,...}serial21285structure. As we can see the.strartupfield references on thefunction. Implementation of this function is very interesting for us,serial21285_startupbecause it is related to the interrupts and interrupt handling.Requesting irq lineLet's look at the implementation of theserial21285function:static int serial21285_startup(struct uart_port *port){int ret;tx_enabled(port) = 1;rx_enabled(port) = 1;ret = request_irq(IRQ_CONRX, serial21285_rx_chars, 0,serial21285_name, port);if (ret == 0) {ret = request_irq(IRQ_CONTX, serial21285_tx_chars, 0,serial21285_name, port);if (ret)free_irq(IRQ_CONRX, port);}return ret;}First of all aboutTXandRX. A serial bus of a device consists of just two wires: one forsending data and another for receiving. As such, serial devices should have two serial pins:the receiver andRXrx_enabled, and the transmitter -TX. With the call of first two macros:tx_enabled, we enable these wires. The following part of these function is the greatest371Last partinterest for us. Note onrequest_irqfunctions. This function registers an interrupt handlerand enables a given interrupt line. Let's look at the implementation of this function and getinto the details. This function defined in the include/linux/interrupt.h header file and looks as:static inline int __must_checkrequest_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,const char *name, void *dev){return request_threaded_irq(irq, handler, NULL, flags, name, dev);}As we can see, theirq- the pointer to the interrupt handler;- the bitmask options;flags- the name of the owner of an interrupt;name- the pointer used for shared interrupt lines;Now let's look at the calls of thefirst parameter isCONRXfunction takes five parameters:- the interrupt number that being requested;handlerdevrequest_irqIRQ_CONRXrequest_irqfunctions in our example. As we can see the. We know that it is number of the interrupt, but what is it? This macro defined in the arch/arm/mach-footbridge/include/mach/irqs.h headerfile. We can find the full list of interrupts that thesecond call of therequest_irqthese interrupts will handleRX21285function we pass theandTXboard can generate. Note that in theIRQ_CONTXinterrupt number. Bothevent in our driver. Implementation of these macrosis easy:#define IRQ_CONRX_DC21285_IRQ(0)#define IRQ_CONTX_DC21285_IRQ(1).........#define _DC21285_IRQ(x)(16 + (x))The ISA IRQs on this board are fromnumbers:16and17serial21285_rx_charsRXorTX0to15, so, our interrupts will have first two. Second parameters for two calls of theandserial21285_tx_charsrequest_irqfunctions are. These functions will be called when aninterrupt occurred. We will not dive in this part into details of these functions,because this chapter covers the interrupts and interrupts handling but not device anddrivers. The next parameter request_irqflagsand as we can see, it is zero in both calls of thefunction. All acceptable flags are defined asIRQF_*macros in theinclude/linux/interrupt.h. Some of it:IRQF_SHARED- allows sharing the irq among several devices;372Last part- an interrupt is per cpu;IRQF_PERCPUIRQF_NO_THREAD- an interrupt cannot be threaded;- excludes this interrupt from irq balancing;IRQF_NOBALANCINGIRQF_IRQPOLL- an interrupt is used for polling;and etc.In our case we pass0, so it will beIRQF_TRIGGER_NONE. This flag means that it does notimply any kind of edge or level triggered interrupt behaviour. To the fourth parameter(name), we pass theserial21285_namethat defined as:static const char serial21285_name[] = "Footbridge UART";and will be displayed in the output of thepass the pointer to the our main/proc/interruptsuart_port. And in the last parameter westructure. Now we know a little aboutfunction and its parameters, let's look at its implemenetation. As we can seerequest_irqabove, therequest_irqinside. Therequest_threaded_irqfunction just makes a call of therequest_threaded_irqfunctionfunction defined in the kernel/irq/manage.c source codefile and allocates a given interrupt line. If we will look at this function, it starts from thedefinition of theirqactionand theirq_desc:int request_threaded_irq(unsigned int irq, irq_handler_t handler,irq_handler_t thread_fn, unsigned long irqflags,const char *devname, void *dev_id){struct irqaction *action;struct irq_desc *desc;int retval;.........}We already saw theirqactionand theirq_descstructures in this chapter. The firststructure represents per interrupt action descriptor and contains pointers to the interrupthandler, name of the device, interrupt number, etc. The second structure represents a, interrupt flags, etc. Notedescriptor of an interrupt and contains pointer to theirqactionthat therequest_irqrequest_threaded_irqparameter:function called by theirq_handler_t thread_fncreated and the givenirq. If this parameter is notNULLwith the additional, theirqthread will behandler will be executed in this thread. In the next step we needto make following checks:373Last partif (((irqflags & IRQF_SHARED) && !dev_id) ||(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))return -EINVAL;First of all we check that realIRQF_COND_SUSPENDfunction with thedev_idis passed for the shared interrupt and theonly makes sense for shared interrupts. Otherwise we exit from this-EINVALerror. After this we convert the givendescriptor wit the help of theirqnumber to theirqfunction that defined in the kernel/irq/irqdesc.cirq_to_descsource code file and exit from this function with the-EINVALerror if it was not successful:desc = irq_to_desc(irq);if (!desc)return -EINVAL;Theirq_to_descfunction checks that givenIRQs and returns the irq descriptor where theirqirqnumber is less than maximum number ofnumber is offset from theirq_descarray:struct irq_desc *irq_to_desc(unsigned int irq){return (irq handler = handler;action->thread_fn = thread_fn;action->flags = irqflags;action->name = devname;action->dev_id = dev_id;In the end of therequest_threaded_irqfunction we call thekernel/irq/manage.c and registers a givenirqaction__setup_irqfunction from the. Release memory for theirqactionand return:chip_bus_lock(desc);retval = __setup_irq(irq, desc, action);chip_bus_sync_unlock(desc);if (retval)kfree(action);return retval;Note that the call of thechip_bus_sync_unlock__setup_irqfunction is placed between theand thefunctions. These functions lock/unlock access to slow busses (likei2c) chips. Now let's look at the implementation of thebeginning of thechip_bus_lock__setup_irqfunction. In the__setup_irqfunction we can see a couple of different checks. First of allwe check that the given interrupt descriptor is notgiven interrupt descriptor module owner is notNULLNULL,irqchipis notNULLand that. After this we check if the interrupt isnested into another interrupt thread or not, and if it is nested we replace theirq_default_primary_handlerwith theirq_nested_primary_handlerIn the next step we create an irq handler thread with thegiven interrupt is not nested and thethread_fnis not.kthread_createNULLfunction, if the:375Last partif (new->thread_fn && !nested) {struct task_struct *t;t = kthread_create(irq_thread, new, "irq/%d-%s", irq, new->name);...}And fill the rest of the given interrupt descriptor fields in the end. So, ourinterrupt request lines are registered and theserial21285_rx_chars16and17andfunctions will be invoked when an interrupt controller will get eventserial21285_tx_charsreleated to these interrupts. Now let's look at what happens when an interrupt occurs.Prepare to handle an interruptIn the previous paragraph we saw the requesting of the irq line for the given interruptdescriptor and registration of theirqactionstructure for the given interrupt. We alreadyknow that when an interrupt event occurs, an interrupt controller notifies the processor aboutthis event and processor tries to find appropriate interrupt gate for this interrupt. If you haveread the eighth part of this chapter, you may remember thenative_init_IRQfunction. Thisfunction makes initialization of the local APIC. The following part of this function is the mostinteresting part for us right now:for_each_clear_bit_from(i, used_vectors, first_system_vector) {set_intr_gate(i, irq_entries_start +8 * (i - FIRST_EXTERNAL_VECTOR));}Here we iterate over all the cleared bit of thefirst_system_vectorused_vectorsbitmap starting atthat is:int first_system_vector = FIRST_SYSTEM_VECTOR; // 0xefand set interrupt gates with theFIRST_EXTERNAL_VECTOR)irq_entries_startprovidesirqivector number and theirq_entries_start + 8 * (i -start address. Only one thing is unclear here - the. This symbol defined in the arch/x86/entry/entry_64.S assembly file andentries. Let's look at it:376Last part.align 8ENTRY(irq_entries_start)vector=FIRST_EXTERNAL_VECTOR.rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)pushq(~vector+0x80)vector=vector+1jmpcommon_interrupt.align8.endrEND(irq_entries_start)Here we can see the GNU assemblerlines that are beforealready know, the0x20.endr-.reptinstruction which repeats the sequence ofFIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTORFIRST_SYSTEM_VECTORis0xef, and thetimes. As weFIRST_EXTERNAL_VECTORis equal to. So, it will work:>>> 0xef - 0x20207times. In the body of the.reptinstruction we push entry stubs on the stack (note that weuse negative numbers for the interrupt vector numbers, because positive numbers alreadyreserved to identify system calls), increase thelabel. In thecommon_interruptexecuteinterruptvectorcommon_interruptnumber with thedo_IRQvariable and jump on thewe adjust vector number on the stack andparameter:common_interrupt:addq-0x80, (%rsp)interrupt do_IRQThe macrointerruptdefined in the same source code file and saves general purposeregisters on the stack, change the userspaceinstruction if need, increase per-cpu and call thedo_IRQgsirq_counton the kernel with theassemblervariable that shows that we are in interruptfunction. This function defined in the arch/x86/kernel/irq.c source codefile and handles our device interrupt. Let's look at this function. Theone parameter -SWAPGSpt_regsdo_IRQfunction takesstructure that stores values of the userspace registers:377Last part__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs){struct pt_regs *old_regs = set_irq_regs(regs);unsigned vector = ~regs->orig_ax;unsigned irq;irq_enter();exit_idle();.........}At the beginning of this function we can see call of theset_irq_regssavedirq_enterper-cpuirq register pointer and the calls of theThe first function__preempt_countisidleirq_enter0and notify theIn the next step we read theandexit_idlefunctions.enters to an interrupt context with the updatingvariable and the second function -with pid -function that returnsirqidle_notifierexit_idlewith thechecks that current processIDLE_ENDfor the current cpu and call the.handle_irqfunction:irq = __this_cpu_read(vector_irq[vector]);if (!handle_irq(irq, regs)) {.........}.........Thehandle_irqfunction defined in the arch/x86/kernel/irq_64.c source code file, checksthe given interrupt descriptor and call thegeneric_handle_irq_desc:desc = irq_to_desc(irq);if (unlikely(!desc))return false;generic_handle_irq_desc(irq, desc);Where thegeneric_handle_irq_desccalls the interrupt handler:static inline void generic_handle_irq_desc(unsigned int irq, struct irq_desc *desc){desc->handle_irq(irq, desc);}378Last partBut stop... What is itand why do we call our interrupt handler from the interrupthandle_irqdescriptor when we know thatirqactionpoints to the actual interrupt handler? Actually theis a high-level API for the calling interrupt handler routine. It setupsirq_desc->handle_irqduring initialization of the device tree and APIC initialization. The kernel selects correctfunction and call chain of theor theserial21285_rx_charsIn the end of thedo_IRQinterrupt context, thethere. In this way, theirq->action(s)serial21285_tx_charsfunction will be executed after an interrupt will occur.function we call theirq_exitfunction that will exit from thewith the old userspace registers and return:set_irq_regsirq_exit();set_irq_regs(old_regs);return 1;We already know that when anIRQfinishes its work, deferred interrupts will be executed ifthey exist.Exit from interruptOk, the interrupt handler finished its execution and now we must return from the interrupt.When the work of thefunction will be finsihed, we will return back to the assemblerdo_IRQcode in the arch/x86/entry/entry_64.S to theinterrupts with theDISABLE_INTERRUPTSdecreases value of the1irq_countret_from_intrlabel. First of all we disablemacro that expands to thecliinstruction andper-cpu variable. Remember, this variable had value -, when we were in interrupt context:DISABLE_INTERRUPTS(CLBR_NONE)TRACE_IRQS_OFFdeclPER_CPU_VAR(irq_count)In the last step we check the previous context (user or kernel), restore it in a correct way andexit from an interrupt with the:INTERRUPT_RETURNwhere theINTERRUPT_RETURN#define INTERRUPT_RETURNmacro is:jmp native_iretand379Last partENTRY(native_iret).global native_irq_return_iretnative_irq_return_iret:iretqThat's all.ConclusionIt is the end of the tenth part of the Interrupts and Interrupt Handling chapter and as youhave read in the beginning of this part - it is the last part of this chapter. This chapter startedfrom the explanation of the theory of interrupts and we have learned what is it interrupt andkinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferredinterrupts and finally we looked on the hardware interrupts and the handling of theirs in thispart. Of course, this part and even this chapter does not cover full aspects of interrupts andinterrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was thebig part, I don't know how about you, but it was really big for me. This theme is much biggerthan this chapter and I am not sure that somewhere there is a book that covers it. We havemissed many part and aspects of interrupts and interrupt handling, but I think it will be goodpoint to dive in the kernel code related to the interrupts and interrupts handling.If you have any questions or suggestions write me a comment or ping me at twitter.Please note that English is not my first language, And I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.LinksSerial driver documentationStrongARM** SA-110/21285 Evaluation BoardIRQmoduleinitcalluartISAmemory managementi2cAPICGNU assembler380Last partProcessor registerper-cpupiddevice treesystem callsPrevious part381System callsSystem callsThis chapter describes thesystem callconcept in the linux kernel.Introduction to system call concept - this part is introduction to thesystem callconceptin the Linux kernel.How the Linux kernel handles a system call - this part describes how the Linux kernelhandles a system call from a userspace application.vsyscall and vDSO - third part describesvsyscallandvDSOconcepts.How the Linux kernel runs a program - this part describes startup process of a program.Implementation of the open system call - this part describes implementation of the opensystem call.Limits on resources in Linux - this part describes implementation of the getrlimit/setrlimitsystem calls.382Introduction to system callsSystem calls in the Linux kernel. Part 1.IntroductionThis post opens up a new chapter in linux-insides book, and as you may understand fromthe title, this chapter will be devoted to the System call concept in the Linux kernel. Thechoice of topic for this chapter is not accidental. In the previous chapter we saw interruptsand interrupt handling. The concept of system calls is very similar to that of interrupts. This isbecause the most common way to implement system calls is as software interrupts. We willsee many different aspects that are related to the system call concept. For example, we willlearn what's happening when a system call occurs from userspace. We will see animplementation of a couple system call handlers in the Linux kernel, VDSO and vsyscallconcepts and many many more.Before we dive into Linux system call implementation, it is good to know some theory aboutsystem calls. Let's do it in the following paragraph.System call. What is it?A system call is just a userspace request of a kernel service. Yes, the operating systemkernel provides many services. When your program wants to write to or read from a file, startto listen for connections on a socket, delete or create directory, or even to finish its work, aprogram uses a system call. In other words, a system call is just a C kernel space functionthat user space programs call to handle some request.The Linux kernel provides a set of these functions and each architecture provides its ownset. For example: the x86_64 provides 322 system calls and the x86 provides 358 differentsystem calls. Ok, a system call is just a function. Let's look on a simpleHello worldexample that's written in the assembly programming language:383Introduction to system calls.datamsg:.ascii "Hello, world!\n"len = . - msg.text.global _start_start:movq$1, %raxmovq$1, %rdimovq$msg, %rsimovq$len, %rdxsyscallmovq$60, %raxxorq%rdi, %rdisyscallWe can compile the above with the following commands:$ gcc -c test.S$ld -o test test.oand run it as follows:./testHello, world!Ok, what do we see here? This simple code representsthe Linuxx86_64Hello worldassembly program forarchitecture. We can see two sections here:.data.textThe first section -.datastores initialized data of our program (length in our case). The second section -.textHello worldcontains the code of our program. We cansplit the code of our program into two parts: first part will be before the firstinstruction and the second part will be between first and secondof all what does thesyscallstring and itssyscallsyscallinstructions. Firstinstruction do in our code and generally? As we can read inthe 64-ia-32-architectures-software-developer-vol-2b-manual:384Introduction to system callsSYSCALL invokes an OS system-call handler at privilege level 0. It does so byloading RIP from the IA32_LSTAR MSR (after saving the address of the instructionfollowing SYSCALL into RCX). (The WRMSR instruction ensures that theIA32_LSTAR MSR always contain a canonical address.).........SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of theIA32_STAR MSR. However, the CS and SS descriptor caches are not loaded from thedescriptors (in GDT or LDT) referenced by those selectors.Instead, the descriptor caches are loaded with fixed values. It is the responsibility of OS software to ensure that the descriptors (in GDT or LDT) referencedby those selector values correspond to the fixed values loaded into the descriptorcaches; the SYSCALL instruction does not ensure this correspondence.To summarize, thesyscallinstruction jumps to the address stored in theMSR_LSTARModelspecific register (Long system target address register). The kernel is responsible forproviding its own custom function for handling syscalls as well as writing the address of thishandler function to theMSR_LSTARregister upon system startup. The custom function is, which is defined in arch/x86/entry/entry_64.S. The address of this syscallentry_SYSCALL_64handling function is written to theMSR_LSTARregister during startup inarch/x86/kernel/cpu/common.c.wrmsrl(MSR_LSTAR, entry_SYSCALL_64);So, thesyscallinstruction invokes a handler of a given system call. But how does it knowwhich handler to call? Actually it gets this information from the general purpose registers. Asyou can see in the system call table, each system call has a unique number. In our examplethe first system call iswritetable and try to find thenumber1, which writes data to the given file. Let's look in the system callwritesystem call. As we can see, the write system call has. We pass the number of this system call through theexample. The next general purpose registers:parameters of thewriteFile descriptor (1%rdi,%rsi, andrax%rdxregister in ourtake the threesyscall. In our case, they are:is stdout in our case)Pointer to our stringSize of dataYes, you heard right. Parameters for a system call. As I already wrote above, a system callis a justCfunction in the kernel space. In our case first system call is write. This systemcall defined in the fs/read_write.c source code file and looks like:385Introduction to system callsSYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,size_t, count){.........}Or in other words:ssize_t write(int fd, const void *buf, size_t nbytes);Don't worry about theSYSCALL_DEFINE3macro for now, we'll come back to it.The second part of our example is the same, but we call another system call. In this case wecall the exit system call. This system call gets only one parameter:Return valueand handles the way our program exits. We can pass the program name of our program tothe strace util and we will see our system calls:$ strace testexecve("./test", ["./test"], [/* 62 vars */]) = 0write(1, "Hello, world!\n", 14Hello, world!)= 14_exit(0)= ?+++ exited with 0 +++In the first line of thestraceoutput, we can see the execve system call that executes ourprogram, and the second and third are system calls that we have used in our program:writeandexit. Note that we pass the parameter through the general purpose registersin our example. The order of the registers is not accidental. The order of the registers isdefined by the following agreement - x86-64 calling conventions. This, and the otheragreement for thex86_64architecture are explained in the special document - System VApplication Binary Interface. PDF. In a general way, argument(s) of a function are placedeither in registers or pushed on the stack. The right order is:rdirsirdxrcxr8386Introduction to system callsr9for the first six parameters of a function. If a function has more than six arguments, theremaining parameters will be placed on the stack.We do not use system calls in our code directly, but our program uses them when we wantto print something, check access to a file or just write or read something to it.For example:#include int main(int argc, char **argv){FILE *fp;char buff[255];fp = fopen("test.txt", "r");fgets(buff, 255, fp);printf("%s\n", buff);fclose(fp);return 0;}There are noopenand,readfclosefopen,write,fgets, and,printfcloseare defined in theC, andfclosesystem calls in the Linux kernel, butinstead. I think you know thatfopen,fgets,printf,standard library. Actually, these functions are justwrappers for the system calls. We do not call system calls directly in our code, but insteaduse these wrapper functions from the standard library. The main reason of this is simple: asystem call must be performed quickly, very quickly. As a system call must be quick, it mustbe small. The standard library takes responsibility to perform system calls with the correctparameters and makes different checks before it will call the given system call. Let's compileour program with the following command:$gcc test.c -o testand examine it with the ltrace util:387Introduction to system calls$ ltrace ./test__libc_start_main([ "./test" ] fopen("test.txt", "r")= 0x602010fgets("Hello World!\n", 255, 0x602010)= 0x7ffd2745e700puts("Hello World!\n"Hello World!)= 14fclose(0x602010)= 0+++ exited (status 0) +++Theltraceutil displays a set of userspace calls of a program. Thethe given text file, thefunction reads file content to thefgetsfunction prints the buffer tostdout, and thefclosebuffopenfunction opensbuffer, theputsfunction closes the file given by the filedescriptor. And as I already wrote, all of these functions call an appropriate system call. Forexample,to theputsltracecalls thewritesystem call inside, we can see it if we will add-Soptionprogram:write@SYS(1, "Hello World!\n\n", 14) = 14Yes, system calls are ubiquitous. Each program needs to open/write/read files and networkconnections, allocate memory, and many other things that can be provided only by thekernel. The proc file system contains special files in a format:/proc/pid/systemcallthatexposes the system call number and argument registers for the system call currently beingexecuted by the process. For example, pid 1 is systemd for me:$sudo cat /proc/1/commsystemd$ sudo cat /proc/1/syscall232 0x4 0x7ffdf82e11b0 0x1f 0xffffffff 0x100 0x7ffdf82e11bf 0x7ffdf82e11a0 0x7f9114681193the system call with number -232which is epoll_wait system call that waits for an I/O eventon an epoll file descriptor. Or for exampleemacseditor where I'm writing this part:$ps ax | grep emacs2093 ?Sl2:40 emacs$ sudo cat /proc/2093/commemacssudo cat /proc/2093/syscall270 0xf 0x7fff068a5a90 0x7fff068a5b10 0x0 0x7fff068a59c0 0x7fff068a59d0 0x7fff068a59b00x7f777dd8813c388Introduction to system callsthe system call with the number270which is sys_pselect6 system call that allowsemacsto monitor multiple file descriptors.Now we know a little about system call, what is it and why we need in it. So let's look at thesystem call that our program used.writeImplementation of write system callLet's look at the implementation of this system call directly in the source code of the Linuxkernel. As we already know, thewritesystem call is defined in the fs/read_write.c sourcecode file and looks like this:SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,size_t, count){struct fd f = fdget_pos(fd);ssize_t ret = -EBADF;if (f.file) {loff_t pos = file_pos_read(f.file);ret = vfs_write(f.file, buf, count, &pos);if (ret >= 0)file_pos_write(f.file, pos);fdput_pos(f);}return ret;}First of all, theSYSCALL_DEFINE3macro is defined in the include/linux/syscalls.h header fileand expands to the definition of thefunction. Let's look at this macro:sys_name(...)#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)#define SYSCALL_DEFINEx(x, sname, ...)\SYSCALL_METADATA(sname, x, __VA_ARGS__)\__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)As we can see theSYSCALL_DEFINE3macro takesnameparameter which will representname of a system call and variadic number of parameters. This macro just expands to theSYSCALL_DEFINEx_##namethe##macro that takes the number of the parameters the given system call, thestub for the future name of the system call (more about tokens concatenation withyou can read in the documentation of gcc). Next we can see theSYSCALL_DEFINExmacro. This macro expands to the two following macros:389Introduction to system callsSYSCALL_METADATA;__SYSCALL_DEFINEx.Implementation of the first macroCONFIG_FTRACE_SYSCALLSSYSCALL_METADATAdepends on thekernel configuration option. As we can understand from the nameof this option, it allows to enable tracer to catch the syscall entry and exit events. If thiskernel configuration option is enabled, thethesyscall_metadataSYSCALL_METADATAmacro executes initialization ofstructure that defined in the include/trace/syscall.h header file andcontains different useful fields as name of a system call, number of a system call in thesystem call table, number of parameters of a system call, list of parameter types and etc:#define SYSCALL_METADATA(sname, nb, ...)\...\...\...\struct syscall_metadata __used\__syscall_meta_##sname = {\.name= "sys"#sname,\.syscall_nr= -1,\.nb_args= nb,\.types= nb ? types_##sname : NULL,\.args= nb ? args_##sname : NULL,\.enter_event= &event_enter_##sname,\.exit_event= &event_exit_##sname,\.enter_fields= LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \};\static struct syscall_metadata __used__attribute__((section("__syscalls_metadata")))\\*__p_syscall_meta_##sname = &__syscall_meta_##sname;If theCONFIG_FTRACE_SYSCALLSSYSCALL_METADATAkernel option is not enabled during kernel configuration, themacro expands to an empty string:#define SYSCALL_METADATA(sname, nb, ...)The second macro__SYSCALL_DEFINExexpands to the definition of the five followingfunctions:390Introduction to system calls#define __SYSCALL_DEFINEx(x, name, ...)\asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\__attribute__((alias(__stringify(SyS##name))));\\static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));\\asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));\\asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\{\long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));\__MAP(x,__SC_TEST,__VA_ARGS__);\__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));\return ret;\}\\static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))The firstsys##nameis definition of the syscall handler function with the given name -sys_system_call_name. The__SC_DECLmacro takes the__VA_ARGS__and combines callinput parameter system type and the parameter name, because the macro definition isunable to determine the parameter types. And theto the__VA_ARGS____SYSCALL_DEFINEx__MAPmacro applies__SC_DECLmacroarguments. The other functions that are generated by themacro are need to protect from the CVE-2009-0029 and we will not diveinto details about this here. Ok, as result of theSYSCALL_DEFINE3macro, we will have:asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);Now we know a little about the system call's definition and we can go back to theimplementation of thewritesystem call. Let's look on the implementation of this systemcall again:391Introduction to system callsSYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,size_t, count){struct fd f = fdget_pos(fd);ssize_t ret = -EBADF;if (f.file) {loff_t pos = file_pos_read(f.file);ret = vfs_write(f.file, buf, count, &pos);if (ret >= 0)file_pos_write(f.file, pos);fdput_pos(f);}return ret;}As we already know and can see from the code, it takes three arguments:fdbuf- file descriptor;- buffer to write;count- length of buffer to write.and writes data from a buffer declared by the user to a given device or a file. Note that thesecond parameterbuf, defined with the__userattribute. The main purpose of thisattribute is for checking the Linux kernel code with the sparse util. It is defined in theinclude/linux/compiler.h header file and depends on the__CHECKER__kernel. That's all about useful meta-information related to ourdefinition in the Linuxsys_writesystem call, let'stry to understand how this system call is implemented. As we can see it starts from thedefinition of thefstructure that hasfdstructure type that represents file descriptor in theLinux kernel and we put the result of the call of thefdget_posfunction. Thefdget_posfunction defined in the same source code file and just expands the call of the__to_fdfunction:static inline struct fd fdget_pos(int fd){return __to_fd(__fdget_pos(fd));}The main purpose of thenumber to thefdfdget_posis to convert the given file descriptor which is just astructure. Through the long chain of function calls, thefunction gets the file descriptor table of the current process,fdget_poscurrent->filesfind a corresponding file descriptor number there. As we got thefd, and tries tostructure for the givenfile descriptor number, we check it and return if it does not exist. We get the current positionin the file with the call of thefile_pos_readfunction that just returnsf_posfield of our file:392Introduction to system callsstatic inline loff_t file_pos_read(struct file *file){return file->f_pos;}and calls thevfs_writefunction. Thevfs_writefunction defined in the fs/read_write.csource code file and does the work for us - writes given buffer to the given file starting fromthe given position. We will not dive into details about thefunction is weakly related to thesystem callvfs_writefunction, because thisconcept but mostly about Virtual file systemconcept which we will see in another chapter. After thevfs_writehas finished its work, wecheck the result and if it was finished successfully we change the position in the file with thefile_pos_writefunction:if (ret >= 0)file_pos_write(f.file, pos);that just updatesf_poswith the given position in the given file:static inline void file_pos_write(struct file *file, loff_t pos){file->f_pos = pos;}At the end of the ourwritesystem call handler, we can see the call of the followingfunction:fdput_pos(f);unlocks thef_pos_lockmutex that protects file position during concurrent writes fromthreads that share file descriptor.That's all.We have seen the partial implementation of one system call provided by the Linux kernel. Ofcourse we have missed some parts in the implementation of thewritesystem call,because as I mentioned above, we will see only system calls related stuff in this chapter andwill not see other stuff related to other subsystems, such as Virtual file system.Conclusion393Introduction to system callsThis concludes the first part covering system call concepts in the Linux kernel. We havecovered the theory of system calls so far and in the next part we will continue to dive into thistopic, touching Linux kernel code related to system calls.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linkssystem callvdsovsyscallgeneral purpose registerssocketC programming languagex86x86_64x86-64 calling conventionsSystem V Application Binary Interface. PDFGCCIntel manual. PDFsystem call tableGCC macro documentationfile descriptorstdoutstracestandard librarywrapper functionsltracesparseproc file systemVirtual file systemsystemdepollPrevious chapter394Introduction to system calls395How the Linux kernel handles a system callSystem calls in the Linux kernel. Part 2.How does the Linux kernel handle a systemcallThe previous part was the first part of the chapter that describes the system call concepts inthe Linux kernel. In the previous part we learned what a system call is in the Linux kernel,and in operating systems in general. This was introduced from a user-space perspective,and part of the write system call implementation was discussed. In this part we continue ourlook at system calls, starting with some theory before moving onto the Linux kernel code.A user application does not make the system call directly from our applications. We did notwrite theHello world!program like:int main(int argc, char **argv){.........sys_write(fd1, buf, strlen(buf));......}We can use something similar with the help of C standard library and it will look somethinglike this:#include int main(int argc, char **argv){.........write(fd1, buf, strlen(buf));......}But anyway,writeis not a direct system call and not a kernel function. An application mustfill general purpose registers with the correct values in the correct order and use thesyscallinstruction to make the actual system call. In this part we will look at what occurs in396How the Linux kernel handles a system callthe Linux kernel when thesyscallinstruction is met by the processor.Initialization of the system calls tableFrom the previous part we know that system call concept is very similar to an interrupt.Furthermore, system calls are implemented as software interrupts. So, when the processorhandles ainstruction from a user application, this instruction causes an exceptionsyscallwhich transfers control to an exception handler. As we know, all exception handlers (or inother words kernel C functions that will react on an exception) are placed in the kernel code.But how does the Linux kernel search for the address of the necessary system call handlerfor the related system call? The Linux kernel contains a special table called thetable. The system call table is represented by thesys_call_tablesystem callarray in the Linux kernelwhich is defined in the arch/x86/entry/syscall_64.c source code file. Let's look at itsimplementation:asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {[0 ... __NR_syscall_max] = &sys_ni_syscall,#include };As we can see, the__NR_syscall_maxsys_call_tableis an array of__NR_syscall_max + 1size where themacro represents the maximum number of system calls for the givenarchitecture. This book is about the x86_64 architecture, so for our case the__NR_syscall_maxkernel version isis322and this is the correct number at the time of writing (current Linux4.2.0-rc8+). We can see this macro in the header file generated by Kbuildduring kernel compilation - include/generated/asm-offsets.h:#define __NR_syscall_max 322There will be the same number of system calls in the arch/x86/entry/syscalls/syscall_64.tblfor thex86_64. There are two important topics here; the type of thesys_call_tableand the initialization of elements in this array. First of all, the type. Thearray,sys_call_ptr_trepresents a pointer to a system call table. It is defined as typedef for a function pointer thatreturns nothing and does not take arguments:typedef void (*sys_call_ptr_t)(void);The second thing is the initialization of thesys_call_tablearray. As we can see in the codeabove, all elements of our array that contain pointers to the system call handlers point to thesys_ni_syscall. Thesys_ni_syscallfunction represents not-implemented system calls. To397How the Linux kernel handles a system callstart with, all elements of thearray point to the not-implemented systemsys_call_tablecall. This is the correct initial behaviour, because we only initialize storage of the pointers tothe system call handlers, it is populated later on. Implementation of thepretty easy, it just returns -errno or-ENOSYSsys_ni_syscallisin our case:asmlinkage long sys_ni_syscall(void){return -ENOSYS;}The-ENOSYSerror tells us that:ENOSYSAlso a note onFunction not implemented (POSIX.1)...in the initialization of thesys_call_table. We can do it with a GCCcompiler extension called - Designated Initializers. This extension allows us to initializeelements in non-fixed order. As you can see, we include theasm/syscalls_64.hheader atthe end of the array. This header file is generated by the special script atarch/x86/entry/syscalls/syscalltbl.sh and generates our header file from the syscall table.Theasm/syscalls_64.hcontains definitions of the following macros:__SYSCALL_COMMON(0, sys_read, sys_read)__SYSCALL_COMMON(1, sys_write, sys_write)__SYSCALL_COMMON(2, sys_open, sys_open)__SYSCALL_COMMON(3, sys_close, sys_close)__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat).........The__SYSCALL_COMMON__SYSCALL_64macro is defined in the same source code file and expands to themacro which expands to the function definition:#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)#define __SYSCALL_64(nr, sym, compat) [nr] = sym,So, after this, oursys_call_tabletakes the following form:398How the Linux kernel handles a system callasmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {[0 ... __NR_syscall_max] = &sys_ni_syscall,[0] = sys_read,[1] = sys_write,[2] = sys_open,.........};After this all elements that point to the non-implemented system calls will contain theaddress of thesys_ni_syscallother elements will point to thefunction that just returnssys_syscall_name-ENOSYSas we saw above, andfunctions.At this point, we have filled the system call table and the Linux kernel knows where eachsystem call handler is. But the Linux kernel does not call asys_syscall_namefunctionimmediately after it is instructed to handle a system call from a user space application.Remember the chapter about interrupts and interrupt handling. When the Linux kernel getsthe control to handle an interrupt, it had to do some preparations like save user spaceregisters, switch to a new stack and many more tasks before it will call an interrupt handler.There is the same situation with the system call handling. The preparation for handling asystem call is the first thing, but before the Linux kernel will start these preparations, theentry point of a system call must be initialized and only the Linux kernel knows how toperform this preparation. In the next paragraph we will see the process of the initialization ofthe system call entry in the Linux kernel.Initialization of the system call entryWhen a system call occurs in the system, where are the first bytes of code that starts tohandle it? As we can read in the Intel manual - 64-ia-32-architectures-software-developervol-2b-manual:SYSCALL invokes an OS system-call handler at privilege level 0.It does so by loading RIP from the IA32_LSTAR MSRit means that we need to put the system call entry in to theIA32_LSTARmodel specificregister. This operation takes place during the Linux kernel initialization process. If you haveread the fourth part of the chapter that describes interrupts and interrupt handling in theLinux kernel, you know that the Linux kernel calls thetrap_initfunction during theinitialization process. This function is defined in the arch/x86/kernel/setup.c source code fileand executes the initialization of thenon-earlyexception handlers like divide error,399How the Linux kernel handles a system callcoprocessor error etc. Besides the initialization of thefunction calls thenon-earlyexceptions handlers, thisfunction from the arch/x86/kernel/cpu/common.c source codecpu_initfile which besides initialization ofstate, calls theper-cpusyscall_initfunction from thesame source code file.This function performs the initialization of the system call entry point. Let's look on theimplementation of this function. It does not take parameters and first of all it fills two modelspecific registers:wrmsrl(MSR_STAR,((u64)__USER32_CS)<<48| ((u64)__KERNEL_CS)<> PAGE_SHIFT,#endif.........It equal to the511. The second argument is the physical address of the page that has to bemapped and the third argument is the flags of the page. Note that the flags of theVSYSCALL_PAGEdepend on thevsyscall_modevariable. It will bePAGE_KERNEL_VSYSCALLif410vsyscall and vDSOthevsyscall_mode(thevariable isPAGE_KERNEL_VSYSCALLNATIVEand theand thePAGE_KERNEL_VVARPAGE_KERNEL_VVARotherwise. Both macros) will be expanded to the followingflags:#define __PAGE_KERNEL_VSYSCALL(__PAGE_KERNEL_RX | _PAGE_USER)#define __PAGE_KERNEL_VVAR(__PAGE_KERNEL_RO | _PAGE_USER)that represent access rights to thevsyscallpage. Both flags have the same_PAGE_USERflags that means that the page can be accessed by a user-mode process running at lowerprivilege levels. The second flag depends on the value of thefirst flag (__PAGE_KERNEL_VSYSCALL) will be set in the case whereThis means virtual system calls will be nativewill havePAGE_KERNEL_VVARif thevsyscall_modesyscallvsyscall_modevariable. Thevsyscall_modeis.NATIVEinstructions. In other way the vsyscallvariable will beemulate. In this casevirtual system calls will be turned into traps and are emulated reasonably. Thevariable gets its value in thevsyscall_modevsyscall_setupfunction:static int __init vsyscall_setup(char *str){if (str) {if (!strcmp("emulate", str))vsyscall_mode = EMULATE;else if (!strcmp("native", str))vsyscall_mode = NATIVE;else if (!strcmp("none", str))vsyscall_mode = NONE;elsereturn -EINVAL;return 0;}return -EINVAL;}That will be called during early kernel parameters parsing:early_param("vsyscall", vsyscall_setup);More aboutearly_parammacro you can read in the sixth part of the chapter that describesprocess of the initialization of the Linux kernel.In the end of thevsyscall_mapfunction we just check that virtual address of thepage is equal to the value of theVSYSCALL_ADDRvsyscallwith the BUILD_BUG_ON macro:411vsyscall and vDSOBUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=(unsigned long)VSYSCALL_ADDR);That's all.passvsyscallpage is set up. The result of the all the above is the following: If wevsyscall=nativehandled as nativeparameter to the kernel command line, virtual system calls will besyscallinstructions in the arch/x86/entry/vsyscall/vsyscall_emu_64.S.The glibc knows addresses of the virtual system call handlers. Note that virtual system callhandlers are aligned by(or10240x400) bytes:__vsyscall_page:mov__NR_gettimeofday, %raxsyscallret.balign 1024, 0xccmov __NR_time, %raxsyscallret.balign 1024, 0xccmov__NR_getcpu, %raxsyscallretAnd the start address of thevsyscallpage is theffffffffff600000every time. So, theglibc knows the addresses of the all virtual system call handlers. You can find definition ofthese addresses in thesource code:glibc#define VSYSCALL_ADDR_vgettimeofday0xffffffffff600000#define VSYSCALL_ADDR_vtime0xffffffffff600400#define VSYSCALL_ADDR_vgetcpu0xffffffffff600800All virtual system call requests will fall into theVSYSCALL_ADDR_vsyscall_name__vsyscall_page+offset, put the number of a virtual system call to thegeneral purpose register and the native for the x86_64In the second case, if we passvsyscall=emulatesyscallraxinstruction will be executed.parameter to the kernel command line, anattempt to perform virtual system call handler will cause a page fault exception. Of course,remember, theThevsyscallpage hasfunction is thedo_page_fault__PAGE_KERNEL_VVAR#PFaccess rights that forbid execution.or page fault handler. It tries to understand thereason of the last page fault. And one of the reason can be situation when virtual system callcalled andvsyscallemulate_vsyscallmode isemulate. In this casevsyscallwill be handled by thefunction that defined in the arch/x86/entry/vsyscall/vsyscall_64.c sourcecode file.412vsyscall and vDSOTheemulate_vsyscallfunction gets the number of a virtual system call, checks it, printserror and sends segmentation fault simply:.........vsyscall_nr = addr_to_vsyscall_nr(address);if (vsyscall_nr di,(struct timezone __user *)regs->si);break;.........}In the end we put the result of thetheaxsys_gettimeofdayor another virtual system call handler togeneral purpose register, as we did it with the normal system calls and restore theinstruction pointer register and addemulatesret8bytes to the stack pointer register. This operationinstruction.regs->ax = ret;do_ret:regs->ip = caller;regs->sp += 8;return true;That's all. Now let's look on the modern concept -vDSO.413vsyscall and vDSOIntroduction to vDSOAs I already wrote above,vsyscallvirtual dynamic shared objectmechanisms is thatbutvsyscall. The main difference between thevsyscallandis static in memory and has the same address every time. For thewill use thevDSOorvDSOmaps memory pages into each process in a shared object form,vDSOarchitecture it is called glibcis an obsolete concept and replaced by thelinux-vdso.so.1x86_64. All userspace applications that dynamically link toautomatically. For example:vDSO~$ldd /bin/unamelinux-vdso.so.1 (0x00007ffe014b7000)libc.so.6 => /lib64/libc.so.6 (0x00007fbfee2fe000)/lib64/ld-linux-x86-64.so.2 (0x00005559aab7c000)Or:~$ sudo cat /proc/1/maps | grep vdso7fff39f73000-7fff39f75000 r-xp 00000000 00:00 0[vdso]Here we can see that uname util was linked with the three libraries:linux-vdso.so.1libc.so.6;;ld-linux-x86-64.so.2The first providesvDSO.functionality, the second isCstandard library and the third is theprogram interpreter (more about this you can read in the part that describes linkers). So, thevDSOsolves limitations of thevsyscall. Implementation of thevsyscallvDSOis similar to.Initialization of thevDSOoccurs in theinit_vdsofunction that defined in thearch/x86/entry/vdso/vma.c source code file. This function starts from the initialization of thevDSOimages for 32-bits and 64-bits depends on theCONFIG_X86_X32_ABIkernelconfiguration option:static int __init init_vdso(void){init_vdso_image(&vdso_image_64);#ifdef CONFIG_X86_X32_ABIinit_vdso_image(&vdso_image_x32);#endif414vsyscall and vDSOBoth functions initialize thevdso_imagestructure. This structure is defined in the twogenerated source code files: the arch/x86/entry/vdso/vdso-image-64.c and thearch/x86/entry/vdso/vdso-image-64.c. These source code files generated by the vdso2cprogram from the different source code files, represent different approaches to call a systemcall likeint 0x80,and etc. The full set of the images depends on the kernelsysenterconfiguration.For example for thex86_64Linux kernel it will contain:vdso_image_64#ifdef CONFIG_X86_64extern const struct vdso_image vdso_image_64;#endifBut for thex86-:vdso_image_32#ifdef CONFIG_X86_X32extern const struct vdso_image vdso_image_x32;#endifIf our kernel is configured for thex86architecture or for themode, we will have ability to call a system call with thex86_64int 0x80interrupt, if compatibilitymode is enabled, we will be able to call a system call with the nativesysenterand compatibilityorsyscall instructioninstruction in other way:#if defined CONFIG_X86_32 || defined CONFIG_COMPATextern const struct vdso_image vdso_image_32_int80;#ifdef CONFIG_COMPATextern const struct vdso_image vdso_image_32_syscall;#endifextern const struct vdso_image vdso_image_32_sysenter;#endifAs we can understand from the name of thethevDSOvdso_imagestructure, it represents image offor the certain mode of the system call entry. This structure contains informationabout size in bytes of thevDSOarea that always a multiple ofpointer to the text mapping, start and end address of thePAGE_SIZEalternatives(4096bytes),(set of instructionswith better alternatives for the certain type of the processor) and etc. For examplevdso_image_64looks like this:415vsyscall and vDSOconst struct vdso_image vdso_image_64 = {.data = raw_data,.size = 8192,.text_mapping = {.name = "[vdso]",.pages = pages,},.alt = 3145,.alt_len = 26,.sym_vvar_start = -8192,.sym_vvar_page = -8192,.sym_hpet_page = -4096,};Where theraw_datacontains raw binary code of the 64-bitvDSOsystem calls which arepage size:2static struct page *pages[2];or 8 Kilobytes.Theinit_vdso_imagefunction is defined in the same source code file and just initializes thevdso_image.text_mapping.pagesinitializes each. First of all this function calculates the number of pages andvdso_image.text_mapping.pages[number_of_page]macro that converts given address to thepagewith thevirt_to_pagestructure:void __init init_vdso_image(const struct vdso_image *image){int i;int npages = (image->size) / PAGE_SIZE;for (i = 0; i text_mapping.pages[i] =virt_to_page(image->data + i*PAGE_SIZE);.........}Thetheinit_vdsoinitcallsfunction passed to thesubsys_initcallmacro adds the given function tolist. All functions from this list will be called in thedo_initcallsfunction fromthe init/main.c source code file:subsys_initcall(init_vdso);416vsyscall and vDSOOk, we just saw initialization of thevDSOrelated to the memory pages that containand initialization ofvDSOpagestructures that aresystem calls. But to where do their pagesmap? Actually they are mapped by the kernel, when it loads binary to the memory. TheLinux kernel calls thearch_setup_additional_pagesfunction from thearch/x86/entry/vdso/vma.c source code file that checks thatand calls themap_vdsovDSOenabled for thex86_64function:int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp){if (!vdso64_enabled)return 0;return map_vdso(&vdso_image_64, true);}ThevDSOmap_vdsofunction is defined in the same source code file and maps pages for theand for the sharedvsyscalland thevDSOvDSOvariables. That's all. The main differences between theconcepts is thatand implementsffffffffff6000003vsyscallhas a static address ofsystem calls, whereas thevDSOloads dynamicallyand implements four system calls:__vdso_clock_gettime__vdso_getcpu;;__vdso_gettimeofday__vdso_time;.That's all.ConclusionThis is the end of the third part about the system calls concept in the Linux kernel. In theprevious part we discussed the implementation of the preparation from the Linux kernel side,before a system call will be handled and implementation of theexitprocess from a systemcall handler. In this part we continued to dive into the stuff which is related to the system callconcept and learned two new concepts that are very similar to the system call - thevsyscalland thevDSO.After all of these three parts, we know almost all things that are related to system calls, weknow what system call is and why user applications need them. We also know what occurswhen a user application calls a system call and how the kernel handles system calls.The next part will be the last part in this chapter and we will see what occurs when a userruns the program.417vsyscall and vDSOIf you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linksx86_64 memory mapx86_64context switchingABIvirtual addressSegmentationenumfix-mapped addressesglibcBUILD_BUG_ONProcessor registerPage faultsegmentation faultinstruction pointerstack pointerunameLinkersPrevious part418How the Linux kernel runs a programSystem calls in the Linux kernel. Part 4.How does the Linux kernel run a programThis is the fourth part of the chapter that describes system calls in the Linux kernel and as Iwrote in the conclusion of the previous - this part will be last in this chapter. In the previouspart we stopped at the two new concepts:vsyscallvDSO;;that are related and very similar on system call concept.This part will be last part in this chapter and as you can understand from the part's title - wewill see what does occur in the Linux kernel when we run our programs. So, let's start.how do we launch our programs?There are many different ways to launch an application from a user perspective. Forexample we can run a program from the shell or double-click on the application icon. It doesnot matter. The Linux kernel handles application launch regardless how we do launch thisapplication.In this part we will consider the way when we just launch an application from the shell. Asyou know, the standard way to launch an application from shell is the following: We justlaunch a terminal emulator application and just write the name of the program and pass ornot arguments to our program, for example:Let's consider what does occur when we launch an application from the shell, what doesshell do when we write program name, what does Linux kernel do etc. But before we willstart to consider these interesting things, I want to warn that this book is about the Linuxkernel. That's why we will see Linux kernel insides related stuff mostly in this part. We willnot consider in details what does shell do, we will not consider complex cases, for examplesubshells etc.419How the Linux kernel runs a programMy default shell is - bash, so I will consider how do bash shell launches a program. So let'sstart. Thebashshell as well as any program that written with C programming languagestarts from the main function. If you will look on the source code of thefind themainbashshell, you willfunction in the shell.c source code file. This function makes many differentthings before the main thread loop of thechecks and tries to open/dev/ttybashstarted to work. For example this function:;check that shell running in debug mode;parses command line arguments;reads shell environment;loads.bashrc,.profileand other configuration files;and many many more.After all of these operations we can see the call of thereader_loopfunction. This functiondefined in the eval.c source code file and represents main thread loop or in other words itreads and executes commands. As thereader_loopgiven program name and arguments, it calls theexecute_cmd.c source code file. Thefunction made all checks and read theexecute_commandexecute_commandfunction from thefunction through the chain of thefunctions calls:execute_command--> execute_command_internal----> execute_simple_command------> execute_disk_command--------> shell_execvemakes different checks like do we need to startsubshell, was it builtinbashfunction ornot etc. As I already wrote above, we will not consider all details about things that are notrelated to the Linux kernel. In the end of this process, theexecveshell_execvefunction calls thesystem call:execve (command, args, env);Theexecvesystem call has the following signature:int execve(const char *filename, char *const argv [], char *const envp[]);and executes a program by the given filename, with the given arguments and environmentvariables. This system call is the first in our case and only, for example:420How the Linux kernel runs a program$strace lsexecve("/bin/ls", ["ls"], [/* 62 vars */]) = 0$ strace echoexecve("/bin/echo", ["echo"], [/* 62 vars */]) = 0$strace unameexecve("/bin/uname", ["uname"], [/* 62 vars */]) = 0So, a user application (in our case) calls the system call and as we already know thebashnext step is Linux kernel.execve system callWe saw preparation before a system call called by a user application and after a system callhandler finished its work in the second part of this chapter. We stopped at the call of theexecvesystem call in the previous paragraph. This system call defined in the fs/exec.csource code file and as we already know it takes three arguments:SYSCALL_DEFINE3(execve,const char __user *, filename,const char __user *const __user *, argv,const char __user *const __user *, envp){return do_execve(getname(filename), argv, envp);}Implementation of theof thedo_execveexecveis pretty simple here, as we can see it just returns the resultfunction. Thedo_execvefunction defined in the same source code fileand do the following things:Initialize two pointers on a userspace data with the given arguments and environmentvariables;return the result of thedo_execveat_common.We can see its implementation:struct user_arg_ptr argv = { .ptr.native = __argv };struct user_arg_ptr envp = { .ptr.native = __envp };return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);421How the Linux kernel runs a programThedo_execveat_commonfunction does main work - it executes a new program. This functiontakes similar set of arguments, but as you can see it takes five arguments instead of three.The first argument is the file descriptor that represent directory with our application, in ourcase theAT_FDCWDmeans that the given pathname is interpreted relative to the currentworking directory of the calling process. The fifth argument is flags. In our case we passedto the0do_execveat_commonFirst of all theNULL. We will check in a next step, so will see it latter.do_execveat_commonfunction checks thefilenamepointer and returns if it is. After this we check flags of the current process that limit of running processes is notexceed:if (IS_ERR(filename))return PTR_ERR(filename);if ((current->flags & PF_NPROC_EXCEEDED) &&atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {retval = -EAGAIN;goto out_ret;}current->flags &= ~PF_NPROC_EXCEEDED;If these two checks were successful we unsetcurrent process to prevent fail of theexecvePF_NPROC_EXCEEDEDflag in the flags of the. You can see that in the next step we call thefunction that defined in the kernel/fork.c and unshares the files of the currentunshare_filestask and check the result of this function:retval = unshare_files(&displaced);if (retval)goto out_ret;We need to call this function to eliminate potential leak of the execve'd binary's filedescriptor. In the next step we start preparation of thelinux_binprmbprmthat represented by thestructure (defined in the include/linux/binfmts.h header file). Thestructlinux_binprmstructure is used to hold the arguments that are used when loading binaries. For example itcontainsvmafield which hasvm_area_structtype and represents single memory area overa contiguous interval in a given address space where our application will be loaded,mmfield which is memory descriptor of the binary, pointer to the top of memory and many otherdifferent fields.First of all we allocate memory for this structure with thekzallocfunction and check theresult of the allocation:422How the Linux kernel runs a programbprm = kzalloc(sizeof(*bprm), GFP_KERNEL);if (!bprm)goto out_files;After this we start to prepare thebinprmcredentials with the call of theprepare_bprm_credsfunction:retval = prepare_bprm_creds(bprm);if (retval)goto out_free;check_unsafe_exec(bprm);current->in_execve = 1;Initialization of thebinprmthat stored inside of thecredentials in other words is initialization of thestructure. Thelinux_binprmcredcredstructurestructure contains the securitycontext of a task for example real uid of the task, real guid of the task,uidandguidforthe virtual file system operations etc. In the next step as we executed preparation of thebprmcredentials we check that now we can safely execute a program with the call of thefunction and set the current process to thecheck_unsafe_execAfter all of these operations we call thewe passed to thedo_execveat_commonstate.in_execvefunction that checks the flags thatdo_open_execatfunction (remember that we have0in theflags)and searches and opens executable file on disk, checks that our we will load a binary filefromnoexecmount points (we need to avoid execute a binary from filesystems that do notcontain executable binaries like proc or sysfs), initializeson this structure. Next we can see the call thesched_execfilestructure and returns pointerafter this:file = do_open_execat(fd, filename, flags);retval = PTR_ERR(file);if (IS_ERR(file))goto out_unmark;sched_exec();Thesched_execfunction is used to determine the least loaded processor that can executethe new program and to migrate the current process to it.After this we need to check file descriptor of the give executable binary. We try to checkdoes the name of the our binary file starts from the/symbol or does the path of the givenexecutable binary is interpreted relative to the current working directory of the callingprocess or in other words file descriptor isAT_FDCWD(read above about this).If one of these checks is successful we set the binary parameter filename:423How the Linux kernel runs a programbprm->file = file;if (fd == AT_FDCWD || filename->name[0] == '/') {bprm->filename = filename->name;}Otherwise if the filename is empty we set the binary parameter filename to theor/dev/fd/%d/%s/dev/fd/%ddepends on the filename of the given executable binary which means thatwe will execute the file to which the file descriptor refers:} else {if (filename->name[0] == '\0')pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);elsepathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",fd, filename->name);if (!pathbuf) {retval = -ENOMEM;goto out_unmark;}bprm->filename = pathbuf;}bprm->interp = bprm->filename;Note that we set not only thebprm->filenamebut alsobprm->interpthat will contain nameof the program interpreter. For now we just write the same name there, but later it will beupdated with the real name of the program interpreter depends on binary format of aprogram. You can read above that we already preparednext step is initialization of other fields of thebprm_mm_initfunction and pass thebprmcredlinux_binprmfor thelinux_binprm. The. First of all we call theto it:retval = bprm_mm_init(bprm);if (retval)goto out_unmark;Thebprm_mm_initdefined in the same source code file and as we can understand from thefunction's name, it makes initialization of the memory descriptor or in other words thebprm_mm_initfunction initializesmm_structstructure. This structure defined in theinclude/linux/mm_types.h header file and represents address space of a process. We willnot consider implementation of thebprm_mm_initfunction because we do not know manyimportant stuff related to the Linux kernel memory manager, but we just need to know thatthis function initializesmm_structand populate it with a temporary stackvm_area_struct.424How the Linux kernel runs a programAfter this we calculate the count of the command line arguments which are were passed tothe our executable binary, the count of the environment variables and set it to the>argcandbprm-respectively:bprm->envcbprm->argc = count(argv, MAX_ARG_STRINGS);if ((retval = bprm->argc) envc = count(envp, MAX_ARG_STRINGS);if ((retval = bprm->envc) < 0)goto out;As you can see we do this operations with the help of thecountsame source code file and calculates the count of strings in theMAX_ARG_STRINGSfunction that defined in thearray. Theargvmacro defined in the include/uapi/linux/binfmts.h header file and as we canunderstand from the macro's name, it represents maximum number of strings that werepassed to theexecvesystem call. The value of theMAX_ARG_STRINGS:#define MAX_ARG_STRINGS 0x7FFFFFFFAfter we calculated the number of the command line arguments and environment variables,we call theprepare_binprmfunction. We already call the function with the similar namebefore this moment. This function is calledfunction initializescredstructure in theprepare_binprm_credlinux_bprm. Now theand we remember that thisprepare_binprmfunction:retval = prepare_binprm(bprm);if (retval filename, bprm);if (retval envc, envp, bprm);if (retval argc, argv, bprm);if (retval exec = bprm->p;The top of the stack will contain the program filename and we store this filename to theexecfield of thelinux_bprmNow we have filledstructure.linux_bprmstructure, we call theexec_binprmfunction:retval = exec_binprm(bprm);if (retval pid;rcu_read_lock();old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));rcu_read_unlock();and call the:search_binary_handler(bprm);function. This function goes through the list of handlers that contains different binary formats.Currently the Linux kernel supports following binary formats:binfmt_scriptbinfmt_misc- support for interpreted scripts that are starts from the #! line;- support different binary formats, according to runtime configuration of theLinux kernel;426How the Linux kernel runs a programbinfmt_elf- support elf format;binfmt_aout- support a.out format;binfmt_flat- support for flat format;binfmt_elf_fdpicbinfmt_em86So, the- Support for elf FDPIC binaries;- support for Intel elf binaries running on Alpha machines.search_binary_handlerlinux_binprmtries to call theload_binaryfunction and passto it. If the binary handler supports the given executable file format, it starts toprepare the executable binary for execution:int search_binary_handler(struct linux_binprm *bprm){.........list_for_each_entry(fmt, &formats, lh) {retval = fmt->load_binary(bprm);if (retval mm) {force_sigsegv(SIGSEGV, current);return retval;}}return retval;Where theload_binaryfor example for the elf checks the magic number (eachfile contains magic number in the header) in thefirst128linux_bprmelfbinarybuffer (remember that we readbytes from the executable binary file): and exit if it is notelfbinary:static int load_elf_binary(struct linux_binprm *bprm){.........loc->elf_ex = *((struct elfhdr *)bprm->buf);if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)goto out;If the given executable file is inload_elf_binaryelfformat, theload_elf_binarycontinues to execute. Thedoes many different things to prepare on execution executable file. Forexample it checks the architecture and type of the executable file:427How the Linux kernel runs a programif (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)goto out;if (!elf_check_arch(&loc->elf_ex))goto out;and exit if there is wrong architecture and executable file non executable non shared. Triesto load theprogram header table:elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);if (!elf_phdata)goto out;that describes segments. Read theprogram interpreterand libraries that linked with theour executable binary file from disk and load it to memory. Thespecified in the.interpdescribes Linkers it is and mapelfprogram interpretersection of the executable file and as you can read in the part that/lib64/ld-linux-x86-64.so.2for thex86_64. It setups the stackbinary into the correct location in memory. It maps the bss and the brksections and does many many other different things to prepare executable file to execute.In the end of the execution of theload_elf_binarywe call thestart_threadfunction andpass three arguments to it:start_thread(regs, elf_entry, bprm->p);retval = 0;out:kfree(loc);out_ret:return retval;These arguments are:Set of registers for the new task;Address of the entry point of the new task;Address of the top of the stack for the new task.As we can understand from the function's name, it starts new thread, but it is not so. Thestart_threadfunction just prepares new task's registers to be ready to run. Let's look onthe implementation of this function:428How the Linux kernel runs a programvoidstart_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp){start_thread_common(regs, new_ip, new_sp,__USER_CS, __USER_DS, 0);}As we can see thestart_threadfunction just makes a call of thestart_thread_commonfunction that will do all for us:static voidstart_thread_common(struct pt_regs *regs, unsigned long new_ip,unsigned long new_sp,unsigned int _cs, unsigned int _ss, unsigned int _ds){loadsegment(fs, 0);loadsegment(es, _ds);loadsegment(ds, _ds);load_gs_index(0);regs->ip= new_ip;regs->sp= new_sp;regs->cs= _cs;regs->ss= _ss;regs->flags= X86_EFLAGS_IF;force_iret();}Thefunction fillsstart_thread_commonfssegment register with zero andesanddswith the value of the data segment register. After this we set new values to the instructionpointer,csforce_iretsegments etc. In the end of thestart_thread_commonmacro that force a system call return viairetinstruction. Ok, we preparednew thread to run in userspace and now we can return from theare in thedo_execveat_commonagain. After theexec_binprmfunction we can see theexec_binprmand now wewill finish its execution werelease memory for structures that was allocated before and return.After we returned from theexecvesystem call handler, execution of our program will bestarted. We can do it, because all context related information already configured for thispurpose. As we saw theexecvesystem call does not return control to a process, but code,data and other segments of the caller process are just overwritten of the program segments.The exit from our application will be implemented through theexitsystem call.That's all. From this point our program will be executed.Conclusion429How the Linux kernel runs a programThis is the end of the fourth part of the about the system calls concept in the Linux kernel.We saw almost all related stuff to thefrom the understanding of thesystem callsystem callconcept in these four parts. We startedconcept, we have learned what is it and why dousers applications need in this concept. Next we saw how does the Linux handle a systemcall from a user application. We met two similar concepts to thearevsyscallandvDSOsystem callconcept, theyand finally we saw how does Linux kernel run a user program.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.LinksSystem callshellbashentry pointCenvironment variablesfile descriptorreal uidvirtual file systemprocfssysfsinodepidnamespace#!elfa.outflatAlphaFDPICsegmentsLinkersProcessor registerinstruction pointerPrevious part430How the Linux kernel runs a program431Implementation of the open system callHow does theopensystem call workIntroductionThis is the fifth part of the chapter that describes system calls mechanism in the Linuxkernel. Previous parts of this chapter described this mechanism in general. Now I will try todescribe implementation of different system calls in the Linux kernel. Previous parts fromthis chapter and parts from other chapters of the books describe mostly deep parts of theLinux kernel that are faintly visible or fully invisible from the userspace. But the Linux kernelcode is not only about itself. The vast of the Linux kernel code provides ability to our code.Due to the linux kernel our programs can read/write from/to files and don't know anythingabout sectors, tracks and other parts of a disk structures, we can send data over networkand don't build encapsulated network packets by hand and etc.I don't know how about you, but it is interesting to me not only how an operating systemworks, but how do my software interacts with it. As you may know, our programs interactswith the kernel through the special mechanism which is called system call. So, I've decidedto write series of parts which will describe implementation and behavior of system callswhich we are using every day likeread,write,open,close,dupand etc.I have decided to start from the description of the open system call. if you have written atleast oneCprogram, you should know that before we are able to read/write or executeother manipulations with a file we need to open it with theopenfunction:432Implementation of the open system call#include #include #include #include #include #include int main(int argc, char *argv) {int fd = open("test", O_RDONLY);if fd = 0) {struct file *f = do_filp_open(dfd, tmp, &op);if (IS_ERR(f)) {put_unused_fd(fd);fd = PTR_ERR(f);} else {fsnotify_open(f);fd_install(fd, f);}}putname(tmp);return fd;}Let's try to understand how thedo_sys_openworks step by step.open(2) flagsAs you know theopenopening a file andmodeif it is created. Thesystem call takes set offlagsas second argument that controlas third argument that specifies permission the permissions of a filedo_sys_openfunction starts from the call of thebuild_open_flagsfunction which does some checks that set of the given flags is valid and handles differentconditions of flags and mode.Let's look at the implementation of thebuild_open_flags. This function is defined in thesame kernel file and takes three arguments:flags - flags that control opening of a file;mode - permissions for newly created file;The last argument -opis represented with theopen_flagsstructure:436Implementation of the open system callstruct open_flags {int open_flag;umode_t mode;int acc_mode;int intent;int lookup_flags;};which is defined in the fs/internal.h header file and as we may see it holds information aboutflags and access mode for internal kernel purposes. As you already may guess the maingoal of thebuild_open_flagsImplementation of thefunction is to fill an instance of this structure.build_open_flagsfunction starts from the definition of local variablesand one of them is:int acc_mode = ACC_MODE(flags);This local variable represents access mode and its initial value will be equal to the value ofexpandedACC_MODEmacro. This macro is defined in the include/linux/fs.h and looks prettyinteresting:#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])#define O_ACCMODEThe00000003"\004\002\006\006"is an array of four chars:"\004\002\006\006" == {'\004', '\002', '\006', '\006'}So, theACC_MODEmacro just expands to the accession to this array byindex. As we just saw, theO_ACCMODEis00000003the two least significant bits which are represents. By applyingread,write[(x) & O_ACCMODE]x & O_ACCMODEwe will takeoraccessread/writemodes:#define O_RDONLY00000000#define O_WRONLY00000001#define O_RDWR00000002After getting value from the array by the calculated index, theaccess mode mask of a file which will holdMAY_WRITE,ACC_MODEMAY_READwill be expanded toand other information.We may see following condition after we have calculated initial access mode:437Implementation of the open system callif (flags & (O_CREAT | __O_TMPFILE))op->mode = (mode & S_IALLUGO) | S_IFREG;elseop->mode = 0;Here we reset permissions inopen_flagsinstance if a opened file wasn't temporary andwasn't open for creation. This is because:if neither O_CREAT nor O_TMPFILE is specified, then mode is ignored.In other case ifO_CREATorO_TMPFILEwere passed we canonicalize it to a regular filebecause a directory should be created with the opendir system call.At the next step we check that a file is not tried to be opened via fanotify and without theO_CLOEXECflag:flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;We do this to not leak a file descriptor. By default, the new file descriptor is set to remainopen across anexecvesystem call, but theopensystem call supportsO_CLOEXECflag thatcan be used to change this default behaviour. So we do this to prevent leaking of a filedescriptor when one thread opens a file to setO_CLOEXECflag and in the same time thesecond process does a fork) + execve) and as you may remember that child will have copiesof the parent's set of open file descriptors.At the next step we check that if our flags containsO_SYNCflag, we applyO_DSYNCflag too:if (flags & __O_SYNC)flags |= O_DSYNC;TheO_SYNCflag guarantees that the any write call will not return before all data has beentransferred to the disk. Thewait for any metadata (likeO_DSYNCin a case ofO_DSYNCatime__O_SYNC,is likemtimeO_SYNCexcept that there is no requirement toand etc.) changes will be written. We applybecause it is implemented as__O_SYNC|O_DSYNCin theLinux kernel.After this we must be sure that if a user wants to create temporary file, the flags shouldcontainO_TMPFILE_MASKor in other words it should contain orO_CREATorO_TMPFILEorboth and also it should be writeable:438Implementation of the open system callif (flags & __O_TMPFILE) {if ((flags & O_TMPFILE_MASK) != O_TMPFILE)return -EINVAL;if (!(acc_mode & MAY_WRITE))return -EINVAL;} else if (flags & O_PATH) {flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;acc_mode = 0;}as it is written in in the manual page:O_TMPFILE must be specified with one of O_RDWR or O_WRONLYIf we didn't passfor creation of a temporary file, we check theO_TMPFILEnext condition. TheO_PATHO_PATHflag at theflag allows us to obtain a file descriptor that may be used fortwo following purposes:to indicate a location in the filesystem tree;to perform operations that act purely at the file descriptor level.So, in this case the file itself is not opened, but operations likebe used. So, if all file content related operations likeonlyO_DIRECTORY | O_NOFOLLOW | O_PATHthis moment in the>open_flagbuild_open_flagsread,dupwrite,fcntland other canand other are permitted,flags can be used. We have finished with flags forfor this moment and we may fill ouropen_flags-with them:op->open_flag = flags;Now we have filledandmodeopen_flagthat will representto fill last flags in the ourfield which represents flags that will control opening of a fileumaskopen_flagsof a new file if we open file for creation. There are stillstructure. The next isaccess mode to a opened file. We already filled thevalue at the beginning of thebuild_open_flagsop->acc_modeacc_modewhich representslocal variable with the initialand now we check last two flags related toaccess mode:if (flags & O_TRUNC)acc_mode |= MAY_WRITE;if (flags & O_APPEND)acc_mode |= MAY_APPEND;op->acc_mode = acc_mode;439Implementation of the open system callThese flags are -O_TRUNCwe open it and thethat will truncate an opened file to lengthO_APPENDflag allows to open a file in0append modeif it existed before. So the opened filewill be appended during write but not overwritten.The next field of theopen_flagsstructure is -intent. It allows us to know about ourintention or in other words what do we really want to do with file, open it, create, rename it orsomething else. So we set it to zero if our flags contains theO_PATHflag as we can't doanything related to a file content with this flag:op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;or just toLOOKUP_OPENintention. Additionally we setLOOKUP_CREATEcreate new file and to be sure that a file didn't exist before withintention if we want toO_EXCLflag:if (flags & O_CREAT) {op->intent |= LOOKUP_CREATE;if (flags & O_EXCL)op->intent |= LOOKUP_EXCL;}The last flag of theopen_flagsstructure is thelookup_flags:if (flags & O_DIRECTORY)lookup_flags |= LOOKUP_DIRECTORY;if (!(flags & O_NOFOLLOW))lookup_flags |= LOOKUP_FOLLOW;op->lookup_flags = lookup_flags;return 0;We fill it withLOOKUP_DIRECTORYif we want to open a directory andwant to follow (open) symlink. That's all. It is the end of theopen_flagsto theLOOKUP_FOLLOWbuild_open_flagsif we don'tfunction. Thestructure is filled with modes and flags for a file opening and we can return backdo_sys_open.Actual opening of a fileAt the next step afterbuild_open_flagsmodes for our file we should get thefunction is finished and we have formed flags andfilenamestructure with the help of thefunction by name of a file which was passed to theopengetnamesystem call:440Implementation of the open system calltmp = getname(filename);if (IS_ERR(tmp))return PTR_ERR(tmp);Thegetnamefunction is defined in the fs/namei.c source code file and looks:struct filename *getname(const char __user * filename){return getname_flags(filename, 0, NULL);}So, it just calls thegetname_flagsfilenamegetname_flagsfunction and returns its result. The main goal of thefunction is to copy a file path given from userland to kernel space. Thestructure is defined in the include/linux/fs.h linux kernel header file and containsfollowing fields:name - pointer to a file path in kernel space;uptr - original pointer from userland;aname - filename from audit context;refcnt - reference counter;iname - a filename in a case when it will be less thanAs I already wrote above, the main goal of thefile which was passed to thePATH_MAXgetname_flags.function is to copy name of asystem call from user space to kernel space with theopenstrncpy_from_user function. The next step after a filename will be copied to kernel space isgetting of new non-busy file descriptor:fd = get_unused_fd_flags(flags);The(0get_unused_fd_flags) and maximum (function takes table of open files of the current process, minimumRLIMIT_NOFILEflags that we have passed to the) possible number of a file descriptor in the system andopensystem call and allocates file descriptor and mark itbusy in the file descriptor table of the current process. Thesets or clears theO_CLOEXECThe last and main step in theget_unused_fd_flagsfunctionflag depends on its state in the passed flags.do_sys_openis thedo_filp_openfunction:441Implementation of the open system callstruct file *f = do_filp_open(dfd, tmp, &op);if (IS_ERR(f)) {put_unused_fd(fd);fd = PTR_ERR(f);} else {fsnotify_open(f);fd_install(fd, f);}The main goal of this function is to resolve given path name intofilestructure whichrepresents an opened file of a process. If something going wrong and execution of thedo_filp_openfunction will be failed, we should free new file descriptor with theput_unused_fdor in other way thefilestructure returned by thedo_filp_openwill bestored in the file descriptor table of the current process.Now let's take a short look at the implementation of thedo_filp_openfunction. This functionis defined in the fs/namei.c linux kernel source code file and starts from initialization of thenameidatastructure. This structure will provide a link to a file inode. Actually this is one ofthe main point of theopendo_filp_opensystem call. After thefunction to acquire annameidatainodeby the filename given tostructure will be initialized, thepath_openatfunctionwill be called:filp = path_openat(&nd, op, flags | LOOKUP_RCU);if (unlikely(filp == ERR_PTR(-ECHILD)))filp = path_openat(&nd, op, flags);if (unlikely(filp == ERR_PTR(-ESTALE)))filp = path_openat(&nd, op, flags | LOOKUP_REVAL);Note that it is called three times. Actually, the Linux kernel will open the file in RCU mode.This is the most efficient way to open a file. If this try will be failed, the kernel enters thenormal mode. The third call is relatively rare, only in the nfs file system is likely to be used.Thepath_openatfunction executespath lookupor in other words it tries to find adentry(what the Linux kernel uses to keep track of the hierarchy of files in directories)corresponding to a path.Thepath_openatallocates a newfunction starts from the call of thefileget_empty_flip()function thatstructure with some additional checks like do we exceed amount ofopened files in the system or not and etc. After we have got allocated newwe call thedo_tmpfileordo_o_pathfilefunctions in a case if we have passedstructureO_TMPFILE |442Implementation of the open system callO_CREATEorO_PATHflags during call of theopensystem call. These both cases are quitespecific, so let's consider quite usual case when we want to open already existed file andwant to read/write from/to it.In this case thepath_initfunction will be called. This function performs some preporatorywork before actual path lookup. This includes search of start position of path traversal andits metadata like/theof the path,inodedentry inodeand etc. This can beor current directory as in our case, because we usedirectory -as starting point (see call ofat the beginning of the post).do_sys_openThe next step after thedo_lastAT_CWDrootpath_initis the loop which executes thelink_path_walkand. The first function executes name resolution or in other words this function startsprocess of walking along a given path. It handles everything step by step except the lastcomponent of a file path. This handling includes checking of a permissions and getting a filecomponent. As a file component is gotten, it is passed tocurrent directory entry from thedcachewalk_componentor asks underlying filesystem. This repeats beforeall path's components will not be handled in such way. After theexecuted, thedo_lastlink_path_walkfrom thefunction will populate afilelink_path_walkwill bestructure based on the result of the. As we reached last component of the given file path thedo_lastthat updatesvfs_openfunctionwill be called.This function is defined in the fs/open.c linux kernel source code file and the main goal ofthis function is to call anopenoperation of underlying filesystem.That's all for now. We didn't consider full implementation of theopensystem call. We skipsome parts like handling case when we want to open a file from other filesystem withdifferent mount point, resolving symlinks and etc., but it should be not so hard to follow thisstuff. This stuff does not included in generic implementation of open system call anddepends on underlying filesystem. If you are interested in, you may lookup thefile_operations.opencallback function for a certain filesystem.ConclusionThis is the end of the fifth part of the implementation of different system calls in the Linuxkernel. If you have questions or suggestions, ping me on twitter 0xAX, drop me an email, orjust create an issue. In the next part, we will continue to dive into system calls in the Linuxkernel and see the implementation of the read system call.Please note that English is not my first language and I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.443Implementation of the open system callLinkssystem callopenfile descriptorprocGNU C Library Reference ManualIA-64x86_64opendirfanotifyfork)execve)symlinkauditinodeRCUreadprevious part444Limits on resources in LinuxLimits on resources in LinuxEach process in the system uses certain amount of different resources like files, CPU time,memory and so on.Such resources are not infinite and each process and we should have an instrument tomanage it. Sometimes it is useful to know current limits for a certain resource or to changeit's value. In this post we will consider such instruments that allow us to get informationabout limits for a process and increase or decrease such limits.We will start from userspace view and then we will look how it is implemented in the Linuxkernel.There are three main fundamental system calls to manage resource limit for a process:getrlimitsetrlimitprlimitThe first two allows a process to read and set limits on a system resource. The last one isextension for previous functions. Theprlimitallows to set and read the resource limits ofa process specified by PID. Definitions of these functions looks:Thegetrlimitis:int getrlimit(int resource, struct rlimit *rlim);Thesetrlimitis:int setrlimit(int resource, const struct rlimit *rlim);And the definition of theprlimitis:int prlimit(pid_t pid, int resource, const struct rlimit *new_limit,struct rlimit *old_limit);In the first two cases, functions takes two parameters:resourcerlim- represents resource type (we will see available types later);- combination ofsoftandhardlimits.There are two types of limits:445Limits on resources in LinuxsofthardThe first provides actual limit for a resource of a process. The second is a ceiling value of asoftlimit and can be set only by superuser. So,hardlimit.Both these values are combined in therlimitlimit can never exceed relatedsoftstructure:struct rlimit {rlim_t rlim_cur;rlim_t rlim_max;};The last one function looks a little bit complex and takes4arguments. Besidesresourceargument, it takes:pid- specifies an ID of a process on which thenew_limit- provides new limits values if it is notold_limit- currentExactlyprlimitsoftandhardprlimitNULLshould be executed;;limits will be placed here if it is notNULL.function is used by ulimit util. We can verify this with the help of strace util.For example:~$ strace ulimit -s 2>&1 | grep rlprlimit64(0, RLIMIT_NPROC, NULL, {rlim_cur=63727, rlim_max=63727}) = 0prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=4*1024}) = 0prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0Here we can seeprlimit64, but not theprlimit. The fact is that we see underlyingsystem call here instead of library call.Now let's look at list of available resources:446Limits on resources in LinuxResourceDescriptionRLIMIT_CPUCPU time limit given in secondsRLIMIT_FSIZEthe maximum size of files that a process may createRLIMIT_DATAthe maximum size of the process's data segmentRLIMIT_STACKthe maximum size of the process stack in bytesRLIMIT_COREthe maximum size of a core file.RLIMIT_RSSthe number of bytes that can be allocated for a process inRAMRLIMIT_NPROCthe maximum number of processes that can be created by auserRLIMIT_NOFILEthe maximum number of a file descriptor that can be openedby a processRLIMIT_MEMLOCKthe maximum number of bytes of memory that may be lockedinto RAM by mlock.RLIMIT_ASthe maximum size of virtual memory in bytes.RLIMIT_LOCKSthe maximum number flock and locking related fcntl callsRLIMIT_SIGPENDINGmaximum number of signals that may be queued for a user ofthe calling processRLIMIT_MSGQUEUEthe number of bytes that can be allocated for POSIXmessage queuesRLIMIT_NICEthe maximum nice value that can be set by a processRLIMIT_RTPRIOmaximum real-time priority valueRLIMIT_RTTIMEmaximum number of microseconds that a process may bescheduled under real-time scheduling policy without makingblocking system callIf you're looking into source code of open source projects, you will note that reading orupdating of a resource limit is quite widely used operation.For example: systemd/* Don't limit the coredump size */(void) setrlimit(RLIMIT_CORE, &RLIMIT_MAKE_CONST(RLIM_INFINITY));Or haproxy:447Limits on resources in Linuxgetrlimit(RLIMIT_NOFILE, &limit);if (limit.rlim_cur = RLIM_NLIMITS)return -EINVAL;and in a failure case returnslimits was passed as non-EINVALerror. After this check will pass successfully and newvalue, two following checks:NULLif (new_rlim) {if (new_rlim->rlim_cur > new_rlim->rlim_max)return -EINVAL;if (resource == RLIMIT_NOFILE &&new_rlim->rlim_max > sysctl_nr_open)return -EPERM;}check that the givensoftlimit does not exceedhardlimit and in a case when the givenresource is the maximum number of a file descriptors that hard limit is not greater thansysctl_nr_openvalue. The value of thesysctl_nr_opencan be found via procfs:~$cat /proc/sys/fs/nr_open1048576After all of these checks we locktasklistto be sure that signal handlers related things willnot be destroyed while we updating limits for a given resource:read_lock(&tasklist_lock);.........read_unlock(&tasklist_lock);We need to do this becauseprlimitsystem call allows us to update limits of another taskby the given pid. As task list is locked, we take therlimitinstance that is responsible forthe given resource limit of the given process:rlim = tsk->signal->rlim + resource;where thetsk->signal->rlimresources. And if theNULLis just array ofnew_rlimis notNULLstruct rlimitthat represents certainwe just update its value. Ifold_rlimis notwe fill it:if (old_rlim)*old_rlim = *rlim;449Limits on resources in LinuxThat's all.ConclusionThis is the end of the second part that describes implementation of the system calls in theLinux kernel. If you have questions or suggestions, ping me on Twitter 0xAX, drop me anemail, or just create an issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you find any mistakes please send me PR to linux-insides.Linkssystem callsPIDulimitstracePOSIX message queues450Timers and time managementTimers and time managementThis chapter describes timers and time management related concepts in the linux kernel.Introduction - An introduction to the timers in the Linux kernel.Introduction to the clocksource framework - Describesclocksourceframework in theLinux kernel.The tick broadcast framework and dyntick - Describes tick broadcast framework anddyntick concept.Introduction to timers - Describes timers in the Linux kernel.Introduction to the clockevents framework - Describes yet another clock/timemanagement related framework :clockeventsx86 related clock sources - Describesx86_64.related clock sources.Time related system calls in the Linux kernel - Describes time related system calls.451IntroductionTimers and time management in the Linuxkernel. Part 1.IntroductionThis is yet another post that opens a new chapter in the linux-insides book. The previouspart described system call concepts, and now it's time to start new chapter. As one mightunderstand from the title, this chapter will be devoted to thetimersandtime managementinthe Linux kernel. The choice of topic for the current chapter is not accidental. Timers (andgenerally, time management) are very important and widely used in the Linux kernel. TheLinux kernel uses timers for various tasks, for example different timeouts in the TCPimplementation, the kernel knowing current time, scheduling asynchronous functions, nextevent interrupt scheduling and many many more.So, we will start to learn implementation of the different time management related stuff in thispart. We will see different types of timers and how different Linux kernel subsystems usethem. As always, we will start from the earliest part of the Linux kernel and go through theinitialization process of the Linux kernel. We already did it in the special chapter whichdescribes the initialization process of the Linux kernel, but as you may remember we missedsome things there. And one of them is the initialization of timers.Let's start.Initialization of non-standard PC hardwareclockAfter the Linux kernel was decompressed (more about this you can read in the Kerneldecompression part) the architecture non-specific code starts to work in the init/main.csource code file. After initialization of the lock validator, initialization of cgroups and settingcanary value we can see the call of thesetup_archfunction.As you may remember, this function (defined in the arch/x86/kernel/setup.c)prepares/initializes architecture-specific stuff (for example it reserves a place for bss section,reserves a place for initrd, parses kernel command line, and many, many other things).Besides this, we can find some time management related functions there.The first is:452Introductionx86_init.timers.wallclock_init();We already sawx86_initstructure in the chapter that describes initialization of the Linuxkernel. This structure contains pointers to the default setup functions for the differentplatforms like Intel MID, Intel CE4100, etc. Thex86_initstructure is defined in thearch/x86/kernel/x86_init.c, and as you can see it determines standard PC hardware bydefault.As we can see, thex86_initstructure has thex86_init_opstype that provides a set offunctions for platform specific setup like reserving standard resources, platform specificmemory setup, initialization of interrupt handlers, etc. This structure looks like:struct x86_init_ops {struct x86_init_resourcesresources;struct x86_init_mpparsempparse;struct x86_init_irqsirqs;struct x86_init_oemoem;struct x86_init_pagingpaging;struct x86_init_timerstimers;struct x86_init_iommuiommu;struct x86_init_pcipci;};Note thefield that has thetimersx86_init_timerstype. We can understand by its namethat this field is related to time management and timers.x86_init_timerscontains fourfields which are all functions that returns pointer on void:setup_percpu_clockevtsc_pre_inittimer_init- set up the per cpu clock event device for the boot cpu;- platform function called before TSC init;- initialize the platform timer;wallclock_init- initialize the wallclock device.So, as we already know, in our case thewallclock device. If we look on theto thex86_init_noopwallclock_initx86_initexecutes initialization of thestructure, we see thatwallclock_initpoints:453Introductionstruct x86_init_ops x86_init __initdata = {..........timers = {.wallclock_init= x86_init_noop,},.........}Where thex86_init_noopis just a function that does nothing:void __cpuinit x86_init_noop(void) { }for the standard PC hardware. Actually, theplatform. Initialization of thewallclock_initx86_init.timers.wallclock_initarch/x86/platform/intel-mid/intel-mid.c source code file in thefunction is used in the Intel MIDis located in thex86_intel_mid_early_setupfunction:void __init x86_intel_mid_early_setup(void){.........x86_init.timers.wallclock_init = intel_mid_rtc_init;.........}Implementation of theintel_mid_rtc_initfunction is in the arch/x86/platform/intel-mid/intel_mid_vrtc.c source code file and looks pretty simple. First of all, this function parsesSimple Firmware Interface M-Real-Time-Clock table for getting such devices to thesfi_mrtc_arrayarray and initialization of theset_timeandget_timefunctions:454Introductionvoid __init intel_mid_rtc_init(void){unsigned long vrtc_paddr;sfi_table_parse(SFI_SIG_MRTC, NULL, NULL, sfi_parse_mrtc);vrtc_paddr = sfi_mrtc_array[0].phys_addr;if (!sfi_mrtc_num || !vrtc_paddr)return;vrtc_virt_base = (void __iomem *)set_fixmap_offset_nocache(FIX_LNW_VRTC,vrtc_paddr);x86_platform.get_wallclock = vrtc_get_time;x86_platform.set_wallclock = vrtc_set_mmss;}That's all, after this a device based onIntel MIDwill be able to get time from the hardwareclock. As I already wrote, the standard PC x86_64 architecture does not supportx86_init_noopand just do nothing during call of this function. We just saw initialization ofthe real time clock for the Intel MID architecture, now it's time to return to the generalx86_64architecture and will look on the time management related stuff there.Acquainted with jiffiesIf we return to thesetup_archfunction (which is located, as you remember, in thearch/x86/kernel/setup.c source code file), we see the next call of the time managementrelated function:register_refined_jiffies(CLOCK_TICK_RATE);Before we look at the implementation of this function, we must know about jiffy. As we canread on wikipedia:Jiffy is an informal term for any unspecified short period of timeThis definition is very similar to thethejiffiesjiffyin the Linux kernel. There is global variable withwhich holds the number of ticks that have occurred since the system booted.The Linux kernel sets this variable to zero:extern unsigned long volatile __jiffy_data jiffies;455Introductionduring initialization process. This global variable will be increased each time during timerinterrupt. Besides this, near thejiffiesvariable we can see the definition of the similarvariableextern u64 jiffies_64;Actually, only one of these variables is in use in the Linux kernel, and it depends on theprocessor type. For the x86_64 it will beu64use and for the x86 it'sunsigned long. Wesee this looking at the arch/x86/kernel/vmlinux.lds.S linker script:#ifdef CONFIG_X86_32...jiffies = jiffies_64;...#else...jiffies_64 = jiffies;...#endifIn the case ofx86_32thejiffieswill be the lowerbits of the32jiffies_64variable.Schematically, we can imagine it as followsjiffies_64+-----------------------------------------------------+||||||||||||||jiffies on x86_32|+-----------------------------------------------------+63310Now we know a little theory aboutjiffiesand can return to our function. There is noarchitecture-specific implementation for our function - theregister_refined_jiffies. Thisfunction is located in the generic kernel code - kernel/time/jiffies.c source code file. Mainpoint of theregister_refined_jiffieson the implementation of theclocksourceis registration of the jiffyregister_refined_jiffiesclocksource. Before we lookfunction, we must know whatis. As we can read in the comments:The clocksource is hardware abstraction for a free-running counter.456IntroductionI'm not sure about you, but that description didn't give a good understanding about theconcept. Let's try to understand what is it, but we will not go deeper becauseclocksourcethis topic will be described in a separate part in much more detail. The main point of theis timekeeping abstraction or in very simple words - it provides a time value toclocksourcethe kernel. We already know about thejiffiesinterface that represents number of ticksthat have occurred since the system booted. It is represented by a global variable in theLinux kernel and increases each timer interrupt. The Linux kernel can usemeasurement. So why do we need in separate context like theclocksourcejiffiesfor time? Actually,different hardware devices provide different clock sources that are varied in their capabilities.The availability of more precise techniques for time intervals measurement is hardwaredependent.For examplehas on-chip a 64-bit counter that is called Time Stamp Counter and itsx86frequency can be equal to processor frequency. Or for example the High Precision EventTimer, that consists of aand they are both for64-bitx86counter of at leastfrequency. Two different timers10 MHz. If we will add timers from other architectures, this only makesthis problem more complex. The Linux kernel provides theclocksourceconcept to solvethe problem.The clocksource concept is represented by theclocksourcestructure in the Linux kernel.This structure is defined in the include/linux/clocksource.h header file and contains a coupleof fields that describe a time counter. For example, it contains name of a counter,thesuspendandLet's look at thefield which is thefield that describes different properties of a counter, pointers toflagsresumenamefunctions, and many more.clocksourcestructure for jiffies that is defined in the kernel/time/jiffies.csource code file:static struct clocksource clocksource_jiffies = {.name= "jiffies",.rating= 1,.read= jiffies_read,.mask= 0xffffffff,.mult= NSEC_PER_JIFFY <>> 0xffffffff4294967295# 42 nanoseconds>>> 42 * pow(10, -9)4.2000000000000006e-08# 43 nanoseconds>>> 43 * pow(10, -9)4.3e-08The next two fieldsandmultshiftare used to convert the clocksource's period tonanoseconds per cycle. When the kernel calls thereturns a value inmachineclocksource.readtime units represented withcycle_tfunction, this functiondata type that we saw justnow. To convert this return value to nanoseconds we need these two fields:shift. Theclocksourceprovides theclocksource_cyc2nsmultandfunction that will do it for us withthe following expression:((u64) cycles * mult) >> shift;As we can see themultfield is equal:458IntroductionNSEC_PER_JIFFY << JIFFIES_SHIFT#define NSEC_PER_JIFFY((NSEC_PER_SEC+HZ/2)/HZ)#define NSEC_PER_SEC1000000000Lby default, and theshiftis#if HZ < 34#define JIFFIES_SHIFT6#elif HZ < 67#define JIFFIES_SHIFT7#else#define JIFFIES_SHIFT8#endifThejiffiesclock source uses theNSEC_PER_JIFFYnanosecond over cycle ratio. Note that values of thedepend onHZvalue. TheHZmultiplier conversion to specify theJIFFIES_SHIFTandNSEC_PER_JIFFYrepresents the frequency of the system timer. This macrodefined in the include/asm-generic/param.h and depends on theconfiguration option. The value ofHZCONFIG_HZkerneldiffers for each supported architecture, but forx86it's defined like:#define HZWhereCONFIG_HZCONFIG_HZcan be one of the following values:459IntroductionThis means that in our case the timer interrupt frequency isper second or one timer interrupt each4msor occurs250times.The last field that we can see in the definition of themax_cycles250 HZclocksource_jiffiesstructure is the -that holds the maximum cycle value that can safely be multiplied withoutpotentially causing an overflow.Ok, we just saw definition of thejiffiesandclocksourceclocksource_jiffiesstructure, also we know a little about, now it is time to get back to the implementation of the ourfunction. In the beginning of this part we have stopped on the call of the:register_refined_jiffies(CLOCK_TICK_RATE);function from the arch/x86/kernel/setup.c source code file.As I already wrote, the main purpose of therefined_jiffiesregister_refined_jiffiesclocksource. We already saw therepresents standardjiffiesfunction is to registerclocksource_jiffiesstructureclock source. Now, if you look in the kernel/time/jiffies.csource code file, you will find yet another clock source definition:struct clocksource refined_jiffies;There is one difference betweenjiffiesrefined_jiffiesandclocksource_jiffies: The standardbased clock source is the lowest common denominator clock source which shouldfunction on all systems. As we already know, thejiffiesduring each timer interrupt. This means the that standardglobal variable will be increasedjiffiesbased clock source hasthe same resolution as the timer interrupt frequency. From this we can understand thatstandardusesjiffiesbased clock source may suffer from inaccuracies. TheCLOCK_TICK_RATEas the base ofjiffiesrefined_jiffiesshift.Let's look at the implementation of this function. First of all, we can see that therefined_jiffiesclock source based on theclocksource_jiffiesstructure:int register_refined_jiffies(long cycles_per_second){u64 nsec_per_tick, shift_hz;long cycles_per_tick;refined_jiffies = clocksource_jiffies;refined_jiffies.name = "refined-jiffies";refined_jiffies.rating++;.........460IntroductionHere we can see that we update the name of therefined_jiffiesincrease the rating of this structure. As you remember, the1, so ourrefined_jiffiesrefined_jiffiesclocksource will have rating -torefined-jiffiesclocksource_jiffies2andhas rating -. This means that thewill be the best selection for clock source management code.In the next step we need to calculate number of cycles per one tick:cycles_per_tick = (cycles_per_second + HZ/2)/HZ;Note that we have usedNSEC_PER_SECmultiplier. Here we are using themacro as the base of the standardcycles_per_secondjiffieswhich is the first parameter of theregister_refined_jiffiesfunction. We've passed theregister_refined_jiffiesfunction. This macro is defined in theCLOCK_TICK_RATEmacro to thearch/x86/include/asm/timex.h header file and expands to the:#define CLOCK_TICK_RATEwhere thePIT_TICK_RATEPIT_TICK_RATEmacro expands to the frequency of the Intel 8253:#define PIT_TICK_RATE 1193182ulAfter this we calculateshift_hzfor theregister_refined_jiffiesor in other words frequency of the system timer. We shift left thefrequency of the programmable interval timer on8that will storehz << 8cycles_per_secondorin order to get extra accuracy:shift_hz = (u64)cycles_per_second << 8;shift_hz += cycles_per_tick/2;do_div(shift_hz, cycles_per_tick);In the next step we calculate the number of seconds per one tick by shifting left theNSEC_PER_SECon8too as we did it with theshift_hzand do the same calculation asbefore:nsec_per_tick = (u64)NSEC_PER_SEC << 8;nsec_per_tick += (u32)shift_hz/2;do_div(nsec_per_tick, (u32)shift_hz);refined_jiffies.mult = ((u32)nsec_per_tick) > shift;is notand100%maxadjaccurate. Instead the number is taken as close as possible to a nanosecondhelps to correct this and allows clocksource API to avoidmultvalues thatmight overflow when adjusted. The next four fields are pointers to the function:enable- optional function to enable clocksource;disable- optional function to disable clocksource;suspend- suspend function for the clocksource;resume- resume function for the clocksource;The next field is themax_cyclesand as we can understand from its name, this fieldrepresents maximum cycle value before potential overflow. And the last field isownerrepresents reference to a kernel module that is owner of a clocksource. This is all. We justwent through all the standard fields of themissed some fields of theclocksourceclocksourcestructure. But you can noted that westructure. We can divide all of missed field on twotypes: Fields of the first type are already known for us. For example, they arenamefield470Clocksource frameworkthat represents name of aclocksource, theratingfield that helps to the Linux kernel toselect the best clocksource and etc. The second type, fields which are dependent from thedifferent Linux kernel configuration options. Let's look on these fields.The first field is thethearchdata. This field hasarch_clocksource_datatype and depends onkernel configuration option. This field is actual only for theCONFIG_ARCH_CLOCKSOURCE_DATAx86 and IA64 architectures for this moment. And again, as we can understand from thefield's name, it represents architecture-specific data for a clock source. For example, itrepresentsvDSOclock mode:struct arch_clocksource_data {int vclock_mode;};for thex86architectures. Where thevDSOclock mode can be one of the:#define VCLOCK_NONE 0#define VCLOCK_TSC1#define VCLOCK_HPET 2#define VCLOCK_PVCLOCK 3The last three fields arewd_listCONFIG_CLOCKSOURCE_WATCHDOGwhat is itwatchdog,cs_lastand thewd_lastdepends on thekernel configuration option. First of all let's try to understand. In a simple words, watchdog is a timer that is used for detection of thecomputer malfunctions and recovering from it. All of these three fields contain watchdogrelated data that is used by theclocksourceframework. If we will grep the Linux kernelsource code, we will see that only arch/x86/KConfig kernel configuration file contains theCONFIG_CLOCKSOURCE_WATCHDOGkernel configuration option. So, why doneed in watchdog? You already may know that allx86x86andx86_64processors has special 64-bitregister - time stamp counter. This register contains number of cycles since the reset.Sometimes the time stamp counter needs to be verified against another clock source. Wewill not see initialization of thewatchdogtimer in this part, before this we must learn moreabout timers.That's all. From this moment we know all fields of theknowledge will help us to learn insides of theclocksourceclocksourcestructure. Thisframework.New clock source registration471Clocksource frameworkWe saw only one function from thefunction was -clocksource__clocksource_registerframework in the previous part. This. This function defined in theinclude/linux/clocksource.h header file and as we can understand from the function's name,main point of this function is to register new clocksource. If we will look on theimplementation of thethe__clocksource_register__clocksource_register_scalefunction, we will see that it just makes call offunction and returns its result:static inline int __clocksource_register(struct clocksource *cs){return __clocksource_register_scale(cs, 1, 0);}Before we will see implementation of thesee thatclocksource__clocksource_register_scalefunction, we canprovides additional API for a new clock source registration:static inline int clocksource_register_hz(struct clocksource *cs, u32 hz){return __clocksource_register_scale(cs, 1, hz);}static inline int clocksource_register_khz(struct clocksource *cs, u32 khz){return __clocksource_register_scale(cs, 1000, khz);}And all of these functions do the same. They return value of the__clocksource_register_scalefunction but with different set of parameters. The__clocksource_register_scalefunction defined in the kernel/time/clocksource.c source codefile. To understand difference between these functions, let's look on the parameters of theclocksource_register_khzcsfunction. As we can see, this function takes three parameters:- clocksource to be installed;scale- scale factor of a clock source. In other words, if we will multiply value of thisparameter on frequency, we will getfreqhzof a clocksource;- clock source frequency divided by scale.Now let's look on the implementation of the__clocksource_register_scalefunction:472Clocksource frameworkint __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq){__clocksource_update_freq_scale(cs, scale, freq);mutex_lock(&clocksource_mutex);clocksource_enqueue(cs);clocksource_enqueue_watchdog(cs);clocksource_select();mutex_unlock(&clocksource_mutex);return 0;}First of all we can see that thethe__clocksource_register_scale__clocksource_update_freq_scalefunction starts from the call offunction that defined in the same source code file andupdates given clock source with the new frequency. Let's look on the implementation of thisfunction. In the first step we need to check given frequency and if it was not passed aszero, we need to calculatemultdo we need to check value of theandfrequencylooked on the implementation of thethat we passeddefinedmultasfrequencyandshiftsaw calculation of the0frequencyparameters for the given clock source. Why? Actually it can be zero. if you attentively__clocksource_registerfunction, you may have noticed. We will do it only for some clock sources that have selfparameters. Look in the previous part and you will see that wemultand__clocksource_update_freq_scaleSo in the start of theshiftshiftforjiffies. Thefunction will do it for us for other clock sources.__clocksource_update_freq_scalefunction we check the value of theparameter and if is not zero we need to calculateclock source. Let's look on themultandshiftmultandshiftfor the givencalculation:473Clocksource frameworkvoid __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq){u64 sec;if (freq) {sec = cs->mask;do_div(sec, freq);do_div(sec, scale);if (!sec)sec = 1;else if (sec > 600 && cs->mask > UINT_MAX)sec = 600;clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,NSEC_PER_SEC / scale, sec * scale);}.........}Here we can see calculation of the maximum number of seconds which we can run before aclock source counter will overflow. First of all we fill thesecvariable with the value of aclock source mask. Remember that a clock source's mask represents maximum amount ofbits that are valid for the given clock source. After this, we can see two division operations.At first we divide ourThefreqvariable on a clock source frequency and then on scale factor.parameter shows us how many timer interrupts will be occurred in one second.So, we dividejiffysecmaskvalue that represents maximum number of a counter (for example) on the frequency of a timer and will get the maximum number of seconds for thecertain clock source. The second division operation will give us maximum number ofseconds for the certain clock source depends on its scale factor which can be11hertz orkilohertz (10^3 Hz).After we have got maximum number of seconds, we check this value and set it to6001ordepends on the result at the next step. These values is maximum sleeping time for aclocksource in seconds. In the next step we can see call of theMain point of this function is calculation of thesource. In the end of thecalculatedmultupdate themax_idle_nsmultand__clocksource_update_freq_scaleshiftclocks_calc_mult_shift.values for a given clockfunction we check that justvalue of a given clock source will not cause overflow after adjustment,andmax_cyclesvalues of a given clock source with the maximumnanoseconds that can be converted to a clock source counter and print result to the kernelbuffer:474Clocksource frameworkpr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n",cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns);that we can see in the dmesg output:$ dmesg | grep "clocksource:"[0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff,max_idle_ns: 1910969940391419 ns[0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns[0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns[0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns:2085701024 ns[1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 nsAfter thethe__clocksource_update_freq_scale__clocksource_register_scalefunction will finish its work, we can return back tofunction that will register new clock source. We can seethe call of the following three functions:mutex_lock(&clocksource_mutex);clocksource_enqueue(cs);clocksource_enqueue_watchdog(cs);clocksource_select();mutex_unlock(&clocksource_mutex);Note that before the first will be called, we lock thetheclocksource_mutexcurrently selectedcontains registeredThe firstmutex is to protectclocksourceandclocksourcesclocksource_enqueuecurr_clocksourceclocksource_listmutex. The point ofvariable which representsvariable which represents list that. Now, let's look on these three functions.function and other two defined in the same source code file.We go through all already registeredelements of theclocksource_mutexclocksource_listclocksourcesor in other words we go through alland tries to find best place for a givenclocksource:475Clocksource frameworkstatic void clocksource_enqueue(struct clocksource *cs){struct list_head *entry = &clocksource_list;struct clocksource *tmp;list_for_each_entry(tmp, &clocksource_list, list)if (tmp->rating >= cs->rating)entry = &tmp->list;list_add(&cs->list, entry);}In the end we just insert new clocksource to theclocksource_enqueue_watchdognew clock source to theclocksource_list. The second function -does almost the same that previous function, but it insertswd_listdepends on flags of a clock source and starts newwatchdog timer. As I already wrote, we will not considerwatchdogrelated stuff in this partbut will do it in next parts.The last function is theclocksource_select. As we can understand from the function'sname, main point of this function - select the bestclocksourcefrom registeredclocksources. This function consists only from the call of the function helper:static void clocksource_select(void){return __clocksource_select(false);}Note that the__clocksource_selectfunction takes one parameter (bool parameter shows how to traverse theclocksource_listthat is meant that we will go through all entries of thethattheclocksourcein our case). This. In our case we passclocksource_listwith the best rating will the first in theclocksource_enqueuefalsefalse. We already knowclocksource_listafter the call offunction, so we can easily get it from this list. After we found aclock source with the best rating, we switch to it:if (curr_clocksource != best && !timekeeping_notify(best)) {pr_info("Switched to clocksource %s\n", best->name);curr_clocksource = best;}The result of this operation we can see in thedmesgoutput:$dmesg | grep Switched[0.199688] clocksource: Switched to clocksource hpet[2.452966] clocksource: Switched to clocksource tsc476Clocksource frameworkNote that we can see two clock sources in thedmesgoutput (hpetandtscin our case).Yes, actually there can be many different clock sources on a particular hardware. So theLinux kernel knows about all registered clock sources and switches to a clock source with abetter rating each time after registration of a new clock source.If we will look on the bottom of the kernel/time/clocksource.c source code file, we will seethat it has sysfs interface. Main initialization occurs in thewhich will be called during deviceinit_clocksource_sysfsinitcallsinit_clocksource_sysfsfunction. Let's look on the implementation of thefunction:static struct bus_type clocksource_subsys = {.name = "clocksource",.dev_name = "clocksource",};static int __init init_clocksource_sysfs(void){int error = subsys_system_register(&clocksource_subsys, NULL);if (!error)error = device_register(&device_clocksource);if (!error)error = device_create_file(&device_clocksource,&dev_attr_current_clocksource);if (!error)error = device_create_file(&device_clocksource,&dev_attr_unbind_clocksource);if (!error)error = device_create_file(&device_clocksource,&dev_attr_available_clocksource);return error;}device_initcall(init_clocksource_sysfs);First of all we can see that it registers asubsys_system_registerclocksourcesubsystem with the call of thefunction. In other words, after the call of this function, we will havefollowing directory:$ pwd/sys/devices/system/clocksourceAfter this step, we can see registration of thedevice_clocksourcedevice which isrepresented by the following structure:477Clocksource frameworkstatic struct device device_clocksource = {.id= 0,.bus= &clocksource_subsys,};and creation of three files:;dev_attr_current_clocksourcedev_attr_unbind_clocksource;dev_attr_available_clocksource.These files will provide information about current clock source in the system, available clocksources in the system and interface which allows to unbind the clock source.After theinit_clocksource_sysfsfunction will be executed, we will be able find someinformation about available clock sources in the:$cat /sys/devices/system/clocksource/clocksource0/available_clocksourcetsc hpet acpi_pmOr for example information about current clock source in the system:$ cat /sys/devices/system/clocksource/clocksource0/current_clocksourcetscIn the previous part, we saw API for the registration of thedive into details about theclocksourcejiffiesclock source, but didn'tframework. In this part we did it and sawimplementation of the new clock source registration and selection of a clock source with thebest rating value in the system. Of course, this is not all API thatprovides. There a couple additional functions likegiven clock source from theclocksource_listclocksourceclocksource_unregisterframeworkfor removingand etc. But I will not describe this functionsin this part, because they are not important for us right now. Anyway if you are interesting init, you can find it in the kernel/time/clocksource.c.That's all.ConclusionThis is the end of the second part of the chapter that describes timers and timermanagement related stuff in the Linux kernel. In the previous part got acquainted with thefollowing two concepts:thejiffiesjiffiesandclocksource. In this part we saw some examples ofusage and knew more details about theclocksourceconcept.478Clocksource frameworkIf you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linksx86x86_64uptimeEnsoniq Soundscape EliteRTCinterruptsIBM PCprogrammable interval timerHznanosecondsdmesgtime stamp counterloadable kernel moduleIA64watchdogclock ratemutexsysfsprevious part479The tick broadcast framework and dyntickTimers and time management in the Linuxkernel. Part 3.The tick broadcast framework and dyntickThis is third part of the chapter which describes timers and time management related stuff inthe Linux kernel and we stopped on theclocksourceframework in the previous part. Wehave started to consider this framework because it is closely related to the special counterswhich are provided by the Linux kernel. One of these counters which we already saw in thefirst part of this chapter is -jiffies. As I already wrote in the first part of this chapter, wewill consider time management related stuff step by step during the Linux kernelinitialization. Previous step was call of the:register_refined_jiffies(CLOCK_TICK_RATE);function which defined in the kernel/time/jiffies.c source code file and executes initializationof therefined_jiffiessetup_archclock source for us. Recall that this function is called from thefunction that defined in thehttps://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c source code and executes architecture-specific (x86_64 in our case)initialization. Look on the implementation of thetheregister_refined_jiffiessetup_archis the last step before theand you will note that the call ofsetup_archfunction will finish itswork.There are many differentsetup_archx86_64specific things already configured after the end of theexecution. For example some early interrupt handlers already able to handleinterrupts, memory space reserved for the initrd, DMI scanned, the Linux kernel log buffer isalready set and this means that the printk function is able to work, e820 parsed and theLinux kernel already knows about available memory and and many many other architecturespecific things (if you are interesting, you can read more about thesetup_archfunction andLinux kernel initialization process in the second chapter of this book).Now, thesetup_archRecall that thefinished its work and we can back to the generic Linux kernel code.setup_archfunction was called from thestart_kernelfunction which isdefined in the init/main.c source code file. So, we shall return to this function. You can seethat there are many different function are called right aftersetup_archfunction inside of the480The tick broadcast framework and dyntickstart_kernelfunction, but since our chapter is devoted to timers and time managementrelated stuff, we will skip all code which is not related to this topic. The first function which isrelated to the time management in the Linux kernel is:tick_init();in thestart_kernel. Thefunction defined in the kernel/time/tick-common.ctick_initsource code file and does two things:Initialization oftick broadcastInitialization offullframework related data structures;tickless mode related data structures.We didn't see anything related to thetick broadcastframework in this book and didn'tknow anything about tickless mode in the Linux kernel. So, the main point of this part is tolook on these concepts and to know what are they.The idle processFirst of all, let's look on the implementation of thetick_initfunction. As I already wrote,this function defined in the kernel/time/tick-common.c source code file and consists from thetwo calls of following functions:void __init tick_init(void){tick_broadcast_init();tick_nohz_init();}As you can understand from the paragraph's title, we are interesting only in thetick_broadcast_initfunction for now. This function defined in the kernel/time/tick-broadcast.c source code file and executes initialization of thetick broadcastframeworkrelated data structures. Before we will look on the implementation of thetick_broadcast_initto know aboutfunction and will try to understand what does this function do, we needtick broadcastframework.Main point of a central processor is to execute programs. But sometimes a processor maybe in a special state when it is not being used by any program. This special state is called idle. When the processor has no anything to execute, the Linux kernel launchesidletask.We already saw a little about this in the last part of the Linux kernel initialization process.When the Linux kernel will finish all initialization processes in thefrom the init/main.c source code file, it will call therest_initstart_kernelfunctionfunction from the same source481The tick broadcast framework and dyntickcode file. Main point of this function is to launch kernelthread, to call thetheThecpu_idle_loopcpu_idle_loopschedulethread and theinitkthreaddfunction to start task scheduling and to go to sleep by callingfunction that defined in the kernel/sched/idle.c source code file.function represents infinite loop which checks the need for reschedulingon each iteration. After the scheduler finds something to execute, theidleprocess willfinish its work and the control will be moved to a new runnable task with the call of theschedule_preempt_disabledfunction:static void cpu_idle_loop(void){while (1) {while (!need_resched()) {........./* the main idle function */cpuidle_idle_call();}.........schedule_preempt_disabled();}Of course, we will not consider full implementation of theof theidlecpu_idle_loopfunction and detailsstate in this part, because it is not related to our topic. But there is oneinteresting moment for us. We know that the processor can execute only one task in onetime. How does the Linux kernel decide to reschedule and stopprocessor executes infinite loop in thecpu_idle_loopidleprocess if the? The answer is system timerinterrupts. When an interrupt occurs, the processor stops theidlethread and transferscontrol to an interrupt handler. After the system timer interrupt handler will be handled, theneed_reschedwill return true and the Linux kernel will stopidleprocess and will transfercontrol to the current runnable task. But handling of the system timer interrupts is noteffective for power management, because if a processor is inidlestate, there is little pointin sending it a system timer interrupt.By default, there is theCONFIG_HZ_PERIODICkernel configuration option which is enabled inthe Linux kernel and tells to handle each interrupt of the system timer. To solve this problem,the Linux kernel provides two additional ways of managing scheduling-clock interrupts:The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in theLinux kernel, we need to enable theCONFIG_NO_HZ_IDLEkernel configuration option. Thisoption allows Linux kernel to avoid sending timer interrupts to idle processors. In this case482The tick broadcast framework and dyntickperiodic timer interrupts will be replaced with on-demand interrupts. This mode is called mode. But if the kernel does not handle interrupts of a system timer, how candyntick-idlethe kernel decide if the system has nothing to do?Whenever the idle task is selected to run, the periodic tick is disabled with the call of thetick_nohz_idle_enterfunction that defined in the kernel/time/tick-sched.c source code fileand enabled with the call of thethe Linux kernel which is called -tick_nohz_idle_exitclock event devicesfunction. There is special concept inthat are used to schedule the nextinterrupt. This concept provides API for devices which can deliver interrupts at a specifictime in the future and represented by theclock_event_deviceWe will not dive into implementation of theclock_event_devicestructure in the Linux kernel.structure now. We will see itin the next part of this chapter. But there is one interesting moment for us right now.The second way is to omit scheduling-clock ticks on processors that are either inidlestate or that have only one runnable task or in other words busy processor. We can enablethis feature with theCONFIG_NO_HZ_FULLkernel configuration option and it allows to reducethe number of timer interrupts significantly.Besides thecpu_idle_loopprovides specialcpuidle, idle processor can be in a sleeping state. The Linux kernelframework. Main point of this framework is to put an idleprocessor to sleeping states. The name of the set of these states is -C-states. But howdoes a processor will be woken if local timer is disabled? The linux kernel providesbroadcasttickframework for this. The main point of this framework is assign a timer which is notaffected by theC-states. This timer will wake a sleeping processor.Now, after some theory we can return to the implementation of our function. Let's recall thatthetick_initfunction just calls two following functions:void __init tick_init(void){tick_broadcast_init();tick_nohz_init();}Let's consider the first function. The firsttick_broadcast_initfunction defined in thekernel/time/tick-broadcast.c source code file and executes initialization of thebroadcasttickframework related data structures. Let's look on the implementation of thetick_broadcast_initfunction:483The tick broadcast framework and dyntickvoid __init tick_broadcast_init(void){zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);#ifdef CONFIG_TICK_ONESHOTzalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);#endif}As we can see, theof thetick_broadcast_initzalloc_cpumask_varfunction. Thefunction allocates different cpumasks with the helpzalloc_cpumask_varfunction defined in thelib/cpumask.c source code file and expands to the call of the following function:bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags){return alloc_cpumask_var(mask, flags | __GFP_ZERO);}Ultimately, the memory space will be allocated for the givenwith the help of thekmalloc_nodecpumaskwith the certain flagsfunction:*mask = kmalloc_node(cpumask_size(), flags, node);Now let's look on theAs we can see, thecpumasksthat will be initialized in thetick_broadcast_initinitialization of the last threecpumaskstick_broadcast_initfunction will initialize sixwill be depended on thecpumasksfunction., and moreover,CONFIG_TICK_ONESHOTkernelconfiguration option.The first threeare:cpumaskstick_broadcast_mask- the bitmap which represents list of processors that are in asleeping mode;tick_broadcast_on- the bitmap that stores numbers of processors which are in aperiodic broadcast state;tmpmask- this bitmap for temporary usage.As we already know, the next threecpumasksdepends on theCONFIG_TICK_ONESHOTkernelconfiguration option. Actually each clock event devices can be in one of two modes:periodiconeshot- clock events devices that support periodic events;- clock events devices that capable of issuing events that happen only once.484The tick broadcast framework and dyntickThe linux kernel defines two mask for such clock events devices in theinclude/linux/clockchips.h header file:#define CLOCK_EVT_FEAT_PERIODIC0x000001#define CLOCK_EVT_FEAT_ONESHOT0x000002So, the last threecpumasksare:tick_broadcast_oneshot_mask- stores numbers of processors that must be notified;tick_broadcast_pending_mask- stores numbers of processors that pending broadcast;tick_broadcast_force_maskWe have initialized six- stores numbers of processors with enforced broadcast.cpumasksin thetick broadcastframework, and now we canproceed to implementation of this framework.Thetick broadcastframeworkHardware may provide some clock source devices. When a processor sleeps and its localtimer stopped, there must be additional clock source device that will handle awakening of aprocessor. The Linux kernel uses thesespecialclock source devices which can raise aninterrupt at a specified time. We already know that such timers calledin the Linux kernel. Besidesclock eventsclock eventsdevicesdevices, each processor in the system has itsown local timer which is programmed to issue interrupt at the time of the next deferred task.Also these timers can be programmed to do a periodical job, like updatingThese timers represented by thetick_devicejiffiesand etc.structure in the Linux kernel. This structuredefined in the kernel/time/tick-sched.h header file and looks:struct tick_device {struct clock_event_device *evtdev;enum tick_device_mode mode;};Note, that thetick_devicestructure contains two fields. The first field -pointer to theclock_event_deviceevtdevrepresentsstructure that defined in the include/linux/clockchips.hheader file and represents descriptor of a clock event device. Aclock eventdevice allowsto register an event that will happen in the future. As I already wrote, we will not considerclock_event_devicestructure and related API in this part, but will see it in the next part.The second field of thetick_devicestructure represents mode of thetick_device. As wealready know, the mode can be one of the:485The tick broadcast framework and dyntickenum tick_device_mode {TICKDEV_MODE_PERIODIC,TICKDEV_MODE_ONESHOT,};Eachclock eventsdevice in the system registers itself by the call of theclockevents_register_devicefunction orclockevents_config_and_registerinitialization process of the Linux kernel. During the registration of a newdevice, the Linux kernel calls thetick_check_new_deviceclock eventsfunction that defined in thekernel/time/tick-common.c source code file and checks the givenshould be used by the Linux kernel. After all checks, thefunction duringclock eventstick_check_new_devicedevicefunctionexecutes a call of the:tick_install_broadcast_device(newdev);function that checks that the givenclock eventdevice can be broadcast device and installit, if the given device can be broadcast device. Let's look on the implementation of thetick_install_broadcast_devicefunction:void tick_install_broadcast_device(struct clock_event_device *dev){struct clock_event_device *cur = tick_broadcast_device.evtdev;if (!tick_check_broadcast_device(cur, dev))return;if (!try_module_get(dev->owner))return;clockevents_exchange_device(cur, dev);if (cur)cur->event_handler = clockevents_handle_noop;tick_broadcast_device.evtdev = dev;if (!cpumask_empty(tick_broadcast_mask))tick_broadcast_start_periodic(dev);if (dev->features & CLOCK_EVT_FEAT_ONESHOT)tick_clock_notify();}First of all we get the currenttick_broadcast_deviceclock eventdevice from thetick_broadcast_device. Thedefined in the kernel/time/tick-common.c source code file:486The tick broadcast framework and dyntickstatic struct tick_device tick_broadcast_device;and represents external clock device that keeps track of events for a processor. The firststep after we got the current clock device is the call of thetick_check_broadcast_devicefunction which checks that a given clock events device can be utilized as broadcast device.The main point of thefeaturestick_check_broadcast_devicefield of the giventhis field, theclock eventsfunction is to check value of thedevice. As we can understand from the name offield contains a clock event device features. Available valuesfeaturesdefined in the include/linux/clockchips.h header file and can be one of the- which represents a clock events device which supports periodicCLOCK_EVT_FEAT_PERIODICevents and etc. So, theCLOCK_EVT_FEAT_ONESHOTtick_check_broadcast_device,CLOCK_EVT_FEAT_DUMMYfunction checkfeaturesand other flags and returnsflags forfalseif thegiven clock events device has one of these features. In other way thetick_check_broadcast_devicefunction comparesratingsof the given clock event deviceand current clock event device and returns the best.After thetick_check_broadcast_devicefunction, we can see the call of thetry_module_getfunction that checks module owner of the clock events. We need to do it to be sure that thegivendevice was correctly initialized. The next step is the call of theclock eventsclockevents_exchange_devicefunction that defined in the kernel/time/clockevents.c sourcecode file and will release old clock events device and replace the previous functional handlerwith a dummy handler.In the last step of thetick_install_broadcast_devicetick_broadcast_maskis not empty and start the givenwith the call of thetick_broadcast_start_periodicfunction we check that theclock eventsdevice in periodic modefunction:if (!cpumask_empty(tick_broadcast_mask))tick_broadcast_start_periodic(dev);if (dev->features & CLOCK_EVT_FEAT_ONESHOT)tick_clock_notify();Thetick_broadcast_maskclock eventsfilled in thetick_device_uses_broadcastdevice during registration of thisclock eventsfunction that checks adevice:487The tick broadcast framework and dyntickint cpu = smp_processor_id();int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu){.........if (!tick_device_is_functional(dev)) {...cpumask_set_cpu(cpu, tick_broadcast_mask);...}.........}More about thesmp_processor_idmacro you can read in the fourth part of the Linux kernelinitialization process chapter.Thethetick_broadcast_start_periodictick_setup_periodicfunction check the givenclock eventdevice and callfunction:static void tick_broadcast_start_periodic(struct clock_event_device *bc){if (bc)tick_setup_periodic(bc, 1);}that defined in the kernel/time/tick-common.c source code file and sets broadcast handler forthe givenclock eventdevice by the call of the following function:tick_set_periodic_handler(dev, broadcast);This function checks the second parameter which represents broadcast state (onoroff)and sets the broadcast handler depends on its value:void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast){if (!broadcast)dev->event_handler = tick_handle_periodic;elsedev->event_handler = tick_handle_periodic_broadcast;}488The tick broadcast framework and dyntickWhen anclock eventdevice will issue an interrupt, thedev->event_handlerwill be called.For example, let's look on the interrupt handler of the high precision event timer which islocated in the arch/x86/kernel/hpet.c source code file:static irqreturn_t hpet_interrupt_handler(int irq, void *data){struct hpet_dev *dev = (struct hpet_dev *)data;struct clock_event_device *hevt = &dev->evt;if (!hevt->event_handler) {printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",dev->num);return IRQ_HANDLED;}hevt->event_handler(hevt);return IRQ_HANDLED;}Thehpet_interrupt_handlerclock eventthegets the irq specific data and check the event handler of thedevice. Recall that we just set in thetick_handler_periodic_broadcasttick_set_periodic_handlerfunction. Sofunction will be called in the end of the high precisionevent timer interrupt handler.Thetick_handler_periodic_broadcastfunction calls thebc_local = tick_do_periodic_broadcast();function which stores numbers of processors which have asked to be woken up in thetemporaryand call thecpumasktick_do_broadcastfunction:cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);return tick_do_broadcast(tmpmask);Thetick_do_broadcastcalls thebroadcastfunction of the given clock events which sendsIPI interrupt to the set of the processors. In the end we can call the event handler of thegiventick_device:if (bc_local)td->evtdev->event_handler(td->evtdev);which actually represents interrupt handler of the local timer of a processor. After this aprocessor will wake up. That is all abouttick broadcastframework in the Linux kernel. Wehave missed some aspects of this framework, for example reprogramming of aclock event489The tick broadcast framework and dyntickdevice and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it isnot real to cover all aspects of it. I think it will be interesting to dive into with yourself.If you remember, we have started this part with the call of theconsider thetick_broadcast_inittick_initfunction and related theory, but thecontains another call of a function and this function is -tick_nohz_initfunction. We justtick_initfunction. Let's look on theimplementation of this function.Initialization of dyntick related data structuresWe already saw some information aboutdyntickconcept in this part and we know that thisconcept allows kernel to disable system timer interrupts in thetick_nohz_initidlestate. Thefunction makes initialization of the different data structures which arerelated to this concept. This function defined in the kernel/time/tick-sched.c source code fileand starts from the check of the value of therepresents state of the tick-less mode for thetick_nohz_full_runningidlevariable whichstate and the state when system timerinterrups are disabled during a processor has only one runnable task:if (!tick_nohz_full_running) {if (tick_nohz_init_all() data;.........map->gc.function = gc;map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;.........}A function that is pointed by themap->gc.expiresgcpointer, will be called after timeout which is equal to the.Ok, we will not dive into this example with the netfilter, because this chapter is not aboutnetwork related stuff. But we saw that timers are widely used in the Linux kernel and learnedthat they represent concept which allows to functions to be called in future.Now let's continue to research source code of Linux kernel which is related to the timers andtime management stuff as we did it in all previous chapters.Introduction to dynamic timers in the LinuxkernelAs I already wrote, we knew about thetick broadcastframework andNO_HZmode in theprevious part. They will be initialized in the init/main.c source code file by the call of thetick_initfunction. If we will look at this source code file, we will see that the next timemanagement related function is:init_timers();This function defined in the kernel/time/timer.c source code file and contains calls of fourfunctions:496Introduction to timersvoid __init init_timers(void){init_timer_cpus();init_timer_stats();timer_register_cpu_notifier();open_softirq(TIMER_SOFTIRQ, run_timer_softirq);}Let's look on implementation of each function. The first function isin the same source code file and just calls theinit_timer_cpuinit_timer_cpusdefinedfunction for each possibleprocessor in the system:static void __init init_timer_cpus(void){int cpu;for_each_possible_cpu(cpu)init_timer_cpu(cpu);}If you do not know or do not remember what is it apart of this book which describespossiblecpumaskpossiblecpu, you can read the specialconcept in the Linux kernel. In short words, aprocessor is a processor which can be plugged in anytime during the life of thesystem.Theinit_timer_cputvec_basefunction does main work for us, namely it executes initialization of thestructure for each processor. This structure defined in the kernel/time/timer.csource code file and stores data related to adynamictimer for a certain processor. Let'slook on the definition of this structure:struct tvec_base {spinlock_t lock;struct timer_list *running_timer;unsigned long timer_jiffies;unsigned long next_timer;unsigned long active_timers;unsigned long all_timers;int cpu;bool migration_enabled;bool nohz_active;struct tvec_root tv1;struct tvec tv2;struct tvec tv3;struct tvec tv4;struct tvec tv5;} ____cacheline_aligned;497Introduction to timersThethec_basenextrunning_timertimer_jiffiesstructure contains following fields: Thelockfortvec_baseprotection, thefield points to the currently running timer for the certain processor, thefields represents the earliest expiration time (it will be used by the Linuxkernel to find already expired timers). The next field -next_timercontains the next pendingtimer for a next timer interrupt in a case when a processor goes to sleep and themode is enabled in the Linux kernel. Theactive_timersNO_HZfield provides accounting of non-deferrable timers or in other words all timers that will not be stopped during a processor willgo to sleep. Theall_timersdeferrable timers. Themigration_enabledcpuandfield tracks total number of timers oractive_timers+field represents number of a processor which owns timers. Thenohz_activefields are represent opportunity of timers migration tomode respectively.another processor and status of theNO_HZThe last five fields of thestructure represent lists of dynamic timers. The firsttv1tvec_basefield has:#define TVR_SIZE (1 lock);base->timer_jiffies = jiffies;base->next_timer = base->timer_jiffies;}Thetvec_basesrepresents per-cpu variable which represents main data structure for adynamic timer for a given processor. Thisper-cpuvariable defined in the same sourcecode file:499Introduction to timersstatic DEFINE_PER_CPU(struct tvec_base, tvec_bases);First of all we're getting the address of thetvec_basesfor the given processor tovariable and as we got it, we are starting to initialize some of theinit_timer_cpufunction. After initialization of theper-cpuinit_timer_statsfields in thedynamic timers with the jiffiesand the number of a possible processor, we need to initialize ain thetvec_basebasetstats_lookup_lockspinlockfunction:void __init init_timer_stats(void){int cpu;for_each_possible_cpu(cpu)raw_spin_lock_init(&per_cpu(tstats_lookup_lock, cpu));}Thetstats_lookcup_lockvariable representsper-cpuraw spinlock:static DEFINE_PER_CPU(raw_spinlock_t, tstats_lookup_lock);which will be used for protection of operation with statistics of timers that can be accessedthrough the procfs:static int __init init_tstats_procfs(void){struct proc_dir_entry *pe;pe = proc_create("timer_stats", 0644, NULL, &tstats_fops);if (!pe)return -ENOMEM;return 0;}For example:500Introduction to timers$cat /proc/timer_statsTimerstats sample period: 3.888770 s12,0 swapperhrtimer_stop_sched_tick (hrtimer_sched_tick)15,1 swapperhcd_submit_urb (rh_timer_func)4,959 kedacschedule_timeout (process_timeout)1,0 swapperpage_writeback_init (wb_timer_fn)28,0 swapperhrtimer_stop_sched_tick (hrtimer_sched_tick)22,2948 IRQ 4tty_flip_buffer_push (delayed_work_timer_fn).........The next step after initialization of thetimer_register_cpu_notifiertstats_lookup_lockspinlock is the call of thefunction. This function depends on theCONFIG_HOTPLUG_CPUkernel configuration option which enables support for hotplug processors in the Linux kernel.When a processor will be logically offlined, a notification will be sent to the Linux kernel withtheCPU_DEADor theCPU_DEAD_FROZENevent by the call of thecpu_notifiermacro:#ifdef CONFIG_HOTPLUG_CPU......static inline void timer_register_cpu_notifier(void){cpu_notifier(timer_cpu_notify, 0);}......#else......static inline void timer_register_cpu_notifier(void) { }......#endif /* CONFIG_HOTPLUG_CPU */In this case themigrate_timerstimer_cpu_notifywill be called which checks an event type and will call thefunction:501Introduction to timersstatic int timer_cpu_notify(struct notifier_block *self,unsigned long action, void *hcpu){switch (action) {case CPU_DEAD:case CPU_DEAD_FROZEN:migrate_timers((long)hcpu);break;default:break;}return NOTIFY_OK;}This chapter will not describehotplugrelated events in the Linux kernel source code, but ifyou are interesting in such things, you can find implementation of themigrate_timersfunction in the kernel/time/timer.c source code file.The last step in theinit_timersfunction is the call of the:open_softirq(TIMER_SOFTIRQ, run_timer_softirq);function. Theopen_softirqfunction may be already familiar to you if you have read theninth part about the interrupts and interrupt handling in the Linux kernel. In short words, theopen_softirqfunction defined in the kernel/softirq.c source code file and executesinitialization of the deferred interrupt handler.In our case the deferred function is thea hardware interrupt in thedo_IRQrun_timer_softirqfunction that is will be called afterfunction which defined in the arch/x86/kernel/irq.csource code file. The main point of this function is to handle a software dynamic timer. TheLinux kernel does not do this thing during the hardware timer interrupt handling because thisis time consuming operation.Let's look on the implementation of therun_timer_softirqfunction:static void run_timer_softirq(struct softirq_action *h){struct tvec_base *base = this_cpu_ptr(&tvec_bases);if (time_after_eq(jiffies, base->timer_jiffies))__run_timers(base);}502Introduction to timersAt the beginning of therun_timer_softirqfunction we get adynamictimer for a currentprocessor and compares the current value of the jiffies with the value of thefor the current structure by the call of thetimer_jiffiesmacro which is defined in thetime_after_eqinclude/linux/jiffies.h header file:#define time_after_eq(a,b)\(typecheck(unsigned long, a) && \typecheck(unsigned long, b) && \((long)((a) - (b)) >= 0))Reclaim that thetimer_jiffiesfield of thetvec_basestructure represents the relative timewhen functions delayed by the given timer will be executed. So we compare these twovalues and if the current time represented by the>timer_jiffies, we call the__run_timersjiffiesis greater thanbase-function that defined in the same source code file.Let's look on the implementation of this function.As I just wrote, the__run_timersfunction runs all expired timers for a given processor. Thisfunction starts from the acquiring of thetvec_base'slock to protect thetvec_basestructurestatic inline void __run_timers(struct tvec_base *base){struct timer_list *timer;spin_lock_irq(&base->lock);.........spin_unlock_irq(&base->lock);}After this it starts the loop while thetimer_jiffieswill not be greater than thejiffies:while (time_after_eq(jiffies, base->timer_jiffies)) {.........}We can find many different manipulations in the our loop, but the main point is to findexpired timers and call delayed functions. First of all we need to calculate thebase->tv1indexof thelist that stores the next timer to be handled with the following expression:index = base->timer_jiffies & TVR_MASK;503Introduction to timerswhere theTVR_MASKis a mask for the getting of thetvec_root->vecelements. As we gotthe index with the next timer which must be handled we check its value. If the index is zero,we go through all lists in our cascade tablecall of thecascadetv2,tv3and etc., and rehashing it with thefunction:if (!index &&(!cascade(base, &base->tv2, INDEX(0))) &&(!cascade(base, &base->tv3, INDEX(1))) &&!cascade(base, &base->tv4, INDEX(2)))cascade(base, &base->tv5, INDEX(3));After this we increase the value of thebase->timer_jiffies:++base->timer_jiffies;In the last step we are executing a corresponding function for each timer from the list in afollowing loop:hlist_move_list(base->tv1.vec + index, head);while (!hlist_empty(head)) {.........timer = hlist_entry(head->first, struct timer_list, entry);fn = timer->function;data = timer->data;spin_unlock(&base->lock);call_timer_fn(timer, fn, data);spin_lock(&base->lock);.........}where thecall_timer_fnjust call the given function:504Introduction to timersstatic void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long),unsigned long data){.........fn(data);.........}That's all. The Linux kernel has infrastructure forfrom this moment. We willdynamic timersnot dive into this interesting theme. As I already wrote thetimersis a widely used conceptin the Linux kernel and nor one part, nor two parts will not cover understanding of suchthings how it implemented and how it works. But now we know about this concept, why doesthe Linux kernel needs in it and some data structures around it.Now let's look usage ofdynamic timersin the Linux kernel.Usage of dynamic timersAs you already can noted, if the Linux kernel provides a concept, it also provides API formanaging of this concept and thedynamic timersconcept is not exception here. To use atimer in the Linux kernel code, we must define a variable with ainitialize ourtimer_listtimer_liststructure in two ways. The first is to use thetype. We caninit_timermacrothat defined in the include/linux/timer.h header file:#define init_timer(timer)\__init_timer((timer), 0)#define __init_timer(_timer, _flags)\init_timer_key((_timer), (_flags), NULL, NULL)where theinit_timer_keyfunction just calls the:do_init_timer(timer, flags, name, key);function which fields the giventimerwith default values. The second way is to use the:#define TIMER_INITIALIZER(_function, _expires, _data)\__TIMER_INITIALIZER((_function), (_expires), (_data), 0)505Introduction to timersmacro which will initialize the givenAfter adynamic timertimer_liststructure too.is initialized we can start thistimerwith the call of the:void add_timer(struct timer_list * timer);function and stop it with the:int del_timer(struct timer_list * timer);function.That's all.ConclusionThis is the end of the fourth part of the chapter that describes timers and timer managementrelated stuff in the Linux kernel. In the previous part we got acquainted with the two newconcepts: thetick broadcastframework and theNO_HZmode. In this part we continued todive into time management related stuff and got acquainted with the new concept timeror software timer. We didn't saw implementation of adynamic timersdynamicmanagementcode in details in this part but saw data structures and API around this concept.In the next part we will continue to dive into timer management related things in the Linuxkernel and will see new concept for us -timers.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.LinksIPnetfilternetworkcpumaskinterruptjiffiesper-cpu506Introduction to timersspinlockprocfsprevious part507Clockevents frameworkTimers and time management in the Linuxkernel. Part 5.Introduction to theframeworkclockeventsThis is fifth part of the chapter which describes timers and time management related stuff inthe Linux kernel. As you might noted from the title of this part, theframeworkclockeventswill be discussed. We already saw one framework in the second part of this chapter. It wasclocksourceframework. Both of these frameworks represent timekeeping abstractions inthe Linux kernel.At first let's refresh your memory and try to remember what is itand what its purpose. The main goal of theclocksourceclocksourceframework andframework is to providetimeline.As described in the documentation:For example issuing the command 'date' on a Linux system will eventually read theclock source to determine exactly what time it is.The Linux kernel supports many different clock sources. You can find some of them in thedrivers/closksource. For example old good Intel 8253 - programmable interval timer with1193182Hz frequency, yet another one - ACPI PM timer with3579545Hz frequency.Besides the drivers/closksource directory, each architecture may provide own architecturespecific clock sources. For example x86 architecture provides High Precision Event Timer,or for example powerpc provides access to the processor timer throughtimebaseregister.Each clock source provides monotonic atomic counter. As I already wrote, the Linux kernelsupports a huge set of different clock source and each clock source has own parameters likefrequency. The main goal of theclocksourceframework is to provide API to select bestavailable clock source in the system i.e. a clock source with the highest frequency.Additional goal of theclocksourceframework is to represent an atomic counter provided bya clock source in human units. In this time, nanoseconds are the favorite choice for the timevalue units of the given clock source in the Linux kernel.Theclocksourceframework represented by theclocksourcestructure which is defined inthe include/linux/clocksource.h header code file which containsnameof a clock source,rating of certain clock source in the system (a clock source with the higher frequency has thebiggest rating in the system),anddisablelistof all registered clock source in the system,fields to enable and disable a clock source, pointer to thereadenablefunctionwhich must return an atomic counter of a clock source and etc.508Clockevents frameworkAdditionally theclocksourcestructure provides two fields:multandshiftwhich areneeded for translation of an atomic counter which is provided by a certain clock source tothe human units, i.e. nanoseconds. Translation occurs via following formula:ns ~= (clocksource * mult) >> shiftAs we already know, besides theclocksourcestructure, theclocksourceframeworkprovides an API for registration of clock source with different frequency scale factor:static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)A clock source unregistration:int clocksource_unregister(struct clocksource *cs)and etc.Additionally to theclocksourceframework, the Linux kernel providesclockeventsframework. As described in the documentation:Clock events are the conceptual reverse of clock sourcesMain goal of the is to manage clock event devices or in other words - to manage devicesthat allow to register an event or in other words interrupt that is going to happen at a definedpoint of time in the future.Now we know a little about theclockeventsframework in the Linux kernel, and now time isto see on it API.API ofclockeventsframeworkThe main structure which described a clock event device isclock_event_devicestructure.This structure is defined in the include/linux/clockchips.h header file and contains a huge setof fields. as well as theclocksourcestructure it hasnamefields which contains humanreadable name of a clock event device, for example local APIC timer:509Clockevents frameworkstatic struct clock_event_device lapic_clockevent = {.name= "lapic",.........}Addresses of theevent_handler,set_next_event,next_eventfunctions for a certain clockevent device which are an interrupt handler, setter of next event and local storage for nextevent respectively. Yet another field of theclock_event_devicestructure is -featuresfield.Its value maybe on of the following generic features:#define CLOCK_EVT_FEAT_PERIODIC0x000001#define CLOCK_EVT_FEAT_ONESHOTWhere the0x000002CLOCK_EVT_FEAT_PERIODICgenerate events periodically. Therepresents device which may be programmed toCLOCK_EVT_FEAT_ONESHOTrepresents device which maygenerate an event only once. Besides these two features, there are also architecturespecific features. For example x86_64 supports two additional features:#define CLOCK_EVT_FEAT_C3STOPThe firstCLOCK_EVT_FEAT_C3STOPstate. Additionally theclocksource0x000008means that a clock event device will be stopped in the C3clock_event_devicestructure. Theclocksourcestructure hasmultandshiftfields as well asstructure also contains other fields, but we willconsider it later.After we considered part of thetheclockeventsinitializeclock_event_devicestructure, time is to look at theAPIofframework. To work with a clock event device, first of all we need toclock_event_deviceframework provides followingstructure and register a clock events device. TheAPIclockeventsfor registration of clock event devices:void clockevents_register_device(struct clock_event_device *dev){.........}This function defined in the kernel/time/clockevents.c source code file and as we may see,theclockevents_register_deviceaddress of afunction takes only one parameter:clock_event_devicestructure which represents a clock event device.510Clockevents frameworkSo, to register a clock event device, at first we need to initializeclock_event_devicestructure with parameters of a certain clock event device. Let's take a look at one randomclock event device in the Linux kernel source code. We can find one in thedrivers/closksource directory or try to take a look at an architecture-specific clock eventdevice. Let's take for example - Periodic Interval Timer (PIT) for at91sam926x. You can findits implementation in the drivers/closksource.First of all let's look at initialization of theat91sam926x_pit_common_initclock_event_devicestructure. This occurs in thefunction:struct pit_data {......struct clock_event_deviceclkevt;......};static void __init at91sam926x_pit_common_init(struct pit_data *data){.........data->clkevt.name = "pit";data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;data->clkevt.shift = 32;data->clkevt.mult = div_sc(pit_rate, NSEC_PER_SEC, data->clkevt.shift);data->clkevt.rating = 100;data->clkevt.cpumask = cpumask_of(0);data->clkevt.set_state_shutdown = pit_clkevt_shutdown;data->clkevt.set_state_periodic = pit_clkevt_set_periodic;data->clkevt.resume = at91sam926x_pit_resume;data->clkevt.suspend = at91sam926x_pit_suspend;...}Here we can see thatpit_dataat91sam926x_pit_common_initstructure which containsevent related information of thenameof the timer device and itstakes one parameter - pointer to theclock_event_deviceat91sam926xfeaturesstructure which will contain clockperiodic Interval Timer. At the start we fill. In our case we deal with periodic timer which aswe already know may be programmed to generate events periodically.The next two fieldsshiftandmultare familiar to us. They will be used to translatecounter of our timer to nanoseconds. After this we set rating of the timer to100. Thismeans if there will not be timers with higher rating in the system, this timer will be used for511Clockevents frameworktimekeeping. The next field -cpumaskindicates for which processors in the system thedevice will work. In our case, the device will work for the first processor. Thecpumask_ofmacro defined in the include/linux/cpumask.h header file and just expands to the call of the:#define cpumask_of(cpu) (get_cpu_mask(cpu))Where theaboutget_cpu_maskcpumasksreturns the cpumask containing just a givencpunumber. Moreconcept you may read in the CPU masks in the Linux kernel part. In the lastfour lines of code we set callbacks for the clock event device suspend/resume, deviceshutdown and update of the clock event device state.After we finished with the initialization of theat91sam926xperiodic timer, we can register itby the call of the following functions:clockevents_register_device(&data->clkevt);Now we can consider implementation of theclockevent_register_devicefunction. As Ialready wrote above, this function is defined in the kernel/time/clockevents.c source code fileand starts from the initialization of the initial event device state:clockevent_set_state(dev, CLOCK_EVT_STATE_DETACHED);Actually, an event device may be in one of this states:enum clock_event_state {CLOCK_EVT_STATE_DETACHED,CLOCK_EVT_STATE_SHUTDOWN,CLOCK_EVT_STATE_PERIODIC,CLOCK_EVT_STATE_ONESHOT,CLOCK_EVT_STATE_ONESHOT_STOPPED,};Where:CLOCK_EVT_STATE_DETACHED- a clock event device is not not used byclockeventsframework. Actually it is initial state of all clock event devices;CLOCK_EVT_STATE_SHUTDOWN- a clock event device is powered-off;CLOCK_EVT_STATE_PERIODIC- a clock event device may be programmed to generateevent periodically;CLOCK_EVT_STATE_ONESHOT- a clock event device may be programmed to generate eventonly once;CLOCK_EVT_STATE_ONESHOT_STOPPED- a clock event device was programmed to generate512Clockevents frameworkevent only once and now it is temporary stopped.The implementation of thefunction is pretty easy:clock_event_set_statestatic inline void clockevent_set_state(struct clock_event_device *dev,enum clock_event_state state){dev->state_use_accessors = state;}As we can see, it just fills thestate_use_accessorsfield of the givenstructure with the given value which is in our case isCLOCK_EVT_STATE_DETACHEDclock event devices has this initial state during registration. Thetheclock_event_devicestructure providesAfter we have set initial state of the givencpumaskcurrentclock_event_device. Actually allstate_use_accessorsfield ofstate of the clock event device.clock_event_devicestructure we check that theof the given clock event device is not zero:if (!dev->cpumask) {WARN_ON(num_possible_cpus() > 1);dev->cpumask = cpumask_of(smp_processor_id());}Remember that we have set theprocessor. If thecpumaskcpumaskof theat91sam926xperiodic timer to firstfield is zero, we check the number of possible processors in thesystem and print warning message if it is less than on. Additionally we set thecpumaskofthe given clock event device to the current processor. If you are interested in how thesmp_processor_idmacro is implemented, you can read more about it in the fourth part of theLinux kernel initialization process chapter.After this check we lock the actual code of the clock event device registration by the callfollowing macros:raw_spin_lock_irqsave(&clockevents_lock, flags);.........raw_spin_unlock_irqrestore(&clockevents_lock, flags);Additionally theraw_spin_lock_irqsaveand theraw_spin_unlock_irqrestoremacros disablelocal interrupts, however interrupts on other processors still may occur. We need to do it toprevent potential deadlock if we adding new clock event device to the list of clock eventdevices and an interrupt occurs from other clock event device.513Clockevents frameworkWe can see following code of clock event device registration between theandraw_spin_lock_irqsaveraw_spin_unlock_irqrestoremacros:list_add(&dev->list, &clockevent_devices);tick_check_new_device(dev);clockevents_notify_released();First of all we add the given clock event device to the list of clock event devices which isrepresented by theclockevent_devices:static LIST_HEAD(clockevent_devices);At the next step we call thetick_check_new_devicefunction which is defined in thekernel/time/tick-common.c source code file and checks do the new registered clock eventdevice should be used or not. Theclock_event_devicetick_devicetick_check_new_devicefunction checks the givengets the current registered tick device which is represented by thestructure and compares their ratings and features. ActuallyCLOCK_EVT_STATE_ONESHOTis preferred:static bool tick_check_preferred(struct clock_event_device *curdev,struct clock_event_device *newdev){if (!(newdev->features & CLOCK_EVT_FEAT_ONESHOT)) {if (curdev && (curdev->features & CLOCK_EVT_FEAT_ONESHOT))return false;if (tick_oneshot_mode_active())return false;}return !curdev ||newdev->rating > curdev->rating ||!cpumask_equal(curdev->cpumask, newdev->cpumask);}If the new registered clock event device is more preferred than old tick device, we exchangeold and new registered devices and install new device:clockevents_exchange_device(curdev, newdev);tick_setup_device(td, newdev, cpu, cpumask_of(cpu));Theclockevents_exchange_deviceevent device from thefunction releases or in other words deleted the old clockclockevent_deviceslist. The next function -tick_setup_deviceas wemay understand from its name, setups new tick device. This function check the mode of the514Clockevents frameworknew registered clock event device and call thetick_setup_oneshottick_setup_periodicfunction or thedepends on the tick device mode:if (td->mode == TICKDEV_MODE_PERIODIC)tick_setup_periodic(newdev, 0);elsetick_setup_oneshot(newdev, handler, next_event);Both of this functions calls thedevice and theclockevents_switch_stateclockevents_program_eventto change state of the clock eventfunction to set next event of clock event devicebased on delta between the maximum and minimum difference current time and time for thenext event. Thetick_setup_periodic:clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC);clockevents_program_event(dev, next, false))and thetick_setup_oneshot_periodic:clockevents_switch_state(newdev, CLOCK_EVT_STATE_ONESHOT);clockevents_program_event(newdev, next_event, true);Theclockevents_switch_stategiven state and calls thefunction checks that the clock event device is not in the__clockevents_switch_statefunction from the same source codefile:if (clockevent_get_state(dev) != state) {if (__clockevents_switch_state(dev, state))return;The__clockevents_switch_statefunction just makes a call of the certain callback dependson the given state:515Clockevents frameworkstatic int __clockevents_switch_state(struct clock_event_device *dev,enum clock_event_state state){if (dev->features & CLOCK_EVT_FEAT_DUMMY)return 0;switch (state) {case CLOCK_EVT_STATE_DETACHED:case CLOCK_EVT_STATE_SHUTDOWN:if (dev->set_state_shutdown)return dev->set_state_shutdown(dev);return 0;case CLOCK_EVT_STATE_PERIODIC:if (!(dev->features & CLOCK_EVT_FEAT_PERIODIC))return -ENOSYS;if (dev->set_state_periodic)return dev->set_state_periodic(dev);return 0;.........In our case forat91sam926xperiodic timer, the state is theCLOCK_EVT_FEAT_PERIODIC:data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;data->clkevt.set_state_periodic = pit_clkevt_set_periodic;So, for thepit_clkevt_set_periodiccallback will be called. If we will read thedocumentation of the Periodic Interval Timer (PIT) for at91sam926x, we will see that there isPeriodic Interval Timer Mode Registerwhich allows us to control of periodic interval timer.It looks like:516Clockevents framework312524+---------------------------------------------------------------+||PITIEN|PITEN|+---------------------------------------------------------------+231916+---------------------------------------------------------------+||PIV|+---------------------------------------------------------------+158+---------------------------------------------------------------+|PIV|+---------------------------------------------------------------+70+---------------------------------------------------------------+|PIV|+---------------------------------------------------------------+WherebitPIVor- defines the value compared with the primaryPeriodic Interval Valuecounter of the Periodic Interval Timer. Thethe bit is1and thePITIENororPeriod Interval Timer EnabledPeriodic Interval Timer Interrupt EnableSo, to set periodic mode, we need to setRegisterPITEN. And we are doing it in the24,25bits in thepit_clkevt_set_periodicif the bit is120-if.Periodic Interval Timer Modefunction:static int pit_clkevt_set_periodic(struct clock_event_device *dev){struct pit_data *data = clkevt_to_pit_data(dev);.........pit_write(data->base, AT91_PIT_MR,(data->cycle - 1) | AT91_PIT_PITEN | AT91_PIT_PITIEN);return 0;}Where theAT91_PT_MR,AT91_PT_PITEN#define AT91_PIT_MRand theAT91_PIT_PITIENare declared as:0x00#define AT91_PIT_PITIENBIT(25)#define AT91_PIT_PITENBIT(24)After the setup of the new clock event device is finished, we can return to theclockevents_register_devicefunction. The last function in theclockevents_register_devicefunction is:517Clockevents frameworkclockevents_notify_released();This function checks theclockevents_releasedlist which contains released clock eventdevices (remember that they may occur after the call of theclockevents_exchange_devicefunction). If this list is not empty, we go through clock event devices from theclock_events_releasedlist and delete it from theclockevent_devices:static void clockevents_notify_released(void){struct clock_event_device *dev;while (!list_empty(&clockevents_released)) {dev = list_entry(clockevents_released.next,struct clock_event_device, list);list_del(&dev->list);list_add(&dev->list, &clockevent_devices);tick_check_new_device(dev);}}That's all. From this moment we have registered new clock event device. So the usage oftheclockeventsframework is simple and clear. Architectures registered their clock eventdevices, in the clock events core. Users of the clockevents core can get clock event devicesfor their use. Theclockeventsframework provides notification mechanisms for variousclock related management events like a clock event device registered or unregistered, aprocessor is offlined in system which supports CPU hotplug and etc.We saw implementation only of theclockevents_register_deviceclock event layer API is small. Besides theclockeventsAPIfunction. But generally, thefor clock event device registration, theframework provides functions to schedule the next event interrupt, clock eventdevice notification service and support for suspend and resume for clock event devices.If you want to know more aboutclockeventsAPI you can start to research following sourcecode and header files: kernel/time/tick-common.c, kernel/time/clockevents.c andinclude/linux/clockchips.h.That's all.ConclusionThis is the end of the fifth part of the chapter that describes timers and timer managementrelated stuff in the Linux kernel. In the previous part got acquainted with thetimersconcept. In this part we continued to learn time management related stuff in the Linux kernel518Clockevents frameworkand saw a little about yet another framework -clockevents.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linkstimekeeping documentationIntel 8253programmable interval timerACPI pdfx86High Precision Event TimerpowerpcfrequencyAPInanosecondsinterruptinterrupt handlerlocal APICC3 statePeriodic Interval Timer (PIT) for at91sam926xCPU masks in the Linux kerneldeadlockCPU hotplugprevious part519x86 related clock sourcesTimers and time management in the Linuxkernel. Part 6.x86_64 related clock sourcesThis is sixth part of the chapter which describes timers and time management related stuff inthe Linux kernel. In the previous part we sawclockeventsframework and now we willcontinue to dive into time management related stuff in the Linux kernel. This part willdescribe implementation of x86 architecture related clock sources (more aboutclocksourceconcept you can read in the second part of this chapter).First of all we must know what clock sources may be used atx86architecture. It is easy toknow from the sysfs or from content of the/sys/devices/system/clocksource/clocksource0/available_clocksource/sys/devices/system/clocksource/clocksourceNavailable_clocksource. Theprovides two special files to achieve this:- provides information about available clock sources in thesystem;current_clocksource- provides information about currently used clock source in thesystem.So, let's look:$ cat /sys/devices/system/clocksource/clocksource0/available_clocksourcetsc hpet acpi_pmWe can see that there are three registered clock sources in my system:tschpet- Time Stamp Counter;- High Precision Event Timer;acpi_pm- ACPI Power Management Timer.Now let's look at the second file which provides best clock source (a clock source which hasthe best rating in the system):$cat /sys/devices/system/clocksource/clocksource0/current_clocksourcetsc520x86 related clock sourcesFor me it is Time Stamp Counter. As we may know from the second part of this chapter,which describes internals of theclocksourceframework in the Linux kernel, the best clocksource in a system is a clock source with the best (highest) rating or in other words with thehighest frequency.Frequency of the ACPI power management timer isPrecision Event Timer is at least10 MHz3.579545 MHz. Frequency of the High. And the frequency of the Time Stamp Counterdepends on processor. For example On older processors, theTime Stamp Counterwascounting internal processor clock cycles. This means its frequency changed when theprocessor's frequency scaling changed. The situation has changed for newer processors.Newer processors have aninvariant Time Stamp counterthat increments at a constant ratein all operational states of processor. Actually we can get its frequency in the output of the/proc/cpuinfo. For example for the first processor in the system:$ cat /proc/cpuinfo...model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz...And although Intel manual says that the frequency of theTime Stamp Counter, whileconstant, is not necessarily the maximum qualified frequency of the processor, or thefrequency given in the brand string, anyway we may see that it will be much more thanfrequency of theACPI PMtimer orHigh Precision Event Timer. And we can see that theclock source with the best rating or highest frequency is current in the system.You can note that besides these three clock source, we don't see yet another two familiar usclock sources in the output of the/sys/devices/system/clocksource/clocksource0/available_clocksourceareandjiffyrefined_jiffies. These clock sources. We don't see them because this filed maps only highresolution clock sources or in other words clock sources with theCLOCK_SOURCE_VALID_FOR_HRES flag.As I already wrote above, we will consider all of these three clock sources in this part. Wewill consider it in order of their initialization or:hpet;acpi_pmtsc;.We can make sure that the order is exactly like this in the output of the dmesg util:521x86 related clock sources$dmesg | grep clocksource[0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff,max_idle_ns: 1910969940391419 ns[0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns[0.094369] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns[0.186498] clocksource: Switched to clocksource hpet[0.196827] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns:2085701024 ns[1.413685] tsc: Refined TSC clocksource calibration: 3999.981 MHz[1.413688] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x73509721780, max_idle_ns: 881591102108 ns[2.413748] clocksource: Switched to clocksource tscThe first clock source is the High Precision Event Timer, so let's start from it.High Precision Event TimerThe implementation of theHigh Precision Event Timerfor the x86 architecture is located inthe arch/x86/kernel/hpet.c source code file. Its initialization starts from the call of thehpet_enableintofunction. This function is called during Linux kernel initialization. If we will lookstart_kernelfunction from the init/main.c source code file, we will see that after the allarchitecture-specific stuff initialized, early console is disabled and time managementsubsystem already ready, call of the following function:if (late_time_init)late_time_init();which does initialization of the late architecture specific timers after early jiffy counter alreadyinitialized. The definition of thelate_time_initfunction for thex86architecture is locatedin the arch/x86/kernel/time.c source code file. It looks pretty easy:static __init void x86_late_time_init(void){x86_init.timers.timer_init();tsc_init();}As we may see, it does initialization of theStamp Countercall of thex86related timer and initialization of theTime. The seconds we will see in the next paragraph, but now let's consider thex86_init.timers.timer_initfunction. Thetimer_initpoints to the522x86 related clock sourceshpet_time_initfunction from the same source code file. We can verify this by looking onthe definition of thestructure from the arch/x86/kernel/x86_init.c:x86_initstruct x86_init_ops x86_init __initdata = {..........timers = {.setup_percpu_clockev.timer_init= setup_boot_APIC_clock,= hpet_time_init,.wallclock_init= x86_init_noop,},.........Thehpet_time_initenablefunction does setup of the programmable interval timer if we can notHigh Precision Event Timerand setups default timer IRQ for the enabled timer:void __init hpet_time_init(void){if (!hpet_enable())setup_pit_timer();setup_default_timer_irq();}First of all thehpet_enablefunction check we can enablethe system by the call of theis_hpet_capableHigh Precision Event Timerinfunction and if we can, we map a virtualaddress space for it:int __init hpet_enable(void){if (!is_hpet_capable())return 0;hpet_set_mapping();}Theis_hpet_capablecommand line and thehpet_set_mappingfunction checks that we didn't passhpet_addresshpet=disableto the kernelis received from the ACPI HPET table. Thefunction just maps the virtual address spaces for the timer registers:hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);As we can read in the IA-PC HPET (High Precision Event Timers) Specification:523x86 related clock sourcesThe timer register space is 1024 bytesSo, theHPET_MMAP_SIZEisbytes too:1024#define HPET_MMAP_SIZE1024After we mapped virtual space for theHigh Precision Event Timer, we readHPET_IDregister to get number of the timers:id = hpet_readl(HPET_ID);last = (id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT;We need to get this number to allocate correct amount of space for theConfiguration Registerof theHigh Precision Event TimerGeneral:cfg = hpet_readl(HPET_CFG);hpet_boot_cfg = kmalloc((last + 2) * sizeof(*hpet_boot_cfg), GFP_KERNEL);After the space is allocated for the configuration register of theHigh Precision Event Timer,we allow to main counter to run, and allow timer interrupts if they are enabled by the settingofHPET_CFG_ENABLEbit in the configuration register for all timers. In the end we just registernew clock source by the call of thehpet_clocksource_registerfunction:if (hpet_clocksource_register())goto out_nohpet;which just calls already familiarclocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);function. Where theclocksource_hpet(remember rating of the previousread_hpetTimeris theclocksourcerefined_jiffiesstructure with the ratingclock source wascallback for the reading of atomic counter provided by the2), name -250hpetandHigh Precision Event:524x86 related clock sourcesstatic struct clocksource clocksource_hpet = {.name= "hpet",.rating= 250,.read= read_hpet,.mask= HPET_MASK,.flags= CLOCK_SOURCE_IS_CONTINUOUS,.resume= hpet_resume_counter,.archdata= { .vclock_mode = VCLOCK_HPET },};After theclocksource_hpetis registered, we can return to thehpet_time_init()functionfrom the arch/x86/kernel/time.c source code file. We can remember that the last step is thecall of the:setup_default_timer_irq();function in theoflegacyhpet_time_init(). Thesetup_default_timer_irqfunction checks existenceIRQs or in other words support for the i8259 and setups IRQ0 depends on this.That's all. From this moment the High Precision Event Timer clock source registered in theLinux kernelread_hpetclock sourceframework and may be used from generic kernel code via the:static cycle_t read_hpet(struct clocksource *cs){return (cycle_t)hpet_readl(HPET_COUNTER);}function which just reads and returns atomic counter from theMain Counter Register.ACPI PM timerThe seconds clock source is ACPI Power Management Timer. Implementation of this clocksource is located in the drivers/clocksource/acpi_pm.c source code file and starts from thecall of theinit_acpi_pm_clocksourcefunction duringIf we will look at implementation of thestarts from the check of the value offsinitcall.init_acpi_pm_clocksourcepmtmr_ioportfunction, we will see that itvariable:525x86 related clock sourcesstatic int __init init_acpi_pm_clocksource(void){.........if (!pmtmr_ioport)return -ENODEV;.........Thispmtmr_ioportvariable contains extended address of theControl Register Block. It gets its value in theacpi_parse_fadtPower Management Timerfunction which is defined inthe arch/x86/kernel/acpi/boot.c source code file. This function parsesACPI table and tries to get the values of theDescription Tablecontains extended address of therepresented inFADTorX_PM_TMR_BLKFixed ACPIfield whichPower Management Timer Control Register BlockGeneric Address Structure,format:static int __init acpi_parse_fadt(struct acpi_table_header *table){#ifdef CONFIG_X86_PM_TIMER.........pmtmr_ioport = acpi_gbl_FADT.xpm_timer_block.address;.........#endifreturn 0;}So, if theCONFIG_X86_PM_TIMERgoing wrong in theacpi_parse_fadtregister and return from thepmtmr_ioportLinux kernel configuration option is disabled or somethingfunction, we can't access theinit_acpi_pm_clocksourcePower Management Timer. In other way, if the value of thevariable is not zero, we check rate of this timer and register this clock sourceby the call of the:clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);function. After the call of theregistered in theclocksourceclocksource_register_hs, theacpi_pmclock source will beframework of the Linux kernel:526x86 related clock sourcesstatic struct clocksource clocksource_acpi_pm = {.name= "acpi_pm",.rating= 200,.read= acpi_pm_read,.mask= (cycle_t)ACPI_PM_MASK,.flags= CLOCK_SOURCE_IS_CONTINUOUS,};with the rating acpi_pm200and theclock source. Theacpi_pm_readacpi_pm_readcallback to read atomic counter provided by thefunction just executesread_pmtmrfunction:static cycle_t acpi_pm_read(struct clocksource *cs){return (cycle_t)read_pmtmr();}which reads value of thePower Management Timerregister. This register has followingstructure:+-------------------------------+----------------------------------+|||upper eight bits of a|| 32-bit power management timer |||running count of the|power management timer|||+-------------------------------+----------------------------------+31E_TMR_VAL24Address of this register is stored in thealready have it in thepmtmr_ioportTMR_VAL0Fixed ACPI Description Table. So, the implementation of theACPI table and weread_pmtmrfunction ispretty easy:static inline u32 read_pmtmr(void){return inl(pmtmr_ioport) & ACPI_PM_MASK;}We just read the value of thePower Management Timerregister and mask itsThat's all. Now we move to the last clock source in this part -24bits.Time Stamp Counter.Time Stamp Counter527x86 related clock sourcesThe third and last clock source in this part is - Time Stamp Counter clock source and itsimplementation is located in the arch/x86/kernel/tsc.c source code file. We already saw thex86_late_time_initfunction in this part and initialization of the Time Stamp Counter startsfrom this place. This function calls thetsc_init()function from the arch/x86/kernel/tsc.csource code file.At the beginning of thehas support of thetsc_initfunction we can see check, which checks that a processorTime Stamp Counter:void __init tsc_init(void){u64 lpj;int cpu;if (!cpu_has_tsc) {setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);return;}.........Thecpu_has_tscmacro expands to the call of the#define cpu_has_tsc#define boot_cpu_has(bit)cpu_hasmacro:boot_cpu_has(X86_FEATURE_TSC)cpu_has(&boot_cpu_data, bit)#define cpu_has(c, bit)\(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :\test_cpu_cap(c, bit))which check the given bit (theboot_cpu_datain our case) in thearray which is filled during early Linux kernel initialization. If the processorhas support of thethe call of theX86_FEATURE_TSC_DEADLINE_TIMERTime Stamp Countercalibrate_tsc, we get the frequency of theTime Stamp Counterbyfunction from the same source code file which tries to getfrequency from the different source like Model Specific Register, calibrate overprogrammable interval timer and etc, after this we initialize frequency and scale factor for theall processors in the system:528x86 related clock sourcestsc_khz = x86_platform.calibrate_tsc();cpu_khz = tsc_khz;for_each_possible_cpu(cpu) {cyc2ns_init(cpu);set_cyc2ns_scale(cpu_khz, cpu);}because only first bootstrap processor will call theStamp Countertsc_init. After this we check hatTimeis not disabled:if (tsc_disabled > 0)return;.........check_system_tsc_reliable();and call thecheck_system_tsc_reliablebootstrap processor has thethetsc_initStamp Counterfunction which sets theX86_FEATURE_TSC_RELIABLEtsc_clocksource_reliableiffeature. Note that we went throughfunction, but did not register our clock source. Actual registration of theTimeclock source occurs in the:static int __init init_tsc_clocksource(void){if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)return 0;.........if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {clocksource_register_khz(&clocksource_tsc, tsc_khz);return 0;}function. This function called during theStamp Counterdeviceinitcall. We do it to be sure that theTimeclock source will be registered after the High Precision Event Timer clocksource.After these all three clock sources will be registered in theTime Stamp Counterclocksourceframework and theclock source will be selected as active, because it has the highestrating among other clock sources:529x86 related clock sourcesstatic struct clocksource clocksource_tsc = {.name= "tsc",.rating= 300,.read= read_tsc,.mask= CLOCKSOURCE_MASK(64),.flags= CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY,.archdata= { .vclock_mode = VCLOCK_TSC },};That's all.ConclusionThis is the end of the sixth part of the chapter that describes timers and timer managementrelated stuff in the Linux kernel. In the previous part got acquainted with theclockeventsframework. In this part we continued to learn time management related stuff in the Linuxkernel and saw a little about three different clock sources which are used in the x86architecture. The next part will be last part of this chapter and we will see some user spacerelated stuff, i.e. how some time related system calls implemented in the Linux kernel.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linksx86sysfsTime Stamp CounterHigh Precision Event TimerACPI Power Management Timer (PDF)frequency.dmesgprogrammable interval timerIRQIA-PC HPET (High Precision Event Timers) SpecificationIRQ0i8259initcall530x86 related clock sourcesprevious part531Time related system callsTimers and time management in the Linuxkernel. Part 7.Time related system calls in the Linux kernelThis is the seventh and last part chapter, which describes timers and time managementrelated stuff in the Linux kernel. In the previous part, we discussed timers in the context ofx86_64: High Precision Event Timer and Time Stamp Counter. Internal time management isan interesting part of the Linux kernel, but of course not only the kernel needs thetimeconcept. Our programs also need to know time. In this part, we will consider implementationof some time management related system calls. These system calls are:clock_gettimegettimeofdaynanosleep;;.We will start from a simple userspace C program and see all way from the call of thestandard library function to the implementation of certain system calls. As each architectureprovides its own implementation of certain system calls, we will consider only x86_64specific implementations of system calls, as this book is related to this architecture.Additionally, we will not consider the concept of system calls in this part, but onlyimplementations of these three system calls in the Linux kernel. If you are interested in whatis asystem call, there is a special chapter about this.So, let's start from thegettimeofdaysystem call.Implementation of thecallAs we can understand from the namegettimeofdaygettimeofdaysystem, this function returns the current time.First of all, let's look at the following simple example:532Time related system calls#include #include #include int main(int argc, char **argv){char buffer[40];struct timeval time;gettimeofday(&time, NULL);strftime(buffer, 40, "Current date/time: %m-%d-%Y/%T", localtime(&time.tv_sec));printf("%s\n",buffer);return 0;}As you can see, here we call thegettimeofdayfirst parameter is a pointer to thetimevalfunction, which takes two parameters. Thestructure, which represents an elapsed time:struct timeval {time_ttv_sec;suseconds_t tv_usec;/* seconds *//* microseconds */};The second parameter of thegettimeofdayfunction is a pointer to thewhich represents a timezone. In our example, we pass address of thegettimeofdayfunction, the Linux kernel fills the givento us. Additionally, we format the time with thetimevalstrftimetimezonestructuretimeval timeto thestructure and returns it backfunction to get something morehuman readable than elapsed microseconds. Let's see the result:~$ gcc date.c -o date~$./dateCurrent date/time: 03-26-2016/16:42:02As you may already know, a userspace application does not call a system call directly fromthe kernel space. Before the actual system call entry will be called, we call a function fromthe standard library. In my case it is glibc, so I will consider this case. The implementation ofthegettimeofdayfunction is located in the sysdeps/unix/sysv/linux/x86/gettimeofday.csource code file. As you already may know, thegettimeofdaylocated in the special area which is called(you can read more about it in the part,vDSOis not a usual system call. It iswhich describes this concept).533Time related system callsTheglibcimplementation ofthis symbol istries to resolve the given symbol; in our casegettimeofday__vdso_gettimeofdayby the call of thesymbol cannot be resolved, it returnsNULL_dl_vdso_vsyminternal function. If theand we fallback to the call of the usual systemcall:return (_dl_vdso_vsym ("__vdso_gettimeofday", &linux26)?: (void*) (&__gettimeofday_syscall));Thegettimeofdayentry is located in the arch/x86/entry/vdso/vclock_gettime.c source codefile. As we can see thegettimeofdayis a weak alias of the__vdso_gettimeofday:int gettimeofday(struct timeval *, struct timezone *)__attribute__((weak, alias("__vdso_gettimeofday")));The__vdso_gettimeofdayfunction if the givenis defined in the same source code file and calls thetimevaldo_realtimeis not null:notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz){if (likely(tv != NULL)) {if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))return vdso_fallback_gtod(tv, tz);tv->tv_usec /= 1000;}if (unlikely(tz != NULL)) {tz->tz_minuteswest = gtod->tz_minuteswest;tz->tz_dsttime = gtod->tz_dsttime;}return 0;}If thedo_realtimewill fail, we fallback to the real system call via call theinstruction and passing theandtimezone__NR_gettimeofdaysyscallsystem call number and the giventimeval:notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz){long ret;asm("syscall" : "=a" (ret) :"0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");return ret;}534Time related system callsThedo_realtimefunction gets the time data from thevsyscall_gtod_datastructure whichis defined in the arch/x86/include/asm/vgtod.h header file and contains mapping of thetimespecstructure and a couple of fields which are related to the current clock source inthe system. This function fills the givenvsyscall_gtod_datatimevalstructure with values from thewhich contains a time related data which is updated via timer interrupt.First of all we try to access thestructure via the call of thegtodorglobal time of daygtod_read_beginthevsyscall_gtod_dataand will continue to do it until it will besuccessful:do {seq = gtod_read_begin(gtod);mode = gtod->vclock_mode;ts->tv_sec = gtod->wall_time_sec;ns = gtod->wall_time_snsec;ns += vgetsns(&mode);ns >>= gtod->shift;} while (unlikely(gtod_read_retry(gtod, seq)));ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);ts->tv_nsec = ns;As we got access to thegtod, we fill thets->tv_secwith thegtod->wall_time_secwhichstores current time in seconds gotten from the real time clock during initialization of thetimekeeping subsystem in the Linux kernel and the same value but in nanoseconds. In theend of this code we just fill the givenThat's all about theclock_gettimegettimeofdaytimespecstructure with the resulted values.system call. The next system call in our list is the.Implementation of the clock_gettime systemcallTheclock_gettimeGenerally theclk_idfunction gets the time which is specified by the second parameter.clock_gettimefunction takes two parameters:- clock identifier;timespec- address of thetimespecstructure which represent elapsed time.Let's look on the following simple example:535Time related system calls#include #include #include int main(int argc, char **argv){struct timespec elapsed_from_boot;clock_gettime(CLOCK_BOOTTIME, &elapsed_from_boot);printf("%d - seconds elapsed from boot\n", elapsed_from_boot.tv_sec);return 0;}which printsuptimeinformation:~$ gcc uptime.c -o uptime~$./uptime14180 - seconds elapsed from bootWe can easily check the result with the help of the uptime util:~$ uptimeup3:56Theelapsed_from_boot.tv_secrepresents elapsed time in seconds, so:>>> 14180 / 60236>>> 14180 / 60 / 603>>> 14180 / 60 % 6056Theclock_idmaybe one of the following:CLOCK_REALTIME- system wide clock which measures real or wall-clock time;CLOCK_REALTIME_COARSECLOCK_MONOTONIC- faster version of theCLOCK_REALTIME;- represents monotonic time since some unspecified starting point;CLOCK_MONOTONIC_COARSECLOCK_MONOTONIC_RAW- faster version of the- the same as theCLOCK_MONOTONICCLOCK_MONOTONIC;but provides non NTPadjusted time.CLOCK_BOOTTIME- the same as theCLOCK_MONOTONICbut plus time that the system wassuspended;536Time related system callsCLOCK_PROCESS_CPUTIME_IDCLOCK_THREAD_CPUTIME_IDTheclock_gettimeplaced in the- per-process time consumed by all threads in the process;- thread-specific clock.is not usual syscall too, but as the, this system call isarea. Entry of this system call is located in the same source code file -vDSOarch/x86/entry/vdso/vclock_gettime.c) as forThe Implementation of theCLOCK_REALTIMEgettimeofdayclock_gettimeclock id, thegettimeofday.depends on the clock id. If we have passed thedo_realtimefunction will be called:notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts){switch (clock) {case CLOCK_REALTIME:if (do_realtime(ts) == VCLOCK_NONE)goto fallback;break;.........fallback:return vdso_fallback_gettime(clock, ts);}In other cases, thedo_{name_of_clock_id}function is called. Implementations of some ofthem is similar. For example if we will pass theCLOCK_MONOTONICclock id:.........case CLOCK_MONOTONIC:if (do_monotonic(ts) == VCLOCK_NONE)goto fallback;break;.........thedo_monotonicdo_realtimefunction will be called which is very similar on the implementation of the:537Time related system callsnotrace static int __always_inline do_monotonic(struct timespec *ts){do {seq = gtod_read_begin(gtod);mode = gtod->vclock_mode;ts->tv_sec = gtod->monotonic_time_sec;ns = gtod->monotonic_time_snsec;ns += vgetsns(&mode);ns >>= gtod->shift;} while (unlikely(gtod_read_retry(gtod, seq)));ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);ts->tv_nsec = ns;return mode;}We already saw a little about the implementation of this function in the previous paragraphabout thetimespecgettimeofday. There is only one difference here, that thevalue will be based on the>wall_time_secgtod->monotonic_time_secwhich maps the value of thesecandinstead oftk->tkr_mono.xtime_nsecnsecof ourgtod-or number ofnanoseconds elapsed.That's all.Implementation of theThe last system call in our list is thefunction providessleepingnanosleepnanosleepsystem call. As you can understand from its name, thisability. Let's look on the following simple example:#include #include #include int main (void){struct timespec ts = {5,0};printf("sleep five seconds\n");nanosleep(&ts, NULL);printf("end of sleep\n");return 0;}If we will compile and run it, we will see the first line538Time related system calls~$gcc sleep_test.c -o sleep~$ ./sleepsleep five secondsend of sleepand the second line after five seconds.Thenanosleepclock_gettimeis not located in thevDSOarea like thefunctions. So, let's look how thegettimeofdayand thesystem call which is located in therealkernel space will be called by the standard library. The implementation of thenanosleepsystem call will be called with the help of the syscall instruction. Before the execution of thesyscallinstruction, parameters of the system call must be put in processor registersaccording to order which is described in the System V Application Binary Interface or inother words:Therdi- first parameter;rsi- second parameter;rdx- third parameter;r10- fourth parameter;r8- fifth parameter;r9- sixth parameter.nanosleepsystem call has two parameters - two pointers to thetimespecstructures.The system call suspends the calling thread until the given timeout has elapsed. Additionallyit will finish if a signal interrupts its execution. It takes two parameters, the first istimespecwhich represents timeout for the sleep. The second parameter is the pointer to thetimespecstructure too and it contains remainder of time if the call of thenanosleepwasinterrupted.Asnanosleephas two parameters:int nanosleep(const struct timespec *req, struct timespec *rem);To call system call, we need put thethersireqto theregister. The glibc does these job in therdiregister, and theINTERNAL_SYSCALLremparameter tomacro which is locatedin the sysdeps/unix/sysv/linux/x86_64/sysdep.h header file.# define INTERNAL_SYSCALL(name, err, nr, args...) \INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)539Time related system callswhich takes the name of the system call, storage for possible error during execution ofsystem call, number of the system call (allx86_64system calls you can find in the systemcalls table) and arguments of certain system call. Theto the call of theINTERNAL_SYSCALL_NCSINTERNAL_SYSCALLmacro just expandsmacro, which prepares arguments of system call(puts them into the processor registers in correct order), executessyscallinstruction andreturns the result:# define INTERNAL_SYSCALL_NCS(name, err, nr, args...)\({\unsigned long int resultvar;\LOAD_ARGS_##nr (args)\LOAD_REGS_##nr\asm volatile (\"syscall\n\t"\: "=a" (resultvar)\: "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);\(long int) resultvar; })TheLOAD_ARGS_##nrmacro calls theLOAD_ARGS_Nmacro where thearguments of the system call. In our case, it will be theLOAD_ARGS_2Nis number ofmacro. Ultimately all ofthese macros will be expanded to the following:# define LOAD_REGS_TYPES_1(t1, a1)register t1 _a1 asm ("rdi") = __arg1;\\LOAD_REGS_0# define LOAD_REGS_TYPES_2(t1, a1, t2, a2)register t2 _a2 asm ("rsi") = __arg2;\\LOAD_REGS_TYPES_1(t1, a1).........After thesyscallinstruction will be executed, the context switch will occur and the kernelwill transfer execution to the system call handler. The system call handler for thenanosleepsystem call is located in the kernel/time/hrtimer.c source code file and defined with theSYSCALL_DEFINE2macro helper:540Time related system callsSYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,struct timespec __user *, rmtp){struct timespec tu;if (copy_from_user(&tu, rqtp, sizeof(tu)))return -EFAULT;if (!timespec_valid(&tu))return -EINVAL;return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);}More about theSYSCALL_DEFINE2macro you may read in the chapter about system calls. Ifwe look at the implementation of thestarts from the call of thenanosleepcopy_from_usersystem call, first of all we will see that itfunction. This function copies the given data fromthe userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspacetimespecstructure and check that the giventimesc_validtimespecis valid by the call of thefunction:static inline bool timespec_valid(const struct timespec *ts){if (ts->tv_sec tv_nsec >= NSEC_PER_SEC)return false;return true;}which just checks that the giventimespecnanoseconds does not overflow1thehrtimer_nanosleepdoes not represent date beforesecond. Thenanosleepdo_nanosleepandfunction ends with the call offunction from the same source code file. Thefunction creates a timer and calls the1970function. Thehrtimer_nanosleepdo_nanosleepdoes mainjob for us. This function provides loop:541Time related system callsdo {set_current_state(TASK_INTERRUPTIBLE);hrtimer_start_expires(&t->timer, mode);if (likely(t->task))freezable_schedule();} while (t->task && !signal_pending(current));__set_current_state(TASK_RUNNING);return t->task == NULL;Which freezes current task during sleep. After we setcurrent task, thehrtimer_start_expiresTASK_INTERRUPTIBLEflag for thefunction starts the give high-resolution timer on thecurrent processor. As the given high resolution timer will expire, the task will be againrunning.That's all.ConclusionThis is the end of the seventh part of the chapter that describes timers and timermanagement related stuff in the Linux kernel. In the previous part we saw x86_64 specificclock sources. As I wrote in the beginning, this part is the last part of this chapter. We sawimportant time management related concepts likeframeworks,jiffiesclocksourceandclockeventscounter and etc., in this chpater. Of course this does not cover all ofthe time management in the Linux kernel. Many parts of this mostly related to the schedulingwhich we will see in other chapter.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linkssystem callC programming languagestandard libraryglibcreal time clock542Time related system callsNTPnanosecondsregisterSystem V Application Binary Interfacecontext switchIntroduction to timers in the Linux kerneluptimesystem calls table for x86_64High Precision Event TimerTime Stamp Counterx86_64previous part543Synchronization primitivesSynchronization primitives in the Linuxkernel.This chapter describes synchronization primitives in the Linux kernel.Introduction to spinlocks - the first part of this chapter describes implementation ofspinlock mechanism in the Linux kernel.Queued spinlocks - the second part describes another type of spinlocks - queuedspinlocks.Semaphores - this part describes implementation ofsemaphoresynchronizationprimitive in the Linux kernel.Mutual exclusion - this part describes -mutexin the Linux kernel.Reader/Writer semaphores - this part describes special type of semaphores reader/writersemaphores.Sequential locks - this part describes sequential locks in the Linux kernel.544Introduction to spinlocksSynchronization primitives in the Linuxkernel. Part 1.IntroductionThis part opens a new chapter in the linux-insides book. Timers and time managementrelated stuff was described in the previous chapter. Now time to go next. As you mayunderstand from the part's title, this chapter will describe synchronization primitives in theLinux kernel.As always, before we will consider something synchronization related, we will try to knowwhat issynchronization primitivein general. Actually, synchronization primitive is asoftware mechanism which provides the ability to two or more parallel processes or threadsto not execute simultaneously on the same segment of a code. For example, let's look onthe following piece of code:mutex_lock(&clocksource_mutex);.........clocksource_enqueue(cs);clocksource_enqueue_watchdog(cs);clocksource_select();.........mutex_unlock(&clocksource_mutex);from the kernel/time/clocksource.c source code file. This code is from the__clocksource_register_scalefunction which adds the given clocksource to the clocksources list. This function produces different operations on a list with registered clocksources. For example, theclocksource_enqueuelist with registered clocksources to two functions:mutex_lockclocksource_mutexandfunction adds the given clock source to theclocksource_listmutex_unlock. Note that these lines of code wrappedwhich takes one parameter - thein our case.These functions represent locking and unlocking based on mutex synchronization primitive.Asmutex_lockwill be executed, it allows us to prevent the situation when two or morethreads will execute this code while themutex_unlockwill not be executed by process-owner of the mutex. In other words, we prevent parallel operations on aclocksource_list.545Introduction to spinlocksWhy do we needhere? What if two parallel processes will try to register a clockmutexsource. As we already know, theto theclocksource_enqueuefunction adds the given clock sourcelist right after a clock source in the list which has the biggest ratingclocksource_list(a registered clock source which has the highest frequency in the system):static void clocksource_enqueue(struct clocksource *cs){struct list_head *entry = &clocksource_list;struct clocksource *tmp;list_for_each_entry(tmp, &clocksource_list, list)if (tmp->rating >= cs->rating)entry = &tmp->list;list_add(&cs->list, entry);}If two parallel processes will try to do it simultaneously, both process may found the sameentrymay occur race condition or in other words, the second process which will executelist_add, will overwrite a clock source from the first thread.Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. Ifwe will go through the previous chapter or other chapters again or if we will look at the Linuxkernel source code in general, we will meet many places like this. We will not consider howmutexis implemented in the Linux kernel. Actually, the Linux kernel provides a set ofdifferent synchronization primitives like:mutex;semaphoresseqlocks;;atomic operations;etc.We will start this chapter from thespinlock.Spinlocks in the Linux kernel.Thespinlockis a low-level synchronization mechanism which in simple words, representsa variable which can be in two states:acquired;released.546Introduction to spinlocksEach process which wants to acquire aspinlock, must write a value which representsstate to this variable and writespinlock acquiredspinlock releasedIf a process tries to execute code which is protected by aspinlockstate to the variable., it will be locked while aprocess which holds this lock will release it. In this case all related operations must beatomic to prevent race conditions state. Thespinlockis represented by thespinlock_ttype in the Linux kernel. If we will look at the Linux kernel code, we will see that this type iswidely used. Thespinlock_tis defined as:typedef struct spinlock {union {struct raw_spinlock rlock;#ifdef CONFIG_DEBUG_LOCK_ALLOC# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))struct {u8 __padding[LOCK_PADSIZE];struct lockdep_map dep_map;};#endif};} spinlock_t;and located in the include/linux/spinlock_types.h header file. We may see that itsimplementation depends on the state of theCONFIG_DEBUG_LOCK_ALLOCkernel configurationoption. We will skip this now, because all debugging related stuff will be in the end of thispart. So, if thespinlock_tCONFIG_DEBUG_LOCK_ALLOCkernel configuration option is disabled, thecontains union with one field which is -raw_spinlock:typedef struct spinlock {union {struct raw_spinlock rlock;};} spinlock_t;Theraw_spinlockimplementation ofstructure defined in the same header file and represents thenormalspinlock. Let's look how theraw_spinlockstructure is defined:typedef struct raw_spinlock {arch_spinlock_t raw_lock;#ifdef CONFIG_GENERIC_LOCKBREAKunsigned int break_lock;#endif} raw_spinlock_t;547Introduction to spinlockswhere thetherepresents architecture-specificarch_spinlock_tbreak_lockfield which holds value -1spinlockimplementation andin a case when one processor starts to waitwhile the lock is held on another processor on SMP systems. This allows prevent long timelocking. As consider the x86_64 architecture in this books, so thearch_spinlock_tisdefined in the arch/x86/include/asm/spinlock_types.h header file and looks:#ifdef CONFIG_QUEUED_SPINLOCKS#include #elsetypedef struct arch_spinlock {union {__ticketpair_t head_tail;struct __raw_tickets {__ticket_t head, tail;} tickets;};} arch_spinlock_t;As we may see, the definition of theCONFIG_QUEUED_SPINLOCKSkernel supportsacquiredandCONFIG_QUEUED_SPINLOCKSstructure depends on the value of thekernel configuration option. This configuration option the Linuxspinlocksreleasedarch_spinlockwith queue. This special type ofatomic values usedatomicspinlocksoperation on akernel configuration option is enabled, thewhich instead of. If thequeuearch_spinlock_twillbe represented by the following structure:typedef struct qspinlock {atomic_tval;} arch_spinlock_t;from the include/asm-generic/qspinlock_types.h header file.We will not stop on this structures for now and before we will consider bothand theqspinlock, let's look at the operations on a spinlock. The Linux kernel providesfollowing main operations on aspin_lock_initspin_lockspinlock:- produces initialization of the given- acquires givenspin_lock_bhspinlockspinlockandspin_lock_irqspinlockspinlock.- disable interrupts on local processor andpreserve/not preserve previous interrupt state in the- releases given;;- disables software interrupts and acquire givenspin_lock_irqsavespin_unlockarch_spinlockflags;;spin_unlock_bh- releases givenspin_is_locked- returns the state of the givenspinlockand enables software interrupts;spinlock;and etc.548Introduction to spinlocksLet's look on the implementation of thespin_lock_initmacro. As I already wrote, this andother macro are defined in the include/linux/spinlock.h header file and thespin_lock_initmacro looks:#define spin_lock_init(_lock)\do {\spinlock_check(_lock);\raw_spin_lock_init(&(_lock)->rlock);\} while (0)As we may see, thecheck the givenspinlock_checkspinlockspin_lock_initspinlockmacro takes aand execute thespinlockraw_spin_lock_initis pretty easy, this function just returns theto be sure that we got exactlyand executes two operations:normal. The implementation of theraw_spinlock_tof the givenraw spinlock:static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock){return &lock->rlock;}Theraw_spin_lock_initmacro:# define raw_spin_lock_init(lock)\do {\*(lock) = __RAW_SPIN_LOCK_UNLOCKED(lock);\} while (0)\assigns the value of theraw_spinlock_t__RAW_SPIN_LOCK_UNLOCKEDwith the given. As we may understand from the name of themacro, this macro does initialization of the givenspinlockspinlockto the given__RAW_SPIN_LOCK_UNLOCKEDand set it toreleasedstate.This macro defined in the include/linux/spinlock_types.h header file and expands to thefollowing macros:#define __RAW_SPIN_LOCK_UNLOCKED(lockname)\(raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname)#define __RAW_SPIN_LOCK_INITIALIZER(lockname)\{\.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,\SPIN_DEBUG_INIT(lockname)\SPIN_DEP_MAP_INIT(lockname)\}549Introduction to spinlocksAs I already wrote above, we will not consider stuff which is related to debugging ofsynchronization primitives. In this case we will not consider theSPIN_DEP_MAP_INITmacros. So the__RAW_SPINLOCK_UNLOCKEDSPIN_DEBUG_INITand themacro will be expanded to the:*(&(_lock)->rlock) = __ARCH_SPIN_LOCK_UNLOCKED;where theis:__ARCH_SPIN_LOCK_UNLOCKED#define __ARCH_SPIN_LOCK_UNLOCKED{ { 0 } }and:#define __ARCH_SPIN_LOCK_UNLOCKEDfor the x86_64 architecture. if the{ ATOMIC_INIT(0) }CONFIG_QUEUED_SPINLOCKSenabled. So, after the expansion of theinitialized and its state will be -unlockedspin_lock_initkernel configuration option ismacro, a givenspinlockwill be., now let's consider API which LinuxFrom this moment we know how to initialize aspinlockkernel provides for manipulations of. The first is:spinlocksstatic __always_inline void spin_lock(spinlock_t *lock){raw_spin_lock(&lock->rlock);}function which allows us toacquirea spinlock. Thesame header file and expands to the call of the#define raw_spin_lock(lock)raw_spin_lock_raw_spin_lockmacro is defined in thefunction:_raw_spin_lock(lock)As we may see in the include/linux/spinlock.h header file, definition of themacro depends on theCONFIG_SMP_raw_spin_lockkernel configuration parameter:#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)# include #else# include #endifSo, if the SMP is enabled in the Linux kernel, the_raw_spin_lockmacro is defined in thearch/x86/include/asm/spinlock.h header file and looks like:550Introduction to spinlocks#define _raw_spin_lock(lock) __raw_spin_lock(lock)The__raw_spin_lockfunction looks:static inline void __raw_spin_lock(raw_spinlock_t *lock){preempt_disable();spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);}As you may see, first of all we disable preemption by the call of thepreempt_disablemacrofrom the include/linux/preempt.h (more about this you may read in the ninth part of the Linuxkernel initialization process chapter). When we will unlock the givenspinlock, preemptionwill be enabled again:static inline void __raw_spin_unlock(raw_spinlock_t *lock){.........preempt_enable();}We need to do this while a process is spinning on a lock, other processes must beprevented to preempt the process which acquired a lock. Thespin_acquiremacro whichthrough a chain of other macros expands to the call of the:#define spin_acquire(l, s, t, i)lock_acquire_exclusive(l, s, t, NULL,i)#define lock_acquire_exclusive(l, s, t, n, i)lock_acquire(l, s, t, 0, 1, n,i)lock_acquirefunction:551Introduction to spinlocksvoid lock_acquire(struct lockdep_map *lock, unsigned int subclass,int trylock, int read, int check,struct lockdep_map *nest_lock, unsigned long ip){unsigned long flags;if (unlikely(current->lockdep_recursion))return;raw_local_irq_save(flags);check_flags(flags);current->lockdep_recursion = 1;trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);__lock_acquire(lock, subclass, trylock, read, check,irqs_disabled_flags(flags), nest_lock, ip, 0, 0);current->lockdep_recursion = 0;raw_local_irq_restore(flags);}As I wrote above, we will not consider stuff here which is related to debugging or tracing.The main point of thethelock_acquirefunction is to disable hardware interrupts by the call ofmacro, because the given spinlock might be acquired with enabledraw_local_irq_savehardware interrupts. In this way the process will not be preempted. Note that in the end ofthefunction we will enable hardware interrupts again with the help of thelock_acquireraw_local_irq_restore__lock_acquireThemacro. As you already may guess, the main work will be in thefunction which is defined in the kernel/locking/lockdep.c source code file.__lock_acquirefunction looks big. We will try to understand what does this function do,but not in this part. Actually this function mostly related to the Linux kernel lock validator andit is not topic of this part. If we will return to the definition of the__raw_spin_lockfunction,we will see that it contains the following definition in the end:LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);TheLOCK_CONTENDEDmacro is defined in the include/linux/lockdep.h header file and just callsthe given function with the givenspinlock:#define LOCK_CONTENDED(_lock, try, lock) \lock(_lock)In our case, thefile and thelock_lockisdo_raw_spin_lockis the givenfunction from the include/linux/spinlock.h headerraw_spinlock_t:552Introduction to spinlocksstatic inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock){__acquire(lock);arch_spin_lock(&lock->raw_lock);}The__acquirehere is just sparse related macro and we are not interested in it in thismoment. Location of the definition of thearch_spin_lockfunction depends on two things:the first is the architecture of the system and the second do we usenot. In our case we consider onlyarch_spin_lockx86_64queued spinlocksorarchitecture, so the definition of theis represented as the macro from the include/asm-generic/qspinlock.hheader file:#define arch_spin_lock(l)if we are usingqueued spinlocksqueued_spin_lock(l). Or in other case, thearch_spin_lockfunction is definedin the arch/x86/include/asm/spinlock.h header file. Now we will consider onlyspinlockand information related tothe definition of thearch_spin_lockarch_spinlockqueued spinlocksnormalwe will see later. Let's look again onstructure, to understand the implementation of thefunction:typedef struct arch_spinlock {union {__ticketpair_t head_tail;struct __raw_tickets {__ticket_t head, tail;} tickets;};} arch_spinlock_t;This variant ofspinlockis called -ticket spinlockparts. When lock is acquired, it increments ato hold aspinlock. If thetailtailis not equal to. As we may see, it consists from twoby one every time when a process wantshead, the process will be locked, untilvalues of these variables will not be equal. Let's look on the implementation of thearch_spin_lockfunction:553Introduction to spinlocksstatic __always_inline void arch_spin_lock(arch_spinlock_t *lock){register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };inc = xadd(&lock->tickets, inc);if (likely(inc.head == inc.tail))goto out;for (;;) {unsigned count = SPIN_THRESHOLD;do {inc.head = READ_ONCE(lock->tickets.head);if (__tickets_equal(inc.head, inc.tail))goto clear_slowpath;cpu_relax();} while (--count);__ticket_lock_spinning(lock, inc.tail);}clear_slowpath:__ticket_check_and_clear_slowpath(lock, inc.head);out:barrier();}At the beginning of thestructure with-tailarch_spin_lock1function we can initialization of the:#define __TICKET_LOCK_INC1In the next line we execute xadd operation on theoperation theincwill store value of thewill be increased oninc__raw_ticketsor1. Theticketstailincandlock->ticketsof the givenlockvalue was increased on. After thisand the1tickets.tailwhich means thatone process started to try to hold a lock. In the next step we do the check that checks thatheadandtailhave the same value. If these values are equal, this means that nobodyholds lock and we go to thesee thebarrieroutlabel. In the end of themacro which representsarch_spin_lockbarrier instructionfunction we maywhich guarantees thatcompiler will not change the order of operations that access memory (more about memorybarriers you can read in the kernel documentation).If one process held a lock and a second process started to execute thefunction, theheadonbetweenthe1headwill not beequaltotail, because thetailarch_spin_lockwill be greater than. In this way, the process will occur in the loop. There will be comparisonheadcpu_relaxand thetailvalues at each loop iteration. If these values are not equal,will be called which is just NOP instruction:554Introduction to spinlocks#define cpu_relax()asm volatile("rep; nop")and the next iteration of the loop will be started. If these values will be equal, this means thatthe process which held this lock, released this lock and the next process may acquire thelock.Thespin_unlockwithunlockoperation goes through the all macros/function asprefix. In the end thethe implementation of thethelock ticketsarch_spin_unlockarch_spin_unlockspin_lock, of coursefunction will be called. If we will look atfunction, we will see that it increasesheadoflist:__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);In a combination of thespin_lockandspin_unlock, we get kind of queue whereheadcontains an index number which maps currently executed process which holds a lock andthetailwhich contains an index number which maps last process which tried to hold thelock:+-------++-------+||head |||7| - - - ||+-------+|7|| tail|+-------+|+-------+|||8|||+-------+|+-------+||||9||+-------+That's all for now. We didn't coverspinlockAPI in full in this part, but I think that the mainidea behind this concept must be clear now.Conclusion555Introduction to spinlocksThis concludes the first part covering synchronization primitives in the Linux kernel. In thispart, we met first synchronization primitivespinlockprovided by the Linux kernel. In thenext part we will continue to dive into this interesting theme and will see othersynchronizationrelated stuff.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.LinksConcurrent computingSynchronizationClocksource frameworkMutexRace conditionAtomic operationsSMPx86_64InterruptsPreemptionLinux kernel lock validatorSparsexadd instructionNOPMemory barriersPrevious chapter556Queued spinlocksSynchronization primitives in the Linuxkernel. Part 2.Queued SpinlocksThis is the second part of the chapter which describes synchronization primitives in the Linuxkernel and in the first part of this chapter we met the first - spinlock. We will continue to learnthis synchronization primitive in this part. If you have read the previous part, you mayremember that besides normal spinlocks, the Linux kernel provides special type ofspinlocks-queued spinlocks. In this part we will try to understand what does this conceptrepresent.We saw API ofspinlockspin_lock_initspin_lockin the previous part:- produces initialization of the given- acquires givenspin_lock_bhspinlock;- disables software interrupts and acquire givenspin_lock_irqsaveand- releases givenspinlock.- disable interrupts on local processor andspin_lock_irqpreserve/not preserve previous interrupt state in thespin_unlock;spinlockspinlockflags;;spin_unlock_bh- releases givenspin_is_locked- returns the state of the givenspinlockand enables software interrupts;spinlock;and etc.And we know that all of these macro which are defined in the include/linux/spinlock.h headerfile will be expanded to the call of the functions witharch_spin_.*prefix from thearch/x86/include/asm/spinlock.h for the x86_64 architecture. If we will look at this header fillwith attention, we will that these functions (arch_spin_unlockarch_spin_is_lockedand etc) defined only if the,arch_spin_lockCONFIG_QUEUED_SPINLOCKS,kernel configurationoption is disabled:557Queued spinlocks#ifdef CONFIG_QUEUED_SPINLOCKS#include #elsestatic __always_inline void arch_spin_lock(arch_spinlock_t *lock){.........}.........#endifThis means that the arch/x86/include/asm/qspinlock.h header file provides ownimplementation of these functions. Actually they are macros and they are located in otherheader file. This header file is - include/asm-generic/qspinlock.h. If we will look into thisheader file, we will find definition of these macros:#define arch_spin_is_locked(l)queued_spin_is_locked(l)#define arch_spin_is_contended(l)queued_spin_is_contended(l)#define arch_spin_value_unlocked(l)queued_spin_value_unlocked(l)#define arch_spin_lock(l)queued_spin_lock(l)#define arch_spin_trylock(l)queued_spin_trylock(l)#define arch_spin_unlock(l)queued_spin_unlock(l)#define arch_spin_lock_flags(l, f)queued_spin_lock(l)#define arch_spin_unlock_wait(l)queued_spin_unlock_wait(l)Before we will consider how queued spinlocks and their API are implemented, we take alook on theoretical part at first.Introduction to queued spinlocksQueued spinlocks is a locking mechanism in the Linux kernel which is replacement for thestandardspinlocks. At least this is true for the x86_64 architecture. If we will look at thefollowing kernel configuration file - kernel/Kconfig.locks, we will see following configurationentries:config ARCH_USE_QUEUED_SPINLOCKSboolconfig QUEUED_SPINLOCKSdef_bool y if ARCH_USE_QUEUED_SPINLOCKSdepends on SMP558Queued spinlocksThis means that thedefault if theCONFIG_QUEUED_SPINLOCKSARCH_USE_QUEUED_SPINLOCKSARCH_USE_QUEUED_SPINLOCKSkernel configuration option will be enabled byis enabled. We may see that theis enabled by default in thex86_64specific kernel configurationfile - arch/x86/Kconfig:config X86.........select ARCH_USE_QUEUED_SPINLOCKS.........Before we will start to consider what is it queued spinlock concept, let's look on other typesofspinlocks. For the start let's consider howimplementation ofnormalnormalspinlocks is implemented. Usually,spinlock is based on the test and set instruction. Principle of workof this instruction is pretty simple. This instruction writes a value to the memory location andreturns old value from this memory location. Both of these operations are in atomic contexti.e. this instruction is non-interruptible. So if the first thread started to execute this instruction,second thread will wait until the first processor will not finish. Basic lock can be built on top ofthis mechanism. Schematically it may look like this:int lock(lock){while (test_and_set(lock) == 1);return 0;}int unlock(lock){lock=0;return lock;}The first thread will execute thetest_and_setsecond thread will call thefunction, it will spin in thewill not call theunlocklockfunction and thelockwhich will set thelockwhilewill be equal to0to1. When theloop, until the first thread. This implementation isnot very good for performance, because it has at least two problems. The first problem isthat this implementation may be unfair and the thread from one processor may have longwaiting time, even if it called thelockbefore other threads which are waiting for free locktoo. The second problem is that all threads which want to acquire a lock, must to execute559Queued spinlocksmanyoperations likeatomictest_and_seton a variable which is in shared memory. Thisleads to the cache invalidation as the cache of the processor will storevalue of thelockin memory may be1lock=1, but theafter a thread will release this lock.In the previous part we saw the second type of spinlock implementation -ticket spinlock.This approach solves the first problem and may guarantee order of threads which want toacquire a lock, but still has a second problem.The topic of this part isproblems. Thequeued spinlocksqueued spinlocks. This approach may help to solve both of theseallows to each processor to use its own memory locationto spin. The basic principle of a queue-based spinlock can best be understood by studying aclassic queue-based spinlock implementation called the MCS lock. Before we will look atimplementation of theis itMCSqueued spinlocksin the Linux kernel, we will try to understand whatlock.The basic idea of theMCSlock is in that as I already wrote in the previous paragraph, athread spins on a local variable and each processor in the system has its own copy of thesevariable. In other words this concept is built on top of the per-cpu variables concept in theLinux kernel.When the first thread wants to acquire a lock, it registers itself in thewords it will be added to the specialqueuequeueor in otherand will acquire lock, because it is free for now.When the second thread will want to acquire the same lock before the first thread willrelease it, this thread adds its own copy of the lock variable into thisfirst thread will contain anextqueue. In this case thefield which will point to the second thread. From thismoment, the second thread will wait until the first thread will release its lock and notifythread about this event. The first thread will be deleted from thequeuenextand the secondthread will be owner of a lock.Schematically we can represent it like:Empty queue:+---------+||||Queue||+---------+First thread tries to acquire a lock:560Queued spinlocks+---------++----------------------------+||||Queue|||---->| First thread acquired lock ||+---------+||+----------------------------+Second thread tries to acquire a lock:+---------++----------------------------------------++----------------------||---+|||||Queue|---->|Second thread waits for first thread|val);}Ok, now we know data structures which represents queued spinlock in the Linux kernel andnow time is to look at the implementation of themainfunction from thequeued spinlocksAPI.#define arch_spin_lock(l)queued_spin_lock(l)564Queued spinlocksYes, this function is -queued_spin_lock. As we may understand from the function's name, itallows to acquire lock by the thread. This function is defined in the include/asmgeneric/qspinlock_types.h header file and its implementation looks:static __always_inline void queued_spin_lock(struct qspinlock *lock){u32 val;val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);if (likely(val == 0))return;queued_spin_lock_slowpath(lock, val);}Looks pretty easy, except thequeued_spin_lock_slowpathfunction. We may see that it takesonly one parameter. In our case this parameter will representlocked. Let's consider the situation thatatomic_cmpxchg_acquirewhich will bewith locks is empty for now and the firstqueuethread wanted to acquire lock. As we may see thecall of thequeued spinlockqueued_spin_lockfunction starts from themacro. As you may guess from the name of this macro, itexecutes atomic CMPXCHG instruction which compares value of the second parameter(zero in our case) with the value of the first parameter (current state of the given spinlock)and if they are identical, it stores value of theis pointed by theThe&lock->valatomic_cmpxchg_acquireexpands to the call of the#define_Q_LOCKED_VALin the memory location whichand return the initial value from this memory location.macro is defined in the include/linux/atomic.h header file andatomic_cmpxchgatomic_cmpxchg_acquirefunction:atomic_cmpxchgwhich is architecture specific. We consider x86_64 architecture, so in our case this headerfile will be arch/x86/include/asm/atomic.h and the implementation of thefunction is just returns the result of thecmpxchgatomic_cmpxchgmacro:static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new){return cmpxchg(&v->counter, old, new);}This macro is defined in the arch/x86/include/asm/cmpxchg.h header file and looks:565Queued spinlocks#define cmpxchg(ptr, old, new) \__cmpxchg(ptr, old, new, sizeof(*(ptr)))#define __cmpxchg(ptr, old, new, size) \__raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)As we may see, thecmpxchgmacro expands to themacro with the almost the__cpmxchgsame set of parameters. New additional parameter is the size of the atomic value. The__cmpxchgmacro addsLOCK_PREFIXand expands to thejust LOCK instruction. After all, theLOCK_PREFIX__raw_cmpxchgmacro wheredoes all job for us:__raw_cmpxchg#define __raw_cmpxchg(ptr, old, new, size, lock) \({.........volatile u32 *__ptr = (volatile u32 *)(ptr);\asm volatile(lock "cmpxchgl %2,%1"\: "=a" (__ret), "+m" (*__ptr)\: "r" (__new), "" (__old)\: "memory");\.........})After theatomic_cmpxchg_acquiremacro will be executed, it returns the previous value of thememory location. Now only one thread tried to acquire a lock, so thewe will return from thequeued_spin_lockvalwill be zero andfunction:val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);if (likely(val == 0))return;From this moment, our first thread will hold a lock. Notice that this behavior differs from thebehavior which was described in theadd it to thebased on thequeueMCSalgorithm. The thread acquired lock, but we didn'tMCS. As I already wrote the implementation ofqueued spinlocksconcept isalgorithm in the Linux kernel, but in the same time it has some differencelike this for optimization purpose.So the first thread have acquired lock and now let's consider that the second thread tried toacquire the same lock. The second thread will start from the samefunction, but thelock->valwill contain1or_Q_LOCKED_VALqueued_spin_lock, because first thread already566Queued spinlocksholds lock. So, in this case thequeued_spin_lock_slowpathqueued_spin_lock_slowpathfunction will be called. Thefunction is defined in the kernel/locking/qspinlock.c source codefile and starts from the following checks:void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val){if (pv_enabled())goto queue;if (virt_spin_lock(lock))return;.........}which check the state of thepvqspinlock. Thepvqspinlockisqueued spinlockinparavirtualized environment. As this chapter is related only to synchronization primitives inthe Linux kernel, we skip these and other parts which are not directly related to the topic ofthis chapter. After these checks we compare our value which represents lock with the valueof the_Q_PENDING_VALmacro and do nothing while this is true:if (val == _Q_PENDING_VAL) {while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)cpu_relax();}wherecpu_relaxis just NOP instruction. Above, we saw that the lock contains -pendingbit. This bit represents thread which wanted to acquire lock, but it is already acquired by theother thread and in the same timeand thequeuequeueis empty. In this case, thependingbit will be setwill not be touched. This is done for optimization, because there are no needin unnecessary latency which will be caused by the cache invalidation in a touching of ownmcs_spinlockarray.At the next step we enter into the following loop:567Queued spinlocksfor (;;) {if (val & ~_Q_LOCKED_MASK)goto queue;new = _Q_LOCKED_VAL;if (val == new)new |= _Q_PENDING_VAL;old = atomic_cmpxchg_acquire(&lock->val, val, new);if (old == val)break;val = old;}The firstifclause here checks that state of the lock (val) is in locked or pending state.This means that first thread already acquired lock, second thread tried to acquire lock too,but now it is in pending state. In this case we need to start to build queue. We will considerthis situation little later. In our case we are first thread holds lock and the second thread triesto do it too. After this check we create new lock in a locked state and compare it with thestate of the previous lock. As you remember, thewhich after the second thread will call the1. Bothnewandvalvalcontains state of theatomic_cmpxchg_acquire&lock->valmacro will be equal tovalues are equal so we set pending bit in the lock of the secondthread. After this we need to check value of the&lock->valagain, because the first threadmay release lock before this moment. If the first thread did not released lock yet, the value oftheoldwill be equal to the value of theval(becauseatomic_cmpxchg_acquirethe value from the memory location which is pointed by thelock->valwill returnand now it is1)and we will exit from the loop. As we exited from this loop, we are waiting for the first threaduntil it will release lock, clear pending bit, acquire lock and return:smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));clear_pending_set_locked(lock);return;Notice that we did not touchqueueyet. We no need in it, because for two threads it justleads to unnecessary latency for memory access. In other case, the first thread may releaseit lock before this moment. In this case the_Q_PENDING_VALand we will start to buildlocal copy of themcs_nodeslock->valqueuewill contain. We start to build_Q_LOCKED_VAL |queueby the getting thearray of the processor which executes thread:node = this_cpu_ptr(&mcs_nodes[0]);idx = node->count++;tail = encode_tail(smp_processor_id(), idx);568Queued spinlocksAdditionally we calculatetailrepresents an entry of thecorrect of themcs_nodesyet andtonextNULLwhich will indicate the tail of themcs_nodesarray, setqueuearray. After this we set thelockedandindexwhichto point to thenodeto zero because this thread didn't acquire lockbecause we don't know anything about otherqueueentries:node += idx;node->locked = 0;node->next = NULL;We already touchcopy of the queue for the processor which executes currentper-cputhread which wants to acquire lock, this means that owner of the lock may released it beforethis moment. So we may try to acquire lock again by the call of thequeued_spin_trylockfunction.if (queued_spin_trylock(lock))goto release;Thequeued_spin_trylockfunction is defined in the include/asm-generic/qspinlock.h headerfile and just does the samequeued_spin_lockfunction that does:static __always_inline int queued_spin_trylock(struct qspinlock *lock){if (!atomic_read(&lock->val) &&(atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))return 1;return 0;}If the lock was successfully acquired we jump to thequeuelabel to release a node of therelease:release:this_cpu_dec(mcs_nodes[0].count);because we no need in it anymore as lock is acquired. If thequeued_spin_trylockwasunsuccessful, we update tail of the queue:old = xchg_tail(lock, tail);and retrieve previous tail. The next step is to check thatqueueis not empty. In this case weneed to link previous entry with the new:569Queued spinlocksif (old & _Q_TAIL_MASK) {prev = decode_tail(old);WRITE_ONCE(prev->next, node);arch_mcs_spin_lock_contended(&node->locked);}After queue entries linked, we start to wait until reaching the head of queue. As we As wereached this, we need to do a check for new node which might be added during this wait:next = READ_ONCE(node->next);if (next)prefetchw(next);If the new node was added, we prefetch cache line from memory pointed by the next queueentry with the PREFETCHW instruction. We preload this pointer now for optimizationpurpose. We just became a head of queue and this means that there is upcomingMCSunlock operation and the next entry will be touched.Yes, from this moment we are in the head of thequeue. But before we are able to acquire alock, we need to wait at least two events: current owner of a lock will release it and thesecond thread withpendingbit will acquire a lock too:smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));After both threads will release a lock, the head of thejust need to update the tail of thequeuequeuewill hold a lock. In the end weand remove current head from it.That's all.ConclusionThis is the end of the second part of the synchronization primitives chapter in the Linuxkernel. In the previous part we already met the first synchronization primitiveprovided by the Linux kernel which is implemented asanother implementation of thespinlockmechanism -ticket spinlockqueued spinlockspinlock. In this part we saw. In the next part wewill continue to dive into synchronization primitives in the Linux kernel.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.570Queued spinlocksPlease note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linksspinlockinterruptinterrupt handlerAPITest and SetMCSper-cpu variablesatomic instructionCMPXCHG instructionLOCK instructionNOP instructionPREFETCHW instructionx86_64Previous part571SemaphoresSynchronization primitives in the Linuxkernel. Part 3.SemaphoresThis is the third part of the chapter which describes synchronization primitives in the Linuxkernel and in the previous part we saw special type of spinlocks previous part was the last part which describesspinlocksqueued spinlocks. Therelated stuff. So we need to goahead.The next synchronization primitive afterspinlockwhich we will see in this part issemaphore. We will start from theoretical side and will learn what is itsemaphoreand onlyafter this, we will see how it is implemented in the Linux kernel as we did in the previouspart.So, let's start.Introduction to the semaphores in the LinuxkernelSo, what is itsemaphore? As you may guess -semaphoreis yet another mechanism forsupport of thread or process synchronization. The Linux kernel already providesimplementation of one synchronization mechanism -spinlocks, why do we need in yetanother one? To answer on this question we need to know details of both of thesemechanisms. We already familiar with theThe main idea behindspinlockspinlocks, so let's start from this mechanism.concept is a lock which will be acquired for a very shorttime. We can't sleep when a lock acquired by a process or thread, because other processeswait us. Context switch is not not allowed because preemption is disabled to avoiddeadlocks.In this way, semaphores is a good solution for locks which may be acquired for a long time.In other way this mechanism is not optimal for locks that acquired for a short time. Tounderstand this, we need to know what isAs usual synchronization primitive, asemaphoresemaphore.is based on a variable. This variable maybe incremented or decremented and it's state will represent ability to acquire lock. Noticethat value of the variable is not limited to0and1. There are two types ofsemaphores:572Semaphoresbinary semaphore;normal semaphore.In the first case, value of1any non-negative number. If the value ofsemaphorecalled asmay be onlysemaphoreor0. In the second case value ofis greater thansemaphoreand it allows to acquire a lock to more thancounting semaphoreallows us to keep records of available resources, whenspinlockon one task. Besides all of this, one more important thing that11it isprocess. Thisallows to hold a lock onlysemaphoreallows to sleep.Moreover when processes waits for a lock which is acquired by other process, the schedulermay switch on another process.Semaphore APISo, we know a little aboutin the Linux kernel. Allfrom theoretical side, let's look on its implementationsemaphoresAPI is located in the include/linux/semaphore.h headersemaphorefile.We may see that thesemaphoremechanism is represented by the following structure:struct semaphore {raw_spinlock_tlock;unsigned intcount;struct list_headwait_list;};in the Linux kernel. Thelockcount-spinlocksemaphorefor astructure consists of three fields:semaphore- amount available resources;wait_list- list of processes which are waiting to acquire a lock.Before we will consider an API of theknow how to initialize asemaphoreexecute initialization of the givensemaphoredata protection;semaphoremechanism in the Linux kernel, we need to. Actually the Linux kernel provides two approaches tosemaphorestructure. These methods allows to initialize ain a:statically;dynamically.ways. Let's look at the first approach. We are able to initialize aDEFINE_SEMAPHOREsemaphorestatically with themacro:573Semaphores#define DEFINE_SEMAPHORE(name)\struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)as we may see, thesemaphore. Themacro provides ability to initialize onlyDEFINE_SEMAPHOREDEFINE_SEMAPHOREmacro expands to the definition of thestructure which is initialized with the__SEMAPHORE_INITIALIZERbinarysemaphoremacro. Let's look at theimplementation of this macro:#define __SEMAPHORE_INITIALIZER(name, n)\{\.lock= __RAW_SPIN_LOCK_UNLOCKED((name).lock),\.count= n,\.wait_list= LIST_HEAD_INIT((name).wait_list),\}The__SEMAPHORE_INITIALIZERmacro takes the name of the futuredoes initialization of the fields of this structure. First of all we initialize agivensemaphorewith theprevious parts, the__RAW_SPIN_LOCK_UNLOCKED__RAW_SPIN_LOCK_UNLOCKEDheader file and expands to thestructure andsemaphorespinlockof themacro. As you may remember from theis defined in the include/linux/spinlock_types.h__ARCH_SPIN_LOCK_UNLOCKEDmacro which just expands tozero or unlocked state:#define __ARCH_SPIN_LOCK_UNLOCKEDThe last two fields of thesemaphore{ { 0 } }structurecountandwait_listare initialized with thegiven value which represents count of available resources and empty list.The second way to initialize aavailable resources to thesemaphoresema_initstructure is to pass thesemaphoreand number offunction which is defined in theinclude/linux/semaphore.h header file:static inline void sema_init(struct semaphore *sem, int val){static struct lock_class_key __key;*sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);}Let's consider implementation of this function. It looks pretty easy and actually it does almostthe same. Thus function executes initialization of the given__SEMAPHORE_INITIALIZERsemaphorewith themacro which we just saw. As I already wrote in the previous partsof this chapter, we will skip the stuff which is related to the lock validator of the Linux kernel.574SemaphoresSo, from now we are able to initialize asemaphoreLinux kernel provides following API to manipulatelet's look at how to lock and unlock. Thesemaphores:void down(struct semaphore *sem);void up(struct semaphore *sem);intdown_interruptible(struct semaphore *sem);intdown_killable(struct semaphore *sem);intdown_trylock(struct semaphore *sem);intdown_timeout(struct semaphore *sem, long jiffies);The first two functions:semaphore. Thesuccessful, thedownanddown_interruptiblecountare for acquiring and releasing of the givenupfunction tries to acquire afield of the givensemaphoresemaphore. If this try waswill be decremented and lock will beacquired, in other way the task will be switched to the blocked state or in other words theTASK_INTERRUPTIBLEflag will be set. ThisTASK_INTERRUPTIBLEflag means that the processmay returned to ruined state by signal.Thedown_killableTASK_KILLABLEfunction does the same as thedown_interruptiblefunction, but set theflag for the current process. This means that the waiting process may beinterrupted by the kill signal.Thedown_trylockfunction is similar on thespin_trylockfunction. This function tries toacquire a lock and exit if this operation was unsuccessful. In this case the process whichwants to acquire a lock, will not wait. The lastdown_timeoutfunction tries to acquire a lock.It will be interrupted in a waiting state when the given timeout will be expired. Additionally,you may notice that the timeout is in jiffiesWe just saw definitions of thesemaphoreAPI. We will start from thedownfunction. Thisfunction is defined in the kernel/locking/semaphore.c source code file. Let's look on theimplementation function:void down(struct semaphore *sem){unsigned long flags;raw_spin_lock_irqsave(&sem->lock, flags);if (likely(sem->count > 0))sem->count--;else__down(sem);raw_spin_unlock_irqrestore(&sem->lock, flags);}EXPORT_SYMBOL(down);575SemaphoresWe may see the definition of thevariable will be passed to theflagsvariable at the beginning of theandraw_spin_lock_irqsavedownfunction. Thisraw_spin_lock_irqrestoremacroswhich are defined in the include/linux/spinlock.h header file and protect a counter of thegivensemaphorespin_unlockhere. Actually both of these macro do the same thatspin_lockandmacros, but additionally they save/restore current value of interrupt flags anddisables interrupts.As you already may guess, the main work is done between theraw_spin_unlock_irqrestoremacros in thedownraw_spin_lock_irqsaveandfunction. We compare the value of thecounter with zero and if it is bigger than zero, we may decrement this counter.semaphoreThis means that we already acquired the lock. In other way counter is zero. This means thatall available resources already finished and we need to wait to acquire this lock. As we maysee, theThe__down__downfunction will be called in this case.function is defined in the same source code file and its implementation looks:static noinline void __sched __down(struct semaphore *sem){__down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);}The__downfunction just calls thesemaphoreflag__down_common;- for the task;timeout- maximum timeout to waitsemaphoreBefore we will consider implementation of theimplementation of thethefunction with three parameters:__down_commondown_trylock,.__down_commondown_timeoutandfunction, notice thatdown_killablefunctions based ontoo:static noinline int __sched __down_interruptible(struct semaphore *sem){return __down_common(sem, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);}The__down_killable:static noinline int __sched __down_killable(struct semaphore *sem){return __down_common(sem, TASK_KILLABLE, MAX_SCHEDULE_TIMEOUT);}And the__down_timeout:576Semaphoresstatic noinline int __sched __down_timeout(struct semaphore *sem, long timeout){return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);}Now let's look at the implementation of thefunction. This function is defined__down_commonin the kernel/locking/semaphore.c source code file too and starts from the definition of thetwo following local variables:struct task_struct *task = current;struct semaphore_waiter waiter;The first represents current task for the local processor which wants to acquire a lock. Thecurrentis a macro which is defined in the arch/x86/include/asm/current.h header file:#define current get_current()Where theget_currentfunction returns value of thecurrent_taskper-cpu variable:DECLARE_PER_CPU(struct task_struct *, current_task);static __always_inline struct task_struct *get_current(void){return this_cpu_read_stable(current_task);}The second variable iswaiterrepresents an entry of asemaphore.wait_listlist:struct semaphore_waiter {struct list_head list;struct task_struct *task;bool up;};Next we add current task to thewait_listand fillwaiterfields after definition of thesevariables:list_add_tail(&waiter.list, &sem->wait_list);waiter.task = task;waiter.up = false;In the next step we join into the following infinite loop:577Semaphoresfor (;;) {if (signal_pending_state(state, task))goto interrupted;if (unlikely(timeout lock);timeout = schedule_timeout(timeout);raw_spin_lock_irq(&sem->lock);if (waiter.up)return 0;}In the previous piece of code we setwhiletheupwill not be set topendingtruewaiter.uptofalse. So, a task will spin in this loop. This loop starts from the check that the current task is instate or in other words flags of this task containsTASK_WAKEKILLorTASK_INTERRUPTIBLEflag. As I already wrote above a task may be interrupted by signal duringwait of ability to acquire a lock. Thesignal_pending_statefunction is defined in theinclude/linux/sched.h source code file and looks:static inline int signal_pending_state(long state, struct task_struct *p){if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))return 0;if (!signal_pending(p))return 0;return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);}We check that thestatebitmask containsTASK_INTERRUPTIBLEorTASK_WAKEKILLbits andif the bitmask does not contain this bit we exit. At the next step we check that the given taskhas a pending signal and exit if there is no. In the end we just checkin thestateTASK_INTERRUPTIBLEbitbitmask again or the SIGKILL signal. So, if our task has a pending signal, wewill jump at theinterruptedlabel:interrupted:list_del(&waiter.list);return -EINTR;578Semaphoreswhere we delete task from the list of lock waiters and return the-EINTRerror code. If a taskhas no pending signal, we check the given timeout and if it is less or equal zero:if (unlikely(timeout lock);timeout = schedule_timeout(timeout);raw_spin_lock_irq(&sem->lock);which is defined in the kernel/time/timer.c source code file. Theschedule_timeoutfunctionmakes the current task sleep until the given timeout.That is all about the__down_commonfunction. A task which wants to acquire a lock which isalready acquired by another task will be spun in the infinite loop while it will not beinterrupted by a signal, the given timeout will not be expired or the task which holds a lockwill not release it. Now let's look at the implementation of theTheupfunction is defined in the same source code file asupdownfunction.function. As we alreadyknow, the main purpose of this function is to release a lock. This function looks:579Semaphoresvoid up(struct semaphore *sem){unsigned long flags;raw_spin_lock_irqsave(&sem->lock, flags);if (likely(list_empty(&sem->wait_list)))sem->count++;else__up(sem);raw_spin_unlock_irqrestore(&sem->lock, flags);}EXPORT_SYMBOL(up);It looks almost the same as thedownall we increment a counter of asemaphorethe__upfunction. There are only two differences here. First ofif the list of waiters is empty. In other way we callfunction from the same source code file. If the list of waiters is not empty we needto allow the first task from the list to acquire a lock:static noinline void __sched __up(struct semaphore *sem){struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,struct semaphore_waiter, list);list_del(&waiter->list);waiter->up = true;wake_up_process(waiter->task);}Here we takes the first task from the list of waiters, delete it from the list, set itsto true. From this point the infinite loop from thewake_up_process__down_commonfunction will be called in the end of the__upwaiter-upfunction will be stopped. Thefunction. As you rememberwe called theschedule_timeoutfunction in the infinite loop from thefunction. Theschedule_timeoutfunction makes the current task sleep until the given__down_commonthistimeout will not be expired. So, as our process may sleep right now, we need to wake it up.That's why we call thewake_up_processfunction from the kernel/sched/core.c source codefile.That's all.ConclusionThis is the end of the third part of the synchronization primitives chapter in the Linux kernel.In the two previous parts we already met the first synchronization primitiveprovided by the Linux kernel which is implemented asticket spinlockspinlockand used for a very580Semaphoresshort time locks. In this part we saw yet another synchronization primitive - semaphorewhich is used for long time locks as it leads to context switch. In the next part we willcontinue to dive into synchronization primitives in the Linux kernel and will see nextsynchronization primitive - mutex.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linksspinlockssynchronization primitivesemaphorecontext switchpreemptiondeadlocksschedulerDoubly linked list in the Linux kerneljiffiesinterruptsper-cpubitmaskSIGKILLerrnoAPImutexPrevious part581MutexSynchronization primitives in the Linuxkernel. Part 4.IntroductionThis is the fourth part of the chapter which describes synchronization primitives in the Linuxkernel and in the previous parts we finished to consider different types spinlocks andsemaphore synchronization primitives. We will continue to learn synchronization primitives inthis part and consider yet another one which is called - mutex which is stands forEXclusionMUTual.As in all previous parts of this book, we will try to consider this synchronization primitive fromthe theoretical side and only than we will consider API provided by the Linux kernel tomanipulate with.mutexesSo, let's start.Concept ofmutexWe already familiar with the semaphore synchronization primitive from the previous part. Itrepresented by the:struct semaphore {raw_spinlock_tlock;unsigned intcount;struct list_headwait_list;};structure which holds information about state of a lock and list of a lock waiters. Depends onthe value of thecountfield, asemaphorecan provide access to a resource of more thanone wishing of this resource. The mutex concept is very similar to a semaphore concept. Butit has some differences. The main difference betweensynchronization primitive is thatone process may holdmutexmutexsemaphoreandmutexhas more strict semantic. Unlike aat one time and only theunlock it. Additional difference in implementation ofownerlockof aAPI. Themutexsemaphore, onlymay release orsemaphoresynchronization primitive forces rescheduling of processes which are in waiters list. Theimplementation ofmutexlockAPIallows to avoid this situation and as a result expensivecontext switches.582MutexThemutexsynchronization primitive represented by the following:struct mutex {atomic_tcount;spinlock_twait_lock;struct list_headwait_list;#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_MUTEX_SPIN_ON_OWNER)struct task_struct*owner;#endif#ifdef CONFIG_MUTEX_SPIN_ON_OWNERstruct optimistic_spin_queue osq;#endif#ifdef CONFIG_DEBUG_MUTEXESvoid*magic;#endif#ifdef CONFIG_DEBUG_LOCK_ALLOCstruct lockdep_mapdep_map;#endif};structure in the Linux kernel. This structure is defined in the include/linux/mutex.h header fileand contains similar to thestructure is -countvalue of thecountfield is,azerosemaphorestructure set of fields. The first field of the. Value of this field represents state of afield is,a1is in themutexmutexlockedis inmutexis in theThe next two fields of themutexstructure -protection of await queuestate. When the value of theunlockedlockedmutexcountcountfield may bestate and has possible waiters.andwait_lockwait_listand list of waiters which represents thislock. As you may notice, the similarity of theRemaining fields of the. In a case when themutexstate. Additionally value of the. In this case anegativemutexmutexandsemaphoreare spinlock for thewait queuefor a certainstructures ends.structure, as we may see depends on different configurationoptions of the Linux kernel.The first field -ownerof this field in therepresents process which acquired a lock. As we may see, existencemutexstructure depends on theCONFIG_MUTEX_SPIN_ON_OWNERnextosqfields amutexfields is support ofmagicanddep_mapCONFIG_DEBUG_MUTEXESorkernel configuration options. Main point of this field and theoptimistic spinningwhich we will see later. The last twoare used only in debugging mode. Therelated information for debugging and the second field -magicfield is to storinglockdep_mapis for lockvalidator of the Linux kernel.Now, after we have considered themutexstructure, we may consider how thissynchronization primitive works in the Linux kernel. As you may guess, a process whichwants to acquire a lock, must to decrease value of themutex->countif possible. And if a583Mutexprocess wants to release a lock, it must to increase the same value. That's true. But as youmay also guess, it is not so simple in the Linux kernel.Actually, when a process try to acquire afastpathmutex, there three possible paths:;;midpathslowpath.which may be taken, depending on the current state of thefastpath. The first path ormutexis the fastest as you may understand from its name. Everything is easy in thiscase. Nobody acquired amutex, so the value of thecountmay be directly decremented. In a case of unlocking of aA process just increments the value of thefield of thestructure, the algorithm is the same.mutexfield of thecountmutexmutexstructure. Of course, allof these operations must be atomic.Yes, this looks pretty easy. But what happens if a process wants to acquire amutexwhichis already acquired by other process? In this case, the control will be transferred to thesecond path -midpath. Themidpathoroptimistic spinningtries to spin with alreadyfamiliar for us MCS lock while the lock owner is running. This path will be executed only ifthere are no other processes ready to run that have higher priority. This path is calledoptimisticbecause the waiting task will not be sleep and rescheduled. This allows to avoidexpensive context switch.In the last case, when theslowpathfastpathandmidpathmay not be executed, the last path -will be executed. This path acts like a semaphore lock. If the lock is unable to beacquired by a process, this process will be added towait queuewhich is represented bythe following:struct mutex_waiter {struct list_headlist;struct task_struct*task;#ifdef CONFIG_DEBUG_MUTEXESvoid*magic;#endif};structure from the include/linux/mutex.h header file and will be sleep. Before we will considerAPI which is provided by the Linux kernel for manipulation withmutex_waiterthat themutexes, let's consider thestructure. If you have read the previous part of this chapter, you may noticemutex_waiterstructure is similar to thesemaphore_waiterstructure from thekernel/locking/semaphore.c source code file:584Mutexstruct semaphore_waiter {struct list_head list;struct task_struct *task;bool up;};It also containslistandtaskThe one difference here that themagicfields which are represent entry of the mutex wait queue.mutex_waiterfield which depends on theused to store amutexdoes not containsupfield, but contains thekernel configuration option andCONFIG_DEBUG_MUTEXESrelated information for debugging purpose.Now we know what is itmutexand how it is represented the Linux kernel. In this case, wemay go ahead and start to look at the API which the Linux kernel provides for manipulationofmutexes.Mutex APIOk, in the previous paragraph we knew what is itthemutexstructure which representsmutexmutexsynchronization primitive and sawin the Linux kernel. Now it's time to considerAPI for manipulation of mutexes. Description of themutexAPI is located in theinclude/linux/mutex.h header file. As always, before we will consider how to acquire andrelease amutex, we need to know how to initialize it.There are two approaches to initialize amutex. The first is to do it statically. For thispurpose the Linux kernel provides following:#define DEFINE_MUTEX(mutexname) \struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)macro. Let's consider implementation of this macro. As we may see, themacro takes name for theAdditionally newmutexmutexand expands to the definition of the newstructure get initialized with thelook at the implementation of theDEFINE_MUTEX__MUTEX_INITIALIZER#define __MUTEX_INITIALIZER(lockname)mutex__MUTEX_INITIALIZERstructure.macro. Let's:\{\.count = ATOMIC_INIT(1),\.wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock), \.wait_list = LIST_HEAD_INIT(lockname.wait_list)\}585MutexThis macro is defined in the same header file and as we may understand it initializes fieldsof thestructure the initial values. Themutexrepresentsstate of a mutex. Theunlockedunlocked state and the last fieldwait_listcountwait_lock__mutex_init__mutex_init1whichspinlock get initialized to theto empty doubly linked list.The second approach allows us to initialize athefield get initialized with themutexdynamically. To do this we need to callfunction from the kernel/locking/mutex.c source code file. Actually, thefunction rarely called directly. Instead of the# define mutex_init(mutex)__mutex_init, the:\do {\static struct lock_class_key __key;\\__mutex_init((mutex), #mutex, &__key);\} while (0)macro is used. We may see that theand call the__mutex_initmutex_initmacro just defines thelock_class_keyfunction. Let's look at the implementation of this function:void__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key){atomic_set(&lock->count, 1);spin_lock_init(&lock->wait_lock);INIT_LIST_HEAD(&lock->wait_list);mutex_clear_owner(lock);#ifdef CONFIG_MUTEX_SPIN_ON_OWNERosq_lock_init(&lock->osq);#endifdebug_mutex_init(lock, name, key);}As we may see the__mutex_initfunction takes three arguments:lock- a mutex itself;name- name of mutex for debugging purpose;key- key for lock validator.At the beginning of theWe set it tounlocked__mutex_initstate with thefunction, we may see initialization of theatomic_setqueueof thethe call of themutexwait queueof the. After this we clear owner of theosq_lock_initstate.function which atomically set the givevariable to the given value. After this we may see initialization of theunlocked state which will protectmutexmutexlockspinlockto theand initialization of thewaitand initialize optimistic queue byfunction from the include/linux/osq_lock.h header file. Thisfunction just sets the tail of the optimistic queue to the unlocked state:586Mutexstatic inline bool osq_is_locked(struct optimistic_spin_queue *lock){return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;}In the end of the__mutex_initfunction we may see the call of thedebug_mutex_initfunction, but as I already wrote in previous parts of this chapter, we will not considerdebugging related stuff in this chapter.After theunlockmutexAPI ofmutex_unlockstructure is initialized, we may go ahead and will look at themutexsynchronization primitive. Implementation oflockmutex_lockandandfunctions located in the kernel/locking/mutex.c source code file. First of alllet's start from the implementation of themutex_lock. It looks:void __sched mutex_lock(struct mutex *lock){might_sleep();__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);mutex_set_owner(lock);}We may see the call of thethe beginning of themight_sleepmutex_lockCONFIG_DEBUG_ATOMIC_SLEEPmacro from the include/linux/kernel.h header file atfunction. Implementation of this macro depends on thekernel configuration option and if this option is enabled, thismacro just prints a stack trace if it was executed in atomic context. This macro is helper fordebugging purposes. In other way this macro does nothing.After themight_sleepmacro, we may see the call of the__mutex_fastpath_lockfunction.This function is architecture-specific and as we consider x86_64 architecture in this book,the implementation of the__mutex_fastpath_lockis located in thearch/x86/include/asm/mutex_64.h header file. As we may understand from the name of the__mutex_fastpath_lockfunction, this function will try to acquire lock in a fast path or in otherwords this function will try to decrement the value of theImplementation of the__mutex_fastpath_lockcountof the given mutex.function consists from two parts. The first partis inline assembly statement. Let's look at it:asm_volatile_goto(LOCK_PREFIX ""decl %0\n"jns %l[exit]\n": : "m" (v->counter): "memory", "cc": exit);587MutexFirst of all, let's pay attention to the. This macro is defined in theasm_volatile_gotoinclude/linux/compiler-gcc.h header file and just expands to the two inline assemblystatements:#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)The first assembly statement containsgotospecificator and the second empty inlineassembly statement is barrier. Now let's return to the our inline assembly statement. As wemay see it starts from the definition of theLOCK_PREFIXmacro which just expands to thelock instruction:#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "As we already know from the previous parts, this instruction allows to execute prefixedinstruction atomically. So, at the first step in the our assembly statement we try decrementvalue of the giventheexitmutex->counter. At the next step the jns instruction will execute jump atlabel if the value of the decrementedlabel is the second part of themutex->counteris not negative. Theexitfunction and it just points to the exit__mutex_fastpath_lockfrom this function:exit:return;For this moment he implementation of theBut the value of themutex->counter__mutex_fastpath_lockfunction looks pretty easy.may be negative after increment. In this case the:fail_fn(v);will be called after our inline assembly statement. Thethe__mutex_fastpath_lockmidpath/slowpathfail_fnis the second parameter offunction and represents pointer to function which representspaths to acquire the given lock. In our case thefail_fnis the__mutex_lock_slowpathfunction. Before we will look at the implementation of the__mutex_lock_slowpathfunction, let's finish with the implementation of themutex_lockfunction. In the simplest way, the lock will be acquired successfully by a process and the__mutex_fastpath_lockwill be finished. In this case, we just call themutex_set_owner(lock);in the end of themutex_lock. Themutex_set_ownerfunction is defined in thekernel/locking/mutex.h header file and just sets owner of a lock to the current process:588Mutexstatic inline void mutex_set_owner(struct mutex *lock){lock->owner = current;}In other way, let's consider situation when a process which wants to acquire a lock is unableto do it, because another process already acquired the same lock. We already know that the__mutex_lock_slowpathfunction will be called in this case. Let's consider implementation ofthis function. This function is defined in the kernel/locking/mutex.c source code file and startsfrom the obtaining of the proper mutex by the mutex state given from the__mutex_fastpath_lockwith thecontainer_ofmacro:__visible void __sched__mutex_lock_slowpath(atomic_t *lock_count){struct mutex *lock = container_of(lock_count, struct mutex, count);__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0,NULL, _RET_IP_, NULL, 0);}and call the__mutex_lock_common__mutex_lock_commonfunction with the obtainedmutex. Thefunction starts from preemption disabling until rescheduling:preempt_disable();After this comes the stage of optimistic spinning. As we already know this stage depends ontheCONFIG_MUTEX_SPIN_ON_OWNERkernel configuration option. If this option is disabled, weskip this stage and move at the last path -slowpathof amutexacquisition:if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {preempt_enable();return 0;}First of all themutex_optimistic_spinfunction check that we don't need to reschedule or inother words there are no other tasks ready to run that have higher priority. If this check wassuccessful we need to updateMCSlock wait queue with the current spin. In this way onlyone spinner can complete for the mutex at one time:osq_lock(&lock->osq)589MutexAt the next step we start to spin in the next loop:while (true) {owner = READ_ONCE(lock->owner);if (owner && !mutex_spin_on_owner(lock, owner))break;if (mutex_try_to_acquire(lock)) {lock_acquired(&lock->dep_map, ip);mutex_set_owner(lock);osq_unlock(&lock->osq);return true;}}and try to acquire a lock. First of all we try to take current owner and if the owner exists (itmay not exists in a case when a process already released a mutex) and we wait for it in themutex_spin_on_ownerfunction before the owner will release a lock. If new task with higherpriority have appeared during wait of the lock owner, we break the loop and go to sleep. Inother case, the process already may release a lock, so we try to acquire a lock with themutex_try_to_acquired. If this operation finished successfully, we set new owner for thegiven mutex, removes ourself from themutex_optimistic_spinMCSwait queue and exit from thefunction. At this state a lock will be acquired by a process and weenable preemption and exit from the__mutex_lock_commonfunction:if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {preempt_enable();return 0;}That's all for this case.In other case all may not be so successful. For example new task may occur during wespinning in the loop from thefrom themutex_optimistic_spinmutex_optimistic_spinbefore this loop. Or finally thedisabled. In this case theor even we may not get to this loopin a case when there were task(s) with higher priorityCONFIG_MUTEX_SPIN_ON_OWNERmutex_optimistic_spinkernel configuration optionwill do nothing:590Mutex#ifndef CONFIG_MUTEX_SPIN_ON_OWNERstatic bool mutex_optimistic_spin(struct mutex *lock,struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx){return false;}#endifIn all of these cases, the__mutex_lock_commonfunction will acct like asemaphore. We try toacquire a lock again because the owner of a lock might already release a lock before thistime:if (!mutex_is_locked(lock) &&(atomic_xchg_acquire(&lock->count, 0) == 1))goto skip_wait;In a failure case the process which wants to acquire a lock will be added to the waiters listlist_add_tail(&waiter.list, &lock->wait_list);waiter.task = task;In a successful case we update the owner of a lock, enable preemption and exit from the__mutex_lock_commonfunction:skip_wait:mutex_set_owner(lock);preempt_enable();return 0;In this case a lock will be acquired. If can't acquire a lock for now, we enter into the followingloop:591Mutexfor (;;) {if (atomic_read(&lock->count) >= 0 && (atomic_xchg_acquire(&lock->count, -1) == 1))break;if (unlikely(signal_pending_state(state, task))) {ret = -EINTR;goto err;}__set_task_state(task, state);schedule_preempt_disabled();}where try to acquire a lock again and exit if this operation was successful. Yes, we try toacquire a lock again right after unsuccessful try before the loop. We need to do it to makesure that we get a wakeup once a lock will be unlocked. Besides this, it allows us to acquirea lock after sleep. In other case we check the current process for pending signals and exit ifthe process was interrupted by asignalduring wait for a lock acquisition. In the end ofloop we didn't acquire a lock, so we set the task state forsleep with call of theschedule_preempt_disabledTASK_UNINTERRUPTIBLEand go tofunction.That's all. We have considered all three possible paths through which a process may passwhen it will wan to acquire a lock. Now let's consider howWhen themutex_unlock__mutex_fastpath_unlockmutex_unlockis implemented.will be called by a process which wants to release a lock, thewill be called from the arch/x86/include/asm/mutex_64.h headerfile:void __sched mutex_unlock(struct mutex *lock){__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);}Implementation of the__mutex_fastpath_unlockimplementation of the__mutex_fastpath_lockfunction is very similar to thefunction:592Mutexstatic inline void __mutex_fastpath_unlock(atomic_t *v,void (*fail_fn)(atomic_t *)){asm_volatile_goto(LOCK_PREFIX ""incl %0\n"jg %l[exit]\n": : "m" (v->counter): "memory", "cc": exit);fail_fn(v);exit:return;}Actually, there is only one difference. We increment value if therepresentin thecorrect__mutex_unlock_slowpathmutexmutexwe need to update it. In this case thewait queuewhich isstate after this operation. Asunlocked. Theinstance by the given__mutex_unlock_common_slowpathmutex->countreleased, but we have somethingfail_fn__mutex_unlock_slowpathmutex->count. So it willfunction will be calledfunction just gets theand calls thefunction:__mutex_unlock_slowpath(atomic_t *lock_count){struct mutex *lock = container_of(lock_count, struct mutex, count);__mutex_unlock_common_slowpath(lock, 1);}In thefunction we will get the first entry from the wait queue__mutex_unlock_common_slowpathif the wait queue is not empty and wakeup related process:if (!list_empty(&lock->wait_list)) {struct mutex_waiter *waiter =list_entry(lock->wait_list.next, struct mutex_waiter, list);wake_up_process(waiter->task);}After this, a mutex will be released by previous process and will be acquired by anotherprocess from a wait queue.That's all. We have considered mainmutex_unlockAPIfor manipulation withmutexes:mutex_lockand. Besides this the Linux kernel provides following API:mutex_lock_interruptiblemutex_lock_killablemutex_trylock;;.593Mutexand corresponding versions ofAPIunlockprefixed functions. This part will not describe this, because it is similar to correspondingAPIofsemaphores. More about it you mayread in the previous part.That's all.ConclusionThis is the end of the fourth part of the synchronization primitives chapter in the Linux kernel.In this part we met with new synchronization primitive which is called -mutex. From thetheoretical side, this synchronization primitive very similar on a semaphore. Actually,mutexrepresents binary semaphore. But its implementation differs from the implementation ofsemaphorein the Linux kernel. In the next part we will continue to dive into synchronizationprimitives in the Linux kernel.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.LinksMutexSpinlockSemaphoreSynchronization primitivesAPILocking mechanismContext switcheslock validatorAtomicMCS lockDoubly linked listx86_64Inline assemblyMemory barrierLock instructionJNS instructionpreemption594MutexUnix signalsPrevious part595Reader/Writer semaphoresSynchronization primitives in the Linuxkernel. Part 5.IntroductionThis is the fifth part of the chapter which describes synchronization primitives in the Linuxkernel and in the previous parts we finished to consider different types spinlocks, semaphoreand mutex synchronization primitives. We will continue to learn synchronization primitives inthis part and start to consider special type of synchronization primitives - readers–writer lock.The first synchronization primitive of this type will be already familiar for us - semaphore. Asin all previous parts of this book, before we will consider implementation of thereader/writer semaphoresin the Linux kernel, we will start from the theoretical side and willtry to understand what is the difference betweensemaphoresreader/writer semaphoresandnormal.So, let's start.Reader/Writer semaphoreActually there are two types of operations may be performed on the data. We may read dataand make changes in data. Two fundamental operations not always),readoperation is performed more often thanreadandwritewrite. Usually (butoperation. In this case, itwould be logical to we may lock data in such way, that some processes may read lockeddata in one time, on condition that no one will not change the data. The readers/writer lockallows us to get this lock.When a process which wants to write something into data, all otherwriterandreaderprocesses will be blocked until the process which acquired a lock, will not release it. When aprocess reads data, other processes which want to read the same data too, will not belocked and will be able to do this. As you may guess, implementation of thesemaphoreis based on the implementation of thenormal semaphorereader/writer. We already familiar withthe semaphore synchronization primitive from the third part of this chapter. From thetheoretical side everything looks pretty simple. Let's look howreader/writer semaphoreisrepresented in the Linux kernel.Thesemaphoreis represented by the:596Reader/Writer semaphoresstruct semaphore {raw_spinlock_tlock;unsigned intcount;struct list_headwait_list;};structure. If you will look in the include/linux/rwsem.h header file, you will find definition of therw_semaphorestructure which representsreader/writer semaphorein the Linux kernel. Let'slook at the definition of this structure:#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK#include #elsestruct rw_semaphore {long count;struct list_head wait_list;raw_spinlock_t wait_lock;#ifdef CONFIG_RWSEM_SPIN_ON_OWNERstruct optimistic_spin_queue osq;struct task_struct *owner;#endif#ifdef CONFIG_DEBUG_LOCK_ALLOCstruct lockdep_mapdep_map;#endif};Before we will consider fields of theof therw_semaphorerw_semaphorestructure depends on thestructure, we may notice, that declarationCONFIG_RWSEM_GENERIC_SPINLOCKkernelconfiguration option. This option is disabled for the x86_64 architecture by default. We canbe sure in this by looking at the corresponding kernel configuration file. In our case, thisconfiguration file is - arch/x86/um/Kconfig:config RWSEM_XCHGADD_ALGORITHMdef_bool 64BITconfig RWSEM_GENERIC_SPINLOCKdef_bool !RWSEM_XCHGADD_ALGORITHMSo, as this book describes only x86_64 architecture related stuff, we will skip the case whentheCONFIG_RWSEM_GENERIC_SPINLOCKtherw_semaphorekernel configuration is enabled and consider definition ofstructure only from the include/linux/rwsem.h header file.If we will take a look at the definition of thethree fields are the same that in therw_semaphoresemaphorerepresents amount of available resources, thestructure, we will notice that firststructure. It containswait_listcountfield whichfield which represents doubly597Reader/Writer semaphoreslinked list of processes which are waiting to acquire a lock andprotection of this list. Notice thatfield in thesemaphorestructure.Thefield of arw_semaphorecount0x0000000000000000field isrw_semaphore.countwait_lockspinlock fortype unlike the samelongstructure may have following values:-reader/writer semaphore-Xis in unlocked state and no one iswaiting for a lock;0x000000000000000Xreaders are active or attempting to acquire a lock and nowriter waiting;- may represent different cases. The first is -0xffffffff0000000XXreaders are activeor attempting to acquire a lock with waiters for the lock. The second is - one writerattempting a lock, no waiters for the lock. And the last - one writer is active and nowaiters for the lock;- may represented two different cases. The first is - one reader is0xffffffff00000001active or attempting to acquire a lock and exist waiters for the lock. The second case isone writer is active or attempting to acquire a lock and no waiters for the lock;- represents situation when there are readers or writers are0xffffffff00000000queued, but no one is active or is in the process of acquire of a lock;- a writer is active or attempting to acquire a lock and waiters are in0xfffffffe00000001queue.So, besides thecountfield, all of these fields are similar to fields of thesemaphorestructure. Last three fields depend on the two configuration options of the Linux kernel: theCONFIG_RWSEM_SPIN_ON_OWNERandCONFIG_DEBUG_LOCK_ALLOC. The first two fields may befamiliar us by declaration of the mutex structure from the previous part. The firstrepresents MCS lock spinner foroptimistic spinningosqfieldand the second represents processwhich is current owner of a lock.The last field of therw_semaphorestructure is -dep_map- debugging related, and as Ialready wrote in previous parts, we will skip debugging related stuff in this chapter.That's all. Now we know a little about what is itreader/writer semaphorereader/writer lockin particular. Additionally we saw how ain general andreader/writer semaphoreis represented in the Linux kernel. In this case, we may go ahead and start to look at the APIwhich the Linux kernel provides for manipulation ofreader/writer semaphores.Reader/Writer semaphore APISo, we know a little aboutreader/writer semaphoresimplementation in the Linux kernel. Allfrom theoretical side, let's look on itsreader/writer semaphoresrelated API is located inthe include/linux/rwsem.h header file.598Reader/Writer semaphoresAs always Before we will consider an API of themechanism in thereader/writer semaphoreLinux kernel, we need to know how to initialize therw_semaphorestructure. As we alreadysaw in previous parts of this chapter, all synchronization primitives may be initialized in twoways:statically;dynamicallyAnd.is not an exception. First of all, let's take a look at the firstreader/writer semaphoreapproach. We may initializestructure with the help of therw_semaphoreDECLARE_RWSEMmacro in compile time. This macro is defined in the include/linux/rwsem.h header file andlooks:#define DECLARE_RWSEM(name) \struct rw_semaphore name = __RWSEM_INITIALIZER(name)As we may see, therw_semaphoreDECLARE_RWSEMmacro just expands to the definition of thestructure with the given name. Additionally newinitialized with the value of the__RWSEM_INITIALIZER#define __RWSEM_INITIALIZER(name)rw_semaphorestructure ismacro:\{\.count = RWSEM_UNLOCKED_VALUE,\.wait_list = LIST_HEAD_INIT((name).wait_list),\.wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)\__RWSEM_OPT_INIT(name)\__RWSEM_DEP_MAP_INIT(name)}and expands to the initialization of fields ofcountfield of therw_semaphoreRWSEM_UNLOCKED_VALUErw_semaphorestructure to thestructure. First of all we initializeunlockedstate withmacro from the arch/x86/include/asm/rwsem.h architecture specificheader file:#define RWSEM_UNLOCKED_VALUE0x00000000LAfter this we initialize list of a lock waiters with the empty linked list and spinlock forprotection of this list with thethe state of theunlockedstate too. TheCONFIG_RWSEM_SPIN_ON_OWNERenabled it expands to the initialization of thestructure. As we already saw above, the__RWSEM_OPT_INITmacro depends onkernel configuration option and if this option isosqandownerfields of theCONFIG_RWSEM_SPIN_ON_OWNERrw_semaphorekernel configurationoption is enabled by default for x86_64 architecture, so let's take a look at the definition ofthe__RWSEM_OPT_INITmacro:599Reader/Writer semaphores#ifdef CONFIG_RWSEM_SPIN_ON_OWNER#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL#else#define __RWSEM_OPT_INIT(lockname)#endifAs we may see, thestate and initialmacro initializes the MCS lock lock with__RWSEM_OPT_INITownerof a lock withNULL. From this moment, aunlockedstructurerw_semaphorewill be initialized in a compile time and may be used for data protection.The second way to initialize ainit_rwsemrw_semaphorestructure isdynamicallyor use themacro from the include/linux/rwsem.h header file. This macro declares aninstance of thewhich is related to the lock validator of the Linux kernel andlock_class_keyto the call of the__init_rwsemfunction with the given#define init_rwsem(sem)reader/writer semaphore:\do {\static struct lock_class_key __key;\\__init_rwsem((sem), #sem, &__key);\} while (0)If you will start definition of the__init_rwsemfunction, you will notice that there are coupleof source code files which contain it. As you may guess, sometimes we need to initializeadditional fields of therw_semaphorestructure, like theosqandowner. But sometimesnot. All of this depends on some kernel configuration options. If we will look at thekernel/locking/Makefile makefile, we will see following lines:obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.oobj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.oAs we already know, the Linux kernel forCONFIG_RWSEM_XCHGADD_ALGORITHMx86_64architecture has enabledkernel configuration option by default:config RWSEM_XCHGADD_ALGORITHMdef_bool 64BITin the arch/x86/um/Kconfig kernel configuration file. In this case, implementation of the__init_rwsemfunction will be located in the kernel/locking/rwsem-xadd.c source code filefor us. Let's take a look at this function:600Reader/Writer semaphoresvoid __init_rwsem(struct rw_semaphore *sem, const char *name,struct lock_class_key *key){#ifdef CONFIG_DEBUG_LOCK_ALLOCdebug_check_no_locks_freed((void *)sem, sizeof(*sem));lockdep_init_map(&sem->dep_map, name, key, 0);#endifsem->count = RWSEM_UNLOCKED_VALUE;raw_spin_lock_init(&sem->wait_lock);INIT_LIST_HEAD(&sem->wait_list);#ifdef CONFIG_RWSEM_SPIN_ON_OWNERsem->owner = NULL;osq_lock_init(&sem->osq);#endif}We may see here almost the same as in__RWSEM_INITIALIZERmacro with difference that allof this will be executed in runtime.So, from now we are able to initialize aunlockreader/writer semaphorelet's look at theAPI. The Linux kernel provides following primary API to manipulatesemaphoreslockandreader/writer:- lock for reading;void down_read(struct rw_semaphore *sem)int down_read_trylock(struct rw_semaphore *sem)void down_write(struct rw_semaphore *sem)- lock for writing;int down_write_trylock(struct rw_semaphore *sem)void up_read(struct rw_semaphore *sem)void up_write(struct rw_semaphore *sem)- try lock for reading;- try lock for writing;- release a read lock;- release a write lock;Let's start as always from the locking. First of all let's consider implementation of thedown_writefunction which executes a try of acquiring of a lock forwrite. This function iskernel/locking/rwsem.c source code file and starts from the call of the macro from theinclude/linux/kernel.h header file:void __sched down_write(struct rw_semaphore *sem){might_sleep();rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);LOCK_CONTENDED(sem, __down_write_trylock, __down_write);rwsem_set_owner(sem);}601Reader/Writer semaphoresWe already met theof themight_sleepmight_sleepmacro in the previous part. In short words, Implementationmacro depends on theCONFIG_DEBUG_ATOMIC_SLEEPkernel configurationoption and if this option is enabled, this macro just prints a stack trace if it was executed inatomic context. As this macro is mostly for debugging purpose we will skip it and will goahead. Additionally we will skip the next macro from therwsem_acquiredown_readfunction -which is related to the lock validator of the Linux kernel, because this is topicof other part.The only two things that remained in theLOCK_CONTENDEDfunction is the call of thedown_writemacro which is defined in the include/linux/lockdep.h header file and settingof owner of a lock with therwsem_set_ownerfunction which sets owner to currently runningprocess:static inline void rwsem_set_owner(struct rw_semaphore *sem){sem->owner = current;}As you already may guess, theimplementation of theLOCK_CONTENDEDLOCK_CONTENDEDmacro does all job for us. Let's look at themacro:#define LOCK_CONTENDED(_lock, try, lock) \lock(_lock)As we may see it just calls thelockfunction which is third parameter of theLOCK_CONTENDEDmacro with the givenLOCK_CONTENDEDmacro is therw_semaphore__down_write. In our case the third parameter of thefunction which is architecture specific functionand located in the arch/x86/include/asm/rwsem.h header file. Let's look at theimplementation of the__down_writefunction:static inline void __down_write(struct rw_semaphore *sem){__down_write_nested(sem, 0);}which just executes a call of the__down_write_nestedfile. Let's take a look at the implementation of thefunction from the same source code__down_write_nestedfunction:602Reader/Writer semaphoresstatic inline void __down_write_nested(struct rw_semaphore *sem, int subclass){long tmp;asm volatile("# beginning down_write\n\t"LOCK_PREFIX "xadd%1,(%2)\n\t""test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t""jz"call call_rwsem_down_write_failed\n"1f\n""1:\n""# ending down_write": "+m" (sem->count), "=d" (tmp): "a" (sem), "1" (RWSEM_ACTIVE_WRITE_BIAS): "memory", "cc");}As for other synchronization primitives which we saw in this chapter, usuallylock/unlockfunctions consists only from an inline assembly statement. As we may see, in our case thesame for__down_write_nestedfunction. Let's try to understand what does this function do.The first line of our assembly statement is just a comment, let's skip it. The second likecontainsLOCK_PREFIXwhich will be expanded to the LOCK instruction as we already know.The next xadd instruction executesinstruction adds value of theoraddandexchangeRWSEM_ACTIVE_WRITE_BIASoperations. In other words,xadd:#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)#define RWSEM_WAITING_BIAS(-RWSEM_ACTIVE_MASK-1)#define RWSEM_ACTIVE_BIAS0x00000001L0xffffffff00000001to thecountof the givenreader/writer semaphoreprevious value of it. After this we check the active mask in theand returnsrw_semaphore->count. If it waszero before, this means that there were no-one writer before, so we acquired a lock. In otherway we call thecall_rwsem_down_write_failedassembly file. The thefunction from the arch/x86/lib/rwsem.Scall_rwsem_down_write_failedrwsem_down_write_failedfunction just calls thefunction from the kernel/locking/rwsem-xadd.c source code fileanticipatorily save general purpose registers:603Reader/Writer semaphoresENTRY(call_rwsem_down_write_failed)FRAME_BEGINsave_common_regsmovq %rax,%rdicall rwsem_down_write_failedrestore_common_regsFRAME_ENDretENDPROC(call_rwsem_down_write_failed)Therwsem_down_write_failedfunction starts from the atomic update of thecountvalue:__visiblestruct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem){count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);.........}with the-RWSEM_ACTIVE_WRITE_BIASvalue. Therwsem_atomic_updatefunction is defined inthe arch/x86/include/asm/rwsem.h header file and implement exchange and add logic:static inline long rwsem_atomic_update(long delta, struct rw_semaphore *sem){return delta + xadd(&sem->count, delta);}This function atomically adds the given delta to thecount. After this it just returns sum of the givenour case we undo write bias from thetry to dooptimistic spinningcountcountdeltaand returns old value of theand old value of thecountfield. Inas we didn't acquire a lock. After this step weby the call of therwsem_optimistic_spinfunction:if (rwsem_optimistic_spin(sem))return sem;We will skip implementation of themutex_optimistic_spinrwsem_optimistic_spinfunction, as it is similar on thefunction which we saw in the previous part. In short words we checkexistence other tasks ready to run that have higher priority in therwsem_optimistic_spinfunction. If there are such tasks, the process will be added to the MCSto spin in the loop until a lock will be able to be acquired. Ifwaitqueueoptimistic spinningand startis disabled,a process will be added to the and marked as waiting for write:604Reader/Writer semaphoreswaiter.task = current;waiter.type = RWSEM_WAITING_FOR_WRITE;if (list_empty(&sem->wait_list))waiting = false;list_add_tail(&waiter.list, &sem->wait_list);waiters list and start to wait until it will successfully acquire the lock. After we have added aprocess to the waiters list which was empty before this moment, we update the value of therw_semaphore->countwith theRWSEM_WAITING_BIAS:count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);with this we markwriterfrom therw_semaphore->counterthat it is already locked and exists/waits onewhich wants to acquire the lock. In other way we try to wakethat were queued before thiswait queuereaders. In the end of therwsem_down_write_failedwriterareaderprocessesprocess and there are no activewriterprocess will go to sleep whichdidn't acquire a lock in the following loop:while (true) {if (rwsem_try_write_lock(count, sem))break;raw_spin_unlock_irq(&sem->wait_lock);do {schedule();set_current_state(TASK_UNINTERRUPTIBLE);} while ((count = sem->count) & RWSEM_ACTIVE_MASK);raw_spin_lock_irq(&sem->wait_lock);}I will skip explanation of this loop as we already met similar functional in the previous part.That's all. From this moment, ouron the value of thethedown_readwriterrw_semaphore->countprocess will acquire or not acquire a lock dependsfield. Now if we will look at the implementation offunction which executes a try of acquiring of a lock. We will see similaractions which we saw in thedown_writefunction. This function calls different debuggingand lock validator related functions/macros:605Reader/Writer semaphoresvoid __sched down_read(struct rw_semaphore *sem){might_sleep();rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);LOCK_CONTENDED(sem, __down_read_trylock, __down_read);}and does all job in the__down_readfunction. The__down_readconsists of inline assemblystatement:static inline void __down_read(struct rw_semaphore *sem){asm volatile("# beginning down_read\n\t"LOCK_PREFIX _ASM_INC "(%1)\n\t""jns1f\n""call call_rwsem_down_read_failed\n""1:\n\t""# ending down_read\n\t": "+m" (sem->count): "a" (sem): "memory", "cc");}which increments value of the givencall_rwsem_down_read_failedand exit. After thisthecountword of thereadrw_semaphore->countand call theif this value is negative. In other way we jump at the label1:lock will be successfully acquired. Notice that we check a sign ofvalue as it may be negative, because as you may remember most significantrw_semaphore->countcontains negated number of active writers.Let's consider case when a process wants to acquire a lock foralready locked. In this case thecall_rwsem_down_read_failedreadoperation, but it isfunction from thearch/x86/lib/rwsem.S assembly file will be called. If you will look at the implementation of thisfunction, you will notice that it does the same thatdoes. Except it calls therwsem_dow_write_failedrwsem_down_read_failedcall_rwsem_down_read_failedfunction instead of. Now let's consider implementation of thefunction. It starts from the adding a process to therw_semaphore->counterfunctionwait queuerwsem_down_read_failedand updating of value of the:606Reader/Writer semaphoreslong adjustment = -RWSEM_ACTIVE_READ_BIAS;waiter.task = tsk;waiter.type = RWSEM_WAITING_FOR_READ;if (list_empty(&sem->wait_list))adjustment += RWSEM_WAITING_BIAS;list_add_tail(&waiter.list, &sem->wait_list);count = rwsem_atomic_update(adjustment, sem);Notice that if theundoreadwait queuewas empty before we clear therw_semaphore->counterandbias in other way. At the next step we check that there are no active locks andwe are first in thewait queuewe need to join currently activereaderprocesses. In otherway we go to sleep until a lock will not be able to acquired.That's all. Now we know howreaderandwriterprocesses will behave in different casesduring a lock acquisition. Now let's take a short look atandfunctions allows us to unlock aup_writea look at the implementation of theup_writereaderunlockoroperations. Thewriterup_readlock. First of all let's takefunction which is defined in thekernel/locking/rwsem.c source code file:void up_write(struct rw_semaphore *sem){rwsem_release(&sem->dep_map, 1, _RET_IP_);rwsem_clear_owner(sem);__up_write(sem);}First of all it calls therwsem_releasemacro which is related to the lock validator of the Linuxkernel, so we will skip it now. And at the next line therwsem_clear_owneryou may understand from the name of this function, just clears therw_semaphoreownerfunction which asfield of the given:static inline void rwsem_clear_owner(struct rw_semaphore *sem){sem->owner = NULL;}The__up_writefunction does all job of unlocking of the lock. The_up_writeisarchitecture-specific function, so for our case it will be located in thearch/x86/include/asm/rwsem.h source code file. If we will take a look at the implementation607Reader/Writer semaphoresof this function, we will see that it does almost the same that__down_writefunction, butconversely. Instead of adding of theto the, we subtract theRWSEM_ACTIVE_WRITE_BIASsame value and check thesignIf the previous value of therw_semaphore->countcountof the previous value.is not negative, a writer process released alock and now it may be acquired by someone else. In other case, thewill contain negative values. This means that there is at least oneIn this case thecall_rwsem_wakerw_semaphore->countwriterin a wait queue.function will be called. This function acts like similarfunctions which we already saw above. It store general purpose registers at the stack forpreserving and call theFirst of all thefunction.rwsem_wakefunction checks if a spinner is present. In this case it will justrwsem_wakeacquire a lock which is just released by lock owner. In other case there must be someone inthewait queuequeueor alland we need to wake or writer process if it exists at the top of thereadersimilar way likeprocesses. Theup_writeRWSEM_ACTIVE_WRITE_BIASup_readreaderlock acts in, but with a little difference. Instead of subtracting offrom therw_semaphore->countless significant word of thecountof therwsem_wakecountfunction which release awaitand calls the, it subtracts1from it, becausecontains number active locks. After this it checkslike__up_writeif thecountsignis negative or in otherway lock will be successfully released.That's all. We have considered API for manipulation withup_read/up_writeanddown_read/down_writeadditional API, besides this functions, like thereader/writer semaphore:. We saw that the Linux kernel provides,and etc. But I will not considerimplementation of these function in this part because it must be similar on that we have seenin this part of except few subtleties.ConclusionThis is the end of the fifth part of the synchronization primitives chapter in the Linux kernel.In this part we met with special type ofsemaphore-readers/writersemaphore whichprovides access to data for multiply process to read or for one process to writer. In the nextpart we will continue to dive into synchronization primitives in the Linux kernel.If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Links608Reader/Writer semaphoresSynchronization primitivesReaders/Writer lockSpinlocksSemaphoreMutexx86_64 architectureDoubly linked listMCS lockAPILinux kernel lock validatorAtomic operationsInline assemblyXADD instructionLOCK instructionPrevious part609SeqLockSynchronization primitives in the Linuxkernel. Part 6.IntroductionThis is the sixth part of the chapter which describes synchronization primitives) in the Linuxkernel and in the previous parts we finished to consider different readers-writer locksynchronization primitives. We will continue to learn synchronization primitives in this partand start to consider a similar synchronization primitive which can be used to avoid thewriter starvationsequential locksproblem. The name of this synchronization primitive is -seqlockor.We know from the previous part that readers-writer lock is a special lock mechanism whichallows concurrent access for read-only operations, but an exclusive lock is needed forwriting or modifying data. As we may guess, it may lead to a problem which is calledstarvationwriter. In other words, a writer process can't acquire a lock as long as at least onereader process which acquired a lock holds it. So, in the situation when contention is high, itwill lead to situation when a writer process which wants to acquire a lock will wait for it for along time.Theseqlocksynchronization primitive can help solve this problem.As in all previous parts of this book, we will try to consider this synchronization primitive fromthe theoretical side and only than we will consider API provided by the Linux kernel tomanipulate withseqlocks.So, let's start.Sequential lockSo, what is aseqlocksynchronization primitive and how does it work? Let's try to answeron these questions in this paragraph. Actuallysequential lockswere introduced in theLinux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-freeaccess to shared resources. Since the heart ofspinlock synchronization primitive,sequential locksequential lockssynchronization primitive iswork in situations where the protectedresources are small and simple. Additionally write access must be rare and also should befast.610SeqLockWork of this synchronization primitive is based on the sequence of events counter. Actually asequential lockallows free access to a resource for readers, but each reader must checkexistence of conflicts with a writer. This synchronization primitive introduces a specialcounter. The main algorithm of work ofsequential locksis simple: Each writer whichacquired a sequential lock increments this counter and additionally acquires a spinlock.When this writer finishes, it will release the acquired spinlock to give access to other writersand increment the counter of a sequential lock again.Read only access works on the following principle, it gets the value of asequential lockcounter before it will enter into critical section and compares it with the value of the samesequential lockcounter at the exit of critical section. If their values are equal, this meansthat there weren't writers for this period. If their values are not equal, this means that a writerhas incremented the counter during the critical section. This conflict means that reading ofprotected data must be repeated.That's all. As we may see principle of work ofsequential locksis simple.unsigned int seq_counter_value;do {seq_counter_value = get_seq_counter_val(&the_lock);//// do as we want here//} while (__retry__);Actually the Linux kernel does not providestub. Like a__retry__get_seq_counter_val()function. Here it is just atoo. As I already wrote above, we will see actual the API for this inthe next paragraph of this part.Ok, now we know what aseqlocksynchronization primitive is and how it is represented inthe Linux kernel. In this case, we may go ahead and start to look at the API which the Linuxkernel provides for manipulation of synchronization primitives of this type.Sequential lock APISo, now we know a little aboutsequential locksynchronization primitive from theoreticalside, let's look at its implementation in the Linux kernel. Allsequential locksAPI arelocated in the include/linux/seqlock.h header file.First of all we may see that the asequential lockmechanism is represented by thefollowing type:611SeqLocktypedef struct {struct seqcount seqcount;spinlock_t lock;} seqlock_t;As we may see theseqlock_tprovides two fields. These fields represent a sequential lockcounter, description of which we saw above and also a spinlock which will protect data fromother writers. Note that theseqcountseqcountcounter represented asseqcounttype. Theis structure:typedef struct seqcount {unsigned sequence;#ifdef CONFIG_DEBUG_LOCK_ALLOCstruct lockdep_map dep_map;#endif} seqcount_t;which holds counter of a sequential lock and lock validator related field.As always in previous parts of this chapter, before we will consider an API oflocksequentialmechanism in the Linux kernel, we need to know how to initialize an instance ofseqlock_t.We saw in the previous parts that often the Linux kernel provides two approaches to executeinitialization of the given synchronization primitive. The same situation with thestructure. These approaches allows to initialize astaticallyseqlock_tin two following:;dynamically.ways. Let's look at the first approach. We are able to initialize aDEFINE_SEQLOCKseqlock_tseqlock_tstatically with themacro:#define DEFINE_SEQLOCK(x) \seqlock_t x = __SEQLOCK_UNLOCKED(x)which is defined in the include/linux/seqlock.h header file. As we may see, theDEFINE_SEQLOCKtheseqlock_tmacro takes one argument and expands to the definition and initialization ofstructure. Initialization occurs with the help of the__SEQLOCK_UNLOCKEDmacrowhich is defined in the same source code file. Let's look at the implementation of this macro:612SeqLock#define __SEQLOCK_UNLOCKED(lockname){\\.seqcount = SEQCNT_ZERO(lockname),.lock =\__SPIN_LOCK_UNLOCKED(lockname)\}As we may see the,seqlock_t__SEQLOCK_UNLOCKEDstructure. The first field ismacro executes initialization of fields of the givenseqcountinitialized with theSEQCNT_ZEROmacrowhich expands to the:#define SEQCNT_ZERO(lockname) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(lockname)}So we just initialize counter of the given sequential lock to zero and additionally we can seelock validator related initialization which depends on the state of theCONFIG_DEBUG_LOCK_ALLOCkernel configuration option:#ifdef CONFIG_DEBUG_LOCK_ALLOC# define SEQCOUNT_DEP_MAP_INIT(lockname) \.dep_map = { .name = #lockname } \.........#else# define SEQCOUNT_DEP_MAP_INIT(lockname).........#endifAs I already wrote in previous parts of this chapter we will not consider debugging and lockvalidator related stuff in this part. So for now we just skip theThe second field of the givenseqlock_tislockSEQCOUNT_DEP_MAP_INITinitialized with themacro.__SPIN_LOCK_UNLOCKEDmacro which is defined in the include/linux/spinlock_types.h header file. We will not considerimplementation of this macro here as it just initialize rawspinlock with architecture-specificmethods (More abot spinlocks you may read in first parts of this chapter).We have considered the first way to initialize a sequential lock. Let's consider second way todo the same, but do it dynamically. We can initialize a sequential lock with theseqlock_initmacro which is defined in the same include/linux/seqlock.h header file.Let's look at the implementation of this macro:613SeqLock#define seqlock_init(x)do {\\seqcount_init(&(x)->seqcount);\spin_lock_init(&(x)->lock);\} while (0)As we may see, theseqlock_initexpands into two macros. The first macrotakes counter of the given sequential lock and expands to the call of theseqcount_init__seqcount_initfunction:# define seqcount_init(s)do {\\static struct lock_class_key __key;__seqcount_init((s), #s, &__key);\\} while (0)from the same header file. This functionstatic inline void __seqcount_init(seqcount_t *s, const char *name,struct lock_class_key *key){lockdep_init_map(&s->dep_map, name, key, 0);s->sequence = 0;}just initializes counter of the givenseqlock_initseqcount_tmacro is the call of thewith zero. The second call from themacro which we saw in the first partspin_lock_initof this chapter.So, now we know how to initialize asequential lockLinux kernel provides following API to manipulate, now let's look at how to use it. Thesequential locks:static inline unsigned read_seqbegin(const seqlock_t *sl);static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start);static inline void write_seqlock(seqlock_t *sl);static inline void write_sequnlock(seqlock_t *sl);static inline void write_seqlock_irq(seqlock_t *sl);static inline void write_sequnlock_irq(seqlock_t *sl);static inline void read_seqlock_excl(seqlock_t *sl)static inline void read_sequnlock_excl(seqlock_t *sl)and others. Before we move on to considering the implementation of this API, we must knowthat actually there are two types of readers. The first type of reader never blocks a writerprocess. In this case writer will not wait for readers. The second type of reader which can614SeqLocklock. In this case, the locking reader will block the writer as it will wait while reader will notrelease its lock.First of all let's consider the first type of readers. Theread_seqbeginfunction begins a seq-read critical section.As we may see this function just returns value of theread_seqcount_beginfunction:static inline unsigned read_seqbegin(const seqlock_t *sl){return read_seqcount_begin(&sl->seqcount);}In its turn theread_seqcount_beginfunction calls theraw_read_seqcount_beginfunction:static inline unsigned read_seqcount_begin(const seqcount_t *s){return raw_read_seqcount_begin(s);}which just returns value of thesequential lockcounter:static inline unsigned raw_read_seqcount(const seqcount_t *s){unsigned ret = READ_ONCE(s->sequence);smp_rmb();return ret;}After we have the initial value of the givensequential lockcounter and did some stuff, weknow from the previous paragraph of this function, that we need to compare it with thecurrent value of the counter the samesequential locksection. We can achieve this by the call of thesequential lockbefore we will exit from the criticalread_seqretryfunction. This function takes a, start value of the counter and through a chain of functions:static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start){return read_seqcount_retry(&sl->seqcount, start);}static inline int read_seqcount_retry(const seqcount_t *s, unsigned start){smp_rmb();return __read_seqcount_retry(s, start);}615SeqLockit calls the__read_seqcount_retryfunction:static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start){return unlikely(s->sequence != start);}which just compares value of the counter of the givensequential lockof this counter. If the initial value of the counter which is obtained fromwith the initial valueread_seqbegin()function is odd, this means that a writer was in the middle of updating the data when ourreader began to act. In this case the value of the data can be in inconsistent state, so weneed to try to read it again.This is a common pattern in the Linux kernel. For example, you may remember thejiffiesconcept from the first part of the timers and time management in the Linux kernel chapter.The sequential lock is used to obtain value ofat x86_64 architecture:jiffiesu64 get_jiffies_64(void){unsigned long seq;u64 ret;do {seq = read_seqbegin(&jiffies_lock);ret = jiffies_64;} while (read_seqretry(&jiffies_lock, seq));return ret;}Here we just read the value of the counter of thewrite value of thejiffies_64jiffies_locksystem variable to theretsequential lock and then we. As here we may seedo/whileloop, the body of the loop will be executed at least one time. So, as the body of loop wasexecuted, we read and compare the current value of the counter of thejiffies_lockwiththe initial value. If these values are not equal, execution of the loop will be repeated, elseget_jiffies_64will return its value inret.We just saw the first type of readers which do not block writer and other readers. Let'sconsider second type. It does not update value of aspinlocksequential lockcounter, but just locks:static inline void read_seqlock_excl(seqlock_t *sl){spin_lock(&sl->lock);}616SeqLockSo, no one reader or writer can't access protected data. When a reader finishes, the lockmust be unlocked with the:static inline void read_sequnlock_excl(seqlock_t *sl){spin_unlock(&sl->lock);}function.Now we know howsequential lockwhen it wants to acquire awriter should usework for readers. Let's consider how does writer actsequential lockwrite_seqlockto modify data. To acquire asequential lock,function. If we look at the implementation of this function:static inline void write_seqlock(seqlock_t *sl){spin_lock(&sl->lock);write_seqcount_begin(&sl->seqcount);}We will see that it acquireswrite_seqcount_beginspinlockto prevent access from other writers and calls thefunction. This function just increments value of thesequential lockcounter:static inline void raw_write_seqcount_begin(seqcount_t *s){s->sequence++;smp_wmb();}When a writer process will finish to modify data, thewrite_sequnlockfunction must becalled to release a lock and give access to other writers or readers. Let's consider at theimplementation of thewrite_sequnlockfunction. It looks pretty simple:static inline void write_sequnlock(seqlock_t *sl){write_seqcount_end(&sl->seqcount);spin_unlock(&sl->lock);}First of all it just callssequentialwrite_seqcount_endfunction to increase value of the counter of thelock again:617SeqLockstatic inline void raw_write_seqcount_end(seqcount_t *s){smp_wmb();s->sequence++;}and in the end we just call thespin_unlockmacro to give access for other readers orwriters.That's all aboutsequential lockmechanism in the Linux kernel. Of course we did notconsider full API of this mechanism in this part. But all other functions are based on thesewhich we described here. For example, Linux kernel also provides some safemacros/functions to usewrite_seqclock_irqandsequential lockmechanism in interrupt handlers of softirq:write_sequnlock_irq:static inline void write_seqlock_irq(seqlock_t *sl){spin_lock_irq(&sl->lock);write_seqcount_begin(&sl->seqcount);}static inline void write_sequnlock_irq(seqlock_t *sl){write_seqcount_end(&sl->seqcount);spin_unlock_irq(&sl->lock);}As we may see, these functions differ only in the initialization of spinlock. They callspin_lock_irqOr for exampleandspin_unlock_irqwrite_seqlock_irqsavethe same but usedspin_lock_irqsaveinstead ofspin_lockandspin_unlockandwrite_sequnlock_irqrestoreandspin_unlock_irqsave.functions which aremacro to use in IRQ)handlers.That's all.ConclusionThis is the end of the sixth part of the synchronization primitives chapter in the Linux kernel.In this part we met with new synchronization primitive which is called -sequential lock.From the theoretical side, this synchronization primitive very similar on a readers-writer locksynchronization primitive, but allows to avoidwriter-starvingissue.618SeqLockIf you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email orjust create issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me PR to linux-insides.Linkssynchronization primitives)readers-writer lockspinlockcritical sectionlock validatordebuggingAPIx86_64Timers and time management in the Linux kernelinterrupt handlerssoftirqIRQ)Previous part619Memory managementLinux kernel memory managementThis chapter describes memory management in the linux kernel. You will see here a coupleof posts which describe different parts of the linux memory management framework:Memblock - describes earlymemblockallocator.Fix-Mapped Addresses and ioremap - describesioremapfix-mappedaddresses and early.kmemcheck - third part describeskmemchecktool.620MemblockLinux kernel memory management Part 1.IntroductionMemory management is one of the most complex (and I think that it is the most complex)part of the operating system kernel. In the last preparations before the kernel entry point partwe stopped right before call of thestart_kernelfunction. This function initializes all thekernel features (including architecture-dependent features) before the kernel runs the firstinitprocess. You may remember as we built early page tables, identity page tables andfixmap page tables in the boot time. No complicated memory management is working yet.When thestart_kernelfunction is called we will see the transition to more complex datastructures and techniques for memory management. For a good understanding of theinitialization process in the linux kernel we need to have a clear understanding of thesetechniques. This chapter will provide an overview of the different parts of the linux kernelmemory management framework and its API, starting from thememblock.MemblockMemblock is one of the methods of managing memory regions during the early bootstrapperiod while the usual kernel memory allocators are not up and running yet. Previously itwas calledmemblockmemblockLogical Memory Block. As Linux kernel for, but with the patch by Yinghai Lu, it was renamed to thex86_64architecture uses this method. We already metin the Last preparations before the kernel entry point part. And now it's time to getacquainted with it closer. We will see how it is implemented.We will start to learnmemblockfrom the data structures. Definitions of all logical-memory-block-related data structures can be found in the include/linux/memblock.h header file.The first structure has the same name as this part and it is:struct memblock {bool bottom_up;phys_addr_t current_limit;struct memblock_type memory;--> array of memblock_regionstruct memblock_type reserved; --> array of memblock_region#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAPstruct memblock_type physmem;#endif};621MemblockThis structure contains five fields. First isbottom-up mode when it istruebottom_up. Next field iswhich allows allocating memory incurrent_limit. This field describes the limitsize of the memory block. The next three fields describe the type of the memory block. It canbe: reserved, memory and physical memory (physical memory is available if theconfiguration option is enabled). Now we see yet anotherCONFIG_HAVE_MEMBLOCK_PHYS_MAPdata structure -memblock_type. Let's look at its definition:struct memblock_type {unsigned long cnt;unsigned long max;phys_addr_t total_size;struct memblock_region *regions;};This structure provides information about the memory type. It contains fields which describethe number of memory regions inside the current memory block, the size of all memoryregions, the size of the allocated array of the memory regions, and a pointer to the array ofthememblock_regionstructures.memblock_regionis a structure which describes a memoryregion. Its definition is:struct memblock_region {phys_addr_t base;phys_addr_t size;unsigned long flags;#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAPint nid;#endif};memblock_regionprovides the base address and size of the memory region as well as aflags field which can have the following values:enum {MEMBLOCK_NONEMEMBLOCK_HOTPLUGMEMBLOCK_MIRRORMEMBLOCK_NOMAP= 0x0,/* No special request */= 0x1,/* hotpluggable region */= 0x2,= 0x4,/* mirrored region *//* don't add to kernel direct mapping */};Alsomemblock_regionprovides an integer field - numa node selector, if theCONFIG_HAVE_MEMBLOCK_NODE_MAPconfiguration option is enabled.Schematically we can imagine it as:622Memblock+---------------------------++---------------------------+|memblock|||_______________________||| ||| |memory| |memblock_type|||-|-->|Array of the|memblock_region|| |_______________________| ||||+---------------------------+|+---------------------------+|_______________________| |reserved| || |memblock_type|||||+---------------------------+Memblock||-|-->|| |_______________________| |These three structures:|memblockArray of the|memblock_region||+---------------------------+,memblock_typeandmemblock_regionare main in the. Now we know about it and can look at Memblock initialization process.Memblock initializationAs all API of thememblockare described in the include/linux/memblock.h header file, allimplementations of these functions are in the mm/memblock.c source code file. Let's look atthe top of the source code file and we will see the initialization of thememblockstructure:struct memblock memblock __initdata_memblock = {.memory.regions= memblock_memory_init_regions,.memory.cnt= 1,.memory.max= INIT_MEMBLOCK_REGIONS,.reserved.regions= memblock_reserved_init_regions,.reserved.cnt= 1,.reserved.max= INIT_MEMBLOCK_REGIONS,#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP.physmem.regions= memblock_physmem_init_regions,.physmem.cnt= 1,.physmem.max= INIT_PHYSMEM_REGIONS,#endif.bottom_up= false,.current_limit= MEMBLOCK_ALLOC_ANYWHERE,};Here we can see initialization of thestructure -memblockmemblock. First of all note thestructure which has the same name as__initdata_memblock. Definition of this macrolooks like:623Memblock#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK#define __init_memblock __meminit#define __initdata_memblock __meminitdata#else#define __init_memblock#define __initdata_memblock#endifYou can see that it depends onCONFIG_ARCH_DISCARD_MEMBLOCKenabled, memblock code will be put into the.init. If this configuration option issection and will be released after thekernel is booted up.Next we can see the initialization of thememblock_type memorymemblock_type physmemfields of thememblock_type.regionsinitialization process. Note that everyinitialized by and array ofmemblockmemblock_region,memblock_type reservedandstructure. Here we are interested only in thememblock_typefield iss:static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAPstatic struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS] __initdata_memblock;#endifEvery array contains 128 memory regions. We can see it in theINIT_MEMBLOCK_REGIONSmacro definition:#define INIT_MEMBLOCK_REGIONS128Note that all arrays are also defined with thesaw in thememblock__initdata_memblockmacro which we alreadystructure initialization (read above if you've forgotten).The last two fields describe thatbottom_upallocation is disabled and the limit of the currentMemblock is:#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)which is0xffffffffffffffff.On this step the initialization of thememblockstructure has been finished and we can have alook at the Memblock API.624MemblockMemblock APIOk we have finished with the initialization of thememblockstructure and now we can look atthe Memblock API and its implementation. As I said above, the implementation ofis taking place fully in mm/memblock.c. To understand howmemblockworks and how it ismemblockimplemented, let's look at its usage first. There are a couple of places in the linux kernelwhere memblock is used. For example let's takefunction from thememblock_x86_fillarch/x86/kernel/e820.c. This function goes through the memory map provided by the e820and adds memory regions reserved by the kernel to thefunction. Since we have met thememblock_addmemblockwith thememblock_addfunction first, let's start from it.This function takes a physical base address and the size of the memory region asarguments and add them to thememblock. Thememblock_addfunction does not do anythingspecial in its body, but just calls the:memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);function. We pass the memory block type -memory, the physical base address and the sizeof the memory region, the maximum number of nodes which is 1 ifnot set in the configuration file ormemblock_add_range1 << CONFIG_NODES_SHIFTCONFIG_NODES_SHIFTisif it is set, and the flags. Thefunction adds a new memory region to the memory block. It starts bychecking the size of the given region and if it is zero it just returns. After this,memblock_add_rangewith the givenmemory_regionchecks the existence of the memory regions in thememblock_typememblockstructure. If there are no memory regions, we just fill new awith the given values and return (we already saw the implementation of thisin the First touch of the linux kernel memory manager framework). Ifempty, we start to add a new memory region to thememblockmemblock_typewith the givenis notmemblock_type.First of all we get the end of the memory region with the:phys_addr_t end = base + memblock_cap_size(base, &size);memblock_cap_sizeadjustssizethatbase + sizewill not overflow. Its implementation ispretty easy:static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size){return *size = min(*size, (phys_addr_t)ULLONG_MAX - base);}625Memblockmemblock_cap_sizeandreturns the new size which is the smallest value between the given sizeULLONG_MAX - base.After that we have the end address of the new memory region,memblock_add_rangechecksfor overlap and merge conditions with memory regions that have been added before.Insertion of the new memory region to thememblockconsists of two steps:Adding of non-overlapping parts of the new memory area as separate regions;Merging of all neighboring regions.We are going through all the already stored memory regions and checking for overlap withthe new region:for (i = 0; i cnt; i++) {struct memblock_region *rgn = &type->regions[i];phys_addr_t rbase = rgn->base;phys_addr_t rend = rbase + rgn->size;if (rbase >= end)break;if (rend cnt + nr_new > type->max)if (memblock_double_array(type, obase, size) < 0)return -ENOMEM;insert = true;goto repeat;memblock_double_arraytrueand go to thedoubles the size of the given regions array. Then we setrepeatlabel. In the second step, starting from therepeatinserttolabel we gothrough the same loop and insert the current memory region into the memory block with thememblock_insert_regionfunction:626Memblockif (base regions[idx];and copies the memory area withmemmove:memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));After this fillsmemblock_regionincreases size of thefields of the new memory region base, size, etc. andmemblock_type. In the end of the execution,memblock_add_rangecallswhich merges neighboring compatible regions in the second step.memblock_merge_regionsIn the second case the new memory region can overlap already stored regions. For examplewe already haveregion10in thememblock:0x1000+-----------------------+|||||region1|||||+-----------------------+And now we want to addregion2to thememblockwith the following base address andsize:627Memblock0x1000x2000+-----------------------+|||||region2|||||+-----------------------+In this case set the base address of the new memory region as the end address of theoverlapped region with:base = min(rend, end);So it will be0x1000in our case. And insert it as we did it already in the second step with:if (base regions[i]and. As I said abovememblock_typetype->regions[i + 1], takes two neighboring memory regions and checks that these regions have the sameflags, belong to the same node and that the end address of the first regions is not equal tothe base address of the second region:while (i cnt - 1) {struct memblock_region *this = &type->regions[i];struct memblock_region *next = &type->regions[i + 1];if (this->base + this->size != next->base ||memblock_get_region_node(this) !=memblock_get_region_node(next) ||this->flags != next->flags) {BUG_ON(this->base + this->size > next->base);i++;continue;}628MemblockIf none of these conditions are true, we update the size of the first region with the size of thenext region:this->size += next->size;As we update the size of the first memory region with the size of the next memory region, wemove all memory regions which are after the (with thememmovenext) memory region one index backwardsfunction:memmove(next, next + 1, (type->cnt - (i + 2)) * sizeof(*next));Thememmovehere moves all regions which are located after theaddress of theregion to the baseregion. In the end we just decrease the count of the memory regionsnextwhich belong to thenextmemblock_type:type->cnt--;After this we will get two memory regions merged into one:00x2000+------------------------------------------------+|||||region1|||||+------------------------------------------------+As we decreased counts of regions in a memblock with certain type, increased size of thethisregion and shifted all regions which are located afterThat's all. This is the whole principle of the work of theThere is alsomemblock_reserveone difference. It storesmemblock_type.memorynextregion to its place.memblock_add_rangefunction which does the same asmemblock_type.reservedfunction.memblock_addin the memblock instead of.Of course this is not the full API. Memblock provides APIs not only for addingreserved, but withmemoryandmemory regions, but also:memblock_remove - removes memory region from memblock;memblock_find_in_range - finds free area in given range;memblock_free - releases memory region in memblock;629Memblockfor_each_mem_range - iterates through memblock areas.and many more....Getting info about memory regionsMemblock also provides an API for getting information about allocated memory regions inthememblock. It is split in two parts:get_allocated_memblock_memory_regions_info - getting info about memory regions;get_allocated_memblock_reserved_regions_info - getting info about reserved regions.Implementation of these functions is easy. Let's look atfor example:get_allocated_memblock_reserved_regions_infophys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(phys_addr_t *addr){if (memblock.reserved.regions == memblock_reserved_init_regions)return 0;*addr = __pa(memblock.reserved.regions);return PAGE_ALIGN(sizeof(struct memblock_region) *memblock.reserved.max);}First of all this function checks thatmemblockmemblockcontains reserved memory regions. Ifdoes not contain reserved memory regions we just return zero. Otherwise wewrite the physical address of the reserved memory regions array to the given address andreturn aligned size of the allocated array. Note that there isPAGE_ALIGNmacro used foralign. Actually it depends on size of page:#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)Implementation of theget_allocated_memblock_memory_regions_infohas only one difference,memblock_type.memoryused instead offunction is the same. Itmemblock_type.reserved.Memblock debugging630MemblockThere are many calls tomemblock=debugmemblock_dbgmemblock_dbgin the memblock implementation. If you pass theoption to the kernel command line, this function will be called. Actuallyis just a macro which expands toprintk:#define memblock_dbg(fmt, ...) \if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)For example you can see a call of this macro in thememblock_reservefunction:memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",(unsigned long long)base,(unsigned long long)base + size - 1,flags, (void *)_RET_IP_);And you will see something like this:Memblock also has support in debugfs. If you run the kernel on another architecture thanX86you can access:/sys/kernel/debug/memblock/memory/sys/kernel/debug/memblock/reserved/sys/kernel/debug/memblock/physmemto get a dump of thememblockcontents.ConclusionThis is the end of the first part about linux kernel memory management. If you havequestions or suggestions, ping me on twitter 0xAX, drop me an email or just create an issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me a PR to linux-insides.Linkse820631MemblocknumadebugfsFirst touch of the linux kernel memory manager framework632Fixmaps and ioremapLinux kernel memory management Part 2.Fix-Mapped Addresses and ioremapFix-Mappedaddresses are a set of special compile-time addresses whose correspondingphysical addresses do not have to be a linear address minus__START_KERNEL_map. Each fix-mapped address maps one page frame and the kernel uses them as pointers that neverchange their address. That is the main point of these addresses. As the comment says:tohave a constant address at compile time, but to set the physical address only in the bootprocess. You can remember that in the earliest part, we already set thelevel2_fixmap_pgt:NEXT_PAGE(level2_fixmap_pgt).fill506,8,0.quadlevel1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE.fill5,8,0NEXT_PAGE(level1_fixmap_pgt).fill512,8,0As you can seelevel2_fixmap_pgtis right after thelevel2_kernel_pgtwhich is kernelcode+data+bss. Every fix-mapped address is represented by an integer index which isdefined in thefixed_addressesit contains entries forFIX_APIC_BASEenum from the arch/x86/include/asm/fixmap.h. For exampleVSYSCALL_PAGE- if emulation of legacy vsyscall page is enabled,for local apic, etc. In virtual memory fix-mapped area is placed in themodules area:+-----------+-----------------+---------------+------------------+||||kernel||vsyscalls||text||fix-mapped||from phys 0|data||addresses||||kernel text|| mapping||Modules||+-----------+-----------------+---------------+------------------+__START_KERNEL_map__START_KERNELBase virtual address and size of theMODULES_VADDRfix-mapped0xffffffffffffffffarea are presented by the two followingmacro:#define FIXADDR_SIZE#define FIXADDR_START(__end_of_permanent_fixed_addresses << PAGE_SHIFT)(FIXADDR_TOP - FIXADDR_SIZE)633Fixmaps and ioremapHere__end_of_permanent_fixed_addressesis an element of thefixed_addressesenum andas I wrote above: Every fix-mapped address is represented by an integer index which isdefined in thefixed_addresses.PAGE_SHIFTsize of the one page we can get with thedetermines the size of a page. For example1 << PAGE_SHIFTexpression.In our case we need to get the size of the fix-mapped area, but not only of one page, that'swhy we are usingarea. The__end_of_permanent_fixed_addresses__end_of_permanent_fixed_addressesenum or in other words thefor getting the size of the fix-mappedis the last index of the__end_of_permanent_fixed_addressesin a fixed-mapped area. So if multiply value of thefixed_addressescontains amount of pages__end_of_permanent_fixed_addressespage size value we will get size of fix-mapped area. In my case it's a little more thanon a536kilobytes. In your case it might be a different number, because the size depends on amountof the fix-mapped addresses which are depends on your kernel's configuration.The secondmacro just subtracts the fix-mapped area size from the lastFIXADDR_STARTaddress of the fix-mapped area to get its base virtual address.is a rounded upFIXADDR_TOPaddress from the base address of the vsyscall space:#define FIXADDR_TOPThefixed_addressesfix_to_virt(round_up(VSYSCALL_ADDR + PAGE_SIZE, 1<= __end_of_fixed_addresses);return __fix_to_virt(idx);}first of all it checks that the index given for thethanthe__end_of_fixed_addresses__fix_to_virtwith thefixed_addressesBUILD_BUG_ONenum is not greater or equalmacro and then returns the result ofmacro:#define __fix_to_virt(x)(FIXADDR_TOP - ((x) <= FIXADDR_TOP || vaddr > PAGE_SHIFT)__virt_to_fixmacro clears the firstaddress, subtracts it from the last address the ofthe result right onPAGE_SHIFTAs in previous example (inwhich is12__fix_to_virt12fix-mappedbits in the given virtualarea (FIXADDR_TOP) and shifts. Let me explain how it works.macro), we start from the top of the fix-mappedarea. We also go back to bottom from the top to search an index of a fix-mapped areacorresponding to the given virtual address. As you may see, first of all we will clear the first12bits in the given virtual address withx & PAGE_MASKexpression. This allows us to getbase address of page. We need to do this for case when the given virtual address pointssomewhere in a beginning/middle or end of a page, but not to the base address of it. At thenext step subtract this from theFIXADDR_TOPand this gives us virtual address of acorresponding page in a fix-mapped area. In the end we just divide value of this address onPAGE_SHIFT. This gives us index of a fix-mapped area corresponding to the given virtualaddress. It may looks hard, but if you will go through this step by step, you will be sure thatthe__virt_to_fixmacro is pretty easy.That's all. For this moment we know a little aboutfix-mappedaddresses, but this is enoughto go next.635Fixmaps and ioremapFix-mappedaddresses are used in different places in the linux kernel.stored there, Intel Trusted Execution Technology UUID stored in thestarted fromaboutFIX_TBOOT_BASEfix-mappedfix-mappedfix-mappeddescriptorfix-mappedareaindex, Xen bootmap and many more... We already saw a littleaddresses in the fifth part about of the linux kernel initialization. We usearea in the earlyunderstand whatIDTioremapioremapinitialization. Let's look at it more closely and try tois, how it is implemented in the kernel and how it is related to theaddresses.ioremapThe Linux kernel provides many different primitives to manage memory. For this moment wewill touchI/O memory. Every device is controlled by reading/writing from/to its registers. Forexample a driver can turn off/on a device by writing to its registers or get the state of adevice by reading from its registers. Besides registers, many devices have buffers where adriver can write something or read from there. As we know for this moment there are twoways to access device's registers and data buffers:through the I/O ports;mapping of the all registers to the memory address space;In the first case every control register of a device has a number of input and output port. Adevice driver can read from a port and write to it with twoinandoutinstructions whichwe already saw. If you want to know about currently registered port regions, you can learnabout them by accessing/proc/ioports:636Fixmaps and ioremap$cat /proc/ioports0000-0cf7 : PCI Bus 0000:000000-001f : dma10020-0021 : pic10040-0043 : timer00050-0053 : timer10060-0060 : keyboard0064-0064 : keyboard0070-0077 : rtc00080-008f : dma page reg00a0-00a1 : pic200c0-00df : dma200f0-00ff : fpu00f0-00f0 : PNP0C04:0003c0-03df : vesafb03f8-03ff : serial04d0-04d1 : pnp 00:060800-087f : pnp 00:010a00-0a0f : pnp 00:040a20-0a2f : pnp 00:040a30-0a3f : pnp 00:040cf8-0cff : PCI conf10d00-ffff : PCI Bus 0000:00.........provides information about which driver uses which address of a/proc/ioportsregion. All of these memory regions, for examplerequest_region0000-0cf7I/Oport, were claimed with thefunction from the include/linux/ioport.h. Actuallyrequest_regionis a macrowhich is defined as:#define request_region(start,n,name)__request_region(&ioport_resource, (start), (n), (name), 0)As we can see it takes three parameters:startn- begin of region;- length of region;name- name of requester.request_regioncalled before therelease_regiontheresourceallocates anI/Orequest_regionport region. Very often thefunction isto check that the given address range is available and thefunction to release the memory region.structure. Thecheck_regionresourcerequest_regionreturns a pointer tostructure represents an abstraction for a tree-likesubset of system resources. We already saw theresourcestructure in the fifth part of thekernel initialization process and it looks as follows:637Fixmaps and ioremapstruct resource {resource_size_t start;resource_size_t end;const char *name;unsigned long flags;struct resource *parent, *sibling, *child;};and contains start and end addresses of the resource, the name, etc. Everystructure contains pointers to theparent,siblingandchildresources. As it has aparent and a child, it means that every subset of resources has rootexample, forports it is theI/Oioport_resourceresourceresourcestructure. Forstructure:struct resource ioport_resource = {.name= "PCI IO",.start= 0,.end.flags= IO_SPACE_LIMIT,= IORESOURCE_IO,};EXPORT_SYMBOL(ioport_resource);Or foriomem, it is theiomem_resourcestructure:struct resource iomem_resource = {.name= "PCI mem",.start= 0,.end= -1,.flags= IORESOURCE_MEM,};As I have mentioned before,request_regionsis used to register I/O port regions and thismacro is used in many places in the kernel. For example let's look at drivers/char/rtc.c. Thissource code file provides the Real Time Clock interface in the linux kernel. As every kernelmodule,rtcmodule containsmodule_initdefinition:module_init(rtc_init);wherertc.crtc_initis thertcinitialization function. This function is defined in the samesource code file. In thertc_request_regionrtc_initfunction we can see a couple of calls to thefunctions, which wraprequest_regionfor example:r = rtc_request_region(RTC_IO_EXTENT);638Fixmaps and ioremapwherertc_request_regioncalls:r = request_region(RTC_PORT(0), size, "rtc");HereRTC_IO_EXTENTthe region andis the size of the memory region and it isRTC_PORT#define RTC_PORT(x)So with thestarts at,"rtc"is the name ofis:(0x70 + (x))request_region(RTC_PORT(0), size, "rtc")0x700x8and and has a size of0x8we register a memory region that. Let's look at/proc/ioports:~$ sudo cat /proc/ioports | grep rtc0070-0077 : rtc0So, we got it! Ok, that was it for the I/O ports. The second way to communicate with driversis through the use ofI/Omemory. As I have mentioned above this works by mapping thecontrol registers and the memory of a device to the memory address space.I/Omemory isa set of contiguous addresses which are provided by a device to the CPU through a bus.None of the memory-mapped I/O addresses are used by the kernel directly. There is aspecialioremapfunction which allows us to convert the physical address on a bus to akernel virtual address. In other words,them accessible from the kernel. Theioremapioremapmaps I/O physical memory regions to makefunction takes two parameters:start of the memory region;size of the memory region;The I/O memory mapping API provides functions to check, request and release memoryregions as I/O memory. There are three functions for that:request_mem_regionrelease_mem_regioncheck_mem_region639Fixmaps and ioremap~sudo cat /proc/iomem.........be826000-be82cfff : ACPI Non-volatile Storagebe82d000-bf744fff : System RAMbf745000-bfff4fff : reservedbfff5000-dc041fff : System RAMdc042000-dc0d2fff : reserveddc0d3000-dc138fff : System RAMdc139000-dc27dfff : ACPI Non-volatile Storagedc27e000-deffefff : reserveddefff000-deffffff : System RAMdf000000-dfffffff : RAM buffere0000000-feafffff : PCI Bus 0000:00e0000000-efffffff : PCI Bus 0000:01e0000000-efffffff : 0000:01:00.0f7c00000-f7cfffff : PCI Bus 0000:06f7c00000-f7c0ffff : 0000:06:00.0f7c10000-f7c101ff : 0000:06:00.0f7c10000-f7c101ff : ahcif7d00000-f7dfffff : PCI Bus 0000:03f7d00000-f7d3ffff : 0000:03:00.0f7d00000-f7d3ffff : alx.........Part of these addresses are from the call of thee820_reserve_resourcesfunction. We canfind a call to this function in the arch/x86/kernel/setup.c and the function itself is defined inarch/x86/kernel/e820.c.e820_reserve_resourcesmemory regions into the rootare inserted into theiomemiomemgoes through the e820 map and insertsresource structure. Alle820memory regions whichresource have the following types:static inline const char *e820_type_to_string(int e820_type){switch (e820_type) {case E820_RESERVED_KERN:case E820_RAM:case E820_ACPI:case E820_NVS:return "System RAM";return "ACPI Tables";return "ACPI Non-volatile Storage";case E820_UNUSABLE:default:return "Unusable memory";return "reserved";}}and we can see them in the/proc/iomem(read above).640Fixmaps and ioremapNow let's try to understand howioremapworks. We already know a little aboutioremap,we saw it in the fifth part about linux kernel initialization. If you have read this part, you canremember the call of theearly_ioremap_initfunction from the arch/x86/mm/ioremap.c.Initialization of theioremapis split into two parts: there is the early part which we can usebefore the normalioremapis available and the normalvmallocinitialization and the call ofvmallocfor now, so let's consider early initialization of theearly_ioremap_initchecks thatpaging_initfixmapwhich is available afterioremap. We do not know anything aboutioremap. First of allis aligned on page middle directory boundary:BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));more aboutyou can read in the first part about Linux Kernel initialization. SoBUILD_BUG_ONBUILD_BUG_ONmacro raises a compilation error if the given expression is true. In the nextstep after this check, we can see call of thefunction from theearly_ioremap_setupmm/early_ioremap.c. This function presents generic initialization of theearly_ioremap_setupfunction fills theslot_virtearly fixmaps. All early fixmaps are afterThey start atare512__end_of_permanent_fixed_addressesFIX_BITMAP_ENDtemporary boot-time mappings, used by early#define NR_FIX_BTMAPSioremap(down). Actually there648#define TOTAL_FIX_BTMAPS(NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)early_ioremap_setupin memory.:#define FIX_BTMAPS_SLOTSand.array with the virtual addresses of the(top) and end withFIX_BITMAP_BEGINioremap:void __init early_ioremap_setup(void){int i;for (i = 0; i < FIX_BTMAPS_SLOTS; i++)if (WARN_ON(prev_map[i]))break;for (i = 0; i > PAGE_SHIFT);set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));}whereset_pmdis:#define set_pmd(pmdp, pmd)andnative_set_pmd(pmdp, pmd)is:native_set_pmdstatic inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd){*pmdp = pmd;}That's all. Earlyfunction, but they are not so important, anyway initialization of theearly_ioremap_initioremapis ready to use. There are a couple of checks in theioremapis finished.Use of early ioremapAs soon as earlyioremaphas been setup successfully, we can use it. It provides twofunctions:early_ioremapearly_iounmapfor mapping/unmapping of I/O physical address to virtual address. Both functions depend ontheCONFIG_MMUconfiguration option. Memory management unit is a special block ofmemory management. The main purpose of this block is the translation of physicaladdresses to virtual addresses. The memory management unit knows about the high-levelpage table addresses (n,early_ioremapnothing. Ifpgd) from thecr3control register. IfCONFIG_MMUjust returns the given physical address andCONFIG_MMUoption is set toy,early_ioremapcallsoptions is set toearly_iounmapdoes__early_ioremapwhichtakes three parameters:phys_addr- base physical address of theI/Omemory region to map on virtualaddresses;size- size of theI/Omemory region;643Fixmaps and ioremapprot- page table entry bits.First of all in the__early_ioremapfor the first free one in thetheslot, we go through all early ioremap fixmap slots and searchprev_maparray. When we found it we remember its number invariable and set up size:slot = -1;for (i = 0; i > PAGE_SHIFT;idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot;Now we can fillloop, we call thefix-mappedarea with the given physical addresses. On every iteration in the__early_set_fixmapfunction from the arch/x86/mm/ioremap.c, increase thegiven physical address by the page size which is4096bytes and update theaddresses644Fixmaps and ioremapindex and the number of pages:while (nrpages > 0) {__early_set_fixmap(idx, phys_addr, prot);phys_addr += PAGE_SIZE;--idx;--nrpages;}The__early_set_fixmapfunction gets the page table entry (stored in thebm_pte, seeabove) for the given physical address with:pte = early_ioremap_pte(addr);In the next step ofmacro and callearly_ioremap_pteset_pteorpte_clearwe check the given page flags with thepgprot_valdepending on the flags given:if (pgprot_val(flags))set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));elsepte_clear(&init_mm, addr, pte);As you can see above, we passedFIXMPA_PAGE_IOFIXMAP_PAGE_IOas flags to the__early_ioremap.expands to the:(__PAGE_KERNEL_EXEC | _PAGE_NX)flags, so we callmanner asset_pteset_pmdfunction to set the page table entry which works in the samebut for PTEs (read above about it). As we have set allloop, we can now take a look at the call of the__flush_tlb_onePTEsin thefunction:__flush_tlb_one(addr);This function is defined in arch/x86/include/asm/tlbflush.h and calls__flush_tlbdepending on the value ofcpu_has_invlpg__flush_tlb_singleor:645Fixmaps and ioremapstatic inline void __flush_tlb_one(unsigned long addr){if (cpu_has_invlpg)__flush_tlb_single(addr);else__flush_tlb();}Thefunction invalidates the given address in the TLB. As you just saw we__flush_tlb_oneupdated the paging structure, butTLBis not informed of the changes, that's why we needto do it manually. There are two ways to do it. The first is to update theand thecr3control registerfunction does this:__flush_tlbnative_write_cr3(__native_read_cr3());The second method is to use theat the__flush_tlb_onecpu_has_invlpginvlpginstruction to invalidate theTLBentry. Let's lookimplementation. As you can see, first of all the function checkswhich is defined as:#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)# define cpu_has_invlpg1#else# define cpu_has_invlpg(boot_cpu_data.x86 > 3)#endifIf a CPU supports theinvlpginstruction, we call theexpands to the call of__native_flush_tlb_single__flush_tlb_singlemacro which:static inline void __native_flush_tlb_single(unsigned long addr){asm volatile("invlpg (%0)" ::"r" (addr) : "memory");}or call__flush_tlbexecution of the__early_ioremapwhich just updates the__early_set_fixmapcr3register as we have seen. After this stepfunction is finished and we can go back to theimplementation. When we have set up the fixmap area for the givenaddress, we need to save the base virtual address of the I/O Re-mapped area in theprev_mapusing theslotindex:prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);and return it.646Fixmaps and ioremapThe second function,early_iounmap, unmaps antwo parameters: base address and size of aI/OI/Oregion and generally looks very similar to. It also goes through fixmap slots and looks for a slot with the given address.early_ioremapAfter that, it gets the index of the fixmap slot and calls__early_set_fixmapdepending on the__early_set_fixmapwith one difference to howpasseszeroregion tomemory region. This function takes__late_clear_fixmapafter_paging_initorvalue. It callsearly_ioremapdoes it:early_iounmapas physical address. And in the end it sets the address of the I/O memoryNULL:prev_map[slot] = NULL;That's all aboutioremapfixmapsandioremap. Of course this part does not cover all features of, only early ioremap but there is also normal ioremap. But we need to know morethings before we study that in more detail.So, this is the end!ConclusionThis is the end of the second part about linux kernel memory management. If you havequestions or suggestions, ping me on twitter 0xAX, drop me an email or just create an issue.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me a PR to linux-insides.LinksapicvsyscallIntel Trusted Execution TechnologyXenReal Time Clocke820Memory management unitTLBPagingLinux kernel memory management Part 1.647Fixmaps and ioremap648kmemcheckLinux kernel memory management Part 3.Introduction to the kmemcheck in the LinuxkernelThis is the third part of the chapter which describes memory management in the Linux kerneland in the previous part of this chapter we met two memory management related concepts:Fix-Mapped Addressesioremap;.The first concept represents special area in virtual memory, whose corresponding physicalmapping is calculated in compile-time. The second concept provides ability to mapinput/output related memory to virtual memory.For example if you will look at the output of the/proc/iomem: sudo cat /proc/iomem00000000-00000fff : reserved00001000-0009d7ff : System RAM0009d800-0009ffff : reserved000a0000-000bffff : PCI Bus 0000:00000c0000-000cffff : Video ROM000d0000-000d3fff : PCI Bus 0000:00000d4000-000d7fff : PCI Bus 0000:00000d8000-000dbfff : PCI Bus 0000:00000dc000-000dffff : PCI Bus 0000:00000e0000-000fffff : reserved.........you will see map of the system's memory for each physical device. Here the first columndisplays the memory registers used by each of the different types of memory. The secondcolumn lists the kind of memory located within those registers. Or for example:649kmemcheck$sudo cat /proc/ioports0000-0cf7 : PCI Bus 0000:000000-001f : dma10020-0021 : pic10040-0043 : timer00050-0053 : timer10060-0060 : keyboard0064-0064 : keyboard0070-0077 : rtc00080-008f : dma page reg00a0-00a1 : pic200c0-00df : dma200f0-00ff : fpu00f0-00f0 : PNP0C04:0003c0-03df : vga+03f8-03ff : serial04d0-04d1 : pnp 00:060800-087f : pnp 00:010a00-0a0f : pnp 00:040a20-0a2f : pnp 00:040a30-0a3f : pnp 00:04.........can show us lists of currently registered port regions used for input or output communicationwith a device. All memory-mapped I/O addresses are not used by the kernel directly. So,before the Linux kernel can use such memory, it must map it to the virtual memory spacewhich is the main purpose of theioremapmechanism. Note that we saw only earlyioremapin the previous part. Soon we will look at the implementation of the non-earlyioremapfunction. But before this we must learn other things, like a different types ofmemory allocators and etc., because in other way it will be very difficult to understand it.So, before we will move on to the non-early memory management of the Linux kernel, wewill see some mechanisms which provide special abilities for debugging, check of memoryleaks, memory control and etc. It will be easier to understand how memory managementarranged in the Linux kernel after learning of all of these things.As you already may guess from the title of this part, we will start to consider memorymechanisms from the kmemcheck. As we always did in other chapters, we will start toconsider from theoretical side and will learn what iskmemcheckmechanism in general andonly after this, we will see how it is implemented in the Linux kernel.So let's start. What is itthis mechanism, thekmemcheckkmemcheckin the Linux kernel? As you may guess from the name ofchecks memory. That's true. Main point of themechanism is to check that some kernel code accessesuninitialized memorykmemcheck. Let's take650kmemcheckfollowing simple C program:#include #include struct A {int a;};int main(int argc, char **argv) {struct A *a = malloc(sizeof(struct A));printf("a->a = %d\n", a->a);return 0;}Here we allocate memory for theAstructure and tries to print value of theafield. If wewill compile this program without additional options:gcc test.c -o testThe compiler will not show us warning thatafiled is not unitialized. But if we will run thisprogram with valgrind tool, we will see the following output:~$valgrind --leak-check=yes ./test==28469== Memcheck, a memory error detector==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info==28469== Command: ./test==28469====28469== Conditional jump or move depends on uninitialised value(s)==28469==at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so)==28469==by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)==28469==by 0x4005B9: main (in /home/alex/test)==28469====28469== Use of uninitialised value of size 8==28469==at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so)==28469==by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so)==28469==by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)==28469==by 0x4005B9: main (in /home/alex/test).........Actually thekmemcheckmechanism does the same for the kernel, what thevalgrinddoesfor userspace programs. It check uninitialized memory.651kmemcheckTo enable this mechanism in the Linux kernel, you need to enable theCONFIG_KMEMCHECKkernel configuration option in the:Kernel hacking-> Memory Debuggingmenu of the Linux kernel configuration:We may not only enable support of thekmemcheckmechanism in the Linux kernel, but it alsoprovides some configuration options for us. We will see all of these options in the nextparagraph of this part. Last note before we will consider how does thekmemcheckcheckmemory. Now this mechanism is implemented only for the x86_64 architecture. You can besure if you will look in the arch/x86/Kconfigx86related kernel configuration file, you willsee following lines:config X86.........select HAVE_ARCH_KMEMCHECK.........So, there is no anything which is specific for other architectures.652kmemcheckOk, so we know thatmemorykmemcheckprovides mechanism to check usage ofuninitializedin the Linux kernel and how to enable it. How it does these checks? When the Linuxkernel tries to allocate some memory i.e. something is called like this:struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);or in other words somebody wants to access a page, a page fault exception is generated.This is achieved by the fact that thekmemcheckmarks memory pages asnon-present(more about this you can read in the special part which is devoted to Paging). If afaultpageexception is occurred, the exception handler knows about it and in a case when thekmemcheckis enabled it transfers control to it. After thepage will be marked aspresentkmemcheckwill finish its checks, theand the interrupted code will be able to continue execution.There is little subtlety in this chain. When the first instruction of interrupted code will beexecuted, thekmemcheckwill mark the page asnon-presentagain. In this way next accessto memory will be caught again.We just considered thekmemcheckmechanism from theoretical side. Now let's consider howit is implemented in the Linux kernel.Implementation of thein the Linux kernelSo, now we know what is itkmemcheckkmemcheckmechanismand what it does in the Linux kernel. Time to see atits implementation in the Linux kernel. Implementation of thekmemcheckis split in two parts.The first is generic part is located in the mm/kmemcheck.c source code file and the secondx86_64 architecture-specific part is located in the arch/x86/mm/kmemcheck directory.Let's start from the initialization of this mechanism. We already know that to enable thekmemcheckmechanism in the Linux kernel, we must enable theCONFIG_KMEMCHECKkernelconfiguration option. But besides this, we need to pass one of following parameters:kmemcheck=0 (disabled)kmemcheck=1 (enabled)kmemcheck=2 (one-shot mode)to the Linux kernel command line. The first two are clear, but the last needs a littleexplanation. This option switches thekmemcheckin a special mode when it will be turned offafter detecting the first use of uninitialized memory. Actually this mode is enabled by defaultin the Linux kernel:653kmemcheckWe know from the seventh part of the chapter which describes initialization of the Linuxkernel that the kernel command line is parsed during initialization of the Linux kernel indo_initcall_level,do_early_paramfunctions. Actually thekmemchecksubsystem consistsfrom two stages. The first stage is early. If we will look at the mm/kmemcheck.c source codefile, we will see theparam_kmemcheckfunction which is will be called during early commandline parsing:static int __init param_kmemcheck(char *str){int val;int ret;if (!str)return -EINVAL;ret = kstrtoint(str, 0, &val);if (ret)return ret;kmemcheck_enabled = val;return 0;}early_param("kmemcheck", param_kmemcheck);As we already saw, the(enabled),1param_kmemcheck(disabled) or2may have one of the following values:(one-shot). The implementation of thepretty simple. We just convert string value of therepresentation and set it to thekmemcheck_enabledkmemcheck0param_kmemcheckiscommand line option to integervariable.654kmemcheckThe second stage will be executed during initialization of the Linux kernel, rather duringinitialization of early initcalls. The second stage is represented by thekmemcheck_init:int __init kmemcheck_init(void){.........}early_initcall(kmemcheck_init);Main goal of thekmemcheck_initfunction is to call thefunction andkmemcheck_selftestcheck its result:if (!kmemcheck_selftest()) {printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n");kmemcheck_enabled = 0;return -EINVAL;}printk(KERN_INFO "kmemcheck: Initialized\n");and return with theEINVALif this check is failed. Thesizes of different memory access related opcodes likeof opcodes are equal to expected sizes, thefalsekmemcheck_selftestrep movsbkmemcheck_selftest,function checksmovzwqwill returnand etc. If sizestrueandin other way.So when the somebody will call:struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);through a series of different function calls thekmem_getpagesfunction will be called. Thisfunction is defined in the mm/slab.c source code file and main goal of this function tries toallocate pages with the given flags. In the end of this function we can see following code:if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);if (cachep->ctor)kmemcheck_mark_uninitialized_pages(page, nr_pages);elsekmemcheck_mark_unallocated_pages(page, nr_pages);}655kmemcheckSo, here we check that the ifflags we setnon-presentis enabled and thekmemcheckSLAB_NOTRACKbit for the just allocated page. TheSLAB_NOTRACKbit is not set inbit tell us to nottrack uninitialized memory. Additionally we check if a cache object has constructor (detailswill be considered in next parts) we mark allocated page as uninitialized or unallocated inother way. Thefunction is defined in the mm/kmemcheck.c sourcekmemcheck_alloc_shadowcode file and does following things:void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node){struct page *shadow;shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order);for(i = 0; i < pages; ++i)page[i].shadow = page_address(&shadow[i]);kmemcheck_hide_pages(page, pages);}First of all it allocates memory space for the shadow bits. If this bit is set in a page, thismeans that this page is tracked by thekmemcheck. After we allocated space for the shadowbit, we fill all allocated pages with this bit. In the end we just call thekmemcheck_hide_pagesfunction with the pointer to the allocated page and number of these pages. Thekmemcheck_hide_pagesis architecture-specific function, so its implementation is located inthe arch/x86/mm/kmemcheck/kmemcheck.c source code file. The main goal of this functionis to setnon-presentbit in given pages. Let's look at the implementation of this function:void kmemcheck_hide_pages(struct page *p, unsigned int n){unsigned int i;for (i = 0; i balance > 0;}Thekmemcheck_contextis structure which describes current state of thekmemcheckmechanism. It stored uninitialized addresses, number of such addresses and etc. Thebalancefield of this structure represents current state of thecan tell us didzero, thepresentkmemcheckkmemcheck_hidealready hid pages or not yet. If thekmemcheckor in other words itdata->balancefunction will be called. This means thanis greater thankmemecheckalready setbit for given pages and now we need to hide pages again to cause next step topage fault. This function will hide addresses of pages again by unsetting ofThis means that one session ofkmemcheckpresentbit.already finished and new page fault occurred. At657kmemcheckthe first step thestart and thethekmemcheck_activekmemcheck_hidedo_page_faultwill return false as thedata->balanceis zero for thewill not be called. Next, we may see following line of code in:if (kmemcheck_fault(regs, address, error_code))return;First of all thekmemcheck_faultfunction checks that the fault was occured by the correctreason. At first we check the flags register and check that we are in normal kernel mode:if (regs->flags & X86_VM_MASK)return false;if (regs->cs != __KERNEL_CS)return false;If these checks wasn't successful we return from thekmemcheck_faultrelated page fault. After this we try to lookup akmemcheckfunction as it was notpage table entryrelated to thefaulted address and if we can't find it we return:pte = kmemcheck_pte_lookup(address);if (!pte)return false;Last two steps of thekmemcheck_faultfunction is to call thekmemcheck_accessfunctionwhich check access to the given page and show addresses again by setting present bit inthe given page. Thekmemcheck_accessfunction does all main job. It check currentinstruction which caused a page fault. If it will find an error, the context of this error will besaved bykmemcheckto the ring queue:static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];Thekmemcheckmechanism declares special tasklet:static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);which runs thedo_wakeupfunction from the arch/x86/mm/kmemcheck/error.c source codefile when it will be scheduled to run.Thedo_wakeupcollected byfunction will call thekmemcheckkmemcheck_error_recallfunction which will print errors. As we already saw the:658kmemcheckkmemcheck_show(regs);function will be called in the end of thekmemcheck_faultfunction. This function will setpresent bit for the given pages again:if (unlikely(data->balance != 0)) {kmemcheck_show_all();kmemcheck_error_save_bug(regs);data->balance = 0;return;}Where thekmemcheck_show_allfunction calls thekmemcheck_show_addrfor each address:static unsigned int kmemcheck_show_all(void){struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);unsigned int i;unsigned int n;n = 0;for (i = 0; i n_addrs; ++i)n += kmemcheck_show_addr(data->addr[i]);return n;}by the call of thekmemcheck_show_addr:int kmemcheck_show_addr(unsigned long address){pte_t *pte;pte = kmemcheck_pte_lookup(address);if (!pte)return 0;set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));__flush_tlb_one(address);return 1;}In the end of thekmemcheck_showfunction we set the TF flag if it wasn't set:659kmemcheckif (!(regs->flags & X86_EFLAGS_TF))data->flags = regs->flags;We need to do it because we need to hide pages again after first executed instruction after apage fault will be handled. In a case when theTFflag, so the processor will switch intosingle-step mode after the first instruction will be executed. In this casedebugexception willoccurred. From this moment pages will be hidden again and execution will be continued. Aspages hidden from this moment, page fault exception will occur again andkmemcheckcontinue to check/collect errors again and print them from time to time.That's all.ConclusionThis is the end of the third part about linux kernel memory management. If you havequestions or suggestions, ping me on twitter 0xAX, drop me an email or just create an issue.In the next part we will see yet another memory debugging related tool -kmemleak.Please note that English is not my first language and I am really sorry for anyinconvenience. If you found any mistakes please send me a PR to linux-insides.Linksmemory managementdebuggingmemory leakskmemcheck documentationvalgrindPagingpage faultinitcallsopcodetranslation lookaside bufferper-cpu variablesflags registertaskletPrevious part660kmemcheck661CgroupsCgroupsThis chapter describescontrol groupsmechanism in the Linux kernel.Introduction662Introduction to Control GroupsControl GroupsIntroductionThis is the first part of the new chapter of the linux insides book and as you may guess bypart's name - this part will cover control groups ormechanism in the Linux kernel.are special mechanism provided by the Linux kernel which allows us to allocateCgroupskind ofcgroupsresourceslike processor time, number of processes per group, amount of memoryper control group or combination of such resources for a process or set of processes.Cgroupsare organized hierarchically and here this mechanism is similar to usual processesas they are hierarchical too and childcgroupsinherit set of certain parameters from theirparents. But actually they are not the same. The main differences betweencgroupsandnormal processes that many different hierarchies of control groups may exist simultaneouslyin one time while normal process tree is always single. This was not a casual step becauseeach control group hierarchy is attached to set of control groupOnecontrol group subsystemprovides support for following twelvecontrol groupcontrol group subsystems. Linux kernel:- assigns individual processor(s) and memory nodes to task(s) in a group;cpuset- uses the scheduler to provide cgroup tasks access to the processor resources;- generates reports about processor usage by a group;cpuacctio.represents one kind of resources like a processor time ornumber of pids or in other words number of processes for acpusubsystems- sets limit to read/write from/to block devices;memory- sets limit on memory usage by a task(s) from a group;devices- allows access to devices by a task(s) from a group;freezer- allows to suspend/resume for a task(s) from a group;net_cls- allows to mark network packets from task(s) from a group;- provides a way to dynamically set the priority of network traffic per networknet_priointerface for a group;perf_event- activates support for huge pages for a group;hugetlbpid- provides access to perf events) to a group;- sets limit to number of processes in a group.Each of these control group subsystems depends on related configuration option. Forexample theoption, thecpusetiosubsystem should be enabled viasubsystem viaCONFIG_BLK_CGROUPCONFIG_CPUSETSkernel configurationkernel configuration option and etc. All of663Introduction to Control Groupsthese kernel configuration options may be found in thesupportGeneral setup → Control Groupmenu:You may see enabled control groups on your computer via proc filesystem:$cat /proc/cgroups#subsys_namecpusetcpuhierarchy87166cpuacct766111661memory9941devices666freezer21net_cls41113111net_prio411hugetlb1011pids5enabled1blkioperf_eventnum_cgroups1691or via sysfs:664Introduction to Control Groups$ ls -l /sys/fs/cgroup/total 0dr-xr-xr-x 5 root root0 Dec2 22:37 blkiolrwxrwxrwx 1 root root 11 Dec2 22:37 cpu -> cpu,cpuacctlrwxrwxrwx 1 root root 11 Dec2 22:37 cpuacct -> cpu,cpuacctdr-xr-xr-x 5 root root0 Dec2 22:37 cpu,cpuacctdr-xr-xr-x 2 root root0 Dec2 22:37 cpusetdr-xr-xr-x 5 root root0 Dec2 22:37 devicesdr-xr-xr-x 2 root root0 Dec2 22:37 freezerdr-xr-xr-x 2 root root0 Dec2 22:37 hugetlbdr-xr-xr-x 5 root root0 Dec2 22:37 memorylrwxrwxrwx 1 root root 16 Dec2 22:37 net_cls -> net_cls,net_priodr-xr-xr-x 2 root root2 22:37 net_cls,net_prio0 Declrwxrwxrwx 1 root root 16 Dec2 22:37 net_prio -> net_cls,net_priodr-xr-xr-x 2 root root0 Dec2 22:37 perf_eventdr-xr-xr-x 5 root root0 Dec2 22:37 pidsdr-xr-xr-x 5 root root0 Dec2 22:37 systemdAs you already may guess thatcontrol groupsmechanism is not such mechanism whichwas invented only directly to the needs of the Linux kernel, but mostly for userspace needs.To use acontrol group, we should create it at first. We may create aThe first way is to create subdirectory in any subsystem fromof a task to ataskscgroup/sys/fs/cgroupvia two ways.and add a pidfile which will be created automatically right after we will create thesubdirectory.The second way is to create/destroy/manage(libcgroup-toolscgroupswith utils fromlibcgrouplibrary/dev/ttydevicein Fedora).Let's consider simple example. Following bash script will print a line towhich represents control terminal for the current process:#!/bin/bashwhile :doecho "print line" > /d

< / 842>

GMT+8, 2020-7-3 00:43 , Processed in 1.073717 second(s), 5 queries , Gzip On, Redis On.