Windows system >> Linux system Tutorial >> Linux Tutorial

Getting Started with AT&T Syntax under Linux (GNU as Assembly Syntax)

Working for so long, I have been studying and working hard on the C language level. Over time, many doubts about C are difficult in books and materials. Find the answer. A programmer is a population that pursues perfection, and even a little black hole in his head will make him restless. Not long ago, in the itput forum, the classic book "Computer Systems A Programmer's Perspective" (hereinafter referred to as CS.APP) was read in the evening to solve the puzzle. Although the book did not answer some of my doubts positively, it pointed me to a road to "no confession" - this is the door to open the assembly.

Assembly language is a language very close to machine language, and the correspondence between statements and machine instructions is simpler and clearer. Opening the assembly door not only relieves the doubts that high-level languages bring you, it also allows you to better understand the operating system of modern computers, and more importantly, it gives you a sense of confidence. , reducing your fear of being crumbling in the heights, responding to the call of Hou Jie's "Do not build a platform in the floating sands". The purpose of learning assembly now is much different than before. As stated in CS.APP, the need for programmers to learn assembly has changed over time, starting with programmers who can write programs directly in assembly, and now requires reading and understanding optimization compilers. Generated code”. Being able to read and understand, this is precisely my needs and goals.

There was contact with the compilation at the university, mainly the compilation of Microsoft MASM macros, but at that time the understanding was not high enough and the attitude was not correct, and missed a good learning opportunity. Most of the time, I use GCC to work on the Unix series platform. The choice of assembly language is of course GNU assembly, just as GNU assembly syntax is used in CS.APP. Since the main purpose of the learning compilation is "deconstruction", the form is mostly a comparison of C code and assembly code.

1, the assembly lets you see more As the level of the language you use increases, the computer in your eyes will become more and more blurred, and your focus will be closer and closer to the language itself. At the other end, "problem domain", for example, through JAVA, you see more of its virtual machine, but you can't see the real computer; through C, you see only one layer of memory; to assembly language, You can go deep into the register layer and play freely. Compiler programmers' "unique scenery" includes: a) “ program counter (%eip) & rdquo; -- a special register, which always stores the address of the next instruction to be executed; b) integer register -- A total of eight, respectively, %eax, %ebx, %ecx, %edx, %esi, %ebi, %esp, and %ebp, they can store integer data, can store addresses, and can also record program status. In the early days, each register had its special purpose. Now, because platforms like Linux use "plane addressing" [1], the particularity of registers is not so obvious. c) Condition Flag Register -- Saves the status information of the most recently executed arithmetic instruction to implement condition changes in the control flow. d) Floating point registers -- as the name implies, are used to store floating point numbers. Although the degree of speciality of registers has been weakened, in fact, each compiler still follows certain rules when using these registers, and will be discussed later.

2, the first glimpse of the assembly is a simple C function: void dummy () {int a = 1234; int b = a;} we use gcc plus -S option to convert it into assembly code as follows ( Omit some parts): movl $1234, -4 (%ebp) movl -4 (%ebp), %eaxmovl %eax, -8 (%ebp) looked at it again, still can't read, just found something familiar Because the above mentioned such as %ebp, %eax, etc. This is just an introduction, let us know the sensibility of the compilation "face" & rdquo;. Let's look at it a little bit. At first glance, the assembly code looks very similar. Yes, the assembly code is a collection of statements of "order + operand". Assembly instructions are fixed, each instruction has its fixed purpose, and operand representations come in many types.

1) Operands indicate that most assembly instructions have one or more operands, including the source and destination of the instruction operation. A standard instruction format is roughly like this: "instruction + source operand + destination operand", where the source operand can be an immediate, a number read from a register, or a number read from memory; The destination operand can be either a register or a memory. According to this classification, there are roughly three kinds of operands: a) immediate notation -- such as “mov1 $1234, -4(%ebp)” in “$1234”, is an immediate number as an operand, According to the GNU assembly syntax, the immediate value is expressed as “$+integer”. Immediate data is often used to represent some constants in the code, such as “$1234” in the example above. Note that the immediate value cannot be used as the destination operand. b) Register notation -- This is relatively simple, it is the content of the register. As in the above "movl -4 (%ebp), %eax in %eax" is using the register notation as the source operand, while %eax in "movl %eax, -8 (%ebp)” The register notation is used as the destination operand. c) Memory Reference Notation -- The calculated value of this operand represents the corresponding memory address. The assembly instruction accesses the corresponding memory location based on this memory address. In the above example, "movl -4 (%ebp), %eax", "-4 (%ebp)", which represents the value of the memory address (content-4 in the %ebp register).

2) Data Transfer Instructions The most commonly used instruction in assembly language -- the data transfer instruction, is the first type of assembly instruction we are exposed to. The format of the instruction is: "mov source operand, destination operand”. The mov series supports access and transfer from a minimum of one byte to a maximum double word. Where movb is used to transmit one byte of information, movw is used to transfer two bytes, that is, one word of information, and movl is used to transmit double word information. These are not detailed. In addition to this, the mov series also provides two instructions with bit extensions movsbl and movzbl

============================ ================================================================================================= The programming language closely integrated with the hardware platform plays an important role in the field of operating system
, embedded development and so on. Because assembly relies on hardware architecture (CPU instruction code), assembly language on different architectures is quite different. This article briefly introduces the AT&T syntax under Linux (ie GNU as assembly syntax) and the basic methods of assembly under Linux.

The AT&T syntax originated from AT&T Bell Labs and was formed on top of the processor opcode syntax used to implement Unix systems. The main differences between AT&T syntax and Intel syntax are as follows: AT&T uses $ for immediate, Intel does not, so it means that AT&T is $2 for decimal 2, and Intel is 2AT&T for % before register, such as eax register for %eaxAT&T processing operands Contrary to Intel, for example, movl %eax, %ebx passes the value in eax to ebx, and Intel is such mov ebx, eaxAT&T adds a single character after the mnemonic to indicate the length of the data in the operation. For example, movl $foo, %eax is equivalent to Intel's mov eax, word ptr foo