Arm Architecture
- Introduction
Introduction
About ARM Architecture
- RISC Architecture
- Load / Store Architecture
- Uniform and Fixed Length instructions
- Control over both ALU and shifter in most data processing instruction .
- Auto increment and auto decrement addressing mode to optimize loop .
- Conditional execution on almost all instruction
- Endianness ( Bi-Endian )
Registers
- Total 37 register
- 15 General Purpose Registers
- 1 PC
- Other status Register’s ( 1.CPSR : Current Program Status Register) ( 5 . SPSR : Saved Program Status Register)
- r0..r3 : holds argument to a subrutine.
- r4..r10 : general purpose register
- r11 : Frame Pointer
- r12 : Intra procedure call
- r13 : Stack Pointer
- r14 : Link Register
-
r15 : Program Counter
-
CPSR : Status Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 18 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 +--+--+--+--+--+-----+--+-----------+--------------+-----------------+--+--+--+--+--+--------------+ |N |Z |C |V |Q | RES |J | Reserved | GR [ 3:0 ] | Reserved |E |A |I |F |T | M [ 4 : 0 ] | +--+--+--+--+--+--+--+--+--+--+--+--+--------------+-----------------+--+--+--+--+--+--------------+
- N : N = 1 if the result is negative ; N = 0 if result is Positive
- Z : Z = 0 if result is Zero
- C :
- C = 1 if addition produces a carry , 0 otherwise
- C = 0 if subtraction produced a borrow , 1 otherwise
- For non-addition/subtraction instruction with shift operation , C is set to the last bit shifted out of the value by the shifter
-
V : For Addition and subtraction , V = 1 if signed overflow occurred
- GE : Greater than or Equal Flag
-
E : Endianness
-
T & J : select the current instruction set
J T Instruction Set 0 0 ARM 0 1 Thumb 1 0 Jezelle 1 1 Reserved
Different State
ARM State
- Default State
- r0 .. r15 can be changed
- Instruction Size 32 bit ( 4 byte )
Thumb State
- Instruction Size 16 bit ( 2 byte ) / 32 bit ( 4 byte )
- pc can only be modified by specific instruction
- Thumb-2 state
- Extended Thumb state with 32 bit instruction
Jazzele
- Allows equipped ARM - Processor to execute Java - Bytecode in hardware
PC Relative Addressing
- Is used to address constants in text region
- CPU loads two instruction in advance
+-----------+
| execute | pc - 8
+-----------+
^
|
+-----------+
| decode | pc - 4
+-----------+
^
|
+-----------+
| fetcha | pc
+-----------+
Therefore , the real pc
value is higher because while executing an instruction it will have decoded the next instruction and fetched the next to next instruction , thus pc
value will be the address of two instruction ahead .
- 8 Bytes in ARM state
pc
= address of current instruction + 8
- 4 bytes in Thumb Mode
pc
= address of current instruction + 4- address is 4 bytes Aligned
Instructions Set
Instruction Format
[ instruction ] [ condition ] [s] [ destination ] , [ source ] , [ other operands ... ]
- s : update status register
- Every instruction can be made conditional
add r1 , r2 , #2 : r1 = r2 + 2
suble r1 , r2 , #3 : if less than : r1 = r2 + 3
movs r1 , r2 : r1 = r2 , Update Status register
Barrel Shifter
- Hardware optimization , inline allows for a multiplication of intermediate ( with power of 2 ) within same instruction cycle
- LSL : Logical shift Left
- LSR : Logical shift right
mov r7 ,r5 ,LSL #2 : r7 = r5 << 2
add r0 ,r1 ,r1 ,LSL #1 : r0 = r1 + ( r1 << 1 )
- ROR : Rotate Right , bits popped off the right end , is directly pushed into left , last off fright Carry )
Load / Store
Like x86 direct manipulation of memory is not possible in ARM , Here one need to load the data onto the register , manipulate it and then store it back to memory .
ldr r2 , [r1] : value @ r1 is loaded to r2
add r2 , #1 : value is incremented
str r2 , [r1] : value in r2 is strored @ r1
-
Different Addressing mode
There instruction have three primary addressing mode which use a
base_register
and aoffset
specified by the instruction-
Offset Addressing [ Rn , offset ]
The memory address is formed by adding or subtraction an offset to or from the base register
ldr r2 , [r0, #8] : load value from r0+8 str r2 , [r0, r1] : value in r2 is stored in r0 + r1
-
Pre-indexed Addressing [ Rn , offset ]!
The memory address is formed in the same way as the offset addressing. As a side effect the memory address is also written back to the base register
ldr r2 , [r0, #8]! : load value from r0 + 8 and r0 = r0 + 8 ( r0 is updated ) str r2 , [r0, r1]! : value in r2 is stored in r0 + r1 and r0 = r0 + r1 ( r0 is updated )
-
Post-indexed Addressing [ Rn ] , offset
The address is the base register value , As a side effect , an offset is added to or subtracted from the base register value and the result is written back to the base register
ldr r2, [r0], #8 : load value from r0 then set r0 = r0 + 8 ( r0 is updated after the operation ) str r2 ,[r0], r1 : value in r2 is stored in r0 then r0 = r0 + r1
-
-
Load / Store Multiple
ldm
andstm
can be used to store multiple register .ldm r0, {r1,r2,r3} : r1 = [r0] , r2 = [r0+4] , r3 = [r0+8] ldm r0!, {r1,r2,r3} : r1 = [r0] , r2 = [r0+4] , r3 = [r0+8] , r0 = r0 + 8 stm r0, {r1-r3} : [r0] = r1 , [r0+4] = r2 , [r0+8] = r3 stm r0!, {r1-r3} : [r0] = r1 , [r0+4] = r2, [r0+8] = r3 , r0 = r0 + 8
There are 4 Addressing modes which decides how the address shall be incremented or decremented
Mode Description IA Increment After (default) IB Increment Before DA Decrements After DB Decrements Before push
andpop
are aliases forstmdb
amdldmia
ldmib r0 , {r1,r2,r3} : r1 = [r0+4] , r2 = [r4+8] , r3 = [r4+12]
-
Load Immediate value
- ARM has a fixed instruction length of 32 bit
- Including opcode and operands
-
Only 12 bits are left for immediate values
- if bit 25 is set to 1 the last 12 bit are handled as immediate
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 18 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 +--+--+--+--+--+--+--+--+--+--+--+--+--------------+-----------+-----------------------------------+ | Cond |0 |0 |1 |0 |0 |0 |0 |S | Rn | Rd | immediate | +--+--+--+--+--+--+--+--+--+--+--+--+--------------+-----------+-----------+-----------------------+
- if bit 25 is set to 0 the last 12 bit are handled as 2nd operand
- In order to make it possible to load bigger value than 4096 ( 12bit ), the value is split
- a = 8 bit value ( 0 to 255 )
- b = 4 bit value ( used for rotate right )
- immediate = a ror ( b « 1 )
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 18 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 +--+--+--+--+--+--+--+--+--+--+--+--+--------------+-----------+-----+--+--+-----------------------+ | Cond |0 |0 |0 |0 |0 |0 |0 |S | Rn | Rd | Rotate | immediate | +--+--+--+--+--+--+--+--+--+--+--+--+--------------+-----------+-----------+-----------------------+
Often other method are used to dodge big intermediate values
ldr r1 , =0x11223344 : most likely substituted by pc + relative address movw r1, #0x3344 : load the value in two step r1 = 0x3344 movt r1, #0x1122 : r1 = 0x11223344 mov r2, #0x2e00 : assemble first part of 0x2ee0 orr r2, #0xe0 : assemble second part of 0x2ee0
- ARM has a fixed instruction length of 32 bit
Bit wise Instruction
Operation | Assembly | Simplified |
---|---|---|
bitwise AND | and r0, r1, #2 | r0 = r1 & 2 |
bitwise OR | orr r0, r1, r2 | r0 = r1 or r2 |
bitwise XOR | eor r0, r1, r2 | r0 = r1 ^ r2 |
negation NOT | mvn r0, r2 | r0 = !r2 |
Arithmetic
Operation | Assembly | Simplified |
---|---|---|
Add | add r0, r1 , #2 | r0 = r1 + 2 |
Add with carry | adc r0, r1 , r2 | r0 = r1 + r2 + 1 |
Substract | sub r0, r1 , #2 | r0 = r1 - 2 |
Reverse Sub | rsb r0, r1 , #2 | r0 = 2 - r1 |
Multiply | mul r0, r1 , r2 | r0 = r1 * r2 |
Compare
Comparisons produce no results – they just set condition codes. Ordinary instructions will also set condition codes if the “S” bit is set. The “S” bit is implied for comparison instructions.
cmp r0, #42 : compare R0 to 42.
cmn r2, #42 : compare R2 to -42.
tst r11, #1 : test bit zero.
teq r8, r9 : test R8 equals R9.
subs r1, r0, #42 : compare R0 to 42, with result.
Branches
- Jump to different location in code
- Function are called by branches
bl[x]
: branch and link- link means the return address is stored in
lr
register
- link means the return address is stored in
:branch
b #0x137 : branch to current address + 0x137
bx r1 : branch to address in r1
:branch and link
bl #0x137 : branch to current address + 0x137
blx r1 : branch to address in r1
-
Branches with ARM / Thumb States
In order to set the CPU in thumb state , the least significant bit has to be set to 1 , if it has bot been set , the CPU switches to ARM state .
To jump to Thumb code at
0x40000
: r1 contains the address ( 0x40000 ) add r1,r1, #1 : The least signeficant bit is set to 1 bx r1 : CPU will change to Thumb mode
Conditional Execution
- Two letter suffix appended to mnemonic
- Condition is tested to current state register flags
subs r0, r0, #1 : s means that the flag register should be updated
subne r0, r0, #2 : sub not equal , substract if zeor flag is set
adde r1, r1, #2 : add not equal , add if zero flag is set
Opcode [31:28] | Suffix | Descripton | Flag |
---|---|---|---|
0000 | EQ | Equal | Z==1 |
0001 | NE | Not Equal | Z==0 |
0010 | CS/HS | Carry Set / unsigned high | C==1 |
0011 | CC/LO | Carry clear / unsigned low | C==0 |
0100 | MI | Minus / Negative | N==1 |
0101 | Pl | Plus / Positive / Zero | N==0 |
0110 | VS | Overflow | V==1 |
0111 | HI | Undigned High | ( C==1 && Z==0 ) |
1000 | LS | Unsigned Low | ( C==0 && Z==1 ) |
1001 | GE | Signed greater than or equal | N==V |
1011 | LT | Signed less than | ( N!=V) |
1100 | GT | Signed greater than | ( Z==0 && N==V ) |
1101 | LE | Signed less than | ( Z===1 or N!=V ) |
Calling Convention
Calling Function
- First Four arguments are passed in registers ( r0 - r3 )
- More arguments on the stack
- Return value will be stored in r0
- r4 .. r11 preserved by subroutine
Calling System Call
- Arguments in r0 .. r5
- Syscall no in r7
swi
/svc
#0 to make a system call
Syscall Reference : syscall
Stack Frame
+-----------------+
| Return Addr | <- r11 ( fp )
+-----------------+
| Saved Frame ptr |
+-----------------+
| ... |
| |
| Local Var |
| |
| ... | <- sp
+-----------------+
Function Prologue
- Functions are called through
bl
andblx
- Return address is stored in
lr
/r14
- Return address is stored in
- Link register is stored in the function prologue if the function is not a leaf function
push {fp , lr}
add fp, sp, #4
sub sp, sp, #0x20
Function Epilogue
- Preserved Register are restored
pc
is restored in different method- restore
lr
and branch tolr
- pop
pc
from the stack
- restore
sub sp, fp, #4
pop {fp, pc}
sub sp, fp, #4S
pop {fp, lr}
bx lr
Reference
-
infocenter : Best Source
Setting Up the Lab
- Qemu
sudo apt-get install qemu qemu-user qemu-user-static
- GDB
The defult GDB does not know anything about other architecture , but gdb-multiarch
adds support for other architecture.
sudo apt-get install gdb-multiarch
- GCC-ARM toolchain for cross-compiling
$ sudo apt-get install gcc-arm-linux-gnueabihf libc6-dev-armhf-cross binfmtc binfmt-support
$ sudo mkdir /etc/qemu-binfmt
$ sudo ln -s /usr/arm-linux-gnueabihf /etc/qemu-binfmt/arm
Now you can compile ARM binary in your system with
arm-linux-gnueabihf-gcc -ohello hello.c
Now onto debugging ARM binaries , with QEMU and GDB .
qemu-arm -g 1337 hello
Now we can connect gdb to port 1337
and debug the program hello
$ gdb-multiarch -q hello
Reading symbols from hello...(no debugging symbols found)...done.
(gdb) set architecture arm
The target architecture is assumed to be arm
(gdb) target remote localhost:1337
Remote debugging using localhost:1337
(gdb)
GEF is an extention for gdb which really plays well with non-x86 debugging : link
Also the creator of the same project has created many qemu image on different architecture to play around , It contains ARM image which is based on Raspberri pi , With this there is no need for remote debuggeing since it emulates the whole operating system, you can run the binary directly and debug it inside the qemu session . link to his blog