Map Rust vector iteration to assembly

Vectors in Rust contain a pointer to a heap vector buffer (buf) and a length (len). The vector buffer is a pointer to a buffer of the type T. The length is the number of elements in the vector. Here is an excerpt from the std::vec::Vec implementation:

pub struct Vec<T, #[unstable(feature = "allocator_api", issue = "32838")] A: Allocator = Global> {
    // Points to the start of the vector data and keeps the capacity information
    buf: RawVec<T, A>,
    // Length of the vector
    len: usize,
}

Sample vector handling code

Let us explore the Rust to assembly mapping of the Vec struct using a simple function, increment_by that increments the elements of a vector by a specified value (num). The sample also includes increment_example that calls the increment_by function.

// Increment each element of a i64 vector by num. A mutable
// reference to the vector is passed in.
pub fn increment_by(num: i64, list: &mut Vec<i64>) {
    // Iterate over the vector and increment each element.
    for item in list {
        // Note that the item needs to be dereferenced 
        *item += num;
    }
}

pub fn increment_example(inc_by: i64, array: [i64; 4]) -> Vec<i64> {
    let mut list = array.to_vec();
    increment_by(inc_by, &mut list);
    list
}

Assembly code for increment_by

The generated assembly code for the increment_by function is shown below. The generated assembly code seems quite complex. On close examination, it becomes clear that the code is optimized for operating on vectors of wildly different lengths.

Let us look at the assembly code that does the heavy lifting for a vector of a certain length. Once we understand these fragments, we will look at the complete assembly to understand how all other lengths are handled by combining these fragments.

Vector length less than 4

For vectors that are shorter than 4, the generated assembly code loops over each entry in the vector and increments it by the specified value.

.LBB0_11:
        add     qword ptr [rax], rdi    ; Add num to the next entry in the vector
        add     rax, 8                  ; Move to the next entry in the vector
        cmp     rax, rcx                ; Check if we have reached the end of the vector
        jne     .LBB0_11                ; Jump if we have not reached the end of the vector

Vector length is 4

If the vector length is 4, the compiler unrolls the loop and generates the following assembly code. Vector instructions are used to perform two 64-bit operations in parallel. The movq instruction is used to move the vector data from the vector buffer to a xmm register. The addq instruction is used to add the specified value to the vector data. The movq instruction is then used to move the vector data back to the vector buffer.

.LBB0_7:
        movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load two 64-bit numbers from the vector
        movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load two more 64-bit numbers from the vector
        paddq   xmm1, xmm0                              ; Add the increment value to the first two 64-bit number
        paddq   xmm2, xmm0                              ; Add the increment value to the 3rd and 4th of 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the result back in the vector (1st and 2nd number)
        movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the result back in the vector (3rd and 4th number)

Vector length is a multiple of 8

For vectors with lengths that are a multiple of 8, the compiler handles 8 64-bit operations per iteration. The 8 64-bit numbers are copied to the xmm registers using four movdqu instructions. Four addq instructions perform the 8 additions. Finally, the four movdqu instructions copy the results back to the vector buffer.

.LBB0_5:
        movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load the first two 64-bit number
        movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load the third and fourth 64-bit number
        movdqu  xmm3, xmmword ptr [rcx + 8*rsi + 32]    ; Load the fifth and sixth 64-bit number
        movdqu  xmm4, xmmword ptr [rcx + 8*rsi + 48]    ; Load the seventh and eighth 64-bit number
        paddq   xmm1, xmm0      ; Add the increment value to the first two 64-bit number
        paddq   xmm2, xmm0      ; Add the increment value to the third and fourth 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the first two 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the third and fourth 64-bit number
        paddq   xmm3, xmm0      ; Add the increment value to the fifth and sixth 64-bit number
        paddq   xmm4, xmm0      ; Add the increment value to the seventh and eighth 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi + 32], xmm3    ; Store the fifth and sixth 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi + 48], xmm4    ; Store the seventh and eighth 64-bit number
        add     rsi, 8       ; Increment the 64-bit numbers index
        add     rax, -2     ; Decrement the count by 2 as 8 64-bit numbers have been processed.
        jne     .LBB0_5     ; Jump if there are still quad 64-bit numbers to process
        test    r10b, 1     ; Check if there are still entries to process
        je      .LBB0_8     ; Jump if less than 4 entries need to be processed.

Handling vectors of arbitrary length

Now that we have analyzed the special cases, let us look at the code below that handles vectors of arbitrary length. The code uses the assembly code blocks we described above. The full code of the function combines the three code blocks to handle vectors of arbitrary length. For example, a vector of length 41 will be handled as.

  1. Iterate until index 31 (32 entries = 4 * 8) with four loops through that process 8 operations per loop (4 sets of vector operations). The .LBB0_5 label shows the code that performs the four loops.
  2. Iterate until index 35 (32 + 4 entries) with the code that processes 4 operations with two vector operations .LBB0_7 label shows this in the assembly of the function.
  3. Iterate until index 38 (36 + 3 entries) with the code that processes one iteration per loop. Refer to .LBB0_11 label in the following code.
; rsi contains the address of the vector to be incremented.
example::increment_by:
        mov     r9, qword ptr [rsi + 16] ; r9 now contains the length of the vector
        test    r9, r9                  ; Check if the vector is empty
        je      .LBB0_12                ; Jump if the vector is empty
        mov     rcx, qword ptr [rsi]    ; rcx now contains the address of the vector data
        lea     rax, [r9 - 1]           ; rax now contains the index of the last entry in the vector.
        movabs  rdx, 2305843009213693951 ; Hex 1FFF_FFFF_FFFF_FFFF
        and     rdx, rax                 ; rdx contains the index of the last entry in the vector with 3 MSB cleared.
        mov     rax, rcx                 ; rax now contains the address of the vector data
        cmp     rdx, 3                   ; Check if the vector has less than 4 entries
        jb      .LBB0_10                 ; Jump if the vector has less than 4 entries
        
        ; == Handle vectors with 4 or more entries == 
        add     rdx, 1          ; Add 1 to the last entry index to get length.
        mov     r8, rdx         ; r8 now contains the number of items to be processed.
        and     r8, -4          ; Mask of the lower 2 bits to get the count of entries that is divisible by 4.
        movq    xmm0, rdi       ; Parameter num, the increment value, is now in xmm0.
        pshufd  xmm0, xmm0, 68  ; xmm0 upper 64 bits and lower 64 bits are set to num to facilitate vector operations.
        lea     rax, [r8 - 4]   ; rax now is 4 less than the number of items to be processed.
        mov     r10, rax        ; r10 is now 4 less than the number of items to be processed.
        shr     r10, 2          ; Divide r10 by 4 to get the index of the last quad 64-bit numbers.
        add     r10, 1          ; Add 1 to get the count of quad 64-bit numbers.
        test    rax, rax        ; Check if any more sets of 4 entries are available
        je      .LBB0_3         ; Jump if no more sets of 4 entries are available
        mov     rax, r10        ; rax now contains the count of quad 64-bit numbers
        and     rax, -2         ; Mask of the lower 2 bits to get the count of quad 64-bit numbers that is divisible by 4.
        xor     esi, esi        ; Set the quad 64-bit numbers index to 0

        ; == Process eight 64-bit additions per iteration ==
.LBB0_5:
        movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load the first two 64-bit number
        movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load the third and fourth 64-bit number
        movdqu  xmm3, xmmword ptr [rcx + 8*rsi + 32]    ; Load the fifth and sixth 64-bit number
        movdqu  xmm4, xmmword ptr [rcx + 8*rsi + 48]    ; Load the seventh and eighth 64-bit number
        paddq   xmm1, xmm0      ; Add the increment value to the first two 64-bit number
        paddq   xmm2, xmm0      ; Add the increment value to the third and fourth 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the first two 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the third and fourth 64-bit number
        paddq   xmm3, xmm0      ; Add the increment value to the fifth and sixth 64-bit number
        paddq   xmm4, xmm0      ; Add the increment value to the seventh and eighth 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi + 32], xmm3    ; Store the fifth and sixth 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi + 48], xmm4    ; Store the seventh and eighth 64-bit number
        add     rsi, 8       ; Increment the 64-bit numbers index
        add     rax, -2     ; Decrement the count by 2 as 8 64-bit numbers have been processed.
        jne     .LBB0_5     ; Jump if there are still quad 64-bit numbers to process
        test    r10b, 1     ; Check if there are still entries to process
        je      .LBB0_8     ; Jump if less than 4 entries need to be processed.

        ; == Block processing four 64-bit additions ==
.LBB0_7:
        movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load two 64-bit numbers from the vector
        movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load two more 64-bit numbers from the vector
        paddq   xmm1, xmm0                              ; Add the increment value to the first two 64-bit number
        paddq   xmm2, xmm0                              ; Add the increment value to the 3rd and 4th of 64-bit number
        movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the result back in the vector (1st and 2nd number)
        movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the result back in the vector (3rd and 4th number)
.LBB0_8:
        cmp     rdx, r8                 ; Check if all entries have been processed
        je      .LBB0_12                ; Jump if all entries have been processed
        lea     rax, [rcx + 8*r8]       ; rax now contains the address of the last entry
.LBB0_10:
        lea     rcx, [rcx + 8*r9]       ; rcx now contains the address of the last entry in the vector
                                        ; rax points to the first entry in the vector

        ; == Process one 64-bit addition per iteration ==
.LBB0_11:
        add     qword ptr [rax], rdi    ; Add num to the next entry in the vector
        add     rax, 8                  ; Move to the next entry in the vector
        cmp     rax, rcx                ; Check if we have reached the end of the vector
        jne     .LBB0_11                ; Jump if we have not reached the end of the vector
.LBB0_12:
        ret
.LBB0_3:
        xor     esi, esi                ; esi is now 0
        test    r10b, 1                 ; Check if one set of quad 64-bit numbers is available
        jne     .LBB0_7                 ; Jump if quad 64-bit numbers are available
        jmp     .LBB0_8                 ; Jump if less than 4 64-bit numbers are available

The following diagrams help visualize how the code above connects the three basic blocks we have seen earlier.

Process eight 64-bit additions per iteration (LBB0_5)

Four 64-bit vector elements per iteration

Block processing four 64-bit additions (LBB0_7)

Two 64-bit updates

Process one 64-bit addition per iteration (LBB0_11)

One 64-bit vector element per iteration

Assembly code for increment_example

pub fn increment_example(inc_by: i64, array: [i64; 4]) -> Vec<i64> {
    let mut list = array.to_vec();
    increment_by(inc_by, &mut list);
    list
}

Looking at the code for increment_example we can see that the generated assembly code is a lot simpler than that for increment_by function. We also see that there is no call to increment_by, the relevant code from the function has been inlined into increment_example. We are seeing a lot of compiler optimizations at work here.

Also, note that the assembly code for increment_example calls __rust_alloc for allocating the vector data on the heap. If the memory allocation fails, the function throws an exception using the ud2 instruction.

example::increment_example:
        push    r15             ; Save r15 on the stack
        push    r14             ; Save r14 on the stack
        push    rbx             ; Save rbx on the stack
        mov     r15, rdx        ; r15 = address of the array
        mov     r14, rsi        ; r14 = inc_by
        mov     rbx, rdi        ; rbx = address of vector on the stack

        ; == Allocate memory for the vector buffer ==
        mov     edi, 32         ; edi = number of 64-bit numbers to process
        mov     esi, 8          ; esi = byte alignment is at 8 bytes
        call    qword ptr [rip + __rust_alloc@GOTPCREL] ; The allocated heap address is in rax
             
        test    rax, rax       ; Check if the allocation failed
        je      .LBB1_1       ; Jump if the allocation failed
        movups  xmm0, xmmword ptr [r15]         ; Load the first two 64-bit numbers from the array
        movups  xmm1, xmmword ptr [r15 + 16]    ; Load the third and fourth 64-bit numbers from the array
        movups  xmmword ptr [rax + 16], xmm1    ; Store the third and fourth 64-bit numbers in the vector
        movups  xmmword ptr [rax], xmm0         ; Store the first two 64-bit numbers in the vector
        add     qword ptr [rax], r14            ; Add the increment value to the first 64-bit number
        mov     qword ptr [rbx], rax            ; Store the address of the vector in the vector buffer
        add     qword ptr [rax + 8], r14        ; Add the increment value to the second 64-bit number
        mov     qword ptr [rbx + 8], 4          ; Store the vector length in the vector capacity field in RawVec.
        add     qword ptr [rax + 16], r14       ; Add the increment value to the third 64-bit number
        mov     qword ptr [rbx + 16], 4         ; Store the vector length in the vector.
        add     qword ptr [rax + 24], r14       ; Add the increment value to the fourth 64-bit number
        mov     rax, rbx                        ; Return the vector in rax
        pop     rbx                             ; Restore rbx
        pop     r14                             ; Restore r14
        pop     r15                             ; Restore r15
        ret                                     ; Return the vector in rax

.LBB1_1:                                        ; Label for the allocation failure
        mov     edi, 32                         ; edi = Size of the attempted allocation
        mov     esi, 8                          ; esi = Alignment is at 8 bytes
        ; == Call the allocation failure handler ==
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2                                     ; Use the ud2 instruction to throw an exception

View in the Compiler Explorer

Click on the Edit on Compiler Explorer to experiment with the generated assembly code. Two interesting experiments are: