Assembly code generated from Rust for parameter passing
Here we will be exploring the performance implications of the passing the self
parameter by value, reference, and smart pointers (Box
, Rc
and Arc
). The generated assembly code will help us understand what happens under the hood.
We will be working with the Complex
struct defined below. The code shows how a struct and its associated methods declarations in Rust. Note that like Python, the self
parameter that refers to the associated object is passed explicitly in the method declaration.
use std::rc::Rc;
use std::sync::Arc;
#[derive(Copy, Clone)]
pub struct Complex {
real: f64,
imaginary: f64,
}
impl Complex {
pub fn magnitude_self_copy(self) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
pub fn magnitude_self_reference(&self) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
// Passing smart pointers
pub fn magnitude_self_box(self: Box<Self>) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
pub fn magnitude_self_rc(self: Rc<Self>) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
pub fn magnitude_self_arc(self: Arc<Self>) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
}
The self
parameter in a method can specify the expected ownership model for the object. The following table shows self
with different ownership models used in the methods associated with the Complex
struct.
Self type | Implication |
---|---|
self | By default, Rust assumes that a parameter passed by value is moved. The ownership of the parameter passes to the called function. In this example, however,the Complex type implements Copy and Clone traits. In this case, the compiler is copying the complete object to the method. |
&self | The method is immutably borrowing the object. The method cannot modify the object. |
self : Box<Self> | Box is like the unique_ptr in C++. Here the object is allocated on the heap. The method gets complete ownership of the object and will cease to exist after the method returns. The memory will be released back to the heap. |
self : Rc<Self> | Here a shared smart pointer has been passed to the method. Multiple pointers to this object may be active in the same thread. The method will share ownership to self . The function will decrement a shared reference counts stored along with the Complex object. If this was the only reference to the object, the object will be destroyed, and the memory will be released to the heap. If the reference counts do not go to zero, the object will live even after the method returns. |
self : Arc<Self> | Here a multi-thread safe Arc smart pointer is being passed to the method. The method will now own the Arc smart pointer. When the method goes out of scope, the shared reference counts saved along with Complex will be atomically decremented. If the reference counts reach 0, the object in the heap will be deleted. Note that the reference counts are now decremented using atomic read-modify-write operations. |
Now let’s examine the assembly code generated for each method shown above.
Self
is passed by value to the method
pub fn magnitude_self_copy(self) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
The assembly code generated for the above function is shown below. One interesting thing to note here is as the compiler has really optimized the passing of the Complex
object by storing the real
and imaginary
fields in xmm0
and xmm1
registers, respectively. The method computes the result, and the final return value is returned via the xmm0 register.
The code generated for calculating the magnitude is annotated in the assembly code below.
; The compiler has optimized the code to pass the real and
; imaginary parts in the xmm0 and xmm1 registers.
example::Complex::magnitude_self_copy:
mulsd xmm0, xmm0 ; Square the real part
mulsd xmm1, xmm1 ; Square the imaginary part
addsd xmm1, xmm0 ; Add the two squared numbers and store the result in xmm1
xorps xmm0, xmm0 ; Clear xmm0. This will zero out the upper bits of the reg.
sqrtsd xmm0, xmm1 ; Perform the square root on the squared sum and store in xmm0
ret ; Return to the called with the result in xmm0
Self
reference &T
is passed to the method
pub fn magnitude_self_reference(&self) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
A reference to self (&self
) has been passed in the above function. The generated code looks like the self
case covered earlier. The main difference is that the compiler now passes the pointer to the Complex
object. The pointer is passed via the rdi
register. As a result of this difference, the first two lines of assembly populate the xmm0
and xmm1
registers with the real
and imaginary
fields from the struct. The following assembly code is identical to the self
case.
; The caller will pass the pointer to the Complex struct in the rdi register.
example::Complex::magnitude_self_reference:
movsd xmm0, qword ptr [rdi] ; Fetch the real part of the struct from memory
movsd xmm1, qword ptr [rdi + 8] ; Fetch the imaginary part of the struct from memory
mulsd xmm0, xmm0 ; real^2 -> xmm0
mulsd xmm1, xmm1 ; imaginary^2 -> xmm1
addsd xmm1, xmm0 ; real^2 + imaginary^2 -> xmm1
xorps xmm0, xmm0 ; Clear the complete xmm0 to 0
sqrtsd xmm0, xmm1 ; square root of xmm1 -> xmm0
ret ; return xmm0
Self
points to the object on the heap via Box
pub fn magnitude_self_box(self: Box<Self>) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
A Box
smart pointer to self is being passed here. The Box
contains a pointer to the Complex
object stored on the heap. The following table shows the heap representation.
Byte offset | Field | Field size in bytes |
---|---|---|
0 | Complex value | 16 |
The generated assembly code looks like the &self
case. The xmm0
and xmm1
registers are populated from the heap. The major difference here is that the heap memory will be freed at the end of the method call. This happens because the method owns the Box
that points to the Complex
on the heap. Once the method exits, the self Box
will go out of scope. The Box
smart pointer will then free the associated memory (The Box
in Rust is like the unique_ptr
in C++).
The assembly code below has been annotated to show the magnitude computation and release of the heap memory.
; The caller passes the address of the Complex object on the heap.
; The address is passed in the rdi register.
example::Complex::magnitude_self_box:
push rax ; Preserve the current value of rax on the stack.
movsd xmm0, qword ptr [rdi] ; Fetch the real part of the struct from memory
movsd xmm1, qword ptr [rdi + 8] ; Fetch the imaginary part of the struct from memory
mulsd xmm0, xmm0 ; real^2 -> xmm0
mulsd xmm1, xmm1 ; imaginary^2 -> xmm1
addsd xmm1, xmm0 ; real^2 + imaginary^2 -> xmm1
xorps xmm0, xmm0 ; Clear the complete xmm0 to 0
sqrtsd xmm0, xmm1 ; square root of xmm1 -> xmm0
; This method owns the Box. Now that the function is about to return so
; the Box is going out of scope and is about to be dropped.
; Dropping here means that the heap memory allocated for the Complex object
; can now be freed. Note that the rdi register already points to the memory that
; needs to be freed.
movsd qword ptr [rsp], xmm0 ; Save xmm0 on the stack
mov esi, 16 ; Size of the memory to be freed (Complex is 16 bytes)
mov edx, 8 ; The data is 8-byte aligned.
; The parameters to the de-allocation function are:
; rdi : Address of memory to be freed
; esi : Size of memory to be freed.
; edx: Alignment of the memory to be freed.
call qword ptr [rip + __rust_dealloc@GOTPCREL]
movsd xmm0, qword ptr [rsp] ; Restore the xmm0 from the stack. This is the return value.
pop rax ; Restore the value of the rax register
ret ; Return the result in xmm0
A reference-counted smart pointer Rc
to self
is passed
pub fn magnitude_self_rc(self: Rc<Self>) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
The above method is designed to take ownership of Rc
, a reference counting smart pointer. The Rc
points to the following data on the heap:
Byte offset | Field | Field size in bytes |
---|---|---|
0 | strong reference count | 8 |
8 | weak reference count | 8 |
16 | Complex value | 16 |
When an Rc
is created it starts with the strong reference count set to 1. If an Rc
is cloned, it does not copy the pointed data, it just increments the reference count. This way multiple shared references may point to the same heap memory. Also, when an Rc
is dropped, the reference count is decremented. If the reference count falls to 0, the memory block on the heap is de-allocated.
The generated code starts with the xmm0
and xmm1
registers getting populated with the real and imaginary parts from the struct. Notice that the offsets for the access are 16 and 24, respectively. This is due to the two 64-bit reference counts that are present before the Complex
object. Once the values have been saved, the reference counts are decremented in preparation of the method going out of scope. If the reference count hits zero, the object pointed from the Rc
will be deleted. If not, the memory block containing the reference counts and Complex
objects live as there are other Rc
smart pointers pointing to the same memory block.
Note: We have ignored the weak reference in this discussion.
; The caller passes a heap address in the rdi register that points to:
; Offset 00: Strong reference count
; Offset 08: Weak reference count
; Offset 16: Complex object
example::Complex::magnitude_self_rc:
sub rsp, 24 ; Create a 24-byte space for local variables
movsd xmm0, qword ptr [rdi + 16] ; Fetch the real part of the struct from memory
movsd xmm1, qword ptr [rdi + 24] ; Fetch the imaginary part of the struct from memory
; This method owns the Rc. The Rc will go out of scope at the end of the function.
; Decrease the reference counts in the Rc and check if the object should be freed.
add qword ptr [rdi], -1 ; Decrement the strong reference
jne .LBB3_3 ; If not zero, proceed with the calculation.
add qword ptr [rdi + 8], -1 ; Decrement the weak reference
jne .LBB3_3 ; If not zero, proceed with the calculation.
mov esi, 32 ; Size of the memory to be freed (Complex is 16 bytes)
; plus two 8-byte reference counters.
mov edx, 8 ; The data is 8-byte aligned.
movsd qword ptr [rsp + 16], xmm0 ; Save xmm0 on the stack
movsd qword ptr [rsp + 8], xmm1 ; Save xmm1 on the stack
; The parameters to the de-allocation function are
; rdi: Address of memory to be freed
; esi: Size of memory to be freed.
; edx: Alignment of the memory to be freed.
call qword ptr [rip + __rust_dealloc@GOTPCREL]
movsd xmm1, qword ptr [rsp + 8] ; Restore xmm1 from the stack
movsd xmm0, qword ptr [rsp + 16] ; Restore xmm0 from the stack
.LBB3_3:
mulsd xmm0, xmm0 ; Square the real part
mulsd xmm1, xmm1 ; Square the imaginary part
addsd xmm1, xmm0 ; real^2 + imaginary^2 -> xmm1
xorps xmm0, xmm0 ; Clear the complete xmm0 to 0
sqrtsd xmm0, xmm1 ; Square root of xmm1 -> xmm0
add rsp, 24 ; Free the space saved for local storage
ret ; Return the result in xmm0
An atomic reference counted shared reference Arc
to self
is passed
pub fn magnitude_self_arc(self: Arc<Self>) -> f64 {
(self.real.powf(2.0) + self.imaginary.powf(2.0)).sqrt()
}
Arc
is a smart pointer that operates across threads. This requires that reference count increments and decrements are atomic. An atomic read-modify-write operation is performed to manage reference counts across threads.
The Arc
smart pointer points to a heap allocation that contains AtomicUsize
strong and weak references. The Complex
is stored after the two references (see the following table for the memory representation).
Byte offset | Field | Field size in bytes |
---|---|---|
0 | AtomicUsize strong reference count | 8 |
8 | AtomicUsize weak reference count | 8 |
16 | Complex value | 16 |
The code generated for Arc
is similar to the code generated for Rc
. The significant differences from the Rc
assembly code are:
-
lock sub qword ptr [rdi], 1
is generated for handling the atomic decrement of the reference count. -
The drop check and weak reference count decrement are handled in
alloc::sync::Arc<T>::drop_slow
function.
; The caller passes a heap address in the rdi register that points to:
; Offset 00: Strong reference
; Offset 08: Weak reference
; Offset 16: Complex object
example::Complex::magnitude_self_arc:
sub rsp, 24 ; Create 24-byte space for local variables
mov qword ptr [rsp + 16], rdi ; Save rdi on the stack
movsd xmm0, qword ptr [rdi + 16] ; Fetch the real part of the struct from memory
movsd xmm1, qword ptr [rdi + 24] ; Fetch the imaginary part of the struct from memory
; This method owns the Arc. The Arc will go out of scope at the end of the function.
; Arc operates across threads, so the reference count decrement has to be a locked
; to perform an atomic read-modify-write operation.
lock sub qword ptr [rdi], 1 ; Lock and perform an atomic decrement
; of the strong reference
jne .LBB4_2 ; If not zero, skip ahead to the computation.
lea rdi, [rsp + 16] ; Load the address where the original rdi is saved
movsd qword ptr [rsp + 8], xmm0 ; Save real part on the stack
movsd qword ptr [rsp], xmm1 ; Save imaginary part on the stack
call alloc::sync::Arc<T>::drop_slow ; Call the drop_slow function to for further delete processing.
movsd xmm1, qword ptr [rsp] ; Now restore the imaginary part from the stack
movsd xmm0, qword ptr [rsp + 8] ; Restore the real part from the stack.
.LBB4_2:
mulsd xmm0, xmm0 ; square real part
mulsd xmm1, xmm1 ; square imaginary part
addsd xmm1, xmm0 ; real^2 + imaginary^2 -> xmm1
xorps xmm0, xmm0 ; Clear the complete xmm0 to 0
sqrtsd xmm0, xmm1 ; Square root of xmm1 -> xmm0
add rsp, 24 ; Free the space saved for local storage
ret ; Return the result in xmm0
; This function frees memory if the atomic reference counts have reached 0.
; The function is invoked with rdi pointing to the address where the address of the complete Arc is stored.
alloc::sync::Arc<T>::drop_slow:
mov rdi, qword ptr [rdi] ; Load the Arc block pointer in memory.
cmp rdi, -1 ; Check if Arc block address is set to -1
je .LBB5_2 ; If it is skip ahead and return.
lock sub qword ptr [rdi + 8], 1 ; Perform an atomic decrement of the weak reference.
jne .LBB5_2 ; If the weak reference is 0, proceed to free the Arc block
mov esi, 32 ; Arc block size is 32: 8-strong, 8-weak, 16-Complex
mov edx, 8 ; Alignment is 8 bytes
jmp qword ptr [rip + __rust_dealloc@GOTPCREL] ; Free memory
.LBB5_2:
ret
Key takeaways
For small types copying the object might be more efficient than passing a reference
In our analysis, the most efficient code with the least memory overhead was generated for the Complex::magnitude_self_copy
method. From small types, passing by value might be more efficient than passing a reference.
Passing a &T
is efficient
In most scenarios, passing a reference will be more efficient than passing by value as the compiler will not need to copy the entire contents to the called function.
Prefer passing a &T
over Rc<T>
and Arc<T>
when the function just wishes to read from T
From the generated code we see that passing an owned Rc<T>
and Arc<T>
introduce significant overhead. Prefer passing a reference &T
in scenarios where no sharing changes are expected.
The Reddit discussion on the subject defines the following rules for Arc<T>
:
- If a function always needs to own its own copy of the
Arc
, passArc<T>
directly by value. The caller can decide whether to clone or move an existingArc
into it. - If the function just very rarely needs to make a copy of the
Arc
,&Arc<T>
can make sense so that you are not forced to do atomic operations in the common case, at the cost of not being able to just move the arc in the uncommon case. - If the function just wants to read from the
T
, just pass&T
.
Consider the memory overhead of Rc
and Arc
On a 64-bit machine, Rc
and Arc
add a 16 byte overhead on the heap.