In my work as a graduate student, I build distributed data structures, and having a remote pointer abstraction is an essential tool for me to write clean, correct code. In this blog post, I'll motivate why remote pointers are necessary, explain how I built C++ remote pointer objects in my library, explain how to make them work like regular C++ pointers using remote reference objects, and discuss when the abstraction breaks down because a custom pointer simply can't do a regular pointer's job (yet). Hopefully this will be useful for other people interested in developing high-level programming abstractions.
When working on distributed computers or with network hardware, you often have access to read and write to some memory space using a C API. One example of this is MPI's API for one-sided communication. In that API, you use functions to read and write directly from the memory of other nodes in a distributed cluster. Here's a slightly simplified version of what this looks like.
void remote_read(void* dst, int target_node, int offset, int size);
void remote_write(void* src, int target_node, int offset, int size);
Given an offset into a target node's shared memory segment, remote_read
will read some bytes from it, and remote_write
will write some bytes to it.
These APIs are great because they give us access to important primitives we can use to implement programs that run on clusters of computers. They're also great because these are really fast primitives that directly mirror what's offered by the hardware: remote direct memory access (RDMA). Modern supercomputer networks like Cray Aries and Mellanox EDR offer 1-2 μs read/write latencies, which are made possible by allowing the network interface card (NIC) to directly read and write to RAM so as to avoid waiting for the remote CPU to wake up and answer your network request.
Where these APIs are not so great is from an application programming perspective. Even with the simple APIs written above, it's easy to accidentally overwrite data you didn't mean to, since you don't have separate names for separate objects stored in memory, just one big contiguous buffer. Also, the interface is untyped, so you don't have the esteemed benefit of having a compiler yell at you if you write the wrong type of value to the wrong location; your code will simply be wrong, in mysterious yet catastrophic ways. Complicating things more, in reality, these APIs are a bit more complicated, and it's easy to swap two or more parameters by mistake.
Pointers are an important and necessary abstraction layer for building high-level programming tools. While pointers can be complicated to use directly and can introduce lots of bugs, they are fundamental building blocks, and data structures and even C++ references will often use pointers under the hood.
If we assume we have an API like the one listed above, then a unique location in remote memory is specified by two things: (1) a rank, or process ID, and (2) an offset into that rank's shared segment of remotely available memory. We can go ahead and make this a struct.
template <typename T>
struct remote_ptr {
size_t rank_;
size_t offset_;
};
At this point, we can already design an API for reading and writing to remote pointers
that is safer than our original API.
template <typename T>
T rget(const remote_ptr<T> src) {
T rv;
remote_read(&rv, src.rank_, src.offset_, sizeof(T));
return rv;
}
template <typename T>
void rput(remote_ptr<T> dst, const T& src) {
remote_write(&src, dst.rank_, dst.offset_, sizeof(T));
}
Block transfers are very similar and are left out for brevity.
Now, to read and write a value, we can write code like
remote_ptr<int> ptr = ...;
int rval = rget(ptr);
rval++;
rput(ptr, rval);
This is already better than the original API because we are
working with typed objects. It's no longer easy to read or
write a value of the wrong type or to only write part of an object.
Pointer arithmetic is an essential feature that allows programmers to manage collections of values in memory—and if we're writing a distributed memory program, presumably we often want to work with large collections of values.
What does incrementing or decrementing a remote pointer mean? Well,
the simplest option is that remote pointer arithmetic is analogous to
regular pointer arithmetic: p+1
will just point to the next
sizeof(T)
-aligned memory location after p
on
the original rank's shared segment.
While this is not the only possible definition of remote pointer arithmetic, it has come to be most accepted recently, with remote pointers in libraries like UPC++, DASH, and BCL using it. However, Unified Parallel C (UPC), which has a long legacy in the high-performance computing (HPC) community, has a more elaborate definition of pointer arithmetic [1].
Implementing pointer arithmetic this way is straightforward and only involves modifying the pointer offset.
template <typename T>
remote_ptr<T> remote_ptr<T>::operator+(std::ptrdiff_t diff)
{
size_t new_offset = offset_ + sizeof(T)*diff;
return remote_ptr<T>{rank_, new_offset};
}
This opens up the possibility of accessing arrays of data in distributed memory. We could, for example, have each process in a SPMD program write or read from a different variable in an array pointed to by a remote pointer [2].
void write_array(remote_ptr<int> ptr, size_t len) {
if (my_rank() < len) {
rput(ptr + my_rank(), my_rank());
}
}
Implementing other operators necessary to support the full set of normal pointer arithmetic operations is straightforward.
nullptr
Value
For normal pointers, the nullptr
value is
NULL
, which is usually just a #define
to 0x0,
since that memory location is unlikely to be used. In our remote pointer
scheme, we have the option of either picking a specific pointer value to be
nullptr
, thus making that memory location unusable,
or else including a special boolean member to mark whether
a pointer is null. While making a memory location unusable is
not optimal, adding a single boolean value will, in most compilers,
double the size of the remote pointer object from 128 bits to 256 bits
in order to maintain alignment. This is particularly undesirable.
In my library, I chose {0, 0}
, that is, offset 0 in rank 0,
as the nullptr
value.
There are, perhaps, other choices of nullptr
that
would work equally well. In addition, some programming environments,
such as UPC, have implemented narrow pointers that fit in
only 64 bits, allowing them to be used in atomic compare-and-swap
operations. A narrow pointer comes with the tradeoff that either the
offset or the rank identifier must fit within 32 bits or less,
limiting scalability.
In languages like Python, the bracket operator serves as syntactic sugar
to call the methods __setitem__
and __getitem__
,
depending on whether you are reading or writing to the object. In C++,
operator[]
does not distinguish what kind of
value category
the object is or whether the returned value will ultimately be read from or written to.
To resolve this, C++ data structures return references, which refer
to memory held inside the container and can be either written to or read from.
An implementation of operator[]
for std::vector
might
look something like this.
T& operator[](size_t idx) {
return data_[idx];
}
Of the essence here is the fact that we are returning something of type
T&
, which is a raw C++ reference that can be written to,
rather than something of type T
, which only represents
the value of the original data.
We cannot return a raw C++ reference in our case because we are referencing memory held on another node which is not present in our own virtual address space. We can, however, create our own custom reference objects.
A reference is essentially an object that serves as a wrapper around a
pointer, and it has two essential functions: it can be implicitly
converted to a value of type T
, and it can be assigned
to a value of type T
. So, for a remote reference, we
just have to implement an implicit conversion operator, which reads the
value, and an assignment operator, which writes to the value.
template <typename T>
struct remote_ref {
remote_ptr<T> ptr_;
operator T() const {
return rget(ptr_);
}
remote_ref& operator=(const T& value) {
rput(ptr_, value);
return *this;
}
};
This allows us to add some new, powerful features to our remote pointer that allow it to be dereferenced in the same way as regular pointers.
template <typename T>
remote_ref<T> remote_ptr<T>::operator*() {
return remote_ref<T>{*this};
}
template <typename T>
remote_ref<T> remote_ptr<T>::operator[](ptrdiff_t idx) {
return remote_ref<T>{*this + idx};
}
We've now filled in all the pieces to use our remote pointers as regular pointers. We can re-write the simple example program from above.
void write_array(remote_ptr<int> ptr, size_t len) {
if (my_rank() < len) {
ptr[my_rank()] = my_rank();
}
}
We can, of course, write more complicated programs using our new pointer API, such as this function to do a parallel tree reduction [3]. Implementations using our remote pointer class are safer and cleaner than what would typically be written using the C API described above.
But what is the cost of using a high-level abstraction like this? With each memory access, we are calling the dereference method, returning an intermediate object that wraps the pointer, then calling the conversion operator or assignment operator on the intermediate object. Will we pay a runtime cost for this?
It turns out, if we design our pointer and reference classes carefully, we
will pay no runtime cost for this abstraction—modern C++ compilers
are able to compile these intermediate objects and method calls away through aggressive
inlining. To judge what the cost of our abstraction is, we can compile
a simple example program and then examine the assembly to see what objects
and methods exist at runtime. For the tree reduction example, compiled with
the remote pointer and reference classes we've described here, modern
compilers render the tree reduction as a few calls to remote_read
and remote_write
[4]. No class methods are
called and no reference objects exist at runtime.
Experienced C++ programmers will recall that the C++ standard template library specifies that STL containers should support custom C++ allocators. Allocators provide a way to allocate memory, and this memory can be referenced using custom pointer types. Does this mean that we can simply create a "remote allocator" and plug it in to store data in remote memory using STL containers?
Unforunately, no. Presumably for performance reasons, the C++ standard no longer requires support for custom reference types, and most implementations of the C++ standard library do not support them. If you want to use GCC's libstdc++, for example, you may use custom pointers, but are restricted to regular C++ references, which prevents you from using STL containers in remote memory. Some high-level C++ template libraries like Agency, which uses custom pointer and reference types, have their own implementations of some STL data structures that do support remote reference types. This allows users a lot of free license to build creative allocator, pointer, and reference types and then have a collection of data structures you can automatically use with them.
This blog post has touched on a lot of broader issues that remain unresolved.
int
in remote memory? Can we cleanly support
complex types? Can we support simple types at the same time without
paying a serialization cost?I plan on writing a series of blog posts that deal with these and other issues that have arisen while developing my distributed data structures library. Please stay tuned!
[1] In UPC, pointers have a phase that determines which rank a pointer
will point to when incremented.
Phase allows pointers to encapsulate distributed arrays with a variety of
distribution patterns. This is very
powerful, but can be mystifying to novice users. Although some expert UPC
programmers still prefer this technique, a more object-oriented approach
that's catching on is to
first build a simple remote pointer class, then rely on separate distributed
data structures to control data distribution.
[2] Most applications in HPC are written using a
SPMD, or "single program,
multiple data" style. SPMD APIs offer a my_rank()
function or variable, which tells a process running the program its
unique rank or process ID and can be used to branch off from the main program.
[3] Here is a basic tree reduction written in a SPMD style using our remote pointer class.
Code is adapted from code originally written by my colleague
Andrew Belt
template <typename T>
T parallel_sum(remote_ptr<T> a, size_t len) {
size_t k = len;
do {
k = (k + 1) / 2;
if (my_rank() < k && my_rank() + k < len) {
a[my_rank()] += a[my_rank() + k];
}
len = k;
barrier();
} while (k > 1);
return a[0];
}
[4] You can see the compiled result of the above code here: https://godbolt.org/z/yDKluF.