This is part of a series of posts on BCL, the distributed data structure library I work on. I assume here that the reader is familiar with remote pointers, which are C++ objects that allow the user to read and write to memory that lives in processes on other nodes in a cluster.
This post focuses on the problem of storing objects in
remote memory. Specifically, how can we create a container
abstraction that can store complex types like
std::string
or
std::vector<T>
in distributed memory while also preserving efficiency
for simple types like int
. I start by discussing
why we can't simply byte-copy all objects into distributed
memory, then go on to create a container class that can wrap
complicated objects so we can store them in distributed memory.
I then explore why this container class isn't the best option
for simple types and create a partial specialization that will allow
us to transparently use the container interface for simple types
while achieving good performance.
One of the challenges of building distributed data structures is that
they need to be able to store things of various types, and not just
primitive types like int
and float
, but
complex types like those in the C++ STL or that users might define
for themselves.
For some types, we can store objects in distributed memory simply by
byte-copying them there. Examples of this kind of type include
int
, other primitive types, and structs containing only
primitive types.
// This struct is byte-copyable, or
// `trivially copyable` in STL parlance.
struct foo {
int foo;
float bar;
char buf[23];
};
Let's look at an example where this is not the case. An essential example program for hash tables, which are perhaps one of the most used distributed data structures, involves storing strings as keys to integer values.
HashMap<std::string, int> map;
std::string my_string = std::to_string(my_rank());
map.insert({my_string, my_rank()});
What happens when we try to store an STL string in distributed memory?
Well, let's look at a representative std::string
implementation.
Since a string can have variable size, it must carry a pointer, and possibly
a length inside it.
struct string {
size_t len_;
char* buf_;
};
We can't byte-copy this string into remote memory because it contains
a regular pointer to a process' virtual address space. If another
process retrieves this string object and tries to use it, dereferencing
the pointer buf_
will lead to memory errors.
We can deal with this problem by introducing a container class. Containers will store values in a serialized form so that the container itself can be byte-copied to another process, then unpacked and read.
To store anything of variable size in distributed memory, we'll need to use a remote pointer, which we can allocate (and reallocate) to fit whatever data size we encounter at runtime [1]. We can take a first stab at sketching out a container for storing variable size objects by separating our abstraction into two parts: (1) a serialization template struct, and (2) a container class. While the serialization struct will provide functions for serializing and deserializing the data to and from a byte-copyable form, the container class will actually store the data in remote memory.
Let's take a first pass at creating a serialization struct for
strings. Since strings are variable size, our serialization
method will need to return something variable size. The return
value will also need to be something that the container object
can understand and store in remote memory.
Let's assume we have a standard serial_ptr<U>
class that can store a string of values of type U
.
We can then create a template struct
serialize<T>
for serializing a value of type
T
and a specialization for std::string
that provides a way to serialize and deserialize strings.
template <typename T>
struct serialize;
template <>
struct serialize<std::string> {
serial_ptr<char> serialize(const std::string& str) {
serial_ptr<char> ptr(str.length());
for (size_t i = 0; i < str.length(); i++) {
ptr[i] = str[i];
}
return ptr;
}
std::string deserialize(const serial_ptr<char>& ptr) {
return std::string(ptr.data(), ptr.size());
}
};
The implementation details are not particularly important here
(we're just reading from and writing to a string), but the
import thing is that this provides an interface that we can
then use to create a container class for storing strings in
distributed memory. That is, the methods
serialize<std::string>::serialize(const& std::string)
and
serialize<std::string>::deserialize(const& serial_ptr<char>)
.
With this serialization interface in hand, we can proceed to write a container class that will store our variable-size object in distributed memory.
template <typename T, typename Serialize = serialize<T>>
class container {
public:
// When we call Serialize::serialize(T),
// the type we get back
// (of the form serial_ptr<U>)
using serial_ptr_type = decltype(
std::declval<Serialize>().serialize(
std::declval<T>()
)
);
// From serial_ptr<U>, the type U.
using serialized_type = typename serial_ptr_type::type;
T get() const;
void set(const T& value);
private:
remote_ptr<serialized_type> ptr_ = nullptr;
size_t len_;
};
So, for a serialization method that serializes something of type
T
to type serial_ptr<U>
(U
is referred to by the type alias serialized_type
above), we'll
store in remote memory an array of U
s. We can now
define methods to read and write to the container.
The important detail here is that in order to get
a value from a container, we have to perform an rget
operation
to read a variable-size object from a remote pointer, and
in order to set
a new value in a container,
we have to perform an rput
to write a
variable-size object.
template <typename T, typename Serialize>
T container<T, Serialize>::get() {
if (ptr_ != nullptr) {
serial_ptr<serialized_type> serialized(len_);
rget(ptr_, serialized.data(), len_);
return Serialize{}.deserialize(serialized);
} else {
return T();
}
}
template <typename T, typename Serialize>
void container<T, Serialize>::set(const T& value) {
if (ptr_ != nullptr) {
delete_(ptr_);
}
auto serialized = Serialize{}.serialize(value);
ptr_ = alloc<serialized_type>(serialized.size());
rput(ptr_, serialized, serialized.size());
len_ = serialized.size();
}
Now that we have a container class, we can easily create remote arrays that contain complex objects and write programs that manipulate them.
// Create a remotely accessible array on rank 0
remote_ptr<container<std::string>> ptr = nullptr;
if (rank() == 0) {
ptr = alloc<container<std::string>>(nprocs());
}
ptr = broadcast(ptr, 0);
// Create a container, store a string inside it.
// This involves writing the variable-sized string
// to remote memory.
container<std::string> val;
val.set(std::to_string(my_rank()));
// Store the container inside the array.
ptr[my_rank()] = val;
// Print out all the strings.
if (rank() == 0) {
for (size_t i = 0; i < nprocs(); i++) {
container<std::string> val;
val = ptr[i];
std::cout << val.get() << std::endl;
}
}
So now that we have a container class that can store any C++
object that we can write a serialization struct for, what
is the cost of the abstraction that we've created? Intuitively,
we know that using this container class to store or retrieve an object
involves two
remote communication operations, one to fetch the container object,
and a second to fetch the serialized object itself from the
remote pointer inside the container class. So, if we're storing
an int
, that's two rget
s, one to
fetch a pointer, and one to fetch the actual value. Seems inefficent.
More formally, let's assume the common alpha-beta communication cost model, where the cost of reading or writing is \(\alpha + \beta n\). \(\alpha\) is the latency cost of performing a read or write, \(\beta\) is the bandwidth cost of sending a byte \((1 / \text{bandwidth})\), and \(n\) is the number of bytes to send.
With the container, we first have to retrieve the container
object itself, which has a bandwidth cost of
$$
\alpha + \beta \epsilon,
$$
where \(\epsilon\) is the size
of a container, which for most architectures is 32 bytes (due to alignment). Then, we have
the cost in the get
or set
operation
of actually pulling the data from the pointer inside the container
object. We can call this
$$
\alpha + \beta n,
$$
where \(n\) is the number of bytes in the serialized object. This makes
the total cost here
$$
2\alpha + \beta (n + \epsilon)
$$
For large objects, this is likely to be efficient, since the value of \(n\) will be quite large, meaning that the cost of accessing the object is fundamentally bandwidth bound. However, for small objects, the value of \(n\) will be small, and the cost of the roundtrip latency necessary to access the element, \(2\alpha\), will dominate. This means that in the worst case, we've introduced a 2x overhead with our container abstraction.
In the case of small, fixed-size objects, adding twice the latency is a big cost, so we'd like to write an improved container implementation for these types. Thankfully, C++ allows us to transparently pick a different container implementation for these types by writing a partial specialization for them.
To start with, how would we like to write a serializer for a
simple type, like, say int
? The serializer is
still important here because it needs to signify to the
container that we can store this type in distributed memory.
Since no serialization is actually necessary for int
,
the most straightforward serializer would just return the original value to
serialize it and return the serialized value to deserialize it.
template <>
struct serialize<int> {
int serialize(const int& value) {
return value;
}
int deserialize(const int& value) {
return value;
}
};
Note that this serializer doesn't return a
serial_ptr<U>
object, which was the special class we serialized
variable-size objects to. Since we're now just storing
a regular object of type T
in distributed
memory, which the user is guaranteeing to us by contract
is byte-copyable, we can store the value itself
inside the container instead of holding a pointer to the
serialized value.
// Enable only if the return type of serialization is
// *not* a serial_ptr
template <typename T, typename Serialize = serialize<T>,
typename = std::enable_if_t<
!is_serial_ptr<
decltype(Serialize{}.serialize(T()))>::value
>
>
class container {
public:
// Return type of serialization
using serialized_type = decltype(
std::declval<Serialize>().serialize(
std::declval<T>()
)
);
T get() const;
void set(const T& value);
private:
serialized_type value_;
};
Now we're just storing a serialized_type
value directly
inside the container, instead of using a remote pointer.
Our get
and set
methods
can directly use the field inside our container now.
template <typename T, typename Serialize>
T container<T, Serialize>::get() {
return Serialize{}.deserialize(value_);
}
template <typename T, typename Serialize>
void container<T, Serialize>::set(const T& value) {
value_ = Serialize{}.serialize(value);
}
To get an idea of how this code actually performs in the wild, let's look
at the assembly we get when we compile an example. Let's start
with the simple string example from above.
Looking at the generated assembly, we can see that reading an std::string
object from remote memory is a two-stage process: first, we must read the container,
then, we must read in the data pointed to by the container.
So there are two calls to
remote_read
(lines 319 and 333 of the assembly), which is
expensive if the objects are small and bound by latency. This is, however,
unavoidable for variable-size objects.
Also notable is the fact that we're making an extra copy of the local string object:
first copying it into a serial_ptr
object, then copying from that
object to an std::string
. If std::string
provided a
special constructor that allowed us to move a char pointer to the string
object instead of copying the data, we might be able to avoid this with a carefully
crafted serialize struct, but without cooperation from the original object, it
will not be possible to elide this local copy operation.
Next, let's look at the performance for the same code example with a simple type,
like int
. Looking at the generated assembly here, we see that
there is only one remote_read
operation taking place
inside the loop (line 51 of the assembly). Our partial specialization for simple types is
evidently being used.
As an added bonus, the compiler is intelligently able to elide making an
extra local copy of the int
object, since the deserialization
function simply returns the original int, instead of reading and writing
to a buffer in a complicated loop. The assembled code is almost exactly
the same as we would get by reading and writing to raw integers instead
of using containers, and it performs the same.
There are a lot of simple types, like int
, float
,
double
, uint64_t
, etc., that are byte-copyable,
and it would be nice to support all of them without having to write a
serialization struct for each one. It turns out that the C++ standard
library has a formal definition for when types, including
user-defined types, are "trivially copyable," and we can use this
to create one template struct that will provide a serializer for
all trivially copyable types.
template <typename T>
struct serialize <T, std::enable_if_t<
std::is_trivially_copyable<T>::value
>> {
T serialize(const T& value) {
return value;
}
T deserialize(const T& value) {
return value;
}
};
This allows users to even define their own trivially-copyable types and automatically use them with our container class.
So we've created a container abstraction that we can use to store any C++ object in distributed memory, given that it is trivially copyable or we can define a serialization struct for it. This kind of mechanism is crucial if we are to develop distributed data structure types, like lists and hash tables, that can store arbitrarily complex C++ objects. There are a few caveats that come up when using this container abstraction that are worth mentioning.
GCC and Clang are smart enough to avoid creating a local copy of a container when the serialization function just involves returning the object stored inside the container. However, when performing block copies, the loop trips up the compiler, and we end up creating an array of containers, then an array of values extracted from the container that is byte-identical to the containers. I solve this problem in practice by writing special block copy functions for copying multiple containers stored using a remote pointer. These copy functions can use partial specialization to take a fast path when the container is using an identity serialization method.
Some data types support in-place operations in distributed memory,
for example integer types, which support atomic compare-and-swap,
fetch-and-add, etc. operations. A simple solution that allows
users to take advantage of these in-place operations is to have a
decay_container
method that will decay a remote pointer
to a container to a remote pointer to a value when the container
identity serializes the type it's storing, as is the case with
integer types. A more robust method would be to do this through
reference types, which could be specialized based on the serialization
function, but this is something I'm still exploring.
[1] Remote pointers are C++ objects that can reference memory on processes that might live on other nodes in a cluster. They can be used almost identically to regular C++ pointers. I've written a blog post on how to design remote pointer objects for C++.