Finding NA in serialized R values
In a previous post, I explained how to test a C++ function, find_na
, used in a database for R values. find_na
checks if NA
is present or not in a previously serialized R value. By not unserializing the value, it avoids unnecessary allocations.
Serialization of R values
Native R serialization functions are defined in serialize.c
in the R codebase. Three versions of the serialization are available; I use the latest one. The serialization starts with a header indicating how atomic values are serialized, including the character encoding, and then recursively traverses the R value.
R_InitOutPStream
first creates a stream in which to write the serialized value and then R_serialize
performs the actual serialization:
WriteBuffer write_buffer(buf);
R_outpstream_st out;
R_InitOutPStream(&out,
reinterpret_cast<R_pstream_data_t>(&write_buffer),
R_pstream_binary_format,
3, //version of the serialization
append_byte, append_buf,
refhook_write, R_NilValue);
R_Serialize(val, &out);
R_pstream_data_t
is an alias forvoid*
. It is passed as an argument to callbacksappend_byte
andappend_buf
;R_pstream_binary_format
indicates that atomic values should be serialized as binary. Among other formats are XDR, withR_pstream_xdr_formatR_pstream_xdr_format
or ASCII, withR_pstream_ascii_format
;refhook_write
provides a way to customize the serialization of some values. In sxpdb, we use it to disable serialization for environments.
SEXP Serializer::refhook_write(SEXP val, SEXP data) {
if(TYPEOF(val) != ENVSXP) {
return R_NilValue;//it means that R will serialize the value as usual
}
// We just do not serialize the environment and return a blank string
return R_BlankScalarString;
}
- The
R_NilValue
passed last toR_InitOutPStream
is the second argument ofrefhook_write
.
Partially unserializing
The database stores the serialized values, stripped of the header with the version of the serialization and the encoding.
Function unserialize_view
takes a serialized value — a buffer of bytes — from the database and returns a sexp_view_t
, which holds pre-computed metadata on the serialized view.
const sexp_view_t Serializer::unserialize_view(
const std::vector<std::byte>& buf);
struct sexp_view_t {
SEXPTYPE type = ANYSXP;
const void* data = nullptr;
size_t length = 0;
size_t element_size = 0;
};
For vector types, the serialized value then comprises flags, the length of the vector, and the actual data in the vector:
The flags include the type and the presence of attributes of the value:
int flags = 0;
std::memcpy(&flags, data, sizeof(int));
sexp_view.type = flags & 255;
bool has_attr = flags & (1 << 9);
The function then stores the actual value data in the data
field.
For vector values (character, logical, integer, real, complex), length
in sexp_view_t
stores the number of elements in the vector. element_size
represents the number of bytes of an element in the vector. For character vectors, that one does not make sense: in R, character vectors of type STRSXP
are actually vectors of atomic strings, of type CHARSXP
, which have variable length.
Looking for NA
For logical and integer vectors, we can simply look for the magic NA
value in the data
field, after casting it to an integer, as v
:
std::find(v, v + length, NA_LOGICAL) != v + length;
For real vectors, NA
is a more complicated beast. There is an existing protocol for missing values defined by the floating point standard (IEEE 754): representing them with NaN
. In R, NA
is NaN
with a special bit pattern: the lowest word is 1954, the year Ross Ihaka, one of the creator of the R Language, was born.
static double R_ValueOfNA(void)
{
/* The gcc shipping with Fedora 9 gets this wrong without
* the volatile declaration. Thanks to Marc Schwartz. */
volatile ieee_double x;
x.word[hw] = 0x7ff00000;
x.word[lw] = 1954;
return x.value;
}
The C R API provides R_IsNA
to simplify the check:
std::find_if(v, v + length, [](double d) -> bool {
return R_IsNA(d) ;}) != v + length;
Complex vectors are NA
if the real or the imaginary part is NA
:
std::find_if(v, v + length, [](const Rcomplex& c) -> bool {
return ISNAN(c.r) || ISNAN(c.i);}) != v + length;
Character vectors are again more complex. Each CHARSXP
element of the vector is again a valid R value and so is serialized with the flags and then the length of the element. NA
is the element with length -1
:
SEXPTYPE type = ANYSXP;
int size = 0;
for(size_t i = 0; i < length; i++) {
std::memcpy(&type, data, sizeof(int));
type &= 255;// this would also store the encoding...
assert(type == CHARSXP);
data += sizeof(int);
std::memcpy(&size, data, sizeof(int));
data += sizeof(int);
if(size == -1) {// this is NA_STRING
return true;
}
assert(size >= 0);
//else we jump to the next CHARSXP
data += size;
}
return false;
With that, the search of NA
is $O(1)$ in memory. Note that I do not deal with non-vector types, such as environments (not stored in the database) or lists.