Finding NA in serialized R values

In a previous post, I explained how to test a C++ function, find_na, used in a database for R values. find_na checks if NA is present or not in a previously serialized R value. By not unserializing the value, it avoids unnecessary allocations.

Serialization of R values

Native R serialization functions are defined in serialize.c in the R codebase. Three versions of the serialization are available; I use the latest one. The serialization starts with a header indicating how atomic values are serialized, including the character encoding, and then recursively traverses the R value.

R_InitOutPStream first creates a stream in which to write the serialized value and then R_serialize performs the actual serialization:

  WriteBuffer write_buffer(buf);
  R_outpstream_st out;

  R_InitOutPStream(&out,
                   reinterpret_cast<R_pstream_data_t>(&write_buffer),
                   R_pstream_binary_format,
                   3, //version of the serialization
                   append_byte, append_buf,
                   refhook_write, R_NilValue);

  R_Serialize(val, &out);
  • R_pstream_data_t is an alias for void*. It is passed as an argument to callbacks append_byte and append_buf;
  • R_pstream_binary_format indicates that atomic values should be serialized as binary. Among other formats are XDR, with R_pstream_xdr_formatR_pstream_xdr_format or ASCII, with R_pstream_ascii_format;
  • refhook_write provides a way to customize the serialization of some values. In sxpdb, we use it to disable serialization for environments.
SEXP Serializer::refhook_write(SEXP val, SEXP data) {
  if(TYPEOF(val) != ENVSXP) {
    return R_NilValue;//it means that R will serialize the value as usual
  }
  // We just do not serialize the environment and return a blank string
  return R_BlankScalarString;
}
  • The R_NilValue passed last to R_InitOutPStream is the second argument of refhook_write.

Partially unserializing

The database stores the serialized values, stripped of the header with the version of the serialization and the encoding.

Function unserialize_view takes a serialized value — a buffer of bytes — from the database and returns a sexp_view_t, which holds pre-computed metadata on the serialized view.

const sexp_view_t Serializer::unserialize_view(
                   const std::vector<std::byte>& buf);

struct sexp_view_t {
  SEXPTYPE type = ANYSXP;
  const void* data = nullptr;
  size_t length = 0;
  size_t element_size = 0;
};

For vector types, the serialized value then comprises flags, the length of the vector, and the actual data in the vector:

header
header
flags
flags
length (1-3 words)
length (1-3 words)
data
data
Text is not SVG - cannot display

The flags include the type and the presence of attributes of the value:

int flags = 0;
std::memcpy(&flags, data, sizeof(int));
sexp_view.type = flags & 255;
bool has_attr = flags & (1 << 9);

The function then stores the actual value data in the data field.

For vector values (character, logical, integer, real, complex), length in sexp_view_t stores the number of elements in the vector. element_size represents the number of bytes of an element in the vector. For character vectors, that one does not make sense: in R, character vectors of type STRSXP are actually vectors of atomic strings, of type CHARSXP, which have variable length.

Looking for NA

For logical and integer vectors, we can simply look for the magic NA value in the data field, after casting it to an integer, as v:

std::find(v, v + length, NA_LOGICAL) != v + length;

For real vectors, NA is a more complicated beast. There is an existing protocol for missing values defined by the floating point standard (IEEE 754): representing them with NaN. In R, NA is NaN with a special bit pattern: the lowest word is 1954, the year Ross Ihaka, one of the creator of the R Language, was born.

static double R_ValueOfNA(void)
{
    /* The gcc shipping with Fedora 9 gets this wrong without
     * the volatile declaration. Thanks to Marc Schwartz. */
    volatile ieee_double x;
    x.word[hw] = 0x7ff00000;
    x.word[lw] = 1954;
    return x.value;
}

The C R API provides R_IsNA to simplify the check:

std::find_if(v, v + length, [](double d) -> bool {
  return R_IsNA(d) ;}) != v + length;

Complex vectors are NA if the real or the imaginary part is NA:

std::find_if(v, v + length, [](const Rcomplex& c) -> bool {
  return ISNAN(c.r) || ISNAN(c.i);}) != v + length;

Character vectors are again more complex. Each CHARSXP element of the vector is again a valid R value and so is serialized with the flags and then the length of the element. NA is the element with length -1:

SEXPTYPE type = ANYSXP;
int size = 0;
for(size_t i = 0; i < length; i++) {
  std::memcpy(&type, data, sizeof(int));
  type &= 255;// this would also store the encoding...
  assert(type == CHARSXP);
  data += sizeof(int);
  std::memcpy(&size, data, sizeof(int));
  data += sizeof(int);
  if(size == -1) {// this is NA_STRING
    return true;
  }
  assert(size >= 0);
  //else we jump to the next CHARSXP
  data += size;
}
return false;

With that, the search of NA is $O(1)$ in memory. Note that I do not deal with non-vector types, such as environments (not stored in the database) or lists.

Pierre Donat-Bouillud
Pierre Donat-Bouillud
Researcher

My research interests including programming languages, fuzzing and testing.

Related