Serialization convention
There are several cases in Cloudstate where user provided values need to be serialized. Given Cloudstate’s dependence on gRPC, protobuf is an obvious serialization format, however, it’s not always convenient. For example, if protobufs were the only supported format, it would not be possible to use strings as keys in an ORMap, instead users would have to create a protobuf message that wrapped a string, and use that instead.
User function libraries are therefore encouraged to support a variety of primitive types, according to what’s available in their language. JSON is also a popular and convenient serialization format in some situations, and this may supported too.
User provided values in Cloudstate are serialized to protobuf Any
values. These contain a type_url
string
field that identify the protobuf type, along with a value
bytes
value for the serialized bytes. While there’s nothing technically stopping anything being put in either of the fields, such as a UTF8 encoded string or JSON, the bytes
are meant to be a valid protobuf, and the type_url
is meant to refer to a protobuf type. To support the serialization of primitive and JSON values in a way that is as consistent as possible with the intentions behind Any
, Cloudstate has defined convention for achieving this.
Primitive values
Serialized primitive values have a type_url
prefix of p.cloudstate.io
. The path of the URL is the protobuf name of the primitive. So, for example, the type_url
of a serialized string is p.cloudstate.io/string
, while for a serialized boolean, it is p.cloudstate.io/bool
.
The value
field contains a message that is serialized using the following schema:
syntax = "proto3";
message Primitive {
<type> value = 1;
}
Where <type>
is the type of the primitive, eg string
. Implementations may define messages to represent these, but in most cases this is not necessary, serializing and deserializing the above message can be done manually using the libraries provided by the protobuf support in a given language.
Primitive value stability
Since these values are often used as keys in maps and elements in sets, the stability of the encoded form is important. While the protobuf spec does not make any guarantees on serialization stability, protobuf implementations are free to guarantee stability of their serialized form, and so it’s important that the implementation used to encode the above messages is stable. This means, no additional fields may be included in the message, and if the primitive value is the default value for that type (for example, for string
, the empty string, or for any number, zero), the encoded message must be an empty byte string.
JSON values
Serialized JSON values have a type_url
prefix of json.cloudstate.io
, and a path determined using a language specific mechanism, for example, in statically typed languages, it may be a fully qualified class name, in dynamically type languages, it may come from a field in the object, such as type
. Typically, the type will be important even in dynamically typed languages when serializing event sourced events, to determine which event handler should be invoked to handle a particular event.
The value
field contains a message that is serialized using the following schema:
syntax = "proto3";
message Json {
bytes json = 1;
}
The UTF8 encoded serialized JSON is placed in the json
field of the message above. This encoding is intentionally identical to using a string
typed json
field instead, both will yield exactly the same bytes for the same JSON. It is also identical to the primitive encoding for string
and bytes
, and so implementations may simply reuse that.
Serialized JSON value stability
As with primitive values, the stability of the encoded form is important. Libraries are encouraged to use JSON encoders that produce a stable output for logically equal inputs. Cloudstate does not put any requirements on how to achieve stable serialization, however many languages have libraries that do produce a canonical JSON representation of values. In the absence of such a library, the following considerations need to be remembered:
-
The order of fields must not change for two equivalent objects - lexicographically sorting them is one way to achieve this.
-
The whitespace in the representation must be consistent (typically, no whitespace is best).
-
The serialization of equal numbers must be consistent (eg, if a number is equal to 1, it should either encode to 1, or 1.0, but not both).