Encode ‌ One Day ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ Data ‌ ‌ Much ‌ ‌ Less ‌ ‌ Number ‌ ‌ Byte.‌ ‌

image

Why‌ ‌menya‌ ‌this‌ олжmust‌ ‌worry ‌


The data is stored in memory in the form of data structures, such as objects, lists, arrays, etc. But if you want to send data over the network or to a file, you need to encode them as a sequence of baytov. Translation representations in the memory in a sequence of bytes is called encoding, and inverse transformation - dekodirovaniem. eventually data diagram processed by the application, or stored in memory may evolve, new fields may be added to or removed starye. Consequently, encoding used must have at reverse (the new code must be able to read the data written by the old code) and direct (the old code must be able to read the data written by the new code) sovmestimost.

In this article, we will discuss a variety of encoding formats, find out why the binary coding is better than JSON, XML, and also as binary coding methods support changes schemes dannyh.

Types‌ ‌ Formats ‌ ‌ Encoding ‌


There is ‌ two ‌ type ‌ formats ‌ coding: ‌ ‌

  • Text‌‌‌ Formats‌ ‌
  • Binary ‌ Formats li
‌ ‌

Text‌‌‌ Formats‌


‌Text formats are somewhat human-readable. Examples of common formats are JSON, CSV, and XML. Text formats are easy to use and understand, but they have certain problems:

  • Text formats can be very ambiguous. For example, in XML and CSV, strings and numbers cannot be distinguished. JSON can distinguish between strings and numbers, but can not distinguish between integers and real numbers, and also does not specify accuracy. This becomes a problem when dealing with large numbers. So, the problem with numbers larger than 2 ^ 53 occurs on Twitter, which uses a 64-bit number to identify each tweet. The JSON returned by the Twitter API includes the tweet ID twice — as a JSON number and as a decimal string — all because JavaScript applications do not always correctly recognize numbers.
  • CSV has no data schema, assigning the definition of the value of each row and column to the application.
  • Text formats take up more space than binary encoding. For example, one of the reasons is that JSON and XML do not have a schema, and therefore they must contain field names.

{ "userName": "Martin", "favoriteNumber": 1337, "interests": ["daydreaming", "hacking"] } 

The JSON encoding of this example after removing all the space characters takes 82 bytes.

Binary Encoding


To analyze data that is used only within the organization, you can choose a more compact or faster format. Although JSON is less verbose than XML, both of them still take up a lot of space compared to binary formats. In this article, we will discuss three different binary encoding formats:

  1. Thrift
  2. Protocol Buffers
  3. Avro

All of them provide efficient data serialization using circuits and have tools for generating code, and also support working with different programming languages. All of them support circuit evolution, providing both backward and forward compatibility.

Thrift and Protocol Buffers


Thrift is developed by Facebook and Protocol Buffers by Google. In both cases, a circuit is required to encode data. Thrift defines a schema using its own interface definition language (IDL).

struct Person { 1: string userName, 2: optional i64 favouriteNumber, 3: list<string> interests } 

Equivalent circuit for Protocol Buffers:

message Person { required string user_name=1; optional int64 favourite_number=2; repeated string interests=3; } 

As you can see, each field has a data type and tag number (1, 2, and 3). Thrift has two different binary encoding formats: BinaryProtocol and CompactProtocol. The binary format is simple, as shown below, and takes 59 bytes to encode the data above.

image

Thrift Binary Encoding

A compact protocol is semantically equivalent to binary, but packs the same information in just 34 bytes. Savings are achieved by packing the field type and label number into one byte.

image

Thrift Compact Encryption

Protocol Buffers encodes data similar to the compact protocol in Thrift, and after encoding, the same data takes 33 bytes.

image

Encoding using Protocol Buffers

Tag numbers provide the evolution of schemes in Thrift and Protocol Buffers. If the old code tries to read the data written with the new scheme, it will simply ignore the fields with the new tag numbers. Similarly, the new code can read data written in the old way, marking the values ​​as null for missing tag numbers.

Avro


Avro is different from Protocol Buffers and Thrift. Avro also uses a schema to define data. The schema can be defined using IDL Avro (Human-readable format):

record Person { string userName; union { null, long } favouriteNumber; array<string> interests; } 

Or JSON (more machine readable format):

"type": "record", "name": "Person", "fields": [ {"name": "userName", "type": "string"}, {"name": "favouriteNumber", "type": ["null", "long"]}, {"name": "interests", "type": {"type": "array", "items": "string"}} ] } 

Note that the fields do not have label numbers. The same data encoded with Avro only takes 32 bytes.

image

Coding with Avro.

As can be seen from the above sequence of bytes, the fields cannot be identified (in Thrift and Protocol Buffers labels with numbers are used for this), it is also impossible to determine the data type of the field. Values ​​are simply put together. Does this mean that any change in the circuit during decoding will generate incorrect data? The key idea of ​​Avro is that the circuit for writing and reading does not have to be the same, but it must be compatible. When the data is decoded, the Avro library solves this problem by looking at both circuits and translating the data from the recorder circuit into the reader circuit.

imagebr>
Resolving the Differences between Reading and Writing Devices

You probably think about how the reading device learns about the writing scheme. It's all about the encoding usage scenario.

  1. When transferring large files or data, the recorder may once include the scheme at the beginning of the file.
  2. In a database with individual records, each row can be written with its own schema. The easiest solution is to indicate the version number at the beginning of each entry and save the list of schemas.
  3. To send recordings to the network, reading and writing devices can agree on the scheme when establishing a connection.

One of the main advantages of using the Avro format is its support for dynamically generated circuits. Since labels with numbers are not generated, you can use the version control system to store various records encoded with different schemes.

Conclusion


In this article, we examined text and binary encoding formats, discussed how the same data can occupy 82 bytes with JSON encoding, 33 bytes with Thrift and Protocol Buffers encoding, and only 32 bytes using Avro encoding. Binary formats offer several undeniable advantages over JSON when transferring data over a network between internal services.

Resources


To learn more about encoding and designing data intensive applications, I highly recommend reading Martin Kleppman’s book, Designing Data-Intensive Applications.

image

Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary, taking SkillFactory paid online courses:



Read more


.

Source