Reasoning

General Compression Formula

Where possible, compression was attempted by just taking the difference of Vector and Map objects and leaving other objects uncompressed. When more compression is needed, bitsets are used to mask members that stayed the same.

I3FrameObjects

Anything that is derived from an I3FrameObject can store the filename that the base object came from. However, if that object is not actually in the frame and only present inside another object, then the filename field will be set to an empty string.

Map Compression

Maps are compressed by storing the keys of any deletions and the full key-value pairs of any additions/overwrites.

Vector Compression

Vectors are compressed by only writing the difference between the two vectors. The difference is calculated by using the “longest common subsequence” algorithm. Specifically, this one:

"A linear space algorithm for computing maximal common subsequences"
D. S. Hirschberg
http://portal.acm.org/citation.cfm?id=360861

This is roughly the same algorithm that diff uses to compare two files.

One consequence of this is that if there is any change in the element in the container the algorithm will call the entire element an addition and save the entire thing. So for a minor change to all elements in the vector, no space has been saved becase the vector looks completely different.

One solution for this is to use a map instead of a vector for elements that have fixed positions.

There is also the option of a second vector compression scheme based on fixed positions, FixedPositionVector. This is implemented, but not used.

I3Station

The I3Station class was created to help compact I3StationGeo objects.

Because of their vector nature, I3StationGeo is difficult to compress. Therefore, we get rid of the vector and replace it with a fixed object. Each station only has 2 tanks, and many of the values are the same between tanks. A further optimization brings these identical values up one level to the station to prevent duplicate values from being serialized.

Compared to an I3FixedPositionVectorDiff<I3StationGeoDiff,I3StationGeo>, this method saves on average 1000 bytes uncompressed, 650 bytes compressed using gzip. This doesn’t sound like a lot, but it would increase total GCDDiff size by 2x. Much of the savings comes from bringing the values up a level to avoid serializing a value twice, and removing the vector container serialization, especially the size of the vector.

Removing the vector from the dataclass code should be considered in the future.