Streaming Compression/decompression in Haskell via Laziness

by Vanessa McHale | 2024-07-07 14:46

In Haskell, one can present a streaming compression/decompression API with lazy bytestrings.

Transcoding

Suppose we want to transcode some data, say, bzip2 to lzip.

bzip2 and lzip both have a C API (via libbz2 and lzlib respectively), viz.

typedef
   struct {
      char *next_in;
      unsigned int avail_in;
      unsigned int total_in_lo32;
      unsigned int total_in_hi32;
  char *next_out;
  unsigned int avail_out;
  unsigned int total_out_lo32;
  unsigned int total_out_hi32;

  void *state;

  void *(*bzalloc)(void *,int,int);
  void (*bzfree)(void *,void *);
  void *opaque;
   }
   bz_stream;
BZ_EXTERN int BZ_API(BZ2_bzDecompressInit) (
      bz_stream *strm,
      int       verbosity,
      int       small
   );
BZ_EXTERN int BZ_API(BZ2_bzDecompress) (
      bz_stream* strm
   );
BZ_EXTERN int BZ_API(BZ2_bzDecompressEnd) (
      bz_stream *strm
   );

enum LZ_Errno { LZ_ok = 0,         LZ_bad_argument, LZ_mem_error,
                LZ_sequence_error, LZ_header_error, LZ_unexpected_eof,
                LZ_data_error,     LZ_library_error };
const char * LZ_strerror( const enum LZ_Errno lz_errno );
struct LZ_Encoder;
struct LZ_Encoder * LZ_compress_open( const int dictionary_size,
                                      const int match_len_limit,
                                      const unsigned long long member_size );
int LZ_compress_close( struct LZ_Encoder * const encoder );
int LZ_compress_finish( struct LZ_Encoder * const encoder );
int LZ_compress_restart_member( struct LZ_Encoder * const encoder,
                                const unsigned long long member_size );
int LZ_compress_sync_flush( struct LZ_Encoder * const encoder );
int LZ_compress_read( struct LZ_Encoder * const encoder,
                      uint8_t * const buffer, const int size );
int LZ_compress_write( struct LZ_Encoder * const encoder,
                       const uint8_t * const buffer, const int size );
int LZ_compress_write_size( struct LZ_Encoder * const encoder );
enum LZ_Errno LZ_compress_errno( struct LZ_Encoder * const encoder );
int LZ_compress_finished( struct LZ_Encoder * const encoder );
int LZ_compress_member_finished( struct LZ_Encoder * const encoder );
unsigned long long LZ_compress_data_position( struct LZ_Encoder * const encoder );
unsigned long long LZ_compress_member_position( struct LZ_Encoder * const encoder );
unsigned long long LZ_compress_total_in_size( struct LZ_Encoder * const encoder );
unsigned long long LZ_compress_total_out_size( struct LZ_Encoder * const encoder );

Then one shuffles data between a buffer, calling the necessary compressors and decompressors above. This becomes untenable: if we also wanted to transcode from bzip2 to lz4, we would have to write glue code calling the Bzip2 and LZ4 APIs, so that for \( n \) compression formats we would have to write \( n^2 \) glue procedures.

In Haskell, one can write

transcode :: Lazy.ByteString -> Lazy.ByteString
transcode = LZ.compress . BZ2.decompress

To handle all cases while writing a sensible amount of code:

data Codec = BZ2 | Zlib | LZ4 | ...
enc :: Codec -> Lazy.ByteString
enc BZ2 = BZ2.encode; enc Zlib = Zlib.encode; enc LZ4 = LZ4.encode; ...
dec :: Codec -> Lazy.ByteString
dec BZ2 = BZ2.decode; dec Zlib = Zlib.decode; dec LZ4 = LZ4.decode; ...
transcode :: Codec -> Codec -> Lazy.ByteString -> Lazy.ByteString
transcode from to = enc to.dec from

Thus streaming transcoding in Haskell is feasible. Interestingly, the standard way to transcode compressed files is via pipes in the shell, e.g.

zstd -dc src.tar.zst | lz4 - src.tar.lz4

which is lazy but via a clumsier mechanism involving system processes.

I think this unified interface to streaming is an underrated accomplishment. It has been around since 2006 and wrangles some subtleties of effects in Haskell that no strict language could showcase.

return

blog

Streaming Compression/decompression in Haskell via Laziness

Transcoding