« Previous - Version 3/5 (diff) - Next » - Current version
Quincey Koziol, 02/15/2011 04:16 pm

Compression for Exascale

Compression in the rich man's parallel sense is hard. HDF5 doesn't support it. Problems have to do with not knowing ahead of time where data lands in files due to variations in size. A variable loss but fixed size compression might be better here (e.g. wavelets). That way, can always hit a target compressed size but quality of compressed result varies. Could be useful in plot files but obviously not for restart.

Alternatively, can do compression in rich man's parallel if HDF5 operated in such a way as to assume a target compression ratio of R (R set by application in a property list), and then assume each block compresses R:1 so compressed block size is always 1/R of orig. This means compressed block size is always predictable though if actual compression exceeds R:1, some space savings will be sacrificed because we'll assume size is 1/R of orig. So what? The real problem is if a given block cannot be compressed R:1. Then what? One option is to fail the write and then app. can re-try the write with a lower R. Another option is to have two kinds of blocks. Those that hit or exceeded the compression target of R:1 and those that didn't. The former will always be treated as size 1/R of orig. and the latter are size of orig. Either way, size is predictable and then manageable in rich man's parallel.

Adding in additional preprocessing filters to the compression pipeline may give a better chance of achieving the R:1 compression ratio (or may allow the compression ratio to be increased), at the expense of additional computing power. Some examples include: shuffle, delta and/or space-filling curve filters.

Eliminating the block-level indirection here might be useful. Yes, its bad for an eventual attempt to subset on read but if caller accepts limitations and/or costs of that, we allow it. Then, whole dataset is single block and it is either compressed to target R:1, with possible wasted space if it exceeded, or not.

Exascale may involve higher than double precision of 64 bits. Maybe 96 or 128 bits are required. What does this mean for compression of floating point data compared to single or double precision? Would we expect to be able to do better because there are more exponent bits or worse because there are more mantissa bits?

Also available in: HTML TXT