2024-05-09 00:23:53 +00:00
# hyparquet decompressors
2024-05-09 04:23:01 +00:00
2024-05-20 08:02:12 +00:00
[](https://www.npmjs.com/package/hyparquet-compressors)
2024-05-20 04:09:36 +00:00
[](https://github.com/hyparam/hyparquet-compressors/actions)
[](https://opensource.org/licenses/MIT)
2024-05-20 05:06:38 +00:00

2024-05-20 04:09:36 +00:00
2024-05-20 01:23:05 +00:00
This package exports a `compressors` object intended to be passed into [hyparquet ](https://github.com/hyparam/hyparquet ).
[Apache Parquet ](https://parquet.apache.org ) is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets. It supports a number of different compression formats, but most parquet files use snappy compression.
The hyparquet library by default only supports `uncompressed` and `snappy` compressed files. The `hyparquet-compressors` package extends support for all legal parquet compression formats.
2024-05-09 04:23:01 +00:00
2024-05-20 08:33:44 +00:00
The `hyparquet-compressors` package works in both node.js and the browser. Uses js and wasm packages, no system dependencies.
2024-05-09 04:23:01 +00:00
## Usage
```js
import { parquetRead } from 'hyparquet'
import { compressors } from 'hyparquet-compressors'
2024-05-20 01:23:05 +00:00
await parquetRead({ file, compressors, onComplete: console.log })
2024-05-09 04:23:01 +00:00
```
2024-05-20 01:23:05 +00:00
2024-05-20 08:33:44 +00:00
See [hyparquet ](https://github.com/hyparam/hyparquet ) repo for further info.
2024-05-26 00:52:32 +00:00
# Compression formats
2024-05-20 01:23:05 +00:00
Parquet compression types supported with `hyparquet-compressors` :
- [X] Uncompressed
- [X] Snappy
2024-05-26 00:52:32 +00:00
- [x] Gzip
2024-05-20 01:23:05 +00:00
- [ ] LZO
2024-05-20 07:03:23 +00:00
- [X] Brotli
2024-05-20 01:23:05 +00:00
- [X] LZ4
2024-05-20 07:03:23 +00:00
- [X] ZSTD
2024-05-20 01:23:05 +00:00
- [X] LZ4_RAW
2024-05-26 00:52:32 +00:00
## Snappy
Snappy compression uses [hysnappy ](https://github.com/hyparam/hysnappy ) for fast snappy decompression using minimal wasm.
## Gzip
New gzip implementation adapted from [fflate ](https://github.com/101arrowz/fflate ).
Includes modifications to handle repeated back-to-back gzip streams that sometimes occur in parquet files, but was not supported by fflate.
## Brotli
2024-05-20 05:06:38 +00:00
Includes a minimal port of [brotli.js ](https://github.com/foliojs/brotli.js ) which compresses the brotli dictionary using gzip and base64 to minimize the distribution bundle size.
2024-05-26 00:52:32 +00:00
## LZ4
New LZ4 implementation includes support for legacy hadoop LZ4 frame format used on some old parquet files.
## Zstd
Uses [fzstd ](https://github.com/101arrowz/fzstd ) for Zstandard decompression.
# Bundle size
| File | Size |
| - | - |
2024-05-20 05:06:38 +00:00
| hyparquet-compressors.min.js | 116.1kb |
| hyparquet-compressors.min.js.gz | 75.2kb |
2024-05-26 00:52:32 +00:00
2024-05-20 01:23:05 +00:00
# References
- https://parquet.apache.org/docs/file-format/data-pages/compression/
2024-05-20 08:33:44 +00:00
- https://en.wikipedia.org/wiki/Brotli
2024-05-20 01:23:05 +00:00
- https://en.wikipedia.org/wiki/Gzip
- https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
- https://en.wikipedia.org/wiki/Snappy_(compression)
2024-05-20 08:33:44 +00:00
- https://en.wikipedia.org/wiki/Zstd
2024-05-22 10:30:22 +00:00
- https://github.com/101arrowz/fflate
2024-05-20 08:33:44 +00:00
- https://github.com/101arrowz/fzstd
- https://github.com/foliojs/brotli.js
- https://github.com/hyparam/hysnappy