hyparquet-compressors/README.md

96 lines
4.1 KiB
Markdown
Raw Permalink Normal View History

2024-05-09 00:23:53 +00:00
# hyparquet decompressors
2024-05-09 04:23:01 +00:00
2024-05-28 06:15:25 +00:00
![hyparquet parakeets](hyparquet-compressors.jpg)
2024-05-20 08:02:12 +00:00
[![npm](https://img.shields.io/npm/v/hyparquet-compressors)](https://www.npmjs.com/package/hyparquet-compressors)
2025-03-19 05:43:11 +00:00
[![minzipped](https://img.shields.io/bundlephobia/minzip/hyparquet-compressors)](https://www.npmjs.com/package/hyparquet-compressors)
2024-05-20 04:09:36 +00:00
[![workflow status](https://github.com/hyparam/hyparquet-compressors/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet-compressors/actions)
2025-03-19 05:43:11 +00:00
[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
2024-05-20 05:06:38 +00:00
![coverage](https://img.shields.io/badge/Coverage-86-darkred)
2024-05-20 04:09:36 +00:00
2025-03-20 07:04:21 +00:00
This package provides decompressors for various compression codecs.
It is designed to be used with [hyparquet](https://github.com/hyparam/hyparquet) in order to provide full support for all parquet compression formats.
## Introduction
2024-05-20 01:23:05 +00:00
[Apache Parquet](https://parquet.apache.org) is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets. It supports a number of different compression formats, but most parquet files use snappy compression.
2025-03-20 07:04:21 +00:00
[Hyparquet](https://github.com/hyparam/hyparquet) is a fast and lightweight parquet reader that is designed to work in both node.js and the browser.
By default, hyparquet only supports `uncompressed` and `snappy` compressed files (the most common parquet compression codecs). The `hyparquet-compressors` package extends support for all legal parquet compression formats.
`hyparquet-compressors` works in both node.js and the browser. Uses js and wasm packages, no system dependencies.
2024-05-09 04:23:01 +00:00
2025-03-20 07:04:21 +00:00
## Hyparquet
2024-05-20 08:33:44 +00:00
2025-03-20 07:04:21 +00:00
To use `hyparquet-compressors` with `hyparquet`, simply pass the `compressors` object to the `parquetReadObjects` function.
2024-05-09 04:23:01 +00:00
```js
2025-03-20 07:04:21 +00:00
import { parquetReadObjects } from 'hyparquet'
2024-05-09 04:23:01 +00:00
import { compressors } from 'hyparquet-compressors'
2025-03-20 07:04:21 +00:00
const data = await parquetReadObjects({ file, compressors })
2024-05-09 04:23:01 +00:00
```
2024-05-20 01:23:05 +00:00
2025-03-19 05:43:11 +00:00
See [hyparquet](https://github.com/hyparam/hyparquet) repo for more info.
2024-05-20 08:33:44 +00:00
2025-03-20 07:04:21 +00:00
## Compression formats
2024-05-20 01:23:05 +00:00
Parquet compression types supported with `hyparquet-compressors`:
- [X] Uncompressed
- [X] Snappy
2024-05-26 00:52:32 +00:00
- [x] Gzip
2024-05-20 01:23:05 +00:00
- [ ] LZO
2024-05-20 07:03:23 +00:00
- [X] Brotli
2024-05-20 01:23:05 +00:00
- [X] LZ4
2024-05-20 07:03:23 +00:00
- [X] ZSTD
2024-05-20 01:23:05 +00:00
- [X] LZ4_RAW
2025-03-20 07:04:21 +00:00
### Snappy
2024-05-26 00:52:32 +00:00
2025-03-20 07:16:08 +00:00
Snappy compression uses [hysnappy](https://github.com/hyparam/hysnappy) for fast snappy decompression using a minimal [WASM](https://en.wikipedia.org/wiki/WebAssembly) module.
We load the wasm module _synchronously_ from base64 in the js file. This avoids a network request, and greatly simplifies bundling and serving wasm.
2024-05-26 00:52:32 +00:00
2025-03-20 07:04:21 +00:00
### Gzip
2024-05-26 00:52:32 +00:00
New gzip implementation adapted from [fflate](https://github.com/101arrowz/fflate).
2025-03-20 07:16:08 +00:00
Includes modifications to handle repeated back-to-back gzip streams that sometimes occur in parquet files (but are not supported by fflate).
For gzip, the `output` buffer argument is optional:
- If `output` is defined, the decompressor will write to `output` until it is full.
- If `output` is undefined, the decompressor will allocate a new buffer, and expand it as needed to fit the uncompressed gzip data. Importantly, the caller should use the _returned_ buffer.
2024-05-26 00:52:32 +00:00
2025-03-20 07:04:21 +00:00
### Brotli
2024-05-26 00:52:32 +00:00
2025-03-20 07:16:08 +00:00
Includes a minimal port of [brotli.js](https://github.com/foliojs/brotli.js).
Our implementation uses gzip to pre-compress the brotli dictionary, in order to minimize the bundle size.
2024-05-26 00:52:32 +00:00
2025-03-20 07:04:21 +00:00
### LZ4
2024-05-26 00:52:32 +00:00
New LZ4 implementation includes support for legacy hadoop LZ4 frame format used on some old parquet files.
2025-03-20 07:04:21 +00:00
### Zstd
2024-05-26 00:52:32 +00:00
Uses [fzstd](https://github.com/101arrowz/fzstd) for Zstandard decompression.
2025-03-20 07:04:21 +00:00
## Bundle size
2024-05-26 00:52:32 +00:00
| File | Size |
2024-05-28 06:15:25 +00:00
| --- | --- |
| hyparquet-compressors.min.js | 116.4kb |
2024-05-20 05:06:38 +00:00
| hyparquet-compressors.min.js.gz | 75.2kb |
2024-05-26 00:52:32 +00:00
2025-03-20 07:04:21 +00:00
## References
2024-05-20 01:23:05 +00:00
- https://parquet.apache.org/docs/file-format/data-pages/compression/
2024-05-20 08:33:44 +00:00
- https://en.wikipedia.org/wiki/Brotli
2024-05-20 01:23:05 +00:00
- https://en.wikipedia.org/wiki/Gzip
- https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
- https://en.wikipedia.org/wiki/Snappy_(compression)
2024-05-20 08:33:44 +00:00
- https://en.wikipedia.org/wiki/Zstd
2024-05-22 10:30:22 +00:00
- https://github.com/101arrowz/fflate
2024-05-20 08:33:44 +00:00
- https://github.com/101arrowz/fzstd
- https://github.com/foliojs/brotli.js
- https://github.com/hyparam/hysnappy