hyparquet-compressors/README.md

# hyparquet decompressors

![hyparquet parakeets](hyparquet-compressors.jpg)

[![npm](https://img.shields.io/npm/v/hyparquet-compressors)](https://www.npmjs.com/package/hyparquet-compressors)
[![minzipped](https://img.shields.io/bundlephobia/minzip/hyparquet-compressors)](https://www.npmjs.com/package/hyparquet-compressors)
[![workflow status](https://github.com/hyparam/hyparquet-compressors/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet-compressors/actions)
[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
![coverage](https://img.shields.io/badge/Coverage-86-darkred)

This package provides decompressors for various compression codecs.
It is designed to be used with [hyparquet](https://github.com/hyparam/hyparquet) in order to provide full support for all parquet compression formats.

## Introduction

[Apache Parquet](https://parquet.apache.org) is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets. It supports a number of different compression formats, but most parquet files use snappy compression.

[Hyparquet](https://github.com/hyparam/hyparquet) is a fast and lightweight parquet reader that is designed to work in both node.js and the browser.

By default, hyparquet only supports `uncompressed` and `snappy` compressed files (the most common parquet compression codecs). The `hyparquet-compressors` package extends support for all legal parquet compression formats.

`hyparquet-compressors` works in both node.js and the browser. Uses js and wasm packages, no system dependencies.

## Hyparquet

To use `hyparquet-compressors` with `hyparquet`, simply pass the `compressors` object to the `parquetReadObjects` function.

```js
import { parquetReadObjects } from 'hyparquet'
import { compressors } from 'hyparquet-compressors'

const data = await parquetReadObjects({ file, compressors })
```

See [hyparquet](https://github.com/hyparam/hyparquet) repo for more info.

## Compression formats

Parquet compression types supported with `hyparquet-compressors`:
 - [X] Uncompressed
 - [X] Snappy
 - [x] Gzip
 - [ ] LZO
 - [X] Brotli
 - [X] LZ4
 - [X] ZSTD
 - [X] LZ4_RAW

### Snappy

Snappy compression uses [hysnappy](https://github.com/hyparam/hysnappy) for fast snappy decompression using a minimal [WASM](https://en.wikipedia.org/wiki/WebAssembly) module.

We load the wasm module _synchronously_ from base64 in the js file. This avoids a network request, and greatly simplifies bundling and serving wasm.

### Gzip

New gzip implementation adapted from [fflate](https://github.com/101arrowz/fflate).
Includes modifications to handle repeated back-to-back gzip streams that sometimes occur in parquet files (but are not supported by fflate).

For gzip, the `output` buffer argument is optional:
 - If `output` is defined, the decompressor will write to `output` until it is full.
 - If `output` is undefined, the decompressor will allocate a new buffer, and expand it as needed to fit the uncompressed gzip data. Importantly, the caller should use the _returned_ buffer.

### Brotli

Includes a minimal port of [brotli.js](https://github.com/foliojs/brotli.js).
Our implementation uses gzip to pre-compress the brotli dictionary, in order to  minimize the bundle size.

### LZ4

New LZ4 implementation includes support for legacy hadoop LZ4 frame format used on some old parquet files.

### Zstd

Uses [fzstd](https://github.com/101arrowz/fzstd) for Zstandard decompression.

## Bundle size

| File | Size |
| --- | --- |
| hyparquet-compressors.min.js | 116.4kb |
| hyparquet-compressors.min.js.gz | 75.2kb |

## References

 - https://parquet.apache.org/docs/file-format/data-pages/compression/
 - https://en.wikipedia.org/wiki/Brotli
 - https://en.wikipedia.org/wiki/Gzip
 - https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
 - https://en.wikipedia.org/wiki/Snappy_(compression)
 - https://en.wikipedia.org/wiki/Zstd
 - https://github.com/101arrowz/fflate
 - https://github.com/101arrowz/fzstd
 - https://github.com/foliojs/brotli.js
 - https://github.com/hyparam/hysnappy
Initial commit 2024-05-09 00:23:53 +00:00			`# hyparquet decompressors`
Initial project skeleton 2024-05-09 04:23:01 +00:00
Hyparquet Parakeets 2024-05-28 06:15:25 +00:00			`![hyparquet parakeets](hyparquet-compressors.jpg)`

Publish v0.1.0 2024-05-20 08:02:12 +00:00			`[![npm](https://img.shields.io/npm/v/hyparquet-compressors)](https://www.npmjs.com/package/hyparquet-compressors)`
Update dependencies 2025-03-19 05:43:11 +00:00			`[![minzipped](https://img.shields.io/bundlephobia/minzip/hyparquet-compressors)](https://www.npmjs.com/package/hyparquet-compressors)`
Fix hadoop lz4 2024-05-20 04:09:36 +00:00			`[![workflow status](https://github.com/hyparam/hyparquet-compressors/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet-compressors/actions)`
Update dependencies 2025-03-19 05:43:11 +00:00			`[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)`
Brotli native 2024-05-20 05:06:38 +00:00			`![coverage](https://img.shields.io/badge/Coverage-86-darkred)`
Fix hadoop lz4 2024-05-20 04:09:36 +00:00
Update README 2025-03-20 07:04:21 +00:00			`This package provides decompressors for various compression codecs.`
			`It is designed to be used with [hyparquet](https://github.com/hyparam/hyparquet) in order to provide full support for all parquet compression formats.`

			`## Introduction`
LZ4_RAW support 2024-05-20 01:23:05 +00:00
			`[Apache Parquet](https://parquet.apache.org) is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets. It supports a number of different compression formats, but most parquet files use snappy compression.`

Update README 2025-03-20 07:04:21 +00:00			`[Hyparquet](https://github.com/hyparam/hyparquet) is a fast and lightweight parquet reader that is designed to work in both node.js and the browser.`

			By default, hyparquet only supports `uncompressed` and `snappy` compressed files (the most common parquet compression codecs). The `hyparquet-compressors` package extends support for all legal parquet compression formats.

			`hyparquet-compressors` works in both node.js and the browser. Uses js and wasm packages, no system dependencies.
Initial project skeleton 2024-05-09 04:23:01 +00:00
Update README 2025-03-20 07:04:21 +00:00			`## Hyparquet`
Fix brotli import 2024-05-20 08:33:44 +00:00
Update README 2025-03-20 07:04:21 +00:00			To use `hyparquet-compressors` with `hyparquet`, simply pass the `compressors` object to the `parquetReadObjects` function.
Initial project skeleton 2024-05-09 04:23:01 +00:00
			```js
Update README 2025-03-20 07:04:21 +00:00			`import { parquetReadObjects } from 'hyparquet'`
Initial project skeleton 2024-05-09 04:23:01 +00:00			`import { compressors } from 'hyparquet-compressors'`

Update README 2025-03-20 07:04:21 +00:00			`const data = await parquetReadObjects({ file, compressors })`
Initial project skeleton 2024-05-09 04:23:01 +00:00			```
LZ4_RAW support 2024-05-20 01:23:05 +00:00
Update dependencies 2025-03-19 05:43:11 +00:00			`See [hyparquet](https://github.com/hyparam/hyparquet) repo for more info.`
Fix brotli import 2024-05-20 08:33:44 +00:00
Update README 2025-03-20 07:04:21 +00:00			`## Compression formats`
LZ4_RAW support 2024-05-20 01:23:05 +00:00
			Parquet compression types supported with `hyparquet-compressors`:
			`- [X] Uncompressed`
			`- [X] Snappy`
Update README and dependencies 2024-05-26 00:52:32 +00:00			`- [x] Gzip`
LZ4_RAW support 2024-05-20 01:23:05 +00:00			`- [ ] LZO`
Zstandard 2024-05-20 07:03:23 +00:00			`- [X] Brotli`
LZ4_RAW support 2024-05-20 01:23:05 +00:00			`- [X] LZ4`
Zstandard 2024-05-20 07:03:23 +00:00			`- [X] ZSTD`
LZ4_RAW support 2024-05-20 01:23:05 +00:00			`- [X] LZ4_RAW`

Update README 2025-03-20 07:04:21 +00:00			`### Snappy`
Update README and dependencies 2024-05-26 00:52:32 +00:00
Resizable gzip output buffer 2025-03-20 07:16:08 +00:00			`Snappy compression uses [hysnappy](https://github.com/hyparam/hysnappy) for fast snappy decompression using a minimal [WASM](https://en.wikipedia.org/wiki/WebAssembly) module.`

			`We load the wasm module _synchronously_ from base64 in the js file. This avoids a network request, and greatly simplifies bundling and serving wasm.`
Update README and dependencies 2024-05-26 00:52:32 +00:00
Update README 2025-03-20 07:04:21 +00:00			`### Gzip`
Update README and dependencies 2024-05-26 00:52:32 +00:00
			`New gzip implementation adapted from [fflate](https://github.com/101arrowz/fflate).`
Resizable gzip output buffer 2025-03-20 07:16:08 +00:00			`Includes modifications to handle repeated back-to-back gzip streams that sometimes occur in parquet files (but are not supported by fflate).`

			For gzip, the `output` buffer argument is optional:
			- If `output` is defined, the decompressor will write to `output` until it is full.
			- If `output` is undefined, the decompressor will allocate a new buffer, and expand it as needed to fit the uncompressed gzip data. Importantly, the caller should use the _returned_ buffer.
Update README and dependencies 2024-05-26 00:52:32 +00:00
Update README 2025-03-20 07:04:21 +00:00			`### Brotli`
Update README and dependencies 2024-05-26 00:52:32 +00:00
Resizable gzip output buffer 2025-03-20 07:16:08 +00:00			`Includes a minimal port of [brotli.js](https://github.com/foliojs/brotli.js).`
			`Our implementation uses gzip to pre-compress the brotli dictionary, in order to minimize the bundle size.`
Update README and dependencies 2024-05-26 00:52:32 +00:00
Update README 2025-03-20 07:04:21 +00:00			`### LZ4`
Update README and dependencies 2024-05-26 00:52:32 +00:00
			`New LZ4 implementation includes support for legacy hadoop LZ4 frame format used on some old parquet files.`

Update README 2025-03-20 07:04:21 +00:00			`### Zstd`
Update README and dependencies 2024-05-26 00:52:32 +00:00
			`Uses [fzstd](https://github.com/101arrowz/fzstd) for Zstandard decompression.`

Update README 2025-03-20 07:04:21 +00:00			`## Bundle size`
Update README and dependencies 2024-05-26 00:52:32 +00:00
			`\| File \| Size \|`
Hyparquet Parakeets 2024-05-28 06:15:25 +00:00			`\| --- \| --- \|`
Split out exports for more efficient packaging 2025-03-20 06:02:32 +00:00			`\| hyparquet-compressors.min.js \| 116.4kb \|`
Brotli native 2024-05-20 05:06:38 +00:00			`\| hyparquet-compressors.min.js.gz \| 75.2kb \|`
Update README and dependencies 2024-05-26 00:52:32 +00:00
Update README 2025-03-20 07:04:21 +00:00			`## References`
LZ4_RAW support 2024-05-20 01:23:05 +00:00
			`- https://parquet.apache.org/docs/file-format/data-pages/compression/`
Fix brotli import 2024-05-20 08:33:44 +00:00			`- https://en.wikipedia.org/wiki/Brotli`
LZ4_RAW support 2024-05-20 01:23:05 +00:00			`- https://en.wikipedia.org/wiki/Gzip`
			`- https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)`
			`- https://en.wikipedia.org/wiki/Snappy_(compression)`
Fix brotli import 2024-05-20 08:33:44 +00:00			`- https://en.wikipedia.org/wiki/Zstd`
Gzip implementation 2024-05-22 10:30:22 +00:00			`- https://github.com/101arrowz/fflate`
Fix brotli import 2024-05-20 08:33:44 +00:00			`- https://github.com/101arrowz/fzstd`
			`- https://github.com/foliojs/brotli.js`
			`- https://github.com/hyparam/hysnappy`