Update README with hyparquet-compressors

2026-02-23 21:01:32 +00:00 · 2024-05-20 05:10:21 -07:00 · 2024-05-20 05:10:21 -07:00 · 97708544e4
commit 97708544e4
parent d453313dca
1 changed files with 34 additions and 28 deletions
--- a/README.md
+++ b/README.md
@ -149,9 +149,20 @@ await parquetRead({

 The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.

-Hyparquet does not support 100% of all parquet files.
-Supporting every possible compression codec available in parquet would blow up the size of the hyparquet library.
-In practice, most parquet files use snappy compression.
+Supported parquet encodings:
+ - [X] PLAIN
+ - [X] PLAIN_DICTIONARY
+ - [X] RLE_DICTIONARY
+ - [X] RLE
+ - [X] BIT_PACKED
+ - [X] DELTA_BINARY_PACKED
+ - [X] DELTA_BYTE_ARRAY
+ - [X] DELTA_LENGTH_BYTE_ARRAY
+ - [X] BYTE_STREAM_SPLIT
+
+## Compression
+
+Supporting every possible compression codec available in parquet would blow up the size of the hyparquet library. In practice, most parquet files use snappy compression.

 Parquet compression types supported by default:
 - [X] Uncompressed
@ -163,29 +174,9 @@ Parquet compression types supported by default:
 - [ ] ZSTD
 - [ ] LZ4_RAW

-You can extend support for other compression codecs using the `compressors` option.
+You can provide custom compression codecs using the `compressors` option.

-```js
-import { parquetRead } from 'hyparquet'
-import { gunzipSync } from 'zlib'
-
-await parquetRead({ file, compressors: {
-  GZIP: (input, output) => output.set(gunzipSync(input)), // add gzip support
-}})
-```
-
-Parquet encodings:
- - [X] PLAIN
- - [X] PLAIN_DICTIONARY
- - [X] RLE_DICTIONARY
- - [X] RLE
- - [X] BIT_PACKED
- - [X] DELTA_BINARY_PACKED
- - [X] DELTA_BYTE_ARRAY
- - [X] DELTA_LENGTH_BYTE_ARRAY
- - [ ] BYTE_STREAM_SPLIT
-
-## Hysnappy
+## hysnappy

 The most common compression codec used in parquet is snappy compression.
 Hyparquet includes a built-in snappy decompressor written in javascript.
@ -199,13 +190,28 @@ To use hysnappy for faster parsing of large parquet files, override the `SNAPPY`
 import { parquetRead } from 'hyparquet'
 import { snappyUncompressor } from 'hysnappy'

-await parquetRead({ file, compressors: {
-  SNAPPY: snappyUncompressor(),
-}})
+await parquetRead({
+  file,
+  compressors: {
+    SNAPPY: snappyUncompressor(),
+  },
+  onComplete: console.log,
+})
 ```

 Parsing a [420mb wikipedia parquet file](https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.en/train-00000-of-00041.parquet) using hysnappy reduces parsing time by 40% (4.1s to 2.3s).

+## hyparquet-compressors
+
+You can include support for ALL parquet compression codecs using the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) library.
+
+```js
+import { parquetRead } from 'hyparquet'
+import { compressors } from 'hyparquet-compressors'
+
+await parquetRead({ file, compressors, onComplete: console.log })
+```
+
 ## References

 - https://github.com/apache/parquet-format