Update README with example for Async and Row/Column filtering

2026-02-23 04:41:33 +00:00 · 2024-04-11 13:11:30 -07:00 · 2024-04-11 13:11:30 -07:00 · dd91122753
commit dd91122753
parent d74da081bb
2 changed files with 65 additions and 26 deletions
--- a/README.md
+++ b/README.md
@ -20,12 +20,12 @@ Hyparquet allows you to read and extract data from Parquet files directly in Jav
 1. **Performant**: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
 2. **Browser-native**: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
 3. **Dependency-free**: Hyparquet has zero dependencies, making it lightweight and easy to install and use in any JavaScript project.
-4. **TypeScript support**: The library is written in typed js code and provides TypeScript type definitions out of the box.
+4. **TypeScript support**: The library is written in jsdoc-typed JavaScript and provides TypeScript definitions out of the box.
 5. **Flexible data access**: Hyparquet allows you to read specific subsets of data by specifying row and column ranges, giving fine-grained control over what data is fetched and loaded.

 ## Features

- Designed to work with huge ML datasets (things like [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata))
+- Designed to work with huge ML datasets (like [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata))
 - Can load metadata separately from data
 - Data can be filtered by row and column ranges
 - Only fetches the data needed
@ -33,7 +33,7 @@ Hyparquet allows you to read and extract data from Parquet files directly in Jav
 - Fast data loading for large scale ML applications
 - Bring data visualization closer to the user, in the browser

-Why make a new parquet parser in javascript?
+Why make a new parquet parser?
 First, existing libraries like [parquetjs](https://github.com/ironSource/parquetjs) are officially "inactive".
 Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser.
 And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.
@ -46,12 +46,6 @@ https://hyparam.github.io/hyparquet/

 Demo source: [index.html](index.html)

-## Installation
-
-```bash
-npm install hyparquet
-```
-
 ## Usage

 Install the hyparquet package from npm:
@ -99,11 +93,57 @@ await parquetRead({
 })
 ```

+## Filtering
+
+To read large parquet files, it is recommended that you filter by row and column.
+Hyparquet is designed to load only the minimal amount of data needed to fulfill a query.
+You can filter rows by number, or columns by name:
+
+```js
+import { parquetRead } from 'hyparquet'
+
+await parquetRead({
+  file,
+  columns: ['colA', 'colB'], // include columns colA and colB
+  rowStart: 100,
+  rowEnd: 200,
+  onComplete: data => console.log(data),
+})
+```
+
 ## Async

-Hyparquet supports asynchronous fetching of parquet files, over a network.
+Hyparquet supports asynchronous fetching of parquet files over a network.
 You can provide an `AsyncBuffer` which is like a js `ArrayBuffer` but the `slice` method returns `Promise<ArrayBuffer>`.

+```typescript
+interface AsyncBuffer {
+  byteLength: number
+  slice(start: number, end?: number): Promise<ArrayBuffer>
+}
+```
+
+You can read parquet files asynchronously using HTTP Range requests so that only the necessary byte ranges from a `url` will be fetched:
+
+```js
+import { parquetRead } from 'hyparquet'
+
+const url = 'https://...'
+await parquetRead({
+  file: { // AsyncBuffer
+    byteLength,
+    async slice(start, end) {
+      const headers = new Headers()
+      headers.set('Range', `bytes=${start}-${end - 1}`)
+      const res = await fetch(url, { headers })
+      if (!res.ok || !res.body) throw new Error('fetch failed')
+      return readableStreamToArrayBuffer(res.body)
+    },
+  }
+  onComplete: data => console.log(data),
+})
+```
+
 ## Supported Parquet Files

 The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.
@ -112,19 +152,7 @@ Hyparquet does not support 100% of all parquet files.
 Supporting every possible compression codec available in parquet would blow up the size of the hyparquet library.
 In practice, most parquet files use snappy compression.

-You can extend support for parquet files with other compression codec using the `compressors` option.
-
-```js
-import { parquetRead } from 'hyparquet'
-import { gunzipSync } from 'zlib'
-
-parquetRead({ file, compressors: {
-  // add gzip support:
-  GZIP: (input, output) => output.set(gunzipSync(input)),
-}})
-```
-
-Compression:
+Parquet compression types supported by default:
 - [X] Uncompressed
 - [X] Snappy
 - [ ] GZip
@ -134,6 +162,17 @@ Compression:
 - [ ] ZSTD
 - [ ] LZ4_RAW

+You can extend support for other compression codecs using the `compressors` option.
+
+```js
+import { parquetRead } from 'hyparquet'
+import { gunzipSync } from 'zlib'
+
+parquetRead({ file, compressors: {
+  GZIP: (input, output) => output.set(gunzipSync(input)), // add gzip support
+}})
+```
+
 ## Hysnappy

 The most common compression codec used in parquet is snappy compression.
@ -160,6 +199,8 @@ Parsing a [420mb wikipedia parquet file](https://huggingface.co/datasets/wikimed
 - https://github.com/apache/parquet-format
 - https://github.com/apache/parquet-testing
 - https://github.com/apache/thrift
+ - https://github.com/apache/arrow
 - https://github.com/dask/fastparquet
 - https://github.com/google/snappy
+ - https://github.com/ironSource/parquetjs
 - https://github.com/zhipeng-jia/snappyjs
--- a/src/metadata.js
+++ b/src/metadata.js
@ -73,8 +73,6 @@ export async function parquetMetadataAsync(asyncBuffer, initialFetchSize = 1 <<
 */
 export function parquetMetadata(arrayBuffer) {
  if (!arrayBuffer) throw new Error('parquet arrayBuffer is required')
-
-  // DataView for easier manipulation of the buffer
  const view = new DataView(arrayBuffer)

  // Validate footer magic number "PAR1"
@ -97,7 +95,7 @@ export function parquetMetadata(arrayBuffer) {
  const metadataOffset = metadataLengthOffset - metadataLength
  const { value: metadata } = deserializeTCompactProtocol(view.buffer, view.byteOffset + metadataOffset)

-  // Parse parquet metadata from thrift data
+  // Parse metadata from thrift data
  const version = metadata.field_1
  const schema = metadata.field_2.map((/** @type {any} */ field) => ({
    type: ParquetType[field.field_1],