parquet file parser for javascript
Go to file
2024-02-21 13:44:12 -08:00
.github/workflows Run github actions in parallel 2024-01-08 10:13:06 -08:00
src Prepare for alternate decompressors 2024-02-21 13:44:12 -08:00
test Prepare for alternate decompressors 2024-02-21 13:44:12 -08:00
.eslintrc.json Prepare for alternate decompressors 2024-02-21 13:44:12 -08:00
.gitignore Prepare for alternate decompressors 2024-02-21 13:44:12 -08:00
benchmark.js Prepare for alternate decompressors 2024-02-21 13:44:12 -08:00
demo.css Better URL error handling 2024-02-04 23:53:20 -08:00
demo.js Never copy data 2024-02-09 14:35:11 -08:00
hyparquet.jpg hyparakeet 2023-12-29 12:12:30 -08:00
index.html Demo from URL 2024-02-04 23:29:20 -08:00
LICENSE Initial commit 2023-12-29 10:32:36 -08:00
package.json Publish v0.3.2 2024-02-16 17:53:57 -08:00
README.md Prepare for alternate decompressors 2024-02-21 13:44:12 -08:00
tsconfig.json All javascript, no typescript 2024-01-04 11:11:00 -08:00

hyparquet

hyparquet parakeet

npm workflow status mit license dependencies

JavaScript parser for Apache Parquet files.

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.

Dependency free since 2023!

Features

  • Designed to work with huge ML datasets (things like starcoder)
  • Can load metadata separately from data
  • Data can be filtered by row and column ranges
  • Only fetches the data needed
  • Written in JavaScript, checked with TypeScript
  • Fast data loading for large scale ML applications
  • Bring data visualization closer to the user, in the browser

Why make a new parquet parser in javascript? First, existing libraries like parquetjs are officially "inactive". Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser. And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.

Demo

Online parquet file reader demo available at:

https://hyparam.github.io/hyparquet/

Demo source: index.html

Installation

npm install hyparquet

Usage

If you're in a node.js environment, you can load a parquet file with the following example:

const { parquetMetadata } = await import('hyparquet')
const fs = await import('fs')

const buffer = fs.readFileSync('example.parquet')
const arrayBuffer = new Uint8Array(buffer).buffer
const metadata = parquetMetadata(arrayBuffer)

If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.

To load parquet data in the browser from a remote server using fetch:

import { parquetMetadata } from 'hyparquet'

const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)

To parse parquet files from a user drag-and-drop action, see example in index.html.

Supported Parquet Files

The parquet format supports a number of different compression and encoding types. Hyparquet does not support 100% of all parquet files, and probably never will, since supporting all possible compression types will increase the size of the library, and are rarely used in practice.

Compression:

  • Uncompressed
  • Snappy
  • GZip
  • LZO
  • Brotli
  • LZ4
  • ZSTD
  • LZ4_RAW

Page Type:

  • Data Page
  • Index Page
  • Dictionary Page
  • Data Page V2

Contributions are welcome!

References