2023-12-29 17:37:37 +00:00
# hyparquet
2023-12-29 18:46:40 +00:00
2023-12-29 20:12:30 +00:00

2024-01-04 19:24:35 +00:00
[](https://www.npmjs.com/package/hyparquet)
2024-01-11 18:46:23 +00:00
[](https://github.com/hyparam/hyparquet/actions)
[](https://opensource.org/licenses/MIT)

2023-12-29 18:46:40 +00:00
2023-12-29 19:27:16 +00:00
JavaScript parser for [Apache Parquet ](https://parquet.apache.org ) files.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
2024-01-03 01:16:33 +00:00
2024-01-03 17:56:17 +00:00
Dependency free since 2023!
2024-01-09 23:15:08 +00:00
## Features
- Designed to work with huge ML datasets (things like [starcoder ](https://huggingface.co/datasets/bigcode/starcoderdata ))
2024-01-15 23:14:11 +00:00
- Can load metadata separately from data
2024-01-09 23:15:08 +00:00
- Data can be filtered by row and column ranges
- Only fetches the data needed
2024-01-15 23:14:11 +00:00
- Written in JavaScript, checked with TypeScript
2024-01-09 23:15:08 +00:00
- Fast data loading for large scale ML applications
- Bring data visualization closer to the user, in the browser
2024-01-15 19:01:35 +00:00
Why make a new parquet parser in javascript?
First, existing libraries like [parquetjs ](https://github.com/ironSource/parquetjs ) are officially "inactive".
Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser.
And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.
## Demo
Online parquet file reader demo available at:
https://hyparam.github.io/hyparquet/
Demo source: [index.html ](index.html )
2024-01-09 23:15:08 +00:00
## Installation
2024-01-04 19:24:35 +00:00
```bash
npm install hyparquet
```
2024-01-09 23:15:08 +00:00
## Usage
If you're in a node.js environment, you can load a parquet file with the following example:
```js
const { parquetMetadata } = await import('hyparquet')
const fs = await import('fs')
const buffer = fs.readFileSync('example.parquet')
2024-01-15 23:14:11 +00:00
const arrayBuffer = new Uint8Array(buffer).buffer
2024-01-09 23:15:08 +00:00
const metadata = parquetMetadata(arrayBuffer)
```
If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.
To load parquet data in the browser from a remote server using `fetch` :
2024-01-04 19:24:35 +00:00
```js
import { parquetMetadata } from 'hyparquet'
2024-01-09 23:15:08 +00:00
const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)
2024-01-04 19:24:35 +00:00
```
2024-01-09 23:15:08 +00:00
To parse parquet files from a user drag-and-drop action, see example in [index.html ](index.html ).
2024-01-03 01:16:33 +00:00
## References
- https://github.com/apache/parquet-format
- https://github.com/dask/fastparquet
- https://github.com/apache/thrift
- https://github.com/google/snappy
- https://github.com/zhipeng-jia/snappyjs