mirror of
https://github.com/asadbek064/hyparquet.git
synced 2026-01-08 12:06:38 +00:00
Update README
This commit is contained in:
parent
fb4fb45485
commit
5d13c9719a
216
README.md
216
README.md
@ -12,96 +12,76 @@ Dependency free since 2023!
|
||||
|
||||
## What is hyparquet?
|
||||
|
||||
Hyparquet is a lightweight, pure JavaScript library for parsing [Apache Parquet](https://parquet.apache.org) files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.
|
||||
**Hyparquet** is a lightweight, dependency-free, pure JavaScript library for parsing [Apache Parquet](https://parquet.apache.org) files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.
|
||||
|
||||
Hyparquet allows you to read and extract data from Parquet files directly in JavaScript environments, both in Node.js and in the browser, without any dependencies. Designed for performance and ease of use, hyparquet is ideal for data engineering, data science, and machine learning applications that require efficient data processing.
|
||||
Hyparquet aims to be the world's most compliant parquet parser. And it runs in the browser.
|
||||
|
||||
## Demo
|
||||
## Parquet Viewer
|
||||
|
||||
Online parquet file reader demo available at:
|
||||
**Try hyparquet online**: Drag and drop your parquet file onto [hyperparam.app](https://hyperparam.app) to view it directly in your browser. This service is powered by hyparquet's in-browser capabilities.
|
||||
|
||||
[https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/](https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/)
|
||||
|
||||
[](https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/)
|
||||
|
||||
See the [source code](https://github.com/hyparam/hyperparam-cli/tree/master/apps/hyparquet-demo).
|
||||
[](https://hyperparam.app/)
|
||||
|
||||
## Features
|
||||
|
||||
1. **Performant**: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
|
||||
2. **Browser-native**: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
|
||||
3. **Dependency-free**: Hyparquet has zero dependencies, making it lightweight and easy to install and use in any JavaScript project.
|
||||
4. **TypeScript support**: The library is written in jsdoc-typed JavaScript and provides TypeScript definitions out of the box.
|
||||
5. **Flexible data access**: Hyparquet allows you to read specific subsets of data by specifying row and column ranges, giving fine-grained control over what data is fetched and loaded.
|
||||
1. **Browser-native**: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
|
||||
2. **Performant**: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
|
||||
3. **TypeScript**: Includes TypeScript definitions.
|
||||
4. **Dependency-free**: Hyparquet has zero dependencies, making it lightweight and easy to use in any JavaScript project. Only 9.2kb min.gz!
|
||||
5. **Highly Compliant:** Supports all parquet encodings, compression codecs, and can open more parquet files than any other library.
|
||||
|
||||
## Why hyparquet?
|
||||
|
||||
Why make a new parquet parser?
|
||||
First, existing libraries like [parquetjs](https://github.com/ironSource/parquetjs) are officially "inactive".
|
||||
Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser.
|
||||
And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.
|
||||
Existing JavaScript-based parquet readers (like [parquetjs](https://github.com/ironSource/parquetjs)) are no longer actively maintained, may not support streaming or in-browser processing efficiently, and often rely on dependencies that can inflate your bundle size.
|
||||
Hyparquet is actively maintained and designed with modern web usage in mind.
|
||||
|
||||
## Usage
|
||||
## Demo
|
||||
|
||||
Install the hyparquet package from npm:
|
||||
Check out a minimal parquet viewer demo that shows how to integrate hyparquet into a react web application using [HighTable](https://github.com/hyparam/hightable).
|
||||
|
||||
```bash
|
||||
npm install hyparquet
|
||||
```
|
||||
- **Live Demo**: [https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/](https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/)
|
||||
- **Source Code**: [https://github.com/hyparam/hyperparam-cli/tree/master/apps/hyparquet-demo](https://github.com/hyparam/hyperparam-cli/tree/master/apps/hyparquet-demo)
|
||||
|
||||
## Reading Data
|
||||
## Quick Start
|
||||
|
||||
### Node.js
|
||||
### Node.js Example
|
||||
|
||||
To read the entire contents of a parquet file in a node.js environment:
|
||||
To read the contents of a parquet file in a node.js environment use `asyncBufferFromFile`:
|
||||
|
||||
```javascript
|
||||
const { asyncBufferFromFile, parquetRead } = await import('hyparquet')
|
||||
|
||||
await parquetRead({
|
||||
file: await asyncBufferFromFile(filename),
|
||||
onComplete: data => console.log(data)
|
||||
})
|
||||
```
|
||||
|
||||
The `hyparquet` package is an ES module and is not packaged as a CommonJS module. That's why you need to use a dynamic import to load the module in Node.js.
|
||||
Note: Hyparquet is published as an ES module, so dynamic `import()` may be required on the command line.
|
||||
|
||||
### Browser
|
||||
### Browser Example
|
||||
|
||||
Hyparquet supports asynchronous fetching of parquet files over a network.
|
||||
In the browser use `asyncBufferFromUrl` to wrap a url for reading asyncronously over the network.
|
||||
It is recommended that you filter by row and column to limit fetch size:
|
||||
|
||||
```js
|
||||
const { asyncBufferFromUrl, parquetRead } = await import('https://cdn.jsdelivr.net/npm/hyparquet/src/hyparquet.min.js')
|
||||
|
||||
const url = 'https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'
|
||||
await parquetRead({
|
||||
file: await asyncBufferFromUrl({url}),
|
||||
columns: ['Breed Name', 'Lifespan'],
|
||||
rowStart: 10,
|
||||
rowEnd: 20,
|
||||
onComplete: data => console.log(data)
|
||||
})
|
||||
```
|
||||
|
||||
Pass the `requestInit` option to authenticate, for example:
|
||||
## Advanced Usage
|
||||
|
||||
```js
|
||||
await parquetRead({
|
||||
file: await asyncBufferFromUrl({url, requestInit: {headers: {Authorization: 'Bearer my_token'}}}),
|
||||
onComplete: data => console.log(data)
|
||||
})
|
||||
```
|
||||
|
||||
## Metadata
|
||||
|
||||
You can read just the metadata, including schema and data statistics using the `parquetMetadata` function:
|
||||
|
||||
```javascript
|
||||
const { parquetMetadata } = await import('hyparquet')
|
||||
const fs = await import('fs')
|
||||
|
||||
const buffer = fs.readFileSync('example.parquet')
|
||||
const arrayBuffer = new Uint8Array(buffer).buffer
|
||||
const metadata = parquetMetadata(arrayBuffer)
|
||||
```
|
||||
|
||||
If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.
|
||||
### Reading Metadata
|
||||
|
||||
You can read just the metadata, including schema and data statistics using the `parquetMetadata` function.
|
||||
To load parquet data in the browser from a remote server using `fetch`:
|
||||
|
||||
```javascript
|
||||
@ -112,28 +92,31 @@ const arrayBuffer = await res.arrayBuffer()
|
||||
const metadata = parquetMetadata(arrayBuffer)
|
||||
```
|
||||
|
||||
To parse parquet files from a user drag-and-drop action, see example in [index.html](index.html).
|
||||
### AsyncBuffer
|
||||
|
||||
## Filtering by Row and Column
|
||||
Hyparquet accepts argument `file` of type `AsyncBuffer` which is like a js `ArrayBuffer` but the `slice` method returns `Promise<ArrayBuffer>`.
|
||||
|
||||
To read large parquet files, it is recommended that you filter by row and column.
|
||||
Hyparquet is designed to load only the minimal amount of data needed to fulfill a query.
|
||||
You can filter rows by number, or columns by name,
|
||||
and columns will be returned in the same order they were requested:
|
||||
```typescript
|
||||
interface AsyncBuffer {
|
||||
byteLength: number
|
||||
slice(start: number, end?: number): Promise<ArrayBuffer>
|
||||
}
|
||||
```
|
||||
|
||||
```javascript
|
||||
import { parquetRead } from 'hyparquet'
|
||||
You can define your own `AsyncBuffer` to create a virtual file that can be read asynchronously. In most cases, you should probably use `asyncBufferFromUrl` or `asyncBufferFromFile`.
|
||||
|
||||
### Authorization
|
||||
|
||||
Pass the `requestInit` option to `asyncBufferFromUrl` to provide authentication information to a remote web server. For example:
|
||||
|
||||
```js
|
||||
await parquetRead({
|
||||
file,
|
||||
columns: ['colB', 'colA'], // include columns colB and colA
|
||||
rowStart: 100,
|
||||
rowEnd: 200,
|
||||
onComplete: data => console.log(data),
|
||||
file: await asyncBufferFromUrl({url, requestInit: {headers: {Authorization: 'Bearer my_token'}}}),
|
||||
onComplete: data => console.log(data)
|
||||
})
|
||||
```
|
||||
|
||||
## Column names
|
||||
### Returned row format
|
||||
|
||||
By default, data returned in the `onComplete` function will be one array of columns per row.
|
||||
If you would like each row to be an object with each key the name of the column, set the option `rowFormat` to `object`.
|
||||
@ -148,100 +131,37 @@ await parquetRead({
|
||||
})
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### AsyncBuffer
|
||||
|
||||
Hyparquet supports asynchronous fetching of parquet files over a network.
|
||||
You can provide an `AsyncBuffer` which is like a js `ArrayBuffer` but the `slice` method returns `Promise<ArrayBuffer>`.
|
||||
|
||||
```typescript
|
||||
interface AsyncBuffer {
|
||||
byteLength: number
|
||||
slice(start: number, end?: number): Promise<ArrayBuffer>
|
||||
}
|
||||
```
|
||||
|
||||
You can read parquet files asynchronously using HTTP Range requests so that only the necessary byte ranges from a `url` will be fetched:
|
||||
|
||||
```javascript
|
||||
import { parquetRead } from 'hyparquet'
|
||||
|
||||
const url = 'https://hyperparam-public.s3.amazonaws.com/wiki-en-00000-of-00041.parquet'
|
||||
const byteLength = 420296449
|
||||
await parquetRead({
|
||||
file: { // AsyncBuffer
|
||||
byteLength,
|
||||
async slice(start, end) {
|
||||
const headers = new Headers()
|
||||
headers.set('Range', `bytes=${start}-${end - 1}`)
|
||||
const res = await fetch(url, { headers })
|
||||
return res.arrayBuffer()
|
||||
},
|
||||
},
|
||||
onComplete: data => console.log(data),
|
||||
})
|
||||
```
|
||||
|
||||
## Supported Parquet Files
|
||||
|
||||
The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.
|
||||
Hyparquet supports all parquet encodings: plain, dictionary, rle, bit packed, delta, etc.
|
||||
|
||||
Supported parquet encodings:
|
||||
- [X] PLAIN
|
||||
- [X] PLAIN_DICTIONARY
|
||||
- [X] RLE_DICTIONARY
|
||||
- [X] RLE
|
||||
- [X] BIT_PACKED
|
||||
- [X] DELTA_BINARY_PACKED
|
||||
- [X] DELTA_BYTE_ARRAY
|
||||
- [X] DELTA_LENGTH_BYTE_ARRAY
|
||||
- [X] BYTE_STREAM_SPLIT
|
||||
**Hyparquet is the most compliant parquet parser on earth** — hyparquet can open more files than pyarrow, rust, and duckdb.
|
||||
|
||||
## Compression
|
||||
|
||||
Supporting every possible compression codec available in parquet would blow up the size of the hyparquet library. In practice, most parquet files use snappy compression.
|
||||
By default, hyparquet supports uncompressed and snappy-compressed parquet files.
|
||||
To support the full range of parquet compression codecs (gzip, brotli, zstd, etc), use the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) package.
|
||||
|
||||
Parquet compression types supported by default:
|
||||
- [X] Uncompressed
|
||||
- [X] Snappy
|
||||
- [ ] GZip
|
||||
- [ ] LZO
|
||||
- [ ] Brotli
|
||||
- [ ] LZ4
|
||||
- [ ] ZSTD
|
||||
- [ ] LZ4_RAW
|
||||
| Codec | hyparquet | with hyparquet-compressors |
|
||||
|---------------|-----------|----------------------------|
|
||||
| Uncompressed | ✅ | ✅ |
|
||||
| Snappy | ✅ | ✅ |
|
||||
| GZip | ❌ | ✅ |
|
||||
| LZO | ❌ | ✅ |
|
||||
| Brotli | ❌ | ✅ |
|
||||
| LZ4 | ❌ | ✅ |
|
||||
| ZSTD | ❌ | ✅ |
|
||||
| LZ4_RAW | ❌ | ✅ |
|
||||
|
||||
You can provide custom compression codecs using the `compressors` option.
|
||||
### hysnappy
|
||||
|
||||
## hysnappy
|
||||
For faster snappy decompression, try [hysnappy](https://github.com/hyparam/hysnappy), which uses WASM for a 40% speed boost on large parquet files.
|
||||
|
||||
The most common compression codec used in parquet is snappy compression.
|
||||
Hyparquet includes a built-in snappy decompressor written in javascript.
|
||||
### hyparquet-compressors
|
||||
|
||||
We developed [hysnappy](https://github.com/hyparam/hysnappy) to make parquet parsing even faster.
|
||||
Hysnappy is a snappy decompression codec written in C, compiled to WASM.
|
||||
You can include support for ALL parquet `compressors` plus hysnappy using the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) package.
|
||||
|
||||
To use hysnappy for faster parsing of large parquet files, override the `SNAPPY` compressor for hyparquet:
|
||||
|
||||
```js
|
||||
import { parquetRead } from 'hyparquet'
|
||||
import { snappyUncompressor } from 'hysnappy'
|
||||
|
||||
await parquetRead({
|
||||
file,
|
||||
compressors: {
|
||||
SNAPPY: snappyUncompressor(),
|
||||
},
|
||||
onComplete: console.log,
|
||||
})
|
||||
```
|
||||
|
||||
Parsing a [420mb wikipedia parquet file](https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.en/train-00000-of-00041.parquet) using hysnappy reduces parsing time by 40% (4.1s to 2.3s).
|
||||
|
||||
## hyparquet-compressors
|
||||
|
||||
You can include support for ALL parquet compression codecs using the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) library.
|
||||
|
||||
```js
|
||||
import { parquetRead } from 'hyparquet'
|
||||
@ -259,11 +179,15 @@ await parquetRead({ file, compressors, onComplete: console.log })
|
||||
- https://github.com/dask/fastparquet
|
||||
- https://github.com/duckdb/duckdb
|
||||
- https://github.com/google/snappy
|
||||
- https://github.com/hyparam/hightable
|
||||
- https://github.com/hyparam/hysnappy
|
||||
- https://github.com/hyparam/hyparquet-compressors
|
||||
- https://github.com/ironSource/parquetjs
|
||||
- https://github.com/zhipeng-jia/snappyjs
|
||||
|
||||
## Contributions
|
||||
|
||||
Contributions are welcome!
|
||||
If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
|
||||
|
||||
Hyparquet development is supported by an open-source grant from Hugging Face :hugs:
|
||||
|
||||
BIN
demo.png
BIN
demo.png
Binary file not shown.
|
Before Width: | Height: | Size: 558 KiB |
BIN
hyperparam.png
Normal file
BIN
hyperparam.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 658 KiB |
@ -29,12 +29,12 @@
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/node": "22.10.1",
|
||||
"@vitest/coverage-v8": "2.1.6",
|
||||
"@vitest/coverage-v8": "2.1.8",
|
||||
"eslint": "9.16.0",
|
||||
"eslint-plugin-jsdoc": "50.6.0",
|
||||
"hyparquet-compressors": "0.1.4",
|
||||
"typescript": "5.7.2",
|
||||
"typescript-eslint": "8.16.0",
|
||||
"vitest": "2.1.6"
|
||||
"typescript-eslint": "8.17.0",
|
||||
"vitest": "2.1.8"
|
||||
}
|
||||
}
|
||||
|
||||
Loading…
Reference in New Issue
Block a user