hyparquet/README.md

205 lines
8.6 KiB
Markdown
Raw Normal View History

2023-12-29 17:37:37 +00:00
# hyparquet
2023-12-29 18:46:40 +00:00
2024-02-19 00:42:58 +00:00
![hyparquet parakeet](hyparquet.jpg)
2023-12-29 20:12:30 +00:00
2024-01-04 19:24:35 +00:00
[![npm](https://img.shields.io/npm/v/hyparquet)](https://www.npmjs.com/package/hyparquet)
2024-12-17 17:25:54 +00:00
[![minzipped](https://img.shields.io/bundlephobia/minzip/hyparquet)](https://www.npmjs.com/package/hyparquet)
2024-01-11 18:46:23 +00:00
[![workflow status](https://github.com/hyparam/hyparquet/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet/actions)
2024-12-17 17:25:54 +00:00
[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
![coverage](https://img.shields.io/badge/Coverage-96-darkred)
2024-12-17 17:25:54 +00:00
[![dependencies](https://img.shields.io/badge/Dependencies-0-blueviolet)](https://www.npmjs.com/package/hyparquet?activeTab=dependencies)
2023-12-29 18:46:40 +00:00
2024-04-05 18:28:57 +00:00
Dependency free since 2023!
2023-12-29 19:27:16 +00:00
2024-04-05 18:28:57 +00:00
## What is hyparquet?
2024-01-03 01:16:33 +00:00
2024-12-06 03:11:53 +00:00
**Hyparquet** is a lightweight, dependency-free, pure JavaScript library for parsing [Apache Parquet](https://parquet.apache.org) files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
Hyparquet aims to be the world's most compliant parquet parser. And it runs in the browser.
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
## Parquet Viewer
2024-07-23 04:51:26 +00:00
2024-12-06 03:11:53 +00:00
**Try hyparquet online**: Drag and drop your parquet file onto [hyperparam.app](https://hyperparam.app) to view it directly in your browser. This service is powered by hyparquet's in-browser capabilities.
2024-12-06 03:11:53 +00:00
[![hyperparam parquet viewer](./hyperparam.png)](https://hyperparam.app/)
2024-09-04 19:52:39 +00:00
2024-07-26 01:03:14 +00:00
## Features
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
1. **Browser-native**: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
2. **Performant**: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
3. **TypeScript**: Includes TypeScript definitions.
4. **Dependency-free**: Hyparquet has zero dependencies, making it lightweight and easy to use in any JavaScript project. Only 9.2kb min.gz!
5. **Highly Compliant:** Supports all parquet encodings, compression codecs, and can open more parquet files than any other library.
2024-01-03 17:56:17 +00:00
2024-07-26 01:03:14 +00:00
## Why hyparquet?
2024-01-09 23:15:08 +00:00
2025-03-04 17:38:39 +00:00
Parquet is widely used in data engineering and data science for its efficient storage and processing of large datasets. What if you could use parquet files directly in the browser, without needing a server or backend infrastructure? That's what hyparquet enables.
2024-12-06 03:11:53 +00:00
Existing JavaScript-based parquet readers (like [parquetjs](https://github.com/ironSource/parquetjs)) are no longer actively maintained, may not support streaming or in-browser processing efficiently, and often rely on dependencies that can inflate your bundle size.
Hyparquet is actively maintained and designed with modern web usage in mind.
2024-01-15 19:01:35 +00:00
2024-12-06 03:11:53 +00:00
## Demo
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
Check out a minimal parquet viewer demo that shows how to integrate hyparquet into a react web application using [HighTable](https://github.com/hyparam/hightable).
2024-04-05 18:28:57 +00:00
2025-03-04 17:38:39 +00:00
- **Live Demo**: [https://hyparam.github.io/demos/hyparquet/](https://hyparam.github.io/demos/hyparquet/)
- **Demo Source Code**: [https://github.com/hyparam/demos/tree/master/hyparquet](https://github.com/hyparam/demos/tree/master/hyparquet)
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
## Quick Start
2024-07-26 01:03:14 +00:00
2024-12-06 03:11:53 +00:00
### Node.js Example
2024-07-26 01:03:14 +00:00
2024-12-06 03:11:53 +00:00
To read the contents of a parquet file in a node.js environment use `asyncBufferFromFile`:
2024-07-26 01:03:14 +00:00
2024-09-24 23:47:56 +00:00
```javascript
const { asyncBufferFromFile, parquetRead } = await import('hyparquet')
2024-12-06 03:11:53 +00:00
2024-07-26 01:03:14 +00:00
await parquetRead({
file: await asyncBufferFromFile(filename),
2024-07-26 01:03:14 +00:00
onComplete: data => console.log(data)
})
```
2024-12-06 03:11:53 +00:00
Note: Hyparquet is published as an ES module, so dynamic `import()` may be required on the command line.
2024-11-15 17:16:06 +00:00
2024-12-06 03:11:53 +00:00
### Browser Example
2024-07-26 01:03:14 +00:00
2024-12-06 03:11:53 +00:00
In the browser use `asyncBufferFromUrl` to wrap a url for reading asyncronously over the network.
It is recommended that you filter by row and column to limit fetch size:
2024-07-26 01:03:14 +00:00
2024-12-17 17:25:54 +00:00
```javascript
const { asyncBufferFromUrl, parquetRead } = await import('https://cdn.jsdelivr.net/npm/hyparquet/src/hyparquet.min.js')
2024-12-06 03:11:53 +00:00
const url = 'https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'
2024-07-26 01:03:14 +00:00
await parquetRead({
file: await asyncBufferFromUrl({url}),
2024-12-06 03:11:53 +00:00
columns: ['Breed Name', 'Lifespan'],
rowStart: 10,
rowEnd: 20,
onComplete: data => console.log(data)
})
```
2024-12-06 03:11:53 +00:00
## Advanced Usage
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
### Reading Metadata
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
You can read just the metadata, including schema and data statistics using the `parquetMetadata` function.
2024-01-09 23:15:08 +00:00
To load parquet data in the browser from a remote server using `fetch`:
2024-09-24 23:47:56 +00:00
```javascript
2025-03-04 17:38:39 +00:00
import { parquetMetadata, parquetSchema } from 'hyparquet'
2024-01-04 19:24:35 +00:00
2024-01-09 23:15:08 +00:00
const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)
2025-03-04 17:38:39 +00:00
// Get total number of rows (convert bigint to number)
const numRows = Number(metadata.num_rows)
// Get nested table schema
const schema = parquetSchema(metadata)
// Get top-level column header names
const columnNames = schema.children.map(e => e.element.name)
2024-01-04 19:24:35 +00:00
```
2024-12-06 03:11:53 +00:00
### AsyncBuffer
2024-01-09 23:15:08 +00:00
2024-12-21 23:28:24 +00:00
Hyparquet accepts argument `file` of type `AsyncBuffer` which is like a js `ArrayBuffer` but the `slice` method can return `Promise<ArrayBuffer>`.
You can pass an `ArrayBuffer` anywhere that an `AsyncBuffer` is expected, if you have the entire file in memory.
2024-12-06 03:11:53 +00:00
```typescript
2024-12-21 23:28:24 +00:00
type Awaitable<T> = T | Promise<T>
2024-12-06 03:11:53 +00:00
interface AsyncBuffer {
byteLength: number
2024-12-21 23:28:24 +00:00
slice(start: number, end?: number): Awaitable<ArrayBuffer>
2024-12-06 03:11:53 +00:00
}
```
2024-12-06 03:11:53 +00:00
You can define your own `AsyncBuffer` to create a virtual file that can be read asynchronously. In most cases, you should probably use `asyncBufferFromUrl` or `asyncBufferFromFile`.
2024-12-06 03:11:53 +00:00
### Authorization
Pass the `requestInit` option to `asyncBufferFromUrl` to provide authentication information to a remote web server. For example:
2024-12-17 17:25:54 +00:00
```javascript
await parquetRead({
2024-12-06 03:11:53 +00:00
file: await asyncBufferFromUrl({url, requestInit: {headers: {Authorization: 'Bearer my_token'}}}),
onComplete: data => console.log(data)
})
```
2024-12-06 03:11:53 +00:00
### Returned row format
By default, data returned in the `onComplete` function will be one array of columns per row.
If you would like each row to be an object with each key the name of the column, set the option `rowFormat` to `object`.
2024-09-24 23:47:56 +00:00
```javascript
import { parquetRead } from 'hyparquet'
await parquetRead({
file,
rowFormat: 'object',
onComplete: data => console.log(data),
})
```
## Supported Parquet Files
2024-04-03 20:30:08 +00:00
The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.
2024-12-06 03:11:53 +00:00
Hyparquet supports all parquet encodings: plain, dictionary, rle, bit packed, delta, etc.
2024-04-03 20:30:08 +00:00
2024-12-06 03:11:53 +00:00
**Hyparquet is the most compliant parquet parser on earth** — hyparquet can open more files than pyarrow, rust, and duckdb.
## Compression
2024-12-06 03:11:53 +00:00
By default, hyparquet supports uncompressed and snappy-compressed parquet files.
To support the full range of parquet compression codecs (gzip, brotli, zstd, etc), use the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) package.
2024-04-08 06:08:09 +00:00
2024-12-06 03:11:53 +00:00
| Codec | hyparquet | with hyparquet-compressors |
|---------------|-----------|----------------------------|
| Uncompressed | ✅ | ✅ |
| Snappy | ✅ | ✅ |
| GZip | ❌ | ✅ |
| LZO | ❌ | ✅ |
| Brotli | ❌ | ✅ |
| LZ4 | ❌ | ✅ |
| ZSTD | ❌ | ✅ |
| LZ4_RAW | ❌ | ✅ |
2024-04-08 06:08:09 +00:00
2024-12-06 03:11:53 +00:00
### hysnappy
2024-04-08 06:08:09 +00:00
2024-12-06 03:11:53 +00:00
For faster snappy decompression, try [hysnappy](https://github.com/hyparam/hysnappy), which uses WASM for a 40% speed boost on large parquet files.
2024-12-06 03:11:53 +00:00
### hyparquet-compressors
2024-12-06 03:11:53 +00:00
You can include support for ALL parquet `compressors` plus hysnappy using the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) package.
2024-12-17 17:25:54 +00:00
```javascript
import { parquetRead } from 'hyparquet'
import { compressors } from 'hyparquet-compressors'
await parquetRead({ file, compressors, onComplete: console.log })
```
2024-01-03 01:16:33 +00:00
## References
- https://github.com/apache/parquet-format
2024-02-14 05:25:40 +00:00
- https://github.com/apache/parquet-testing
2024-01-03 01:16:33 +00:00
- https://github.com/apache/thrift
- https://github.com/apache/arrow
2024-02-14 05:25:40 +00:00
- https://github.com/dask/fastparquet
2024-04-29 02:03:39 +00:00
- https://github.com/duckdb/duckdb
2024-01-03 01:16:33 +00:00
- https://github.com/google/snappy
2024-12-06 03:11:53 +00:00
- https://github.com/hyparam/hightable
- https://github.com/hyparam/hysnappy
- https://github.com/hyparam/hyparquet-compressors
- https://github.com/ironSource/parquetjs
2024-01-03 01:16:33 +00:00
- https://github.com/zhipeng-jia/snappyjs
2024-06-18 16:56:00 +00:00
## Contributions
Contributions are welcome!
2024-12-06 03:11:53 +00:00
If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
2024-06-18 16:56:00 +00:00
Hyparquet development is supported by an open-source grant from Hugging Face :hugs: