2023-12-29 17:37:37 +00:00
# hyparquet
2023-12-29 18:46:40 +00:00
2024-02-19 00:42:58 +00:00

2023-12-29 20:12:30 +00:00
2024-01-04 19:24:35 +00:00
[](https://www.npmjs.com/package/hyparquet)
2024-01-11 18:46:23 +00:00
[](https://github.com/hyparam/hyparquet/actions)
[](https://opensource.org/licenses/MIT)
2024-02-02 21:24:53 +00:00
[](https://www.npmjs.com/package/hyparquet?activeTab=dependencies)
2024-09-25 08:59:21 +00:00

2023-12-29 18:46:40 +00:00
2024-04-05 18:28:57 +00:00
Dependency free since 2023!
2023-12-29 19:27:16 +00:00
2024-04-05 18:28:57 +00:00
## What is hyparquet?
2024-01-03 01:16:33 +00:00
2024-12-06 03:11:53 +00:00
**Hyparquet** is a lightweight, dependency-free, pure JavaScript library for parsing [Apache Parquet ](https://parquet.apache.org ) files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
Hyparquet aims to be the world's most compliant parquet parser. And it runs in the browser.
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
## Parquet Viewer
2024-07-23 04:51:26 +00:00
2024-12-06 03:11:53 +00:00
**Try hyparquet online**: Drag and drop your parquet file onto [hyperparam.app ](https://hyperparam.app ) to view it directly in your browser. This service is powered by hyparquet's in-browser capabilities.
2024-11-19 17:56:09 +00:00
2024-12-06 03:11:53 +00:00
[](https://hyperparam.app/)
2024-09-04 19:52:39 +00:00
2024-07-26 01:03:14 +00:00
## Features
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
1. **Browser-native** : Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
2. **Performant** : Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
3. **TypeScript** : Includes TypeScript definitions.
4. **Dependency-free** : Hyparquet has zero dependencies, making it lightweight and easy to use in any JavaScript project. Only 9.2kb min.gz!
5. **Highly Compliant:** Supports all parquet encodings, compression codecs, and can open more parquet files than any other library.
2024-01-03 17:56:17 +00:00
2024-07-26 01:03:14 +00:00
## Why hyparquet?
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
Existing JavaScript-based parquet readers (like [parquetjs ](https://github.com/ironSource/parquetjs )) are no longer actively maintained, may not support streaming or in-browser processing efficiently, and often rely on dependencies that can inflate your bundle size.
Hyparquet is actively maintained and designed with modern web usage in mind.
2024-01-15 19:01:35 +00:00
2024-12-06 03:11:53 +00:00
## Demo
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
Check out a minimal parquet viewer demo that shows how to integrate hyparquet into a react web application using [HighTable ](https://github.com/hyparam/hightable ).
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
- **Live Demo**: [https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/ ](https://hyparam.github.io/hyperparam-cli/apps/hyparquet-demo/ )
- **Source Code**: [https://github.com/hyparam/hyperparam-cli/tree/master/apps/hyparquet-demo ](https://github.com/hyparam/hyperparam-cli/tree/master/apps/hyparquet-demo )
2024-04-05 18:28:57 +00:00
2024-12-06 03:11:53 +00:00
## Quick Start
2024-07-26 01:03:14 +00:00
2024-12-06 03:11:53 +00:00
### Node.js Example
2024-07-26 01:03:14 +00:00
2024-12-06 03:11:53 +00:00
To read the contents of a parquet file in a node.js environment use `asyncBufferFromFile` :
2024-07-26 01:03:14 +00:00
2024-09-24 23:47:56 +00:00
```javascript
2024-07-27 00:02:45 +00:00
const { asyncBufferFromFile, parquetRead } = await import('hyparquet')
2024-12-06 03:11:53 +00:00
2024-07-26 01:03:14 +00:00
await parquetRead({
2024-07-27 00:02:45 +00:00
file: await asyncBufferFromFile(filename),
2024-07-26 01:03:14 +00:00
onComplete: data => console.log(data)
})
```
2024-12-06 03:11:53 +00:00
Note: Hyparquet is published as an ES module, so dynamic `import()` may be required on the command line.
2024-11-15 17:16:06 +00:00
2024-12-06 03:11:53 +00:00
### Browser Example
2024-07-26 01:03:14 +00:00
2024-12-06 03:11:53 +00:00
In the browser use `asyncBufferFromUrl` to wrap a url for reading asyncronously over the network.
It is recommended that you filter by row and column to limit fetch size:
2024-07-26 01:03:14 +00:00
```js
2024-07-27 00:02:45 +00:00
const { asyncBufferFromUrl, parquetRead } = await import('https://cdn.jsdelivr.net/npm/hyparquet/src/hyparquet.min.js')
2024-12-06 03:11:53 +00:00
2024-07-27 00:02:45 +00:00
const url = 'https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'
2024-07-26 01:03:14 +00:00
await parquetRead({
2024-11-22 20:19:34 +00:00
file: await asyncBufferFromUrl({url}),
2024-12-06 03:11:53 +00:00
columns: ['Breed Name', 'Lifespan'],
rowStart: 10,
rowEnd: 20,
2024-11-22 20:19:34 +00:00
onComplete: data => console.log(data)
})
```
2024-12-06 03:11:53 +00:00
## Advanced Usage
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
### Reading Metadata
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
You can read just the metadata, including schema and data statistics using the `parquetMetadata` function.
2024-01-09 23:15:08 +00:00
To load parquet data in the browser from a remote server using `fetch` :
2024-09-24 23:47:56 +00:00
```javascript
2024-01-04 19:24:35 +00:00
import { parquetMetadata } from 'hyparquet'
2024-01-09 23:15:08 +00:00
const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)
2024-01-04 19:24:35 +00:00
```
2024-12-06 03:11:53 +00:00
### AsyncBuffer
2024-01-09 23:15:08 +00:00
2024-12-06 03:11:53 +00:00
Hyparquet accepts argument `file` of type `AsyncBuffer` which is like a js `ArrayBuffer` but the `slice` method returns `Promise<ArrayBuffer>` .
2024-04-11 20:11:30 +00:00
2024-12-06 03:11:53 +00:00
```typescript
interface AsyncBuffer {
byteLength: number
slice(start: number, end?: number): Promise< ArrayBuffer >
}
```
2024-04-11 20:11:30 +00:00
2024-12-06 03:11:53 +00:00
You can define your own `AsyncBuffer` to create a virtual file that can be read asynchronously. In most cases, you should probably use `asyncBufferFromUrl` or `asyncBufferFromFile` .
2024-04-11 20:11:30 +00:00
2024-12-06 03:11:53 +00:00
### Authorization
Pass the `requestInit` option to `asyncBufferFromUrl` to provide authentication information to a remote web server. For example:
```js
2024-04-11 20:11:30 +00:00
await parquetRead({
2024-12-06 03:11:53 +00:00
file: await asyncBufferFromUrl({url, requestInit: {headers: {Authorization: 'Bearer my_token'}}}),
onComplete: data => console.log(data)
2024-04-11 20:11:30 +00:00
})
```
2024-12-06 03:11:53 +00:00
### Returned row format
2024-08-13 16:15:59 +00:00
By default, data returned in the `onComplete` function will be one array of columns per row.
If you would like each row to be an object with each key the name of the column, set the option `rowFormat` to `object` .
2024-09-24 23:47:56 +00:00
```javascript
2024-08-13 16:15:59 +00:00
import { parquetRead } from 'hyparquet'
await parquetRead({
file,
rowFormat: 'object',
onComplete: data => console.log(data),
})
```
2024-02-13 18:50:36 +00:00
## Supported Parquet Files
2024-04-03 20:30:08 +00:00
The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.
2024-12-06 03:11:53 +00:00
Hyparquet supports all parquet encodings: plain, dictionary, rle, bit packed, delta, etc.
2024-04-03 20:30:08 +00:00
2024-12-06 03:11:53 +00:00
**Hyparquet is the most compliant parquet parser on earth** — hyparquet can open more files than pyarrow, rust, and duckdb.
2024-05-20 12:10:21 +00:00
## Compression
2024-12-06 03:11:53 +00:00
By default, hyparquet supports uncompressed and snappy-compressed parquet files.
To support the full range of parquet compression codecs (gzip, brotli, zstd, etc), use the [hyparquet-compressors ](https://github.com/hyparam/hyparquet-compressors ) package.
2024-04-08 06:08:09 +00:00
2024-12-06 03:11:53 +00:00
| Codec | hyparquet | with hyparquet-compressors |
|---------------|-----------|----------------------------|
| Uncompressed | ✅ | ✅ |
| Snappy | ✅ | ✅ |
| GZip | ❌ | ✅ |
| LZO | ❌ | ✅ |
| Brotli | ❌ | ✅ |
| LZ4 | ❌ | ✅ |
| ZSTD | ❌ | ✅ |
| LZ4_RAW | ❌ | ✅ |
2024-04-08 06:08:09 +00:00
2024-12-06 03:11:53 +00:00
### hysnappy
2024-04-08 06:08:09 +00:00
2024-12-06 03:11:53 +00:00
For faster snappy decompression, try [hysnappy ](https://github.com/hyparam/hysnappy ), which uses WASM for a 40% speed boost on large parquet files.
2024-02-13 18:50:36 +00:00
2024-12-06 03:11:53 +00:00
### hyparquet-compressors
2024-02-13 18:50:36 +00:00
2024-12-06 03:11:53 +00:00
You can include support for ALL parquet `compressors` plus hysnappy using the [hyparquet-compressors ](https://github.com/hyparam/hyparquet-compressors ) package.
2024-05-20 12:10:21 +00:00
```js
import { parquetRead } from 'hyparquet'
import { compressors } from 'hyparquet-compressors'
await parquetRead({ file, compressors, onComplete: console.log })
```
2024-01-03 01:16:33 +00:00
## References
- https://github.com/apache/parquet-format
2024-02-14 05:25:40 +00:00
- https://github.com/apache/parquet-testing
2024-01-03 01:16:33 +00:00
- https://github.com/apache/thrift
2024-04-11 20:11:30 +00:00
- https://github.com/apache/arrow
2024-02-14 05:25:40 +00:00
- https://github.com/dask/fastparquet
2024-04-29 02:03:39 +00:00
- https://github.com/duckdb/duckdb
2024-01-03 01:16:33 +00:00
- https://github.com/google/snappy
2024-12-06 03:11:53 +00:00
- https://github.com/hyparam/hightable
- https://github.com/hyparam/hysnappy
- https://github.com/hyparam/hyparquet-compressors
2024-04-11 20:11:30 +00:00
- https://github.com/ironSource/parquetjs
2024-01-03 01:16:33 +00:00
- https://github.com/zhipeng-jia/snappyjs
2024-06-18 16:56:00 +00:00
## Contributions
Contributions are welcome!
2024-12-06 03:11:53 +00:00
If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
2024-06-18 16:56:00 +00:00
Hyparquet development is supported by an open-source grant from Hugging Face :hugs: