mirror of
https://github.com/asadbek064/hyparquet.git
synced 2025-12-25 23:06:36 +00:00
parquet file parser for javascript
| .github/workflows | ||
| src | ||
| test | ||
| .eslintrc.json | ||
| .gitignore | ||
| hyparquet.jpg | ||
| index.html | ||
| LICENSE | ||
| package.json | ||
| README.md | ||
| tsconfig.json | ||
hyparquet
JavaScript parser for Apache Parquet files.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Dependency free since 2023!
Features
- Designed to work with huge ML datasets (things like starcoder)
- Loads metadata separately from data
- Data can be filtered by row and column ranges
- Only fetches the data needed
- Fast data loading for large scale ML applications
- Bring data visualization closer to the user, in the browser
Installation
npm install hyparquet
Usage
If you're in a node.js environment, you can load a parquet file with the following example:
const { parquetMetadata } = await import('hyparquet')
const fs = await import('fs')
const buffer = fs.readFileSync('example.parquet')
const arrayBuffer = buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
const metadata = parquetMetadata(arrayBuffer)
If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.
To load parquet data in the browser from a remote server using fetch:
import { parquetMetadata } from 'hyparquet'
const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)
To parse parquet files from a user drag-and-drop action, see example in index.html.
