hyparquet/README.md

# hyparquet

![hyparquet](hyparquet.jpg)

[![npm](https://img.shields.io/npm/v/hyparquet)](https://www.npmjs.com/package/hyparquet)
[![workflow status](https://github.com/hyparam/hyparquet/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet/actions)
[![mit license](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
![dependencies](https://img.shields.io/badge/Dependencies-0-blueviolet)

JavaScript parser for [Apache Parquet](https://parquet.apache.org) files.

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.

Dependency free since 2023!

## Features

- Designed to work with huge ML datasets (things like [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata))
- Can load metadata separately from data
- Data can be filtered by row and column ranges
- Only fetches the data needed
- Written in JavaScript, checked with TypeScript
- Fast data loading for large scale ML applications
- Bring data visualization closer to the user, in the browser

Why make a new parquet parser in javascript?
First, existing libraries like [parquetjs](https://github.com/ironSource/parquetjs) are officially "inactive".
Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser.
And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.

## Demo

Online parquet file reader demo available at:

https://hyparam.github.io/hyparquet/

Demo source: [index.html](index.html)

## Installation

```bash
npm install hyparquet
```

## Usage

If you're in a node.js environment, you can load a parquet file with the following example:

```js
const { parquetMetadata } = await import('hyparquet')
const fs = await import('fs')

const buffer = fs.readFileSync('example.parquet')
const arrayBuffer = new Uint8Array(buffer).buffer
const metadata = parquetMetadata(arrayBuffer)
```

If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.

To load parquet data in the browser from a remote server using `fetch`:

```js
import { parquetMetadata } from 'hyparquet'

const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)
```

To parse parquet files from a user drag-and-drop action, see example in [index.html](index.html).

## References

 - https://github.com/apache/parquet-format
 - https://github.com/dask/fastparquet
 - https://github.com/apache/thrift
 - https://github.com/google/snappy
 - https://github.com/zhipeng-jia/snappyjs
Initial commit 2023-12-29 17:37:37 +00:00			`# hyparquet`
Update readme 2023-12-29 18:46:40 +00:00
hyparakeet 2023-12-29 20:12:30 +00:00			`![hyparquet](hyparquet.jpg)`

Update README 2024-01-04 19:24:35 +00:00			`[![npm](https://img.shields.io/npm/v/hyparquet)](https://www.npmjs.com/package/hyparquet)`
Dependencies: 0 2024-01-11 18:46:23 +00:00			`[![workflow status](https://github.com/hyparam/hyparquet/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet/actions)`
			`[![mit license](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)`
			`![dependencies](https://img.shields.io/badge/Dependencies-0-blueviolet)`
Update readme 2023-12-29 18:46:40 +00:00
Github actions 2023-12-29 19:27:16 +00:00			`JavaScript parser for [Apache Parquet](https://parquet.apache.org) files.`

			`Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.`
Return Decoded struct with bytes read 2024-01-03 01:16:33 +00:00
Parquet metadata parser 2024-01-03 17:56:17 +00:00			`Dependency free since 2023!`

Update README 2024-01-09 23:15:08 +00:00			`## Features`

			`- Designed to work with huge ML datasets (things like [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata))`
Publish v0.2.0 2024-01-15 23:14:11 +00:00			`- Can load metadata separately from data`
Update README 2024-01-09 23:15:08 +00:00			`- Data can be filtered by row and column ranges`
			`- Only fetches the data needed`
Publish v0.2.0 2024-01-15 23:14:11 +00:00			`- Written in JavaScript, checked with TypeScript`
Update README 2024-01-09 23:15:08 +00:00			`- Fast data loading for large scale ML applications`
			`- Bring data visualization closer to the user, in the browser`

Add demo to README 2024-01-15 19:01:35 +00:00			`Why make a new parquet parser in javascript?`
			`First, existing libraries like [parquetjs](https://github.com/ironSource/parquetjs) are officially "inactive".`
			`Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser.`
			`And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.`

			`## Demo`

			`Online parquet file reader demo available at:`

			`https://hyparam.github.io/hyparquet/`

			`Demo source: [index.html](index.html)`

Update README 2024-01-09 23:15:08 +00:00			`## Installation`
Update README 2024-01-04 19:24:35 +00:00
			```bash
			`npm install hyparquet`
			```

Update README 2024-01-09 23:15:08 +00:00			`## Usage`

			`If you're in a node.js environment, you can load a parquet file with the following example:`

			```js
			`const { parquetMetadata } = await import('hyparquet')`
			`const fs = await import('fs')`

			`const buffer = fs.readFileSync('example.parquet')`
Publish v0.2.0 2024-01-15 23:14:11 +00:00			`const arrayBuffer = new Uint8Array(buffer).buffer`
Update README 2024-01-09 23:15:08 +00:00			`const metadata = parquetMetadata(arrayBuffer)`
			```

			`If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.`

			To load parquet data in the browser from a remote server using `fetch`:

Update README 2024-01-04 19:24:35 +00:00			```js
			`import { parquetMetadata } from 'hyparquet'`

Update README 2024-01-09 23:15:08 +00:00			`const res = await fetch(url)`
			`const arrayBuffer = await res.arrayBuffer()`
			`const metadata = parquetMetadata(arrayBuffer)`
Update README 2024-01-04 19:24:35 +00:00			```

Update README 2024-01-09 23:15:08 +00:00			`To parse parquet files from a user drag-and-drop action, see example in [index.html](index.html).`

Return Decoded struct with bytes read 2024-01-03 01:16:33 +00:00			`## References`

			`- https://github.com/apache/parquet-format`
			`- https://github.com/dask/fastparquet`
			`- https://github.com/apache/thrift`
			`- https://github.com/google/snappy`
			`- https://github.com/zhipeng-jia/snappyjs`