hyparquet-writer/README.md

124 lines
4.1 KiB
Markdown
Raw Normal View History

2025-03-26 03:15:14 +00:00
# Hyparquet Writer
2025-04-07 08:27:06 +00:00
![hyparquet writer parakeet](hyparquet-writer.jpg)
2025-03-27 06:37:05 +00:00
[![npm](https://img.shields.io/npm/v/hyparquet-writer)](https://www.npmjs.com/package/hyparquet-writer)
2025-03-27 07:27:22 +00:00
[![minzipped](https://img.shields.io/bundlephobia/minzip/hyparquet-writer)](https://www.npmjs.com/package/hyparquet-writer)
[![workflow status](https://github.com/hyparam/hyparquet-writer/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet-writer/actions)
2025-03-26 03:15:14 +00:00
[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)
2025-04-12 00:24:21 +00:00
![coverage](https://img.shields.io/badge/Coverage-95-darkred)
2025-04-01 06:32:14 +00:00
[![dependencies](https://img.shields.io/badge/Dependencies-1-blueviolet)](https://www.npmjs.com/package/hyparquet-writer?activeTab=dependencies)
2025-03-26 05:36:06 +00:00
2025-04-03 07:42:54 +00:00
Hyparquet Writer is a JavaScript library for writing [Apache Parquet](https://parquet.apache.org) files. It is designed to be lightweight, fast and store data very efficiently. It is a companion to the [hyparquet](https://github.com/hyparam/hyparquet) library, which is a JavaScript library for reading parquet files.
2025-04-08 10:22:30 +00:00
## Quick Start
2025-03-26 05:36:06 +00:00
2025-04-08 10:22:30 +00:00
To write a parquet file to an `ArrayBuffer` use `parquetWriteBuffer` with argument `columnData`. Each column in `columnData` should contain:
2025-04-08 06:14:48 +00:00
- `name`: the column name
- `data`: an array of same-type values
2025-04-08 10:22:30 +00:00
- `type`: the parquet schema type (optional)
2025-04-08 06:14:48 +00:00
2025-04-08 10:22:30 +00:00
```javascript
import { parquetWriteBuffer } from 'hyparquet-writer'
const arrayBuffer = parquetWriteBuffer({
columnData: [
2025-04-15 06:22:55 +00:00
{ name: 'name', data: ['Alice', 'Bob', 'Charlie'], type: 'BYTE_ARRAY' },
2025-04-08 10:22:30 +00:00
{ name: 'age', data: [25, 30, 35], type: 'INT32' },
],
})
```
Note: if `type` is not provided, the type will be guessed from the data. The supported parquet types are:
- `BOOLEAN`
- `INT32`
- `INT64`
- `FLOAT`
- `DOUBLE`
- `BYTE_ARRAY`
2025-04-15 06:22:55 +00:00
- `FIXED_LEN_BYTE_ARRAY`
Strings are represented in parquet as type `BYTE_ARRAY`.
2025-04-08 10:22:30 +00:00
### Node.js Write to Local Parquet File
To write a local parquet file in node.js use `parquetWriteFile` with arguments `filename` and `columnData`:
2025-03-26 05:36:06 +00:00
```javascript
2025-04-08 10:22:30 +00:00
const { parquetWriteFile } = await import('hyparquet-writer')
2025-03-26 05:36:06 +00:00
2025-04-08 10:22:30 +00:00
parquetWriteFile({
filename: 'example.parquet',
2025-03-27 07:27:22 +00:00
columnData: [
2025-04-15 06:22:55 +00:00
{ name: 'name', data: ['Alice', 'Bob', 'Charlie'], type: 'BYTE_ARRAY' },
2025-03-28 23:13:27 +00:00
{ name: 'age', data: [25, 30, 35], type: 'INT32' },
2025-03-27 07:27:22 +00:00
],
})
2025-03-26 05:36:06 +00:00
```
2025-04-08 10:22:30 +00:00
Note: hyparquet-writer is published as an ES module, so dynamic `import()` may be required on the command line.
## Advanced Usage
2025-04-04 03:19:37 +00:00
2025-04-08 10:22:30 +00:00
Options can be passed to `parquetWrite` to adjust parquet file writing behavior:
2025-04-08 06:14:48 +00:00
2025-04-08 10:22:30 +00:00
- `writer`: a generic writer object
2025-04-15 06:22:55 +00:00
- `compressed`: use snappy compression (default true)
2025-04-03 20:21:57 +00:00
- `statistics`: write column statistics (default true)
- `rowGroupSize`: number of rows in each row group (default 100000)
2025-04-08 10:22:30 +00:00
- `kvMetadata`: extra key-value metadata to be stored in the parquet footer
2025-04-04 03:19:37 +00:00
2025-04-08 06:14:48 +00:00
```javascript
2025-04-08 10:22:30 +00:00
import { ByteWriter, parquetWrite } from 'hyparquet-writer'
2025-04-08 06:14:48 +00:00
2025-04-08 10:22:30 +00:00
const writer = new ByteWriter()
2025-04-08 06:14:48 +00:00
const arrayBuffer = parquetWrite({
2025-04-08 10:22:30 +00:00
writer,
2025-04-08 06:14:48 +00:00
columnData: [
2025-04-15 06:22:55 +00:00
{ name: 'name', data: ['Alice', 'Bob', 'Charlie'], type: 'BYTE_ARRAY' },
2025-04-08 06:14:48 +00:00
{ name: 'age', data: [25, 30, 35], type: 'INT32' },
],
2025-04-15 06:22:55 +00:00
compressed: false,
2025-04-08 06:14:48 +00:00
statistics: false,
rowGroupSize: 1000,
2025-04-15 06:22:55 +00:00
kvMetadata: [
{ key: 'key1', value: 'value1' },
{ key: 'key2', value: 'value2' },
],
})
```
### Converted Types
You can provide additional type hints by providing a `converted_type` to the `columnData` elements:
```javascript
parquetWrite({
columnData: [
{
name: 'dates',
data: [new Date(1000000), new Date(2000000)],
type: 'INT64',
converted_type: 'TIMESTAMP_MILLIS',
},
{
name: 'json',
data: [{ foo: 'bar' }, { baz: 3 }, 'imastring'],
type: 'BYTE_ARRAY',
converted_type: 'JSON',
},
]
2025-04-08 06:14:48 +00:00
})
```
2025-04-15 06:22:55 +00:00
Most converted types will be auto-detected if you just provide data with no types. However, it is still recommended that you provide type information when possible. (zero rows would throw an exception, floats might be typed as int, etc)
2025-03-26 05:36:06 +00:00
## References
- https://github.com/hyparam/hyparquet
2025-03-31 21:51:11 +00:00
- https://github.com/hyparam/hyparquet-compressors
2025-03-26 05:36:06 +00:00
- https://github.com/apache/parquet-format
- https://github.com/apache/parquet-testing