docs.sheetjs.com/21-pandas.md at 952244b91739a6a52622e8864cfb201ab65a8b12

Rasmus/docs.sheetjs.com

2024-01-30 04:27:22 -05:00

10 KiB

Raw Blame History

title	sidebar_label	description	pagination_prev	pagination_next
Spreadsheet Data in Pandas	Python + Pandas	Process structured data in Python with Pandas. Seamlessly integrate spreadsheets into your workflow with SheetJS. Analyze complex Excel spreadsheets with confidence.	demos/index	demos/frontend/index

import current from '/version.js'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

Pandas¹ is a Python software library for data analysis.

SheetJS is a JavaScript library for reading and writing data from spreadsheets.

This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.

The "Complete Example" includes a wrapper library that simplifies importing and exporting spreadsheets.

:::info pass

Pandas includes limited support for reading spreadsheets (pandas.from_excel) and writing XLSX spreadsheets (pandas.DataFrame.to_excel).

SheetJS supports common spreadsheet formats that Pandas cannot process.

SheetJS operations also offer more flexibility in processing complex worksheets.

:::

:::note Tested Environments

This demo was tested in the following deployments:

Architecture	JS Engine	Pandas	Python	Date
`darwin-x64`	Duktape `2.7.0`	2.0.3	3.11.7	2024-01-29
`linux-x64`	Duktape `2.7.0`	1.5.3	3.11.3	2024-01-29

:::

Integration Details

sheetjs.py is a wrapper script that provides helper methods for reading and writing spreadsheets. Installation notes are included in the "Complete Example" section.

JS in Python

JS code cannot be directly evaluated in Python implementations.

To run JS code from Python, JavaScript engines² can be embedded in Python modules or dynamically loaded using the ctypes foreign function library³. This demo uses ctypes with the Duktape engine.

Wrapper

The script exports a class named SheetJSWrapper. It is a context manager that initializes the Duktape engine and executes SheetJS scripts on entrance. All work should be performed in the context:

#!/usr/bin/env python3
from sheetjs import SheetJSWrapper

with SheetJSWrapper() as sheetjs:

  # Parse file
  wb = sheetjs.read_file("pres.numbers")
  print("Loaded file pres.numbers")

  # Get first worksheet name
  first_ws_name = wb.get_sheet_names()[0]
  print(f"Reading from sheet {first_ws_name}")

  # Generate DataFrame from first worksheet
  df = wb.get_df(first_ws_name)
  print(df.info())

  # Export DataFrame to XLSB
  sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")

Reading Files

sheetjs.read_file accepts a path to a spreadsheet file. It will parse the file and return an object representing the workbook.

The get_sheet_names method of the workbook returns a list of sheet names.

The get_df method of the workbook generates a DataFrame from the workbook. The specific sheet can be selected by passing the name.

For example, the following code reads pres.numbers and generates a DataFrame from the second worksheet:

with SheetJSWrapper() as sheetjs:
  # Parse file
  wb = sheetjs.read_file(path)

  # Generate DataFrame from second worksheet
  ws_name = wb.get_sheet_names()[1]
  df = wb.get_df(ws_name)

  # Print metadata
  print(df.info())

Under the hood, sheetjs.py performs the following steps:

flowchart LR
  file[(workbook\nfile)]
  subgraph SheetJS operations
    bytes(Byte\nstring)
    wb((SheetJS\nWorkbook))
    csv(CSV\nstring)
  end
  subgraph Pandas operations
    stream(CSV\nStream)
    df[(Pandas\nDataFrame)]
  end
  file --> |`open`/`read`\nPython ops| bytes
  bytes --> |`XLSX.read`\nParse Bytes| wb
  wb --> |`sheet_to_csv`\nExtract Data| csv
  csv --> |`StringIO`\nPython ops| stream
  stream --> |`read_csv`\nParse CSV| df

Pure Python operations read the spreadsheet file and generate a byte string.
SheetJS libraries parse the string and generate a clean CSV.

The read method⁴ parses file bytes into a SheetJS workbook object⁵
After selecting a worksheet, sheet_to_csv⁶ generates a CSV string

Python operations convert the CSV string to a stream object.⁷
The Pandas read_csv method⁸ ingests the stream and generate a DataFrame.

Writing Files

sheetjs.write_df accepts a DataFrame and a path. It will attempt to export the data to a spreadsheet file.

For example, the following code exports a DataFrame to SheetJSPandas.xlsb:

with SheetJSWrapper() as sheetjs:
  # Export DataFrame to XLSB
  sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")

Under the hood, sheetjs.py performs the following steps:

flowchart LR
  subgraph Pandas operations
    df[(Pandas\nDataFrame)]
    json(JSON\nString)
  end
  subgraph SheetJS operations
    aoo(array of\nobjects)
    wb((SheetJS\nWorkbook))
    u8a(File\nbytes)
  end
  file[(workbook\nfile)]
  df --> |`to_json`\nPandas ops| json
  json --> |`JSON.parse`\nJS Engine| aoo
  aoo --> |`json_to_sheet`\nSheetJS Ops| wb
  wb --> |`XLSX.write`\nUint8Array| u8a
  u8a --> |`open`/`write`\nPython ops| file

The Pandas DataFrame to_json method⁹ generates a JSON string.
JS engine operations translate the JSON string to an array of objects.
SheetJS libraries process the data array and generate file bytes.

The json_to_sheet method¹⁰ creates a SheetJS sheet object from the data.
The book_new method¹¹ creates a SheetJS workbook that includes the sheet.
The write method¹² generates the spreadsheet file bytes.

Pure Python operations write the bytes to file.

Complete Example

This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to the binary XLSB spreadsheet format.

Install Pandas:

sudo python3 -m pip install pandas

:::caution pass

On Arch Linux-based platforms including the Steam Deck, the install may fail:

error: externally-managed-environment

In these situations, Pandas must be installed through the package manager:

sudo pacman -Syu python-pandas

:::

Build the Duktape shared library:

curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..

Copy the shared library to the current folder. When the demo was last tested, the shared library file name differed by platform:

OS	name
Darwin	`libduktape.207.20700.so`
Linux	`libduktape.so.207.20700`

cp duktape-*/libduktape.* .

Download the SheetJS Standalone script and move to the project directory:

shim.min.js
xlsx.full.min.js

{\ curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/shim.min.js curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js}

Download the following test scripts and files:

curl -LO https://sheetjs.com/pres.numbers
curl -LO https://docs.sheetjs.com/pandas/sheetjs.py
curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py

Edit the sheetjs.py script.

The lib variable declares the path to the library:

# highlight-next-line
lib = "libduktape.207.20700.so"

The name of the library is libduktape.207.20700.so:

# highlight-next-line
lib = "libduktape.207.20700.so"

The name of the library is libduktape.so.207.20700:

# highlight-next-line
lib = "libduktape.so.207.20700"

Run the script:

python3 SheetJSPandas.py pres.numbers

If successful, the script will display DataFrame metadata:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      object
 1   Index   5 non-null      int64
dtypes: int64(1), object(1)

It will also export the DataFrame to SheetJSPandas.xlsb. The file can be inspected with a spreadsheet editor that supports XLSB files.

The official documentation site is https://pandas.pydata.org/ and the official distribution point is https://pypi.org/project/pandas/ ↩︎
See "Other Languages" for more examples. ↩︎
See ctypes in the Python documentation. ↩︎
See read in "Reading Files" ↩︎
See "Workbook Object" ↩︎
See sheet_to_csv in "Utilities" ↩︎
See the examples in "IO tools" in the Pandas documentation. ↩︎
See pandas.read_csv in the Pandas documentation. ↩︎
See pandas.DataFrame.to_json in the Pandas documentation. ↩︎
See json_to_sheet in "Utilities" ↩︎
See book_new in "Utilities" ↩︎
See write in "Writing Files" ↩︎

10 KiB Raw Blame History