forked from sheetjs/docs.sheetjs.com
		
	
		
			
	
	
		
			437 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			437 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|  | --- | ||
|  | title: Spreadsheet Data in Pandas | ||
|  | sidebar_label: Python (Pandas) | ||
|  | description: Process structured data in Python with Pandas. Seamlessly integrate spreadsheets into your workflow with SheetJS. Analyze complex Excel spreadsheets with confidence. | ||
|  | pagination_prev: demos/cloud/index | ||
|  | pagination_next: demos/bigdata/index | ||
|  | --- | ||
|  | 
 | ||
|  | import current from '/version.js'; | ||
|  | import Tabs from '@theme/Tabs'; | ||
|  | import TabItem from '@theme/TabItem'; | ||
|  | import CodeBlock from '@theme/CodeBlock'; | ||
|  | 
 | ||
|  | Pandas[^1] is a Python software library for data analysis. | ||
|  | 
 | ||
|  | [SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing | ||
|  | data from spreadsheets. | ||
|  | 
 | ||
|  | This demo uses SheetJS to process data from a spreadsheet and translate to the | ||
|  | Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, | ||
|  | generate DataFrames from workbooks, and write DataFrames back to workbooks. | ||
|  | 
 | ||
|  | :::note | ||
|  | 
 | ||
|  | This demo was tested in the following deployments: | ||
|  | 
 | ||
|  | | Architecture | V8 version    | Pandas | Python | Date       | | ||
|  | |:-------------|:--------------|:-------|:-------|:-----------| | ||
|  | | `darwin-x64` | `11.5.150.16` | 2.0.3  | 3.11.4 | 2023-07-29 | | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | :::info pass | ||
|  | 
 | ||
|  | Pandas includes limited support for reading spreadsheets (`pandas.from_excel`) | ||
|  | and writing XLSX spreadsheets (`pandas.DataFrame.to_excel`). | ||
|  | 
 | ||
|  | The SheetJS approach supports many common spreadsheet formats that are not | ||
|  | supported by the current set of Pandas codecs and offers greater flexibility in | ||
|  | processing complex worksheets. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | ## Integration Details
 | ||
|  | 
 | ||
|  | JS code cannot literally be run in the Python interpreter. To run JS code from | ||
|  | Python, JavaScript engines[^2] can be embedded in CPython modules. | ||
|  | 
 | ||
|  | ### Loading SheetJS
 | ||
|  | 
 | ||
|  | This demo uses the `STPyV8` module[^3] to access the V8 JavaScript engine. | ||
|  | 
 | ||
|  | _Initialize V8_ | ||
|  | 
 | ||
|  | The engine library provides a convenient context manager `JSContext` for context | ||
|  | resource management.  Within the context, the `eval` method can evaluate code: | ||
|  | 
 | ||
|  | ```py | ||
|  | from STPyV8 import JSContext | ||
|  | 
 | ||
|  | # Initialize JS context
 | ||
|  | with JSContext() as ctxt: | ||
|  |   # Run code | ||
|  |   res = ctxt.eval("'Sheet' + 'JS'") | ||
|  | 
 | ||
|  |   # print result | ||
|  |   print(res) | ||
|  | ``` | ||
|  | 
 | ||
|  | `STPyV8` handles data interchange for common types. Arrays and JS objects can be | ||
|  | translated to Python `list` and `dict` respectively. The following `convert` | ||
|  | function is used in the test suite[^4] | ||
|  | 
 | ||
|  | ```py | ||
|  | # from `tests/test_Wrapper.py` in the STPyV8 library
 | ||
|  | # License: Apache 2.0
 | ||
|  | def convert(obj): | ||
|  |   if isinstance(obj, JSArray): | ||
|  |     return [convert(v) for v in obj] | ||
|  |   if isinstance(obj, JSObject): | ||
|  |     return dict([[str(k), convert(obj.__getattr__(str(k)))] for k in obj.__dir__()]) | ||
|  |   return obj | ||
|  | ``` | ||
|  | 
 | ||
|  | _Loading the Library_ | ||
|  | 
 | ||
|  | The [Standalone scripts](/docs/getting-started/installation/standalone) can be | ||
|  | parsed and evaluated from the JS engine. Once evaluated, the `XLSX` variable is | ||
|  | available as a global. | ||
|  | 
 | ||
|  | Assuming the standalone library is in the same directory as the source file, | ||
|  | the script can be evaluated with `eval`: | ||
|  | 
 | ||
|  | ```py | ||
|  |   # Within a JSContext, open `xlsx.full.min.js` and evaluate | ||
|  |   with open("xlsx.full.min.js") as f: | ||
|  |     ctxt.eval(f.read()) | ||
|  | ``` | ||
|  | 
 | ||
|  | ### Reading Files
 | ||
|  | 
 | ||
|  | The following diagram depicts the spreadsheet salsa: | ||
|  | 
 | ||
|  | ```mermaid | ||
|  | flowchart LR | ||
|  |   file[(workbook\nfile)] | ||
|  |   subgraph SheetJS operations | ||
|  |     base64(Base64\nstring) | ||
|  |     wb((SheetJS\nWorkbook)) | ||
|  |     aoo(array of\nobjects) | ||
|  |   end | ||
|  |   subgraph Pandas operations | ||
|  |     lod(list of\nrecords) | ||
|  |     df[(Pandas\nDataFrame)] | ||
|  |   end | ||
|  |   file --> |`open`/`read`\nPython ops| base64 | ||
|  |   base64 --> |`XLSX.read`\nParse Bytes| wb | ||
|  |   wb --> |`sheet_to_json`\nExtract Data| aoo | ||
|  |   aoo --> |`convert`\nPython ops|lod | ||
|  |   lod --> |`from_records`\nPandas ops| df | ||
|  | ``` | ||
|  | 
 | ||
|  | At a high level: | ||
|  | 
 | ||
|  | 1) Pure Python operations read the file and generate a Base64 string | ||
|  | 
 | ||
|  | 2) SheetJS libraries parse the string and generates JS records | ||
|  | 
 | ||
|  | 3) JS engine operations translate the rows to Python `list` of `dicts` | ||
|  | 
 | ||
|  | 4) Pandas operations translate the Python data to a DataFrame | ||
|  | 
 | ||
|  | #### Read files
 | ||
|  | 
 | ||
|  | The safest format for data interchange is Base64-encoded strings: | ||
|  | 
 | ||
|  | ```py | ||
|  | from base64 import b64encode | ||
|  | 
 | ||
|  | with open(path, mode="rb") as f: | ||
|  |   file_bytes = f.read() | ||
|  |   b64 = b64encode(file_bytes) | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Parse bytes
 | ||
|  | 
 | ||
|  | From JS code, `XLSX.read`[^5] parses the Base64 string | ||
|  | 
 | ||
|  | ```py | ||
|  | wb = ctxt.eval("(b64 => XLSX.read(b64, {type: 'base64', dense: true}))")(b64) | ||
|  | ``` | ||
|  | 
 | ||
|  | The `wb` object follows the "Common Spreadsheet Format"[^6], an in-memory format | ||
|  | for representing workbooks, worksheets, cells, and spreadsheet features. | ||
|  | 
 | ||
|  | #### Get First Worksheet
 | ||
|  | 
 | ||
|  | As explained in the "Workbook Object"[^7] section: | ||
|  | - the `SheetNames` property is a ordered list of the sheet names in the workbook | ||
|  | - the `Sheets` property of the workbook object is an object whose keys are sheet | ||
|  |   names and whose values are sheet objects. | ||
|  | 
 | ||
|  | For use in Python, the `SheetNames` array must be converted to a `list`: | ||
|  | 
 | ||
|  | ```py | ||
|  | sheet_names = convert(wb.SheetNames) | ||
|  | first_sheet_name = sheet_names[0] | ||
|  | ``` | ||
|  | 
 | ||
|  | Since utility functions will process the worksheet object from JavaScript, it is | ||
|  | preferable not to convert the object: | ||
|  | 
 | ||
|  | ```py | ||
|  | first_sheet = wb.Sheets[first_sheet_name] # do not convert | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Generate List of Records
 | ||
|  | 
 | ||
|  | In JavaScript, the equivalent of the "`list` of `dict`s" or "`list` of records" | ||
|  | is "array of objects". They can be created with `XLSX.utils.sheet_to_json`[^8]: | ||
|  | 
 | ||
|  | ```py | ||
|  | rows = convert(ctxt.eval("(ws => XLSX.utils.sheet_to_json(ws))")(first_sheet)) | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Generate Pandas DataFrame
 | ||
|  | 
 | ||
|  | `rows` is a `list` of `dict` objects. `from_records`[^9] understands this data | ||
|  | shape and generates a proper DataFrame: | ||
|  | 
 | ||
|  | ```py | ||
|  | df = pd.DataFrame.from_records(rows) | ||
|  | ``` | ||
|  | 
 | ||
|  | ### Writing Files
 | ||
|  | 
 | ||
|  | The writing process looks similar to the reading process in reverse: | ||
|  | 
 | ||
|  | ```mermaid | ||
|  | flowchart LR | ||
|  |   subgraph Pandas operations | ||
|  |     df[(Pandas\nDataFrame)] | ||
|  |     json(JSON\nString) | ||
|  |   end | ||
|  |   subgraph SheetJS operations | ||
|  |     aoo(array of\nobjects) | ||
|  |     wb((SheetJS\nWorkbook)) | ||
|  |     base64(Base64\nstring) | ||
|  |   end | ||
|  |   file[(workbook\nfile)] | ||
|  |   df --> |`to_json`\nPandas ops| json | ||
|  |   json --> |`JSON.parse`\nJS Engine| aoo | ||
|  |   aoo --> |`json_to_sheet`\nSheetJS Ops| wb | ||
|  |   wb --> |`XLSX.write`\nBase64| base64 | ||
|  |   base64 --> |`open`/`write`\nPython ops| file | ||
|  | ``` | ||
|  | 
 | ||
|  | At a high level: | ||
|  | 
 | ||
|  | 1) Pandas operations translate the Python data to JSON string | ||
|  | 
 | ||
|  | 2) JS engine operations translate the JSON string to an array of objects | ||
|  | 
 | ||
|  | 3) SheetJS libraries parse the array and generate a Base64-encoded workbook | ||
|  | 
 | ||
|  | 4) Pure Python operations decode the Base64 string and write the bytes to file. | ||
|  | 
 | ||
|  | #### Generate JSON
 | ||
|  | 
 | ||
|  | `DataFrame#to_json`[^10] with the option `orient="records"` generates a JSON | ||
|  | string that encodes an array of objects: | ||
|  | 
 | ||
|  | ```py | ||
|  | json = df.to_json(orient="records") | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Generate Worksheet
 | ||
|  | 
 | ||
|  | In JavaScript, `JSON.parse` will interpret the string as an array of objects. | ||
|  | `XLSX.utils.json_to_sheet`[^11] generates a SheetJS worksheet object: | ||
|  | 
 | ||
|  | ```py | ||
|  | sheet = ctxt.eval("(json => XLSX.utils.json_to_sheet(JSON.parse(json)) )")(json) | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Export Enhancements
 | ||
|  | 
 | ||
|  | At this point, there are many options for improving the appearance of the sheet. | ||
|  | For example, the "Export Tutorial"[^12] shows how to adjust column widths. | ||
|  | 
 | ||
|  | :::tip pass | ||
|  | 
 | ||
|  | [SheetJS Pro](https://sheetjs.com/pro) offers additional styling options such as | ||
|  | cell styling and frozen rows. | ||
|  | 
 | ||
|  | "Pro Edit" offers a special approach for inserting data into an existing file. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | #### Generate Workbook
 | ||
|  | 
 | ||
|  | `XLSX.utils.book_new`[^13] creates a new workbook and `XLSX.utils.book_append_sheet`[^14] | ||
|  | appends a worksheet to the workbook. The new worksheet will be called "Export": | ||
|  | 
 | ||
|  | :::note pass | ||
|  | 
 | ||
|  | The code in the string literal is reproduced below: | ||
|  | 
 | ||
|  | ```js | ||
|  | (ws, name) => { | ||
|  |   const wb = XLSX.utils.book_new(); | ||
|  |   XLSX.utils.book_append_sheet(wb, ws, name); | ||
|  |   return wb; | ||
|  | } | ||
|  | ``` | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | ```py | ||
|  | book = ctxt.eval("""((ws, name) => { | ||
|  |   const wb = XLSX.utils.book_new(); | ||
|  |   XLSX.utils.book_append_sheet(wb, ws, name); | ||
|  |   return wb; | ||
|  | })""")(sheet, "Export") | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Generate File
 | ||
|  | 
 | ||
|  | `XLSX.write`[^15] with the option `type: "base64"` attempts to create a file and | ||
|  | generate a Base64 string: | ||
|  | 
 | ||
|  | ```py | ||
|  | b64 = ctxt.eval("(wb => XLSX.write(wb, {type:'base64', bookType:'xls'}))")(book) | ||
|  | ``` | ||
|  | 
 | ||
|  | With the Base64 string, standard Python operations can create a file: | ||
|  | 
 | ||
|  | ```py | ||
|  | from base64 import b64decode | ||
|  | 
 | ||
|  | raw = b64decode(b64) | ||
|  | with open("export.xls", mode="wb") as f: | ||
|  |   f.write(raw) | ||
|  | ``` | ||
|  | 
 | ||
|  | ## Complete Demo
 | ||
|  | 
 | ||
|  | This example will extract data from an Apple Numbers spreadsheet and generate a | ||
|  | DataFrame. The DataFrame will be exported to a legacy XLS spreadsheet. | ||
|  | 
 | ||
|  | ### Engine Setup
 | ||
|  | 
 | ||
|  | 0) Follow the official installation instructions[^16]. | ||
|  | 
 | ||
|  | <details><summary><b>Instructions for macOS 12</b> (click to show)</summary> | ||
|  | 
 | ||
|  | - Install `boost-python3` package using `brew`: | ||
|  | 
 | ||
|  | ```bash | ||
|  | brew install boost-python3 | ||
|  | ``` | ||
|  | 
 | ||
|  | - Identify python version: | ||
|  | 
 | ||
|  | ```bash | ||
|  | python3 --version | ||
|  | ``` | ||
|  | 
 | ||
|  | :::note pass | ||
|  | 
 | ||
|  | When the demo was last tested, the version was `3.11.4` | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | - [Download latest release](https://github.com/cloudflare/stpyv8/releases) | ||
|  | 
 | ||
|  | ```bash | ||
|  | curl -LO https://github.com/cloudflare/stpyv8/releases/download/v11.5.150.16/stpyv8-macos-12-python-3.11.zip | ||
|  | ``` | ||
|  | 
 | ||
|  | - Extract ZIP file and enter folder | ||
|  | 
 | ||
|  | ```bash | ||
|  | unzip stpyv8-macos-12-python-3.11.zip | ||
|  | cd stpyv8-macos-12-3.11 | ||
|  | ``` | ||
|  | 
 | ||
|  | - Move `icudtl.dat` to `/Library/Application Support/STPyV8/`: | ||
|  | 
 | ||
|  | ```bash | ||
|  | sudo mkdir -p /Library/Application\ Support/STPyV8 | ||
|  | sudo mv icudtl.dat /Library/Application\ Support/STPyV8/ | ||
|  | ``` | ||
|  | 
 | ||
|  | - Install wheel: | ||
|  | 
 | ||
|  | ```bash | ||
|  | sudo python3 -m pip install --upgrade *.whl | ||
|  | cd .. | ||
|  | ``` | ||
|  | 
 | ||
|  | </details> | ||
|  | 
 | ||
|  | ### Demo
 | ||
|  | 
 | ||
|  | 1) Follow the [standalone script](/docs/getting-started/installation/standalone) | ||
|  |    instructions to download the script: | ||
|  | 
 | ||
|  | <CodeBlock language="bash">{`\ | ||
|  | curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js`} | ||
|  | </CodeBlock> | ||
|  | 
 | ||
|  | 2) Install Pandas. On macOS: | ||
|  | 
 | ||
|  | ```python | ||
|  | sudo python3 -m pip install pandas | ||
|  | ``` | ||
|  | 
 | ||
|  | 3) Download the following test scripts and files: | ||
|  | 
 | ||
|  | - [`pres.numbers` test file](https://sheetjs.com/pres.numbers) | ||
|  | - [`sheetjs.py` wrapper](pathname:///pandas/sheetjs.py) | ||
|  | - [`SheetJSPandas.py` script](pathname:///pandas/SheetJSPandas.py) | ||
|  | 
 | ||
|  | ```bash | ||
|  | curl -LO https://sheetjs.com/pres.numbers | ||
|  | curl -LO https://docs.sheetjs.com/pandas/sheetjs.py | ||
|  | curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py | ||
|  | ``` | ||
|  | 
 | ||
|  | 4) Run the script: | ||
|  | 
 | ||
|  | ```bash | ||
|  | python3 SheetJSPandas.py pres.numbers | ||
|  | ``` | ||
|  | 
 | ||
|  | If successful, it will display data rows in the file: | ||
|  | 
 | ||
|  | ``` | ||
|  | Reading from sheet Sheet1 | ||
|  | {'Name': 'Bill Clinton', 'Index': 42} | ||
|  | {'Name': 'GeorgeW Bush', 'Index': 43} | ||
|  | {'Name': 'Barack Obama', 'Index': 44} | ||
|  | {'Name': 'Donald Trump', 'Index': 45} | ||
|  | {'Name': 'Joseph Biden', 'Index': 46} | ||
|  | ``` | ||
|  | 
 | ||
|  | If Pandas is installed, the script will display DataFrame metadata: | ||
|  | 
 | ||
|  | ``` | ||
|  | RangeIndex: 5 entries, 0 to 4 | ||
|  | Data columns (total 2 columns): | ||
|  |  #   Column  Non-Null Count  Dtype  | ||
|  | ---  ------  --------------  -----  | ||
|  |  0   Name    5 non-null      object | ||
|  |  1   Index   5 non-null      int64  | ||
|  | dtypes: int64(1), object(1) | ||
|  | ``` | ||
|  | 
 | ||
|  | It will also export to `pres.xls`. The file can be read in a spreadsheet editor. | ||
|  | 
 | ||
|  | [^1]: The official documentation site is <https://pandas.pydata.org/> and the official distribution point is <https://pypi.org/project/pandas/> | ||
|  | [^2]: See ["Other Languages"](/docs/demos/engines/) for more examples. | ||
|  | [^3]: [`STPyV8`](https://github.com/cloudflare/stpyv8) is a fork of the original [`PyV8` project](https://pypi.org/project/PyV8/). It is available under the permissive Apache 2.0 License. Special thanks to Flier Lu and CloudFlare! | ||
|  | [^4]: See [`tests/test_Wrapper.py`](https://github.com/cloudflare/stpyv8/blob/410b31abe7a103b408d362cb872ce81604281c48/tests/test_Wrapper.py#L15) in the `STPyV8` code repository. | ||
|  | [^5]: See [`read` in "Reading Files"](/docs/api/parse-options) | ||
|  | [^6]: See ["SheetJS Data Model"](/docs/csf/) | ||
|  | [^7]: See ["Workbook Object"](/docs/csf/book) | ||
|  | [^8]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output) | ||
|  | [^9]: See [`pandas.DataFrame.from_records`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html) in the Pandas documentation. | ||
|  | [^10]: See [`pandas.DataFrame.to_json`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html) in the Pandas documentation. | ||
|  | [^11]: See [`json_to_sheet` in "Utilities"](/docs/api/utilities/array#array-of-objects-input) | ||
|  | [^12]: See ["Clean up Workbook"](/docs/getting-started/examples/export#clean-up-workbook) in "Export Tutorial". | ||
|  | [^13]: See [`book_new` in "Utilities"](/docs/api/utilities/wb) | ||
|  | [^14]: See [`book_append_sheet` in "Utilities"](/docs/api/utilities/wb) | ||
|  | [^15]: See [`write` in "Writing Files"](/docs/api/write-options) | ||
|  | [^16]: See ["Installing"](https://github.com/cloudflare/stpyv8#installing) in the `STPyV8` project documentation |