docs.sheetjs.com/docz/docs/08-api/03-parse-options.md

464 lines
19 KiB
Markdown
Raw Normal View History

2022-05-16 03:26:04 +00:00
---
2024-05-14 02:43:07 +00:00
title: Reading Files
2023-05-18 09:21:08 +00:00
sidebar_position: 3
2022-05-16 03:26:04 +00:00
hide_table_of_contents: true
---
2024-05-14 02:43:07 +00:00
The main SheetJS method for reading files is `read`. It expects developers to
2025-11-16 07:08:21 +00:00
supply the actual data in a [supported representation](#input-type).
2024-05-14 02:43:07 +00:00
The `readFile` helper method accepts a filename and tries to read the specified
file using standard APIs. *It does not work in web browsers!*
**Parse file data and generate a SheetJS workbook object**
```js
var wb = XLSX.read(data, opts);
```
2022-05-16 03:26:04 +00:00
2023-08-21 23:07:34 +00:00
`read` attempts to parse `data` and return [a workbook object](/docs/csf/book)
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
The [`type`](#input-type) property of the `opts` object controls how `data` is
2023-08-21 23:07:34 +00:00
interpreted. For string data, the default interpretation is Base64.
2024-05-14 02:43:07 +00:00
**Read a specified file and generate a SheetJS workbook object**
```js
var wb = XLSX.readFile(filename, opts);
```
2023-08-21 23:07:34 +00:00
`readFile` attempts to read a local file with specified `filename`.
:::caution pass
2024-05-14 02:43:07 +00:00
`readFile` works in specific platforms. **It does not support web browsers!**
2023-08-21 23:07:34 +00:00
The [NodeJS installation note](/docs/getting-started/installation/nodejs#usage)
includes additional instructions for non-standard use cases.
:::
2025-11-16 07:08:21 +00:00
:::tip pass
The SheetJS file format import codecs focus on raw data. Not all codecs support
all features. Features not described in the documentation may not be extracted.
[SheetJS Pro](https://sheetjs.com/pro) offers support for additional features,
including styling, images, graphs, and PivotTables.
:::
2023-08-21 23:07:34 +00:00
## Parsing Options
2022-05-16 03:26:04 +00:00
The read functions accept an options argument:
2025-11-16 07:08:21 +00:00
| Option Name | Default | Description |
|:--------------|:--------|:---------------------------------------------------|
| `type` | | [Input data representation](#input-type) |
| `raw` | `false` | Disable [value parsing in plaintext formats](#raw) |
| `dense` | `false` | If true, [generate dense worksheets](#dense) |
| `codepage` | | Use specified [code page encoding](#codepage) |
| `cellFormula` | `true` | Save [formulae to the `f` field](#formulae) |
| `cellHTML` | `true` | Parse text and [save HTML to the `h` field](#html) |
| `cellNF` | `false` | Save [number format to the `z` field](#text) |
| `cellStyles` | `false` | Save [style/theme info to the `s` field](#style) |
| `cellText` | `true` | Save [formatted text to the `w` field](#text) |
| `cellDates` | `false` | [Generate proper date (type `d`) cells](#dates) |
| `dateNF` | | If specified, [override date code 14](#dates) |
| `sheetStubs` | `false` | [Create cells of type `z` for stub cells](#stubs) |
| `sheetRows` | `0` | If >0, read the [specified number of rows](#range) |
| `bookDeps` | `false` | If true, parse calculation chains |
| `bookFiles` | `false` | Add [raw files](#files) to book object |
| `bookProps` | `false` | If true, [only parse book metadata](#metadata) |
| `bookSheets` | `false` | If true, [only parse sheet names](#metadata) |
| `bookVBA` | `false` | If true, generate [VBA blob](#vba) |
| `password` | `""` | If specified, [decrypt workbook](#password) |
| `WTF` | `false` | [Do not suppress worksheet parsing errors](#wtf) |
| `sheets` | | Only parse [specified sheets](#sheets) |
| `nodim` | `false` | If true, calculate [worksheet ranges](#range) |
| `PRN` | `false` | If true, [allow parsing of PRN files](#prn) |
| `xlfn` | `false` | Use [raw formula function names](#formulae) |
| `FS` | | [DSV Field Separator override](#dsv) |
| `UTC` | `true` | Parse [text dates and times using UTC](#tz) |
### Cell-Level Options
#### Dates
By default, for consistency with spreadsheet applications, date cells are stored
as numeric cells (type `n`) with special number formats. If `cellDates` is
enabled, date codes are converted to proper Date objects.
Excel file formats (including XLSX, XLSB, and XLS) support a locale-specific
date format, typically stored as date code 14 or the string `m/d/yy`. The
formatted text for some cells will change based on the computer locale. SheetJS
parsers use the `en-US` form by default. If the `dateNF` option is set, that
number format string will be used.
["Dates and Times"](/docs/csf/features/dates) covers features in more detail.
#### Formulae
For some file formats, the `cellFormula` option must be explicitly enabled to
ensure that formulae are extracted.
Newer Excel functions are serialized with the `_xlfn.` prefix, hidden from the
user. By default, the file parsers will strip `_xlfn.` and similar prefixes.
If the `xlfn` option is enabled, the prefixes will be preserved.
[The "Formulae" docs](/docs/csf/features/formulae#prefixed-future-functions)
covers this in more detail.
#### Formatted Text {#text}
Many common spreadsheet formats (including XLSX, XLSB, and XLS) store numeric
values and number formats. Applications are expected to use the number formats
to display currency strings, dates, and other values.
Under the hood, parsers use the [SSF Number Formatter](/docs/constellation/ssf)
library to generated formatted text.
By default, formatted text is generated. If the `cellText` option is false,
formatted text will not be written.
By default, cell number formats are not preserved. If the `cellNF` option is
enabled, number format strings will be saved to the `z` field of cell objects.
["Number Formats"](/docs/csf/features/nf) covers the features in more detail.
:::note pass
Even if `cellNF` is false, formatted text will be generated and saved to `w`.
:::
#### Text and Cell Styling {#style}
By default, SheetJS CE parsers focus on data extraction.
If the `cellStyles` option is `true`, other styling metadata including
[row](/docs/csf/features/rowprops) and [column](/docs/csf/features/colprops)
properties will be parsed.
:::tip pass
[SheetJS Pro](https://sheetjs.com/pro) offers cell / text styling, conditional
formatting and additional styling options.
:::
#### HTML Formatted Text {#html}
Spreadsheet applications support a limited form of rich text styling.
If the `cellHTML` option is `true`, some file parsers will attempt to translate
the rich text to standard HTML with inner tags for bold text and other styles.
:::tip pass
[SheetJS Pro](https://sheetjs.com/pro) offers additional styling options,
conversions for all supported file formats, and whole-worsheet HTML generation.
:::
### Sheet-Level Options
2022-05-16 03:26:04 +00:00
2024-07-08 08:18:18 +00:00
#### Dense
2025-08-13 20:28:31 +00:00
By default, the `read` and `readFile` functions generate "sparse" worksheets.
When the `dense` option is set to `true`, the functions will generate "dense"
worksheets that may be more efficient in modern browsers.
The ["Cell Storage"](/docs/csf/sheet#cell-storage) section explains worksheet
structures in more detail.
2024-07-08 08:18:18 +00:00
:::note pass
[Utility functions that process SheetJS workbook objects](/docs/api/utilities/)
2025-08-13 20:28:31 +00:00
typically support sparse and dense worksheets.
2024-07-08 08:18:18 +00:00
:::
2024-05-14 02:43:07 +00:00
#### Range
2025-08-13 20:28:31 +00:00
Some file formats, including XLSX and XLS, can self-report worksheet ranges.
`read` and `readFile` assume the self-reported worksheet ranges are correct. If
files include cells outside this range, the parsers will save cell information
but other utility functions will ignore those cells.
2024-05-14 02:43:07 +00:00
If the `sheetRows` option is set, up to `sheetRows` rows will be parsed from the
worksheets. `sheetRows-1` rows will be generated when looking at the JSON object
2024-07-08 08:18:18 +00:00
output (since the header row is counted as a row when parsing the data). The
`!ref` property of the worksheet will hold the adjusted range. For formats that
self-report sheet ranges, the `!fullref` property will hold the original range.
2024-05-14 02:43:07 +00:00
The `nodim` option instructs the parser to ignore self-reported ranges and use
the actual cells in the worksheet to determine the range. This addresses known
issues with non-compliant third-party exporters.
2025-11-16 07:08:21 +00:00
#### Stubs
2024-05-14 02:43:07 +00:00
2025-11-16 07:08:21 +00:00
Some file formats, including XLSX and XLS, can specify cells without cell data.
For example, cells covered by a [merged cell block](/docs/csf/features/merges)
are technically invalid but files may include metadata.
2024-05-14 02:43:07 +00:00
2025-11-16 07:08:21 +00:00
By default, the cells are skipped. If the `sheetStubs` option is `true`, these
cells will be parsed as [stub cells](/docs/csf/cell#cell-types)
2024-05-14 02:43:07 +00:00
2025-11-16 07:08:21 +00:00
### Book-Level Options
2024-05-14 02:43:07 +00:00
#### VBA
When a macro-enabled file is parsed, if the `bookVBA` option is `true`, the raw
VBA blob will be stored in the `vbaraw` property of the workbook.
["VBA and Macros"](/docs/csf/features/vba) covers the features in more detail.
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
The `bookVBA` merely exposes the raw VBA CFB object. It does not parse the data.
XLSM and XLSB store the VBA CFB object in `xl/vbaProject.bin`. BIFF8 XLS mixes
the VBA entries alongside the core Workbook entry, so the library generates a
new blob from the XLS CFB container that works in XLSM and XLSB files.
</details>
2025-11-16 07:08:21 +00:00
#### Workbook Metadata {#metadata}
By default, the data from each worksheet is parsed.
If any of the following options are passed, parsers will not parse sheet data.
They will parse enough of the workbook to extract the requested information.
| Option | Extracted Data |
|:-------------|:--------------------|
| `bookProps` | Workbook properties |
| `bookSheets` | Worksheet names |
The options apply to XLSX, XLSB, XLS and XLML parsers.
#### Worksheets {#sheets}
By default, all worksheets are parsed. The `sheets` option limits which sheets
are parsed.
If the `sheets` option is a number, the number is interpreted as a zero-based
index. For example, `sheets: 2` instructs the parser to read the third sheet.
If the `sheets` option is text, the string is interpreted as a worksheet name.
The name is case-insensitive. `sheets: "Sheet1"` instructs the parser to read
the worksheet named "Sheet1".
If the `sheets` option is an array of numbers and text, each worksheets will
be parsed. `sheets: [2, "Sheet1"]` instructs the parser to read the third sheet
and the sheet named "Sheet1". If the third worksheet is coincidentally named
"Sheet1", only one worksheet will be parsed
### File-Level Options
#### Password Protection {#password}
SheetJS CE currently supports XOR encryption in XLS files. Errors will be thrown
when trying to parse files using unsupported encryption methods.
:::tip pass
[SheetJS Pro](https://sheetjs.com/pro) offers support for additional encryption
schemes, including the AES-CBC schemes used in XLSX / XLSM / XLSB files and the
RC4 schemes used in newer XLS files.
:::
#### Lotus Formatted Text (PRN) {#prn}
Lotus Formatted Text (`PRN`) worksheets are plain text files that do not include
delimiter characters. Each cell in a column has the same width.
If the `PRN` option is set, the plaintext parser will attempt to parse some
plaintext files as if they follow the `PRN` format.
:::note pass
If the `PRN` option is set, text files that do not include commas or semicolons
or other common delimiters may not be parsed as expected.
This option should not be enabled unless it is known that the file was exported
from Lotus 1-2-3 or from Excel using the "Lotus Formatted Text (`PRN`)" format.
:::
#### Value Parsing {#raw}
Spreadsheet software including Excel aggressively try to interpret values from
CSV and other plain text. This leads to surprising behavior[^1]!
If the `raw` option is true, value parsing will be suppressed. All cells values
are treated as strings.
The `raw` option affects the following formats: HTML, CSV, PRN, DIF, RTF.
The `raw` option does not affect XLSX, XLSB, XLS and other file formats that
support explicit value typing.
:::note pass
See [Issue #3331](https://git.sheetjs.com/sheetjs/sheetjs/issues/3145) in the
SheetJS CE bug tracker for more details.
:::
#### Code Page Encoding {#codepage}
Spreadsheet applications support a number of legacy encodings. Plaintext files
will appear different when opened in different computers in different regions.
By default, the parsers use the most common "English (United States)" encodings.
The `codepage` option controls the encoding in BIFF2 - BIFF5 XLS files without
`CodePage` records, some legacy formats including DBF, and in CSV files without
BOM in `type: "binary"`. BIFF8 XLS always defaults to 1200.
The `codepage` support library is not guaranteed to be loaded by default. The
["Installation"](/docs/getting-started/installation/) section describes how to
install and load the support library.
See ["Legacy Codepages"](/docs/constellation/codepage) for more details.
#### Date Processing {#tz}
Plaintext formats may include date and time values without timezone info. The
time `12:30 AM` is ambiguous.
In the wild, there are two popular approaches:
A) Spreadsheet software typically interpret time values using local timezones.
When opening a file in New York, `12:30 AM` will be parsed as `12:30 AM ET`.
When opening a file in Los Angeles, the time will be parsed as `12:30 AM PT`.
B) APIs use [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time), the
most popular global time standard. `12:30 AM` will be parsed as the absolute
moment in time corresponding to `8:30 PM EDT` or `7:30 PM EST`.
By default, the parsers assume files are specified in UTC. When the `UTC` option
is explicitly set to `false`, dates and times are interpreted in timezone of the
web browser or JavaScript engine.
#### Delimiter-Separated Values {#dsv}
The plaintext parser applies a number of heuristics to determine if files are
CSV (fields separated by commas), TSV (fields separated by tabs), PSV (fields
separated by `|`) or SSV (fields separated by `;`). The heuristics are based on
the presence of characters not in a double-quoted value.
The `FS` option instructs the parser to use the specified delimiter if multiple
delimiter characters are in the text. This bypasses the default heuristics.
#### Internal Files {#files}
Some file formats are structured as larger containers that include sub-files.
For example, XLSX files are ZIP files with XML sub-files.
If the `bookFiles` option is `true`, each sub-file will be preserved in the
workbook. The behavior depends on file type:
- `keys` array (paths in the ZIP) for ZIP-based formats
- `files` hash (mapping paths to objects representing the files) for ZIP
- `cfb` object for formats using CFB containers
#### Parsing Errors {#wtf}
By default, the workbook parser will suppress errors when parsing worksheets.
This ensures the valid worksheets from a multi-sheet workbook are parsed.
If the `WTF` option is enabled, the errors will not be suppressed.
2022-05-16 03:26:04 +00:00
### Input Type
2024-05-14 02:43:07 +00:00
The `type` parameter for `read` controls how data is interpreted:
2022-05-16 03:26:04 +00:00
2025-11-16 07:08:21 +00:00
| `type` | expected input |
|:---------|:----------------------------------------------------------------|
| `base64` | string: Base64 encoding of the file |
| `binary` | string: binary string (byte `n` is `data.charCodeAt(n)`) |
| `string` | string: JS string (only appropriate for UTF-8 text formats) |
| `buffer` | nodejs Buffer |
| `array` | array: array of 8-bit unsigned integers (byte `n` is `data[n]`) |
| `file` | string: path of file that will be read (nodejs only) |
2022-05-16 03:26:04 +00:00
2022-11-13 20:45:13 +00:00
Some common types are automatically deduced from the data input type, including
NodeJS `Buffer` objects, `Uint8Array` and `ArrayBuffer` objects, and arrays of
numbers.
When a JS `string` is passed with no `type`, the library assumes the data is a
Base64 string. `FileReader#readAsBinaryString` or ASCII data requires `"binary"`
type. DOM strings including `FileReader#readAsText` should use type `"string"`.
2022-05-16 03:26:04 +00:00
### Guessing File Type
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
Excel and other spreadsheet tools read the first few bytes and apply other
heuristics to determine a file type. This enables file type punning: renaming
files with the `.xls` extension will tell your computer to use Excel to open the
file but Excel will know how to handle it. This library applies similar logic:
| Byte 0 | Raw File Type | Spreadsheet Types |
|:-------|:--------------|:----------------------------------------------------|
| `0xD0` | CFB Container | BIFF 5/8 or protected XLSX/XLSB or WQ3/QPW or XLR |
| `0x09` | BIFF Stream | BIFF 2/3/4/5 |
| `0x3C` | XML/HTML | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x50` | ZIP Archive | XLSB or XLSX/M or ODS or UOS2 or NUMBERS or text |
| `0x49` | Plain Text | SYLK or plain text |
| `0x54` | Plain Text | DIF or plain text |
2022-08-25 08:22:28 +00:00
| `0xEF` | UTF-8 Text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0xFF` | UTF-16 Text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
2022-05-16 03:26:04 +00:00
| `0x00` | Record Stream | Lotus WK\* or Quattro Pro or plain text |
| `0x7B` | Plain text | RTF or plain text |
| `0x0A` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x0D` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x20` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
DBF files are detected based on the first byte as well as the third and fourth
bytes (corresponding to month and day of the file date)
2022-08-25 08:22:28 +00:00
Works for Windows files are detected based on the `BOF` record with type `0xFF`
2022-05-16 03:26:04 +00:00
Plain text format guessing follows the priority order:
| Format | Test |
|:-------|:--------------------------------------------------------------------|
| XML | `<?xml` appears in the first 1024 characters |
2024-05-14 02:43:07 +00:00
| HTML | starts with `<` and HTML tags appear in the first 1024 characters |
2022-05-16 03:26:04 +00:00
| XML | starts with `<` and the first tag is valid |
| RTF | starts with `{\rt` |
2024-03-12 06:47:52 +00:00
| DSV | starts with `sep=` followed by field delimiter and line separator |
| DSV | more unquoted `\|` chars than `;` `\t` or `,` in the first 1024 |
2022-05-16 03:26:04 +00:00
| DSV | more unquoted `;` chars than `\t` or `,` in the first 1024 |
| TSV | more unquoted `\t` chars than `,` chars in the first 1024 |
| CSV | one of the first 1024 characters is a comma `","` |
| ETH | starts with `socialcalc:version:` |
| PRN | `PRN` option is set to true |
| CSV | (fallback) |
2024-05-14 02:43:07 +00:00
HTML tags include `html`, `table`, `head`, `meta`, `script`, `style`, `div`
2022-05-16 03:26:04 +00:00
</details>
2023-08-21 23:07:34 +00:00
<details open>
2024-10-11 19:36:18 +00:00
<summary><b>Why are random files valid?</b> (click to hide)</summary>
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
Excel is extremely aggressive in reading files. Adding the XLS extension to any
2024-10-11 19:36:18 +00:00
file tricks Excel into processing the file.
2022-05-16 03:26:04 +00:00
2024-10-11 19:36:18 +00:00
If the file matches certain heuristics, Excel will use a format-specific parser.
2022-05-16 03:26:04 +00:00
2024-10-11 19:36:18 +00:00
If it cannot deduce the file type, Excel will parse the unknown file as if it
were CSV or TSV. SheetJS attempts to replicate that behavior.
2022-05-16 03:26:04 +00:00
</details>
2025-11-16 07:08:21 +00:00
[^1]: The gene [`SEPT1`](https://en.wikipedia.org/wiki/SEPTIN1) was renamed to
`SEPTIN1` to avoid Excel value interpretations: the string `SEPT1` is parsed as
the date "September 1".