docs.sheetjs.com/03-parse-options.md at 1327b2f98c9c2aced13fe311683651ec849e98ed

sheetjs/docs.sheetjs.com

SheetJS 1327b2f98c vite moved fast and broke things

2025-11-16 02:08:21 -05:00

19 KiB

Raw Blame History

title	sidebar_position	hide_table_of_contents
Reading Files	3	true

The main SheetJS method for reading files is read. It expects developers to supply the actual data in a supported representation.

The readFile helper method accepts a filename and tries to read the specified file using standard APIs. It does not work in web browsers!

Parse file data and generate a SheetJS workbook object

var wb = XLSX.read(data, opts);

read attempts to parse data and return a workbook object

The type property of the opts object controls how data is interpreted. For string data, the default interpretation is Base64.

Read a specified file and generate a SheetJS workbook object

var wb = XLSX.readFile(filename, opts);

readFile attempts to read a local file with specified filename.

:::caution pass

readFile works in specific platforms. It does not support web browsers!

The NodeJS installation note includes additional instructions for non-standard use cases.

:::

:::tip pass

The SheetJS file format import codecs focus on raw data. Not all codecs support all features. Features not described in the documentation may not be extracted.

SheetJS Pro offers support for additional features, including styling, images, graphs, and PivotTables.

:::

Parsing Options

The read functions accept an options argument:

Option Name	Default	Description
`type`		Input data representation
`raw`	`false`	Disable value parsing in plaintext formats
`dense`	`false`	If true, generate dense worksheets
`codepage`		Use specified code page encoding
`cellFormula`	`true`	Save formulae to the `f` field
`cellHTML`	`true`	Parse text and save HTML to the `h` field
`cellNF`	`false`	Save number format to the `z` field
`cellStyles`	`false`	Save style/theme info to the `s` field
`cellText`	`true`	Save formatted text to the `w` field
`cellDates`	`false`	Generate proper date (type `d`) cells
`dateNF`		If specified, override date code 14
`sheetStubs`	`false`	Create cells of type `z` for stub cells
`sheetRows`	`0`	If >0, read the specified number of rows
`bookDeps`	`false`	If true, parse calculation chains
`bookFiles`	`false`	Add raw files to book object
`bookProps`	`false`	If true, only parse book metadata
`bookSheets`	`false`	If true, only parse sheet names
`bookVBA`	`false`	If true, generate VBA blob
`password`	`""`	If specified, decrypt workbook
`WTF`	`false`	Do not suppress worksheet parsing errors
`sheets`		Only parse specified sheets
`nodim`	`false`	If true, calculate worksheet ranges
`PRN`	`false`	If true, allow parsing of PRN files
`xlfn`	`false`	Use raw formula function names
`FS`		DSV Field Separator override
`UTC`	`true`	Parse text dates and times using UTC

Cell-Level Options

Dates

By default, for consistency with spreadsheet applications, date cells are stored as numeric cells (type n) with special number formats. If cellDates is enabled, date codes are converted to proper Date objects.

Excel file formats (including XLSX, XLSB, and XLS) support a locale-specific date format, typically stored as date code 14 or the string m/d/yy. The formatted text for some cells will change based on the computer locale. SheetJS parsers use the en-US form by default. If the dateNF option is set, that number format string will be used.

"Dates and Times" covers features in more detail.

Formulae

For some file formats, the cellFormula option must be explicitly enabled to ensure that formulae are extracted.

Newer Excel functions are serialized with the _xlfn. prefix, hidden from the user. By default, the file parsers will strip _xlfn. and similar prefixes. If the xlfn option is enabled, the prefixes will be preserved.

The "Formulae" docs covers this in more detail.

Formatted Text

Many common spreadsheet formats (including XLSX, XLSB, and XLS) store numeric values and number formats. Applications are expected to use the number formats to display currency strings, dates, and other values.

Under the hood, parsers use the SSF Number Formatter library to generated formatted text.

By default, formatted text is generated. If the cellText option is false, formatted text will not be written.

By default, cell number formats are not preserved. If the cellNF option is enabled, number format strings will be saved to the z field of cell objects.

"Number Formats" covers the features in more detail.

:::note pass

Even if cellNF is false, formatted text will be generated and saved to w.

:::

Text and Cell Styling

By default, SheetJS CE parsers focus on data extraction.

If the cellStyles option is true, other styling metadata including row and column properties will be parsed.

:::tip pass

SheetJS Pro offers cell / text styling, conditional formatting and additional styling options.

:::

HTML Formatted Text

Spreadsheet applications support a limited form of rich text styling.

If the cellHTML option is true, some file parsers will attempt to translate the rich text to standard HTML with inner tags for bold text and other styles.

:::tip pass

SheetJS Pro offers additional styling options, conversions for all supported file formats, and whole-worsheet HTML generation.

:::

Sheet-Level Options

Dense

By default, the read and readFile functions generate "sparse" worksheets. When the dense option is set to true, the functions will generate "dense" worksheets that may be more efficient in modern browsers.

The "Cell Storage" section explains worksheet structures in more detail.

:::note pass

Utility functions that process SheetJS workbook objects typically support sparse and dense worksheets.

:::

Range

Some file formats, including XLSX and XLS, can self-report worksheet ranges. read and readFile assume the self-reported worksheet ranges are correct. If files include cells outside this range, the parsers will save cell information but other utility functions will ignore those cells.

If the sheetRows option is set, up to sheetRows rows will be parsed from the worksheets. sheetRows-1 rows will be generated when looking at the JSON object output (since the header row is counted as a row when parsing the data). The !ref property of the worksheet will hold the adjusted range. For formats that self-report sheet ranges, the !fullref property will hold the original range.

The nodim option instructs the parser to ignore self-reported ranges and use the actual cells in the worksheet to determine the range. This addresses known issues with non-compliant third-party exporters.

Stubs

Some file formats, including XLSX and XLS, can specify cells without cell data. For example, cells covered by a merged cell block are technically invalid but files may include metadata.

By default, the cells are skipped. If the sheetStubs option is true, these cells will be parsed as stub cells

Book-Level Options

VBA

When a macro-enabled file is parsed, if the bookVBA option is true, the raw VBA blob will be stored in the vbaraw property of the workbook.

"VBA and Macros" covers the features in more detail.

Implementation Details (click to show)

The bookVBA merely exposes the raw VBA CFB object. It does not parse the data.

XLSM and XLSB store the VBA CFB object in xl/vbaProject.bin. BIFF8 XLS mixes the VBA entries alongside the core Workbook entry, so the library generates a new blob from the XLS CFB container that works in XLSM and XLSB files.

Workbook Metadata

By default, the data from each worksheet is parsed.

If any of the following options are passed, parsers will not parse sheet data. They will parse enough of the workbook to extract the requested information.

Option	Extracted Data
`bookProps`	Workbook properties
`bookSheets`	Worksheet names

The options apply to XLSX, XLSB, XLS and XLML parsers.

Worksheets

By default, all worksheets are parsed. The sheets option limits which sheets are parsed.

If the sheets option is a number, the number is interpreted as a zero-based index. For example, sheets: 2 instructs the parser to read the third sheet.

If the sheets option is text, the string is interpreted as a worksheet name. The name is case-insensitive. sheets: "Sheet1" instructs the parser to read the worksheet named "Sheet1".

If the sheets option is an array of numbers and text, each worksheets will be parsed. sheets: [2, "Sheet1"] instructs the parser to read the third sheet and the sheet named "Sheet1". If the third worksheet is coincidentally named "Sheet1", only one worksheet will be parsed

File-Level Options

Password Protection

SheetJS CE currently supports XOR encryption in XLS files. Errors will be thrown when trying to parse files using unsupported encryption methods.

:::tip pass

SheetJS Pro offers support for additional encryption schemes, including the AES-CBC schemes used in XLSX / XLSM / XLSB files and the RC4 schemes used in newer XLS files.

:::

Lotus Formatted Text (PRN)

Lotus Formatted Text (PRN) worksheets are plain text files that do not include delimiter characters. Each cell in a column has the same width.

If the PRN option is set, the plaintext parser will attempt to parse some plaintext files as if they follow the PRN format.

:::note pass

If the PRN option is set, text files that do not include commas or semicolons or other common delimiters may not be parsed as expected.

This option should not be enabled unless it is known that the file was exported from Lotus 1-2-3 or from Excel using the "Lotus Formatted Text (PRN)" format.

:::

Value Parsing

Spreadsheet software including Excel aggressively try to interpret values from CSV and other plain text. This leads to surprising behavior¹!

If the raw option is true, value parsing will be suppressed. All cells values are treated as strings.

The raw option affects the following formats: HTML, CSV, PRN, DIF, RTF.

The raw option does not affect XLSX, XLSB, XLS and other file formats that support explicit value typing.

:::note pass

See Issue #3331 in the SheetJS CE bug tracker for more details.

:::

Code Page Encoding

Spreadsheet applications support a number of legacy encodings. Plaintext files will appear different when opened in different computers in different regions.

By default, the parsers use the most common "English (United States)" encodings. The codepage option controls the encoding in BIFF2 - BIFF5 XLS files without CodePage records, some legacy formats including DBF, and in CSV files without BOM in type: "binary". BIFF8 XLS always defaults to 1200.

The codepage support library is not guaranteed to be loaded by default. The "Installation" section describes how to install and load the support library.

See "Legacy Codepages" for more details.

Date Processing

Plaintext formats may include date and time values without timezone info. The time 12:30 AM is ambiguous.

In the wild, there are two popular approaches:

A) Spreadsheet software typically interpret time values using local timezones. When opening a file in New York, 12:30 AM will be parsed as 12:30 AM ET. When opening a file in Los Angeles, the time will be parsed as 12:30 AM PT.

B) APIs use UTC, the most popular global time standard. 12:30 AM will be parsed as the absolute moment in time corresponding to 8:30 PM EDT or 7:30 PM EST.

By default, the parsers assume files are specified in UTC. When the UTC option is explicitly set to false, dates and times are interpreted in timezone of the web browser or JavaScript engine.

Delimiter-Separated Values

The plaintext parser applies a number of heuristics to determine if files are CSV (fields separated by commas), TSV (fields separated by tabs), PSV (fields separated by |) or SSV (fields separated by ;). The heuristics are based on the presence of characters not in a double-quoted value.

The FS option instructs the parser to use the specified delimiter if multiple delimiter characters are in the text. This bypasses the default heuristics.

Internal Files

Some file formats are structured as larger containers that include sub-files. For example, XLSX files are ZIP files with XML sub-files.

If the bookFiles option is true, each sub-file will be preserved in the workbook. The behavior depends on file type:

keys array (paths in the ZIP) for ZIP-based formats
files hash (mapping paths to objects representing the files) for ZIP
cfb object for formats using CFB containers

Parsing Errors

By default, the workbook parser will suppress errors when parsing worksheets. This ensures the valid worksheets from a multi-sheet workbook are parsed.

If the WTF option is enabled, the errors will not be suppressed.

Input Type

The type parameter for read controls how data is interpreted:

`type`	expected input
`base64`	string: Base64 encoding of the file
`binary`	string: binary string (byte `n` is `data.charCodeAt(n)`)
`string`	string: JS string (only appropriate for UTF-8 text formats)
`buffer`	nodejs Buffer
`array`	array: array of 8-bit unsigned integers (byte `n` is `data[n]`)
`file`	string: path of file that will be read (nodejs only)

Some common types are automatically deduced from the data input type, including NodeJS Buffer objects, Uint8Array and ArrayBuffer objects, and arrays of numbers.

When a JS string is passed with no type, the library assumes the data is a Base64 string. FileReader#readAsBinaryString or ASCII data requires "binary" type. DOM strings including FileReader#readAsText should use type "string".

Guessing File Type

Implementation Details (click to show)

Excel and other spreadsheet tools read the first few bytes and apply other heuristics to determine a file type. This enables file type punning: renaming files with the .xls extension will tell your computer to use Excel to open the file but Excel will know how to handle it. This library applies similar logic:

Byte 0	Raw File Type	Spreadsheet Types
`0xD0`	CFB Container	BIFF 5/8 or protected XLSX/XLSB or WQ3/QPW or XLR
`0x09`	BIFF Stream	BIFF 2/3/4/5
`0x3C`	XML/HTML	SpreadsheetML / Flat ODS / UOS1 / HTML / plain text
`0x50`	ZIP Archive	XLSB or XLSX/M or ODS or UOS2 or NUMBERS or text
`0x49`	Plain Text	SYLK or plain text
`0x54`	Plain Text	DIF or plain text
`0xEF`	UTF-8 Text	SpreadsheetML / Flat ODS / UOS1 / HTML / plain text
`0xFF`	UTF-16 Text	SpreadsheetML / Flat ODS / UOS1 / HTML / plain text
`0x00`	Record Stream	Lotus WK* or Quattro Pro or plain text
`0x7B`	Plain text	RTF or plain text
`0x0A`	Plain text	SpreadsheetML / Flat ODS / UOS1 / HTML / plain text
`0x0D`	Plain text	SpreadsheetML / Flat ODS / UOS1 / HTML / plain text
`0x20`	Plain text	SpreadsheetML / Flat ODS / UOS1 / HTML / plain text

DBF files are detected based on the first byte as well as the third and fourth bytes (corresponding to month and day of the file date)

Works for Windows files are detected based on the BOF record with type 0xFF

Plain text format guessing follows the priority order:

Format	Test
XML	`<?xml` appears in the first 1024 characters
HTML	starts with `<` and HTML tags appear in the first 1024 characters
XML	starts with `<` and the first tag is valid
RTF	starts with `{\rt`
DSV	starts with `sep=` followed by field delimiter and line separator
DSV	more unquoted `\|` chars than `;` `\t` or `,` in the first 1024
DSV	more unquoted `;` chars than `\t` or `,` in the first 1024
TSV	more unquoted `\t` chars than `,` chars in the first 1024
CSV	one of the first 1024 characters is a comma `","`
ETH	starts with `socialcalc:version:`
PRN	`PRN` option is set to true
CSV	(fallback)

HTML tags include html, table, head, meta, script, style, div

Why are random files valid? (click to hide)

Excel is extremely aggressive in reading files. Adding the XLS extension to any file tricks Excel into processing the file.

If the file matches certain heuristics, Excel will use a format-specific parser.

If it cannot deduce the file type, Excel will parse the unknown file as if it were CSV or TSV. SheetJS attempts to replicate that behavior.

The gene SEPT1 was renamed to SEPTIN1 to avoid Excel value interpretations: the string SEPT1 is parsed as the date "September 1". ↩︎

19 KiB Raw Blame History

Parsing Options

Cell-Level Options

Dates

Formulae

Formatted Text

Text and Cell Styling

HTML Formatted Text

Sheet-Level Options

Dense

Range

Stubs

Book-Level Options

VBA

Workbook Metadata

Worksheets

File-Level Options

Password Protection

Lotus Formatted Text (PRN)

Value Parsing

Code Page Encoding

Date Processing

Delimiter-Separated Values

Internal Files

Parsing Errors

Input Type

Guessing File Type

19 KiB

Raw Blame History