forked from sheetjs/docs.sheetjs.com
		
	
		
			
	
	
		
			355 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			355 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|  | --- | ||
|  | title: Large Datasets | ||
|  | --- | ||
|  | 
 | ||
|  | For maximal compatibility, the library reads entire files at once and generates | ||
|  | files at once. Browsers and other JS engines enforce tight memory limits.  In | ||
|  | these cases, the library offers strategies to optimize for memory or space by | ||
|  | using platform-specific APIs. | ||
|  | 
 | ||
|  | ## Dense Mode
 | ||
|  | 
 | ||
|  | The `dense` option (supported in `read`, `readFile` and `aoa_to_sheet`) creates | ||
|  | worksheet objects that use arrays of arrays under the hood: | ||
|  | 
 | ||
|  | ```js | ||
|  | var dense_wb = XLSX.read(ab, {dense: true}); | ||
|  | 
 | ||
|  | var dense_sheet = XLSX.utils.aoa_to_sheet(aoa); | ||
|  | ``` | ||
|  | 
 | ||
|  | <details><summary><b>Historical Note</b> (click to show)</summary> | ||
|  | 
 | ||
|  | The earliest versions of the library aimed for IE6+ compatibility.  In early | ||
|  | testing, both in Chrome 26 and in IE6, the most efficient worksheet storage for | ||
|  | small sheets was a large object whose keys were cell addresses. | ||
|  | 
 | ||
|  | Over time, V8 (the engine behind Chrome and NodeJS) evolved in a way that made | ||
|  | the array of arrays approach more efficient but reduced the performance of the | ||
|  | large object approach. | ||
|  | 
 | ||
|  | In the interest of preserving backwards compatibility, the library opts to make | ||
|  | the array of arrays approach available behind a special `dense` option. | ||
|  | 
 | ||
|  | </details> | ||
|  | 
 | ||
|  | The various API functions will seamlessly handle dense and sparse worksheets. | ||
|  | 
 | ||
|  | ## Streaming Write
 | ||
|  | 
 | ||
|  | The streaming write functions are available in the `XLSX.stream` object.  They | ||
|  | take the same arguments as the normal write functions: | ||
|  | 
 | ||
|  | - `XLSX.stream.to_csv` is the streaming version of `XLSX.utils.sheet_to_csv`. | ||
|  | - `XLSX.stream.to_html` is the streaming version of `XLSX.utils.sheet_to_html`. | ||
|  | - `XLSX.stream.to_json` is the streaming version of `XLSX.utils.sheet_to_json`. | ||
|  | 
 | ||
|  | "Stream" refers to the NodeJS push streams API. | ||
|  | 
 | ||
|  | <details><summary><b>Historical Note</b> (click to show)</summary> | ||
|  | 
 | ||
|  | NodeJS push streams were introduced in 2012. | ||
|  | 
 | ||
|  | The first streaming write function, `to_csv`, was introduced in April 2017.  It | ||
|  | used and still uses the same NodeJS streaming API. | ||
|  | 
 | ||
|  | Years later, browser vendors are settling on a different stream API. | ||
|  | 
 | ||
|  | For maximal compatibility, the library uses NodeJS push streams. | ||
|  | 
 | ||
|  | </details> | ||
|  | 
 | ||
|  | ### NodeJS
 | ||
|  | 
 | ||
|  | :::note | ||
|  | 
 | ||
|  | In a CommonJS context, NodeJS Streams and `fs` immediately work with SheetJS: | ||
|  | 
 | ||
|  | ```js | ||
|  | const XLSX = require("xlsx"); // "just works" | ||
|  | ``` | ||
|  | 
 | ||
|  | In NodeJS ESM, the dependency must be loaded manually: | ||
|  | 
 | ||
|  | ```js | ||
|  | import * as XLSX from 'xlsx'; | ||
|  | import { Readable } from 'stream'; | ||
|  | 
 | ||
|  | XLSX.stream.set_readable(Readable); // manually load stream helpers | ||
|  | ``` | ||
|  | 
 | ||
|  | Additionally, for file-related operations in NodeJS ESM, `fs` must be loaded: | ||
|  | 
 | ||
|  | ```js | ||
|  | import * as XLSX from 'xlsx'; | ||
|  | import * as fs from 'fs'; | ||
|  | 
 | ||
|  | XLSX.set_fs(fs); // manually load fs helpers | ||
|  | ``` | ||
|  | 
 | ||
|  | **It is strongly encouraged to use CommonJS in NodeJS whenever possible.** | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | This example reads a worksheet passed as an argument to the script, pulls the | ||
|  | first worksheet, converts to CSV and writes to `out.csv`: | ||
|  | 
 | ||
|  | ```js | ||
|  | var XLSX = require("xlsx"); | ||
|  | var workbook = XLSX.readFile(process.argv[2]); | ||
|  | var worksheet = workbook.Sheets[workbook.SheetNames[0]]; | ||
|  | // highlight-next-line | ||
|  | var stream = XLSX.stream.to_csv(worksheet); | ||
|  | 
 | ||
|  | var output_file_name = "out.csv"; | ||
|  | // highlight-next-line | ||
|  | stream.pipe(fs.createWriteStream(output_file_name)); | ||
|  | ``` | ||
|  | 
 | ||
|  | `stream.to_json` uses Object-mode streams. A `Transform` stream can be used to | ||
|  | generate a normal stream for streaming to a file or the screen: | ||
|  | 
 | ||
|  | ```js | ||
|  | var XLSX = require("xlsx"); | ||
|  | var workbook = XLSX.readFile(process.argv[2], {dense: true}); | ||
|  | var worksheet = workbook.Sheets[workbook.SheetNames[0]]; | ||
|  | /* to_json returns an object-mode stream */ | ||
|  | // highlight-next-line | ||
|  | var stream = XLSX.stream.to_json(worksheet, {raw:true}); | ||
|  | 
 | ||
|  | /* this Transform stream converts JS objects to text and prints to screen */ | ||
|  | var conv = new Transform({writableObjectMode:true}); | ||
|  | conv._transform = function(obj, e, cb){ cb(null, JSON.stringify(obj) + "\n"); }; | ||
|  | conv.pipe(process.stdout); | ||
|  | 
 | ||
|  | // highlight-next-line | ||
|  | stream.pipe(conv); | ||
|  | ``` | ||
|  | 
 | ||
|  | ### Browser
 | ||
|  | 
 | ||
|  | <details><summary><b>Live Demo</b> (click to show)</summary> | ||
|  | 
 | ||
|  | The following live demo fetches and parses a file in a Web Worker.  The `to_csv` | ||
|  | streaming function is used to generate CSV rows and pass back to the main thread | ||
|  | for further processing. | ||
|  | 
 | ||
|  | :::note | ||
|  | 
 | ||
|  | For Chromium browsers, the File System Access API provides a modern worker-only | ||
|  | approach. [The Web Workers demo](/docs/demos/worker#streaming-write) includes a | ||
|  | live example of CSV streaming write. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | The demo has a URL input box.  Feel free to change the URL.  For example, | ||
|  | 
 | ||
|  | `https://raw.githubusercontent.com/SheetJS/test_files/master/large_strings.xls` | ||
|  | is an XLS file over 50 MB | ||
|  | 
 | ||
|  | `https://raw.githubusercontent.com/SheetJS/libreoffice_test-files/master/calc/xlsx-import/perf/8-by-300000-cells.xlsx` | ||
|  | is an XLSX file with 300000 rows (approximately 20 MB) | ||
|  | 
 | ||
|  | ```jsx live | ||
|  | function SheetJSFetchCSVStreamWorker() { | ||
|  |   const [__html, setHTML] = React.useState(""); | ||
|  |   const [state, setState]  = React.useState(""); | ||
|  |   const [cnt, setCnt] = React.useState(0); | ||
|  |   const [url, setUrl] = React.useState("https://oss.sheetjs.com/test_files/large_strings.xlsx"); | ||
|  | 
 | ||
|  |   return ( <> | ||
|  |     <b>URL: </b><input type="text" value={url} onChange={(e) => setUrl(e.target.value)} size="80"/> | ||
|  |     <button onClick={() => { | ||
|  |       /* this mantra embeds the worker source in the function */ | ||
|  |       const worker = new Worker(URL.createObjectURL(new Blob([`\ | ||
|  | /* load standalone script from CDN */ | ||
|  | importScripts("https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js"); | ||
|  | 
 | ||
|  | function sheet_to_csv_cb(ws, cb, opts, batch = 1000) { | ||
|  |   XLSX.stream.set_readable(() => ({ | ||
|  |     __done: false, | ||
|  |     // this function will be assigned by the SheetJS stream methods | ||
|  |     _read: function() { this.__done = true; }, | ||
|  |     // this function is called by the stream methods | ||
|  |     push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; }, | ||
|  |     resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); } | ||
|  |   })); | ||
|  |   return XLSX.stream.to_csv(ws, opts); | ||
|  | } | ||
|  | 
 | ||
|  | /* this callback will run once the main context sends a message */ | ||
|  | self.addEventListener('message', async(e) => { | ||
|  |   try { | ||
|  |     postMessage({state: "fetching " + e.data.url}); | ||
|  |     /* Fetch file */ | ||
|  |     const res = await fetch(e.data.url); | ||
|  |     const ab = await res.arrayBuffer(); | ||
|  | 
 | ||
|  |     /* Parse file */ | ||
|  |     let len = ab.byteLength; | ||
|  |     if(len < 1024) len += " bytes"; else { len /= 1024; | ||
|  |       if(len < 1024) len += " KB"; else { len /= 1024; len += " MB"; } | ||
|  |     } | ||
|  |     postMessage({state: "parsing " + len}); | ||
|  |     const wb = XLSX.read(ab, {dense: true}); | ||
|  |     const ws = wb.Sheets[wb.SheetNames[0]]; | ||
|  | 
 | ||
|  |     /* Generate CSV rows */ | ||
|  |     postMessage({state: "csv"}); | ||
|  |     const strm = sheet_to_csv_cb(ws, (csv) => { | ||
|  |       if(csv != null) postMessage({csv}); | ||
|  |       else postMessage({state: "done"}); | ||
|  |     }); | ||
|  |     strm.resume(); | ||
|  |   } catch(e) { | ||
|  |     /* Pass the error message back */ | ||
|  |     postMessage({error: String(e.message || e) }); | ||
|  |   } | ||
|  | }, false); | ||
|  |       `]))); | ||
|  |       /* when the worker sends back data, add it to the DOM */ | ||
|  |       worker.onmessage = function(e) { | ||
|  |         if(e.data.error) return setHTML(e.data.error); | ||
|  |         else if(e.data.state) return setState(e.data.state); | ||
|  |         setHTML(e.data.csv); | ||
|  |         setCnt(cnt => cnt+1); | ||
|  |       }; | ||
|  |       setCnt(0); setState(""); | ||
|  |       /* post a message to the worker with the URL to fetch */ | ||
|  |       worker.postMessage({url}); | ||
|  |     }}><b>Click to Start</b></button> | ||
|  |     <pre>State: <b>{state}</b><br/>Number of rows: <b>{cnt}</b></pre> | ||
|  |     <pre dangerouslySetInnerHTML={{ __html }}/> | ||
|  |   </> ); | ||
|  | } | ||
|  | ``` | ||
|  | 
 | ||
|  | </details> | ||
|  | 
 | ||
|  | NodeJS streaming APIs are not available in the browser.  The following function | ||
|  | supplies a pseudo stream object compatible with the `to_csv` function: | ||
|  | 
 | ||
|  | ```js | ||
|  | function sheet_to_csv_cb(ws, cb, opts, batch = 1000) { | ||
|  |   XLSX.stream.set_readable(() => ({ | ||
|  |     __done: false, | ||
|  |     // this function will be assigned by the SheetJS stream methods | ||
|  |     _read: function() { this.__done = true; }, | ||
|  |     // this function is called by the stream methods | ||
|  |     push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; }, | ||
|  |     resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); } | ||
|  |   })); | ||
|  |   return XLSX.stream.to_csv(ws, opts); | ||
|  | } | ||
|  | 
 | ||
|  | // assuming `workbook` is a workbook, stream the first sheet | ||
|  | const ws = workbook.Sheets[workbook.SheetNames[0]]; | ||
|  | const strm = sheet_to_csv_cb(ws, (csv)=>{ if(csv != null) console.log(csv); }); | ||
|  | strm.resume(); | ||
|  | ``` | ||
|  | 
 | ||
|  | #### Web Workers
 | ||
|  | 
 | ||
|  | For processing large files in the browser, it is strongly encouraged to use Web | ||
|  | Workers. The [Worker demo](/docs/demos/worker#streaming-write) includes examples | ||
|  | using the File System Access API. | ||
|  | 
 | ||
|  | Typically, the file and stream processing occurs in the Web Worker.  CSV rows | ||
|  | can be sent back to the main thread in the callback: | ||
|  | 
 | ||
|  | ```js title="worker.js" | ||
|  | /* load standalone script from CDN */ | ||
|  | importScripts("https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js"); | ||
|  | 
 | ||
|  | function sheet_to_csv_cb(ws, cb, opts, batch = 1000) { | ||
|  |   XLSX.stream.set_readable(() => ({ | ||
|  |     __done: false, | ||
|  |     // this function will be assigned by the SheetJS stream methods | ||
|  |     _read: function() { this.__done = true; }, | ||
|  |     // this function is called by the stream methods | ||
|  |     push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; }, | ||
|  |     resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); } | ||
|  |   })); | ||
|  |   return XLSX.stream.to_csv(ws, opts); | ||
|  | } | ||
|  | 
 | ||
|  | /* this callback will run once the main context sends a message */ | ||
|  | self.addEventListener('message', async(e) => { | ||
|  |   try { | ||
|  |     postMessage({state: "fetching " + e.data.url}); | ||
|  |     /* Fetch file */ | ||
|  |     const res = await fetch(e.data.url); | ||
|  |     const ab = await res.arrayBuffer(); | ||
|  | 
 | ||
|  |     /* Parse file */ | ||
|  |     postMessage({state: "parsing"}); | ||
|  |     const wb = XLSX.read(ab, {dense: true}); | ||
|  |     const ws = wb.Sheets[wb.SheetNames[0]]; | ||
|  | 
 | ||
|  |     /* Generate CSV rows */ | ||
|  |     postMessage({state: "csv"}); | ||
|  |     const strm = sheet_to_csv_cb(ws, (csv) => { | ||
|  |       if(csv != null) postMessage({csv}); | ||
|  |       else postMessage({state: "done"}); | ||
|  |     }); | ||
|  |     strm.resume(); | ||
|  |   } catch(e) { | ||
|  |     /* Pass the error message back */ | ||
|  |     postMessage({error: String(e.message || e) }); | ||
|  |   } | ||
|  | }, false); | ||
|  | ``` | ||
|  | 
 | ||
|  | The main thread will receive messages with CSV rows for further processing: | ||
|  | 
 | ||
|  | ```js | ||
|  | worker.onmessage = function(e) { | ||
|  |   if(e.data.error) { console.error(e.data.error); /* show an error message */ } | ||
|  |   else if(e.data.state) { console.info(e.data.state); /* current state */ } | ||
|  |   else { | ||
|  |     /* e.data.csv is the row generated by the stream */ | ||
|  |     console.log(e.data.csv); | ||
|  |   } | ||
|  | }; | ||
|  | ``` | ||
|  | 
 | ||
|  | ### Deno
 | ||
|  | 
 | ||
|  | Deno does not support NodeJS streams in normal execution, so a wrapper is used. | ||
|  | This example fetches <https://sheetjs.com/pres.numbers> and prints CSV rows: | ||
|  | 
 | ||
|  | ```ts title="sheet2csv.ts" | ||
|  | // @deno-types="https://cdn.sheetjs.com/xlsx-latest/package/types/index.d.ts" | ||
|  | import { stream, Sheet2CSVOpts, WorkSheet } from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs'; | ||
|  | 
 | ||
|  | interface Resumable { resume:()=>void; }; | ||
|  | /* Generate row strings from a worksheet */ | ||
|  | function sheet_to_csv_cb(ws: WorkSheet, cb:(d:string|null)=>void, opts: Sheet2CSVOpts = {}, batch = 1000): Resumable { | ||
|  |   stream.set_readable(() => ({ | ||
|  |     __done: false, | ||
|  |     // this function will be assigned by the SheetJS stream methods | ||
|  |     _read: function() { this.__done = true; }, | ||
|  |     // this function is called by the stream methods | ||
|  |     push: function(d: any) { if(!this.__done) cb(d); if(d == null) this.__done = true; }, | ||
|  |     resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); } | ||
|  |   })); | ||
|  |   return stream.to_csv(ws, opts) as Resumable; | ||
|  | } | ||
|  | 
 | ||
|  | /* Callback invoked on each row (string) and at the end (null) */ | ||
|  | const csv_cb = (d:string|null) => { | ||
|  |   if(d == null) return; | ||
|  |   /* The strings include line endings, so raw write ops should be used */ | ||
|  |   Deno.stdout.write(new TextEncoder().encode(d)); | ||
|  | }; | ||
|  | 
 | ||
|  | /* Fetch https://sheetjs.com/pres.numbers, parse, and get first worksheet */ | ||
|  | import { read } from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs'; | ||
|  | const ab = await (await fetch("https://sheetjs.com/pres.numbers")).arrayBuffer(); | ||
|  | const wb = read(ab, { dense: true }); | ||
|  | const ws = wb.Sheets[wb.SheetNames[0]]; | ||
|  | 
 | ||
|  | /* Create and start CSV stream */ | ||
|  | sheet_to_csv_cb(ws, csv_cb).resume(); | ||
|  | ``` |