forked from sheetjs/docs.sheetjs.com
		
	
		
			
	
	
		
			441 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			441 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|  | --- | ||
|  | title: Sheets in TensorFlow | ||
|  | sidebar_label: TensorFlow.js | ||
|  | pagination_prev: demos/index | ||
|  | pagination_next: demos/frontend/index | ||
|  | --- | ||
|  | 
 | ||
|  | <head> | ||
|  |   <script src="https://docs.sheetjs.com/tfjs/tf.min.js"></script> | ||
|  | </head> | ||
|  | 
 | ||
|  | [TensorFlow.js](https://www.tensorflow.org/js) (shortened to TF.js) is a library | ||
|  | for machine learning in JavaScript. | ||
|  | 
 | ||
|  | [SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing | ||
|  | data from spreadsheets. | ||
|  | 
 | ||
|  | This demo uses TensorFlow.js and SheetJS to process data in spreadsheets. We'll | ||
|  | explore how to load spreadsheet data into TF.js datasets and how to export | ||
|  | results back to spreadsheets. | ||
|  | 
 | ||
|  | - ["CSV Data Interchange"](#csv-data-interchange) uses SheetJS to process sheets | ||
|  |   and generate CSV data that TF.js can import. | ||
|  | 
 | ||
|  | - ["JSON Data Interchange"](#json-data-interchange) uses SheetJS to process | ||
|  |   sheets and generate rows of objects that can be post-processed. | ||
|  | 
 | ||
|  | :::info pass | ||
|  | 
 | ||
|  | Live code blocks in this page use the TF.js `4.14.0` standalone build. | ||
|  | 
 | ||
|  | For use in web frameworks, the `@tensorflow/tfjs` module should be used. | ||
|  | 
 | ||
|  | For use in NodeJS, the native bindings module is `@tensorflow/tfjs-node`. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | :::note Tested Deployments | ||
|  | 
 | ||
|  | Each browser demo was tested in the following environments: | ||
|  | 
 | ||
|  | | Browser     | TF.js version | Date       | | ||
|  | |:------------|:--------------|:-----------| | ||
|  | | Chrome 119  | `4.14.0`      | 2023-12-09 | | ||
|  | | Safari 16.6 | `4.14.0`      | 2023-12-09 | | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | ## CSV Data Interchange
 | ||
|  | 
 | ||
|  | `tf.data.csv`[^1] generates a Dataset from CSV data. The function expects a URL. | ||
|  | 
 | ||
|  | :::note pass | ||
|  | 
 | ||
|  | When this demo was last tested, there was no direct method to pass a CSV string | ||
|  | to the underlying parser. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | Fortunately blob URLs are supported. | ||
|  | 
 | ||
|  | ```mermaid | ||
|  | flowchart LR | ||
|  |   ws((SheetJS\nWorksheet)) | ||
|  |   csv(CSV\nstring) | ||
|  |   url{{Data\nURL}} | ||
|  |   dataset[(TF.js\nDataset)] | ||
|  |   ws --> |sheet_to_csv\nSheetJS| csv | ||
|  |   csv --> |JavaScript\nAPIs| url | ||
|  |   url --> |tf.data.csv\nTensorFlow.js| dataset | ||
|  | ``` | ||
|  | 
 | ||
|  | The SheetJS `sheet_to_csv` method[^2] generates a CSV string from a worksheet | ||
|  | object. Using standard JavaScript techniques, a blob URL can be constructed: | ||
|  | 
 | ||
|  | ```js | ||
|  | function worksheet_to_csv_url(worksheet) { | ||
|  |   /* generate CSV */ | ||
|  |   const csv = XLSX.utils.sheet_to_csv(worksheet); | ||
|  | 
 | ||
|  |   /* CSV -> Uint8Array -> Blob */ | ||
|  |   const u8 = new TextEncoder().encode(csv); | ||
|  |   const blob = new Blob([u8], { type: "text/csv" }); | ||
|  | 
 | ||
|  |   /* generate a blob URL */ | ||
|  |   return URL.createObjectURL(blob); | ||
|  | } | ||
|  | ``` | ||
|  | 
 | ||
|  | ### CSV Demo
 | ||
|  | 
 | ||
|  | This demo shows a simple model fitting using the "cars" dataset from TensorFlow. | ||
|  | The [sample XLS file](https://sheetjs.com/data/cd.xls) contains the data. The | ||
|  | data processing mirrors the official "Making Predictions from 2D Data" demo[^3]. | ||
|  | 
 | ||
|  | ```mermaid | ||
|  | flowchart LR | ||
|  |   file[(Remote\nFile)] | ||
|  |   subgraph SheetJS Operations | ||
|  |     ab[(Data\nBytes)] | ||
|  |     wb(((SheetJS\nWorkbook))) | ||
|  |     ws((SheetJS\nWorksheet)) | ||
|  |     csv(CSV\nstring) | ||
|  |   end | ||
|  |   subgraph TensorFlow.js Operations | ||
|  |     url{{Data\nURL}} | ||
|  |     dataset[(TF.js\nDataset)] | ||
|  |     results((Results)) | ||
|  |   end | ||
|  |   file --> |fetch\n\n| ab | ||
|  |   ab --> |read\n\n| wb | ||
|  |   wb --> |select\nsheet| ws | ||
|  |   ws --> |sheet_to_csv\n\n| csv | ||
|  |   csv --> |JS\nAPI| url | ||
|  |   url --> |tf.data.csv\nTF.js| dataset | ||
|  |   dataset --> |fitDataset\nTF.js| results | ||
|  | ``` | ||
|  | 
 | ||
|  | The demo builds a model for predicting MPG from Horsepower data. It: | ||
|  | 
 | ||
|  | - fetches <https://sheetjs.com/data/cd.xls> | ||
|  | - parses the data with the SheetJS `read`[^4] method | ||
|  | - selects the first worksheet[^5] and converts to CSV using `sheet_to_csv`[^6] | ||
|  | - generates a blob URL from the CSV text | ||
|  | - generates a TF.js dataset with `tf.data.csv`[^7] and selects data columns | ||
|  | - builds a model and trains with `fitDataset`[^8] | ||
|  | - predicts MPG from a set of sample inputs and displays results in a table | ||
|  | 
 | ||
|  | <details><summary><b>Live Demo</b> (click to show)</summary> | ||
|  | 
 | ||
|  | :::caution pass | ||
|  | 
 | ||
|  | In some test runs, the results did not make sense given the underlying data. | ||
|  | The dependent and independent variables are expected to be anti-correlated. | ||
|  | 
 | ||
|  | **This is a known issue in TF.js and affects the official demos** | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | :::caution pass | ||
|  | 
 | ||
|  | If the live demo shows a message | ||
|  | 
 | ||
|  | ``` | ||
|  | ReferenceError: tf is not defined | ||
|  | ``` | ||
|  | 
 | ||
|  | please refresh the page.  This is a known bug in the documentation generator. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | ```jsx live | ||
|  | function SheetJSToTFJSCSV() { | ||
|  |   const [output, setOutput] = React.useState(""); | ||
|  |   const [results, setResults] = React.useState([]); | ||
|  |   const [disabled, setDisabled] = React.useState(false); | ||
|  | 
 | ||
|  |   function worksheet_to_csv_url(worksheet) { | ||
|  |     /* generate CSV */ | ||
|  |     const csv = XLSX.utils.sheet_to_csv(worksheet); | ||
|  | 
 | ||
|  |     /* CSV -> Uint8Array -> Blob */ | ||
|  |     const u8 = new TextEncoder().encode(csv); | ||
|  |     const blob = new Blob([u8], { type: "text/csv" }); | ||
|  | 
 | ||
|  |     /* generate a blob URL */ | ||
|  |     return URL.createObjectURL(blob); | ||
|  |   } | ||
|  | 
 | ||
|  |   const doit = React.useCallback(async () => { | ||
|  |     setResults([]); setOutput(""); setDisabled(true); | ||
|  |     try { | ||
|  |     /* fetch file */ | ||
|  |     const f = await fetch("https://sheetjs.com/data/cd.xls"); | ||
|  |     const ab = await f.arrayBuffer(); | ||
|  |     /* parse file and get first worksheet */ | ||
|  |     const wb = XLSX.read(ab); | ||
|  |     const ws = wb.Sheets[wb.SheetNames[0]]; | ||
|  | 
 | ||
|  |     /* generate blob URL */ | ||
|  |     const url = worksheet_to_csv_url(ws); | ||
|  | 
 | ||
|  |     /* feed to tf.js */ | ||
|  |     const dataset = tf.data.csv(url, { | ||
|  |       hasHeader: true, | ||
|  |       configuredColumnsOnly: true, | ||
|  |       columnConfigs:{ | ||
|  |         "Horsepower": {required: false, default: 0}, | ||
|  |         "Miles_per_Gallon":{required: false, default: 0, isLabel:true} | ||
|  |       } | ||
|  |     }); | ||
|  | 
 | ||
|  |     /* pre-process data */ | ||
|  |     let flat = dataset | ||
|  |       .map(({xs,ys}) =>({xs: Object.values(xs), ys: Object.values(ys)})) | ||
|  |       .filter(({xs,ys}) => [...xs,...ys].every(v => v>0)); | ||
|  | 
 | ||
|  |     /* normalize manually :( */ | ||
|  |     let minX = Infinity, maxX = -Infinity, minY = Infinity, maxY = -Infinity; | ||
|  |     await flat.forEachAsync(({xs, ys}) => { | ||
|  |       minX = Math.min(minX, xs[0]); maxX = Math.max(maxX, xs[0]); | ||
|  |       minY = Math.min(minY, ys[0]); maxY = Math.max(maxY, ys[0]); | ||
|  |     }); | ||
|  |     flat = flat.map(({xs, ys}) => ({xs:xs.map(v => (v-minX)/(maxX - minX)),ys:ys.map(v => (v-minY)/(maxY-minY))})); | ||
|  |     flat = flat.batch(32); | ||
|  | 
 | ||
|  |     /* build and train model */ | ||
|  |     const model = tf.sequential(); | ||
|  |     model.add(tf.layers.dense({inputShape: [1], units: 1})); | ||
|  |     model.compile({ optimizer: tf.train.sgd(0.000001), loss: 'meanSquaredError' }); | ||
|  |     await model.fitDataset(flat, { epochs: 100, callbacks: { onEpochEnd: async (epoch, logs) => { | ||
|  |       setOutput(`${epoch}:${logs.loss}`); | ||
|  |     }}}); | ||
|  | 
 | ||
|  |     /* predict values */ | ||
|  |     const inp = tf.linspace(0, 1, 9); | ||
|  |     const pred = model.predict(inp); | ||
|  |     const xs = await inp.dataSync(), ys = await pred.dataSync(); | ||
|  |     setResults(Array.from(xs).map((x, i) => [ x * (maxX - minX) + minX, ys[i] * (maxY - minY) + minY ])); | ||
|  |     setOutput(""); | ||
|  | 
 | ||
|  |     } catch(e) { setOutput(`ERROR: ${String(e)}`); } finally { setDisabled(false);} | ||
|  |   }); | ||
|  |   return ( <> | ||
|  |     <button onClick={doit} disabled={disabled}>Click to run</button><br/> | ||
|  |     {output && <pre>{output}</pre> || <></>} | ||
|  |     {results.length && <table><thead><tr><th>Horsepower</th><th>MPG</th></tr></thead><tbody> | ||
|  |     {results.map((r,i) => <tr key={i}><td>{r[0]}</td><td>{r[1].toFixed(2)}</td></tr>)} | ||
|  |     </tbody></table> || <></>} | ||
|  |   </> ); | ||
|  | } | ||
|  | ``` | ||
|  | 
 | ||
|  | </details> | ||
|  | 
 | ||
|  | ## JS Array Interchange
 | ||
|  | 
 | ||
|  | [The official Linear Regression tutorial](https://www.tensorflow.org/js/tutorials/training/linear_regression) | ||
|  | loads data from a JSON file: | ||
|  | 
 | ||
|  | ```json | ||
|  | [ | ||
|  |   { | ||
|  |     "Name": "chevrolet chevelle malibu", | ||
|  |     "Miles_per_Gallon": 18, | ||
|  |     "Cylinders": 8, | ||
|  |     "Displacement": 307, | ||
|  |     "Horsepower": 130, | ||
|  |     "Weight_in_lbs": 3504, | ||
|  |     "Acceleration": 12, | ||
|  |     "Year": "1970-01-01", | ||
|  |     "Origin": "USA" | ||
|  |   }, | ||
|  |   // ... | ||
|  | ] | ||
|  | ``` | ||
|  | 
 | ||
|  | In real use cases, data is stored in [spreadsheets](https://sheetjs.com/data/cd.xls) | ||
|  | 
 | ||
|  |  | ||
|  | 
 | ||
|  | Following the tutorial, the data fetching method can be adapted to handle arrays | ||
|  | of objects, such as those generated by the SheetJS `sheet_to_json` method[^9]. | ||
|  | 
 | ||
|  | Differences from the official example are highlighted below: | ||
|  | 
 | ||
|  | ```js | ||
|  | /** | ||
|  |  * Get the car data reduced to just the variables we are interested | ||
|  |  * and cleaned of missing data. | ||
|  |  */ | ||
|  | async function getData() { | ||
|  |   // highlight-start | ||
|  |   /* fetch file */ | ||
|  |   const carsDataResponse = await fetch('https://sheetjs.com/data/cd.xls'); | ||
|  |   /* get file data (ArrayBuffer) */ | ||
|  |   const carsDataAB = await carsDataResponse.arrayBuffer(); | ||
|  |   /* parse */ | ||
|  |   const carsDataWB = XLSX.read(carsDataAB); | ||
|  |   /* get first worksheet */ | ||
|  |   const carsDataWS = carsDataWB.Sheets[carsDataWB.SheetNames[0]]; | ||
|  |   /* generate array of JS objects */ | ||
|  |   const carsData = XLSX.utils.sheet_to_json(carsDataWS); | ||
|  |   // highlight-end | ||
|  |   const cleaned = carsData.map(car => ({ | ||
|  |     mpg: car.Miles_per_Gallon, | ||
|  |     horsepower: car.Horsepower, | ||
|  |   })) | ||
|  |   .filter(car => (car.mpg != null && car.horsepower != null)); | ||
|  | 
 | ||
|  |   return cleaned; | ||
|  | } | ||
|  | ``` | ||
|  | 
 | ||
|  | ## Low-Level Operations
 | ||
|  | 
 | ||
|  | ### Data Transposition
 | ||
|  | 
 | ||
|  | A typical dataset in a spreadsheet will start with one header row and represent | ||
|  | each data record in its own row. For example, the Iris dataset might look like | ||
|  | 
 | ||
|  |  | ||
|  | 
 | ||
|  | The SheetJS `sheet_to_json` method[^10] will translate worksheet objects into an | ||
|  | array of row objects: | ||
|  | 
 | ||
|  | ```js | ||
|  | var aoo = [ | ||
|  |   {"sepal length": 5.1, "sepal width": 3.5, ...}, | ||
|  |   {"sepal length": 4.9, "sepal width":   3, ...}, | ||
|  |   ... | ||
|  | ]; | ||
|  | ``` | ||
|  | 
 | ||
|  | TF.js and other libraries tend to operate on individual columns, equivalent to: | ||
|  | 
 | ||
|  | ```js | ||
|  | var sepal_lengths = [5.1, 4.9, ...]; | ||
|  | var sepal_widths = [3.5, 3, ...]; | ||
|  | ``` | ||
|  | 
 | ||
|  | When a `tensor2d` can be exported, it will look different from the spreadsheet: | ||
|  | 
 | ||
|  | ```js | ||
|  | var data_set_2d = [ | ||
|  |   [5.1, 4.9, ...], | ||
|  |   [3.5, 3, ...], | ||
|  |   ... | ||
|  | ] | ||
|  | ``` | ||
|  | 
 | ||
|  | This is the transpose of how people use spreadsheets! | ||
|  | 
 | ||
|  | ### Exporting Datasets to a Worksheet
 | ||
|  | 
 | ||
|  | The `aoa_to_sheet` method[^11] can generate a worksheet from an array of arrays. | ||
|  | ML libraries typically provide APIs to pull an array of arrays, but it will be | ||
|  | transposed. To export multiple data sets, the data should be transposed: | ||
|  | 
 | ||
|  | ```js | ||
|  | /* assuming data is an array of typed arrays */ | ||
|  | var aoa = []; | ||
|  | for(var i = 0; i < data.length; ++i) { | ||
|  |   for(var j = 0; j < data[i].length; ++j) { | ||
|  |     if(!aoa[j]) aoa[j] = []; | ||
|  |     aoa[j][i] = data[i][j]; | ||
|  |   } | ||
|  | } | ||
|  | /* aoa can be directly converted to a worksheet object */ | ||
|  | var ws = XLSX.utils.aoa_to_sheet(aoa); | ||
|  | ``` | ||
|  | 
 | ||
|  | ### Importing Data from a Spreadsheet
 | ||
|  | 
 | ||
|  | `sheet_to_json` with the option `header:1`[^12] will generate a row-major array | ||
|  | of arrays that can be transposed. However, it is more efficient to walk the | ||
|  | sheet manually: | ||
|  | 
 | ||
|  | ```js | ||
|  | /* find worksheet range */ | ||
|  | var range = XLSX.utils.decode_range(ws['!ref']); | ||
|  | var out = [] | ||
|  | /* walk the columns */ | ||
|  | for(var C = range.s.c; C <= range.e.c; ++C) { | ||
|  |   /* create the typed array */ | ||
|  |   var ta = new Float32Array(range.e.r - range.s.r + 1); | ||
|  |   /* walk the rows */ | ||
|  |   for(var R = range.s.r; R <= range.e.r; ++R) { | ||
|  |     /* find the cell, skip it if the cell isn't numeric or boolean */ | ||
|  |     var cell = ws["!data"] ? (ws["!data"][R]||[])[C] : ws[XLSX.utils.encode_cell({r:R, c:C})]; | ||
|  |     if(!cell || cell.t != 'n' && cell.t != 'b') continue; | ||
|  |     /* assign to the typed array */ | ||
|  |     ta[R - range.s.r] = cell.v; | ||
|  |   } | ||
|  |   out.push(ta); | ||
|  | } | ||
|  | ``` | ||
|  | 
 | ||
|  | If the data set has a header row, the loop can be adjusted to skip those rows. | ||
|  | 
 | ||
|  | ### TF.js Tensors
 | ||
|  | 
 | ||
|  | A single `Array#map` can pull individual named fields from the result, which | ||
|  | can be used to construct TensorFlow.js tensor objects: | ||
|  | 
 | ||
|  | ```js | ||
|  | const aoo = XLSX.utils.sheet_to_json(worksheet); | ||
|  | const lengths = aoo.map(row => row["sepal length"]); | ||
|  | const tensor = tf.tensor1d(lengths); | ||
|  | ``` | ||
|  | 
 | ||
|  | `tf.Tensor` objects can be directly transposed using `transpose`: | ||
|  | 
 | ||
|  | ```js | ||
|  | var aoo = XLSX.utils.sheet_to_json(worksheet); | ||
|  | // "x" and "y" are the fields we want to pull from the data | ||
|  | var data = aoo.map(row => ([row["x"], row["y"]])); | ||
|  | 
 | ||
|  | // create a tensor representing two column datasets | ||
|  | var tensor = tf.tensor2d(data).transpose(); | ||
|  | 
 | ||
|  | // individual columns can be accessed | ||
|  | var col1 = tensor.slice([0,0], [1,tensor.shape[1]]).flatten(); | ||
|  | var col2 = tensor.slice([1,0], [1,tensor.shape[1]]).flatten(); | ||
|  | ``` | ||
|  | 
 | ||
|  | For exporting, `stack` can be used to collapse the columns into a linear array: | ||
|  | 
 | ||
|  | ```js | ||
|  | /* pull data into a Float32Array */ | ||
|  | var result = tf.stack([col1, col2]).transpose(); | ||
|  | var shape = tensor.shape; | ||
|  | var f32 = tensor.dataSync(); | ||
|  | 
 | ||
|  | /* construct an array of arrays of the data in spreadsheet order */ | ||
|  | var aoa = []; | ||
|  | for(var j = 0; j < shape[0]; ++j) { | ||
|  |   aoa[j] = []; | ||
|  |   for(var i = 0; i < shape[1]; ++i) aoa[j][i] = f32[j * shape[1] + i]; | ||
|  | } | ||
|  | 
 | ||
|  | /* add headers to the top */ | ||
|  | aoa.unshift(["x", "y"]); | ||
|  | 
 | ||
|  | /* generate worksheet */ | ||
|  | var worksheet = XLSX.utils.aoa_to_sheet(aoa); | ||
|  | ``` | ||
|  | 
 | ||
|  | [^1]: See [`tf.data.csv`](https://js.tensorflow.org/api/latest/#data.csv) in the TensorFlow.js documentation | ||
|  | [^2]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output) | ||
|  | [^3]: The ["Making Predictions from 2D Data" example](https://codelabs.developers.google.com/codelabs/tfjs-training-regression/) uses a hosted JSON file. The [sample XLS file](https://sheetjs.com/data/cd.xls) includes the same data. | ||
|  | [^4]: See [`read` in "Reading Files"](/docs/api/parse-options) | ||
|  | [^5]: See ["Workbook Object"](/docs/csf/book) | ||
|  | [^6]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output) | ||
|  | [^7]: See [`tf.data.csv`](https://js.tensorflow.org/api/latest/#data.csv) in the TensorFlow.js documentation | ||
|  | [^8]: See [`tf.LayersModel.fitDataset`](https://js.tensorflow.org/api/latest/#tf.LayersModel.fitDataset) in the TensorFlow.js documentation | ||
|  | [^9]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output) | ||
|  | [^10]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output) | ||
|  | [^11]: See [`aoa_to_sheet` in "Utilities"](/docs/api/utilities/array#array-of-arrays-input) | ||
|  | [^12]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output) |