forked from sheetjs/docs.sheetjs.com
		
	
		
			
	
	
		
			216 lines
		
	
	
		
			7.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			216 lines
		
	
	
		
			7.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|  | --- | ||
|  | sidebar_position: 7 | ||
|  | --- | ||
|  | 
 | ||
|  | # Headless Automation
 | ||
|  | 
 | ||
|  | Headless automation involves controlling "headless browsers" to access websites | ||
|  | and submit or download data.  It is also possible to automate browsers using | ||
|  | custom browser extensions. | ||
|  | 
 | ||
|  | The [SheetJS standalone script](../../installation/standalone) can be added to | ||
|  | any website by inserting a `SCRIPT` tag.  Headless browsers usually provide | ||
|  | utility functions for running custom snippets in the browser and passing data | ||
|  | back to the automation script. | ||
|  | 
 | ||
|  | ## Use Case
 | ||
|  | 
 | ||
|  | This demo focuses on exporting table data to a workbook.  Headless browsers do | ||
|  | not generally support passing objects between the browser context and the | ||
|  | automation script, so the file data must be generated in the browser context | ||
|  | and sent back to the automation script for saving in the filesystem.  Steps: | ||
|  | 
 | ||
|  | 1) Launch the headless browser and load the target webpage. | ||
|  | 
 | ||
|  | 2) Add the standalone SheetJS build to the page in a `SCRIPT` tag. | ||
|  | 
 | ||
|  | 3) Add a script to the page (in the browser context) that will: | ||
|  | 
 | ||
|  | - Make a workbook object from the first table using `XLSX.utils.table_to_book` | ||
|  | - Generate the bytes for an XLSB file using `XLSX.write` | ||
|  | - Send the bytes back to the automation script | ||
|  | 
 | ||
|  | 4) When the automation context receives data, save to a file | ||
|  | 
 | ||
|  | This demo exports data from <https://sheetjs.com/demos/table>. | ||
|  | 
 | ||
|  | :::note | ||
|  | 
 | ||
|  | It is also possible to parse files from the browser context, but parsing from | ||
|  | the automation context is more performant and strongly recommended. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | ## Puppeteer
 | ||
|  | 
 | ||
|  | Puppeteer enables headless Chromium automation for NodeJS.  Releases ship with | ||
|  | an installer script.  Installation is straightforward: | ||
|  | 
 | ||
|  | ```bash | ||
|  | npm i https://cdn.sheetjs.com/xlsx-latest/xlsx-latest.tgz puppeteer | ||
|  | ``` | ||
|  | 
 | ||
|  | Binary strings are the favored data type.  They can be safely passed from the | ||
|  | browser context to the automation script.  NodeJS provides an API to write | ||
|  | binary strings to file (`fs.writeFileSync` using encoding `binary`). | ||
|  | 
 | ||
|  | To run the example, after installing the packages, save the following script to | ||
|  | `SheetJSPuppeteer.js` and run `node SheetJSPuppeteer.js`.  Steps are commented: | ||
|  | 
 | ||
|  | ```js title="SheetJSPuppeteer.js" | ||
|  | const fs = require("fs"); | ||
|  | const puppeteer = require('puppeteer'); | ||
|  | (async () => { | ||
|  |   /* (1) Load the target page */ | ||
|  |   const browser = await puppeteer.launch(); | ||
|  |   const page = await browser.newPage(); | ||
|  |   page.on("console", msg => console.log("PAGE LOG:", msg.text())); | ||
|  |   await page.setViewport({width: 1920, height: 1080}); | ||
|  |   await page.goto('https://sheetjs.com/demos/table'); | ||
|  | 
 | ||
|  |   /* (2) Load the standalone SheetJS build from the CDN */ | ||
|  |   await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js' }); | ||
|  | 
 | ||
|  |   /* (3) Run the snippet in browser and return data */ | ||
|  |   const bin = await page.evaluate(() => { | ||
|  |     /* NOTE: this function will be evaluated in the browser context. | ||
|  |        `page`, `fs` and `puppeteer` are not available. | ||
|  |        `XLSX` will be available thanks to step 2 */ | ||
|  | 
 | ||
|  |     /* find first table */ | ||
|  |     var table = document.body.getElementsByTagName('table')[0]; | ||
|  | 
 | ||
|  |     /* call table_to_book on first table */ | ||
|  |     var wb = XLSX.utils.table_to_book(table); | ||
|  | 
 | ||
|  |     /* generate XLSB and return binary string */ | ||
|  |     return XLSX.write(wb, {type: "binary", bookType: "xlsb"}); | ||
|  |   }); | ||
|  | 
 | ||
|  |   /* (4) write data to file */ | ||
|  |   fs.writeFileSync("SheetJSPuppeteer.xlsb", bin, { encoding: "binary" }); | ||
|  | 
 | ||
|  |   await browser.close(); | ||
|  | })(); | ||
|  | ``` | ||
|  | 
 | ||
|  | ## Playwright
 | ||
|  | 
 | ||
|  | Playwright presents a unified scripting framework for Chromium, WebKit, and | ||
|  | other browsers.  It draws inspiration from Puppeteer.  In fact, the example | ||
|  | code is almost identical! | ||
|  | 
 | ||
|  | ```bash | ||
|  | npm i https://cdn.sheetjs.com/xlsx-latest/xlsx-latest.tgz playwright | ||
|  | ``` | ||
|  | 
 | ||
|  | To run the example, after installing the packages, save the following script to | ||
|  | `SheetJSPlaywright.js` and run `node SheetJSPlaywright.js`.  Import divergences | ||
|  | from the Puppeteer example are highlighted below: | ||
|  | 
 | ||
|  | ```js title="SheetJSPlaywright.js" | ||
|  | const fs = require("fs"); | ||
|  | // highlight-next-line | ||
|  | const { webkit } = require('playwright'); // import desired browser | ||
|  | (async () => { | ||
|  |   /* (1) Load the target page */ | ||
|  |   // highlight-next-line | ||
|  |   const browser = await webkit.launch(); // launch desired browser | ||
|  |   const page = await browser.newPage(); | ||
|  |   page.on("console", msg => console.log("PAGE LOG:", msg.text())); | ||
|  |   // highlight-next-line | ||
|  |   await page.setViewportSize({width: 1920, height: 1080}); // different name :( | ||
|  |   await page.goto('https://sheetjs.com/demos/table'); | ||
|  | 
 | ||
|  |   /* (2) Load the standalone SheetJS build from the CDN */ | ||
|  |   await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js' }); | ||
|  | 
 | ||
|  |   /* (3) Run the snippet in browser and return data */ | ||
|  |   const bin = await page.evaluate(() => { | ||
|  |     /* NOTE: this function will be evaluated in the browser context. | ||
|  |        `page`, `fs` and the browser engine are not available. | ||
|  |        `XLSX` will be available thanks to step 2 */ | ||
|  | 
 | ||
|  |     /* find first table */ | ||
|  |     var table = document.body.getElementsByTagName('table')[0]; | ||
|  | 
 | ||
|  |     /* call table_to_book on first table */ | ||
|  |     var wb = XLSX.utils.table_to_book(table); | ||
|  | 
 | ||
|  |     /* generate XLSB and return binary string */ | ||
|  |     return XLSX.write(wb, {type: "binary", bookType: "xlsb"}); | ||
|  |   }); | ||
|  | 
 | ||
|  |   /* (4) write data to file */ | ||
|  |   fs.writeFileSync("SheetJSPlaywright.xlsb", bin, { encoding: "binary" }); | ||
|  | 
 | ||
|  |   await browser.close(); | ||
|  | })(); | ||
|  | ``` | ||
|  | 
 | ||
|  | 
 | ||
|  | ## PhantomJS
 | ||
|  | 
 | ||
|  | PhantomJS is a headless web browser powered by WebKit.  Standalone binaries are | ||
|  | available at <https://phantomjs.org/download.html> | ||
|  | 
 | ||
|  | :::warning  | ||
|  | 
 | ||
|  | This information is provided for legacy deployments.  PhantomJS development has | ||
|  | been suspended and there are known vulnerabilities, so new projects should use | ||
|  | alternatives.  For WebKit automation, new projects should use Playwright. | ||
|  | 
 | ||
|  | ::: | ||
|  | 
 | ||
|  | Binary strings are the favored data type.  They can be safely passed from the | ||
|  | browser context to the automation script.  PhantomJS provides an API to write | ||
|  | binary strings to file (`fs.write` using mode `wb`). | ||
|  | 
 | ||
|  | To run the example, save the following script to `SheetJSPhantom.js` in the same | ||
|  | folder as `phantomjs.exe` or `phantomjs` and run | ||
|  | 
 | ||
|  | ``` | ||
|  | ./phantomjs SheetJSPhantom.js     ## macOS / Linux | ||
|  | .\phantomjs.exe SheetJSPhantom.js ## windows | ||
|  | ``` | ||
|  | 
 | ||
|  | The steps are marked in the comments: | ||
|  | 
 | ||
|  | ```js title="SheetJSPhantom.js" | ||
|  | var page = require('webpage').create(); | ||
|  | page.onConsoleMessage = function(msg) { console.log(msg); }; | ||
|  | 
 | ||
|  | /* (1) Load the target page */ | ||
|  | page.open('https://sheetjs.com/demos/table', function() { | ||
|  | 
 | ||
|  |   /* (2) Load the standalone SheetJS build from the CDN */ | ||
|  |   page.includeJs("https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js", function() { | ||
|  | 
 | ||
|  |     /* (3) Run the snippet in browser and return data */ | ||
|  |     var bin = page.evaluateJavaScript([ "function(){", | ||
|  | 
 | ||
|  |       /* find first table */ | ||
|  |       "var table = document.body.getElementsByTagName('table')[0];", | ||
|  | 
 | ||
|  |       /* call table_to_book on first table */ | ||
|  |       "var wb = XLSX.utils.table_to_book(table);", | ||
|  | 
 | ||
|  |       /* generate XLSB file and return binary string */ | ||
|  |       "return XLSX.write(wb, {type: 'binary', bookType: 'xlsb'});", | ||
|  |     "}" ].join("")); | ||
|  | 
 | ||
|  |     /* (4) write data to file */ | ||
|  |     require("fs").write("SheetJSPhantomJS.xlsb", bin, "wb"); | ||
|  | 
 | ||
|  |     phantom.exit(); | ||
|  |   }); | ||
|  | }); | ||
|  | ``` | ||
|  | 
 | ||
|  | :::caution | ||
|  | 
 | ||
|  | PhantomJS is very finicky and will hang if there are script errors.  It is | ||
|  | strongly recommended to add verbose logging and to lint scripts before use. | ||
|  | 
 | ||
|  | ::: |