| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | --- | 
					
						
							| 
									
										
										
										
											2023-01-22 04:23:58 +00:00
										 |  |  | title: Browser Automation | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | --- | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-08-25 08:22:28 +00:00
										 |  |  | import Tabs from '@theme/Tabs'; | 
					
						
							|  |  |  | import TabItem from '@theme/TabItem'; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | Headless automation involves controlling "headless browsers" to access websites | 
					
						
							|  |  |  | and submit or download data.  It is also possible to automate browsers using | 
					
						
							|  |  |  | custom browser extensions. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-10-30 05:45:37 +00:00
										 |  |  | The [SheetJS standalone script](/docs/getting-started/installation/standalone) can be added to | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | any website by inserting a `SCRIPT` tag.  Headless browsers usually provide | 
					
						
							|  |  |  | utility functions for running custom snippets in the browser and passing data | 
					
						
							|  |  |  | back to the automation script. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Use Case
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This demo focuses on exporting table data to a workbook.  Headless browsers do | 
					
						
							|  |  |  | not generally support passing objects between the browser context and the | 
					
						
							|  |  |  | automation script, so the file data must be generated in the browser context | 
					
						
							| 
									
										
										
										
											2023-01-22 04:23:58 +00:00
										 |  |  | and sent back to the automation script for saving in the file system. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```mermaid | 
					
						
							|  |  |  | sequenceDiagram | 
					
						
							|  |  |  |   autonumber off | 
					
						
							|  |  |  |   actor U as User | 
					
						
							|  |  |  |   participant C as Controller | 
					
						
							|  |  |  |   participant B as Browser | 
					
						
							|  |  |  |   U->>C: run script | 
					
						
							|  |  |  |   rect rgba(255, 0, 0, 0.25) | 
					
						
							|  |  |  |     C->>B: launch browser | 
					
						
							|  |  |  |     B->>C: ready | 
					
						
							|  |  |  |     C->>B: load URL | 
					
						
							|  |  |  |     B->>C: site loaded | 
					
						
							|  |  |  |   end | 
					
						
							|  |  |  |   rect rgba(0, 127, 0, 0.25) | 
					
						
							|  |  |  |     C->>B: add SheetJS script | 
					
						
							|  |  |  |     B->>C: script loaded | 
					
						
							|  |  |  |   end | 
					
						
							|  |  |  |   rect rgba(255, 0, 0, 0.25) | 
					
						
							|  |  |  |     C->>B: ask for file | 
					
						
							|  |  |  |     Note over B: scrape tables | 
					
						
							|  |  |  |     Note over B: generate workbook | 
					
						
							|  |  |  |     B->>C: file bytes | 
					
						
							|  |  |  |   end | 
					
						
							|  |  |  |   rect rgba(0, 127, 0, 0.25) | 
					
						
							|  |  |  |     C->>U: save file | 
					
						
							|  |  |  |   end | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Steps: | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-08-25 08:22:28 +00:00
										 |  |  | 1) Launch the headless browser and load the target site. | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 2) Add the standalone SheetJS build to the page in a `SCRIPT` tag. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 3) Add a script to the page (in the browser context) that will: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | - Make a workbook object from the first table using `XLSX.utils.table_to_book` | 
					
						
							|  |  |  | - Generate the bytes for an XLSB file using `XLSX.write` | 
					
						
							|  |  |  | - Send the bytes back to the automation script | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 4) When the automation context receives data, save to a file | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This demo exports data from <https://sheetjs.com/demos/table>. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | :::note | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | It is also possible to parse files from the browser context, but parsing from | 
					
						
							| 
									
										
										
										
											2022-08-25 08:22:28 +00:00
										 |  |  | the automation context is more efficient and strongly recommended. | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | ::: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Puppeteer
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Puppeteer enables headless Chromium automation for NodeJS.  Releases ship with | 
					
						
							|  |  |  | an installer script.  Installation is straightforward: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```bash | 
					
						
							|  |  |  | npm i https://cdn.sheetjs.com/xlsx-latest/xlsx-latest.tgz puppeteer | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-08-25 08:22:28 +00:00
										 |  |  | <Tabs> | 
					
						
							|  |  |  |   <TabItem value="nodejs" label="NodeJS"> | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | Binary strings are the favored data type.  They can be safely passed from the | 
					
						
							|  |  |  | browser context to the automation script.  NodeJS provides an API to write | 
					
						
							|  |  |  | binary strings to file (`fs.writeFileSync` using encoding `binary`). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To run the example, after installing the packages, save the following script to | 
					
						
							|  |  |  | `SheetJSPuppeteer.js` and run `node SheetJSPuppeteer.js`.  Steps are commented: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```js title="SheetJSPuppeteer.js" | 
					
						
							|  |  |  | const fs = require("fs"); | 
					
						
							|  |  |  | const puppeteer = require('puppeteer'); | 
					
						
							|  |  |  | (async () => { | 
					
						
							|  |  |  |   /* (1) Load the target page */ | 
					
						
							|  |  |  |   const browser = await puppeteer.launch(); | 
					
						
							|  |  |  |   const page = await browser.newPage(); | 
					
						
							|  |  |  |   page.on("console", msg => console.log("PAGE LOG:", msg.text())); | 
					
						
							|  |  |  |   await page.setViewport({width: 1920, height: 1080}); | 
					
						
							|  |  |  |   await page.goto('https://sheetjs.com/demos/table'); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (2) Load the standalone SheetJS build from the CDN */ | 
					
						
							|  |  |  |   await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js' }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (3) Run the snippet in browser and return data */ | 
					
						
							|  |  |  |   const bin = await page.evaluate(() => { | 
					
						
							|  |  |  |     /* NOTE: this function will be evaluated in the browser context. | 
					
						
							|  |  |  |        `page`, `fs` and `puppeteer` are not available. | 
					
						
							|  |  |  |        `XLSX` will be available thanks to step 2 */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* find first table */ | 
					
						
							|  |  |  |     var table = document.body.getElementsByTagName('table')[0]; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* call table_to_book on first table */ | 
					
						
							|  |  |  |     var wb = XLSX.utils.table_to_book(table); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* generate XLSB and return binary string */ | 
					
						
							|  |  |  |     return XLSX.write(wb, {type: "binary", bookType: "xlsb"}); | 
					
						
							|  |  |  |   }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (4) write data to file */ | 
					
						
							|  |  |  |   fs.writeFileSync("SheetJSPuppeteer.xlsb", bin, { encoding: "binary" }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   await browser.close(); | 
					
						
							|  |  |  | })(); | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-08-25 08:22:28 +00:00
										 |  |  | This script will generate `SheetJSPuppeteer.xlsb` which can be opened in Excel. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   </TabItem> | 
					
						
							|  |  |  |   <TabItem value="deno" label="Deno"> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | :::caution | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Deno Puppeteer is a fork. It is not officially supported by the Puppeteer team. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ::: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Installation is straightforward: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```bash | 
					
						
							|  |  |  | env PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/puppeteer@14.1.1/install.ts | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Base64 strings are the favored data type.  They can be safely passed from the | 
					
						
							|  |  |  | browser context to the automation script.  Deno can decode the Base64 strings | 
					
						
							|  |  |  | and write the decoded `Uint8Array` data to file with `Deno.writeFileSync` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To run the example, after installing the packages, save the following script to | 
					
						
							|  |  |  | `SheetJSPuppeteer.ts` and run `deno run -A --unstable SheetJSPuppeteer.js`. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```js title="SheetJSPuppeteer.ts" | 
					
						
							|  |  |  | import puppeteer from "https://deno.land/x/puppeteer@14.1.1/mod.ts"; | 
					
						
							|  |  |  | import { decode } from "https://deno.land/std/encoding/base64.ts" | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* (1) Load the target page */ | 
					
						
							|  |  |  | const browser = await puppeteer.launch(); | 
					
						
							|  |  |  | const page = await browser.newPage(); | 
					
						
							|  |  |  | page.on("console", msg => console.log("PAGE LOG:", msg.text())); | 
					
						
							|  |  |  | await page.setViewport({width: 1920, height: 1080}); | 
					
						
							|  |  |  | await page.goto('https://sheetjs.com/demos/table'); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* (2) Load the standalone SheetJS build from the CDN */ | 
					
						
							|  |  |  | await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js' }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* (3) Run the snippet in browser and return data */ | 
					
						
							|  |  |  | const b64 = await page.evaluate(() => { | 
					
						
							|  |  |  |   /* NOTE: this function will be evaluated in the browser context. | 
					
						
							|  |  |  |      `page`, `fs` and `puppeteer` are not available. | 
					
						
							|  |  |  |      `XLSX` will be available thanks to step 2 */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* find first table */ | 
					
						
							|  |  |  |   var table = document.body.getElementsByTagName('table')[0]; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* call table_to_book on first table */ | 
					
						
							|  |  |  |   var wb = XLSX.utils.table_to_book(table); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* generate XLSB and return binary string */ | 
					
						
							|  |  |  |   return XLSX.write(wb, {type: "base64", bookType: "xlsb"}); | 
					
						
							|  |  |  | }); | 
					
						
							|  |  |  | /* (4) write data to file */ | 
					
						
							|  |  |  | Deno.writeFileSync("SheetJSPuppeteer.xlsb", decode(b64)); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | await browser.close(); | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This script will generate `SheetJSPuppeteer.xlsb` which can be opened in Excel. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   </TabItem> | 
					
						
							|  |  |  | </Tabs> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | ## Playwright
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Playwright presents a unified scripting framework for Chromium, WebKit, and | 
					
						
							|  |  |  | other browsers.  It draws inspiration from Puppeteer.  In fact, the example | 
					
						
							|  |  |  | code is almost identical! | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```bash | 
					
						
							|  |  |  | npm i https://cdn.sheetjs.com/xlsx-latest/xlsx-latest.tgz playwright | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To run the example, after installing the packages, save the following script to | 
					
						
							|  |  |  | `SheetJSPlaywright.js` and run `node SheetJSPlaywright.js`.  Import divergences | 
					
						
							|  |  |  | from the Puppeteer example are highlighted below: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```js title="SheetJSPlaywright.js" | 
					
						
							|  |  |  | const fs = require("fs"); | 
					
						
							|  |  |  | // highlight-next-line | 
					
						
							|  |  |  | const { webkit } = require('playwright'); // import desired browser | 
					
						
							|  |  |  | (async () => { | 
					
						
							|  |  |  |   /* (1) Load the target page */ | 
					
						
							|  |  |  |   // highlight-next-line | 
					
						
							|  |  |  |   const browser = await webkit.launch(); // launch desired browser | 
					
						
							|  |  |  |   const page = await browser.newPage(); | 
					
						
							|  |  |  |   page.on("console", msg => console.log("PAGE LOG:", msg.text())); | 
					
						
							|  |  |  |   // highlight-next-line | 
					
						
							|  |  |  |   await page.setViewportSize({width: 1920, height: 1080}); // different name :( | 
					
						
							|  |  |  |   await page.goto('https://sheetjs.com/demos/table'); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (2) Load the standalone SheetJS build from the CDN */ | 
					
						
							|  |  |  |   await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js' }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (3) Run the snippet in browser and return data */ | 
					
						
							|  |  |  |   const bin = await page.evaluate(() => { | 
					
						
							|  |  |  |     /* NOTE: this function will be evaluated in the browser context. | 
					
						
							|  |  |  |        `page`, `fs` and the browser engine are not available. | 
					
						
							|  |  |  |        `XLSX` will be available thanks to step 2 */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* find first table */ | 
					
						
							|  |  |  |     var table = document.body.getElementsByTagName('table')[0]; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* call table_to_book on first table */ | 
					
						
							|  |  |  |     var wb = XLSX.utils.table_to_book(table); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* generate XLSB and return binary string */ | 
					
						
							|  |  |  |     return XLSX.write(wb, {type: "binary", bookType: "xlsb"}); | 
					
						
							|  |  |  |   }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (4) write data to file */ | 
					
						
							|  |  |  |   fs.writeFileSync("SheetJSPlaywright.xlsb", bin, { encoding: "binary" }); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   await browser.close(); | 
					
						
							|  |  |  | })(); | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## PhantomJS
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-11-07 10:41:00 +00:00
										 |  |  | PhantomJS is a headless web browser powered by WebKit. | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-07-25 20:48:10 +00:00
										 |  |  | :::warning | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | This information is provided for legacy deployments.  PhantomJS development has | 
					
						
							|  |  |  | been suspended and there are known vulnerabilities, so new projects should use | 
					
						
							|  |  |  | alternatives.  For WebKit automation, new projects should use Playwright. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ::: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Binary strings are the favored data type.  They can be safely passed from the | 
					
						
							|  |  |  | browser context to the automation script.  PhantomJS provides an API to write | 
					
						
							|  |  |  | binary strings to file (`fs.write` using mode `wb`). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To run the example, save the following script to `SheetJSPhantom.js` in the same | 
					
						
							|  |  |  | folder as `phantomjs.exe` or `phantomjs` and run | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ``` | 
					
						
							| 
									
										
										
										
											2022-08-23 03:20:02 +00:00
										 |  |  | ./phantomjs SheetJSPhantom.js     ## MacOS / Linux | 
					
						
							| 
									
										
										
										
											2022-07-07 04:05:14 +00:00
										 |  |  | .\phantomjs.exe SheetJSPhantom.js ## windows | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The steps are marked in the comments: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```js title="SheetJSPhantom.js" | 
					
						
							|  |  |  | var page = require('webpage').create(); | 
					
						
							|  |  |  | page.onConsoleMessage = function(msg) { console.log(msg); }; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* (1) Load the target page */ | 
					
						
							|  |  |  | page.open('https://sheetjs.com/demos/table', function() { | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   /* (2) Load the standalone SheetJS build from the CDN */ | 
					
						
							|  |  |  |   page.includeJs("https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js", function() { | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* (3) Run the snippet in browser and return data */ | 
					
						
							|  |  |  |     var bin = page.evaluateJavaScript([ "function(){", | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       /* find first table */ | 
					
						
							|  |  |  |       "var table = document.body.getElementsByTagName('table')[0];", | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       /* call table_to_book on first table */ | 
					
						
							|  |  |  |       "var wb = XLSX.utils.table_to_book(table);", | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       /* generate XLSB file and return binary string */ | 
					
						
							|  |  |  |       "return XLSX.write(wb, {type: 'binary', bookType: 'xlsb'});", | 
					
						
							|  |  |  |     "}" ].join("")); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     /* (4) write data to file */ | 
					
						
							|  |  |  |     require("fs").write("SheetJSPhantomJS.xlsb", bin, "wb"); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     phantom.exit(); | 
					
						
							|  |  |  |   }); | 
					
						
							|  |  |  | }); | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | :::caution | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | PhantomJS is very finicky and will hang if there are script errors.  It is | 
					
						
							|  |  |  | strongly recommended to add verbose logging and to lint scripts before use. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ::: |