Tesseract.js: How To OCR Remote Images from a URL in Node

Time to read:

November 01, 2016

Written by

Twilion

This post is part of Twilio’s archive and may contain outdated information. We’re always building something new, so be sure to check out our latest posts for the most up-to-date insights.

Tesseract.js: How To OCR Remote Images from a URL in Node

Tesseract.js is a JavaScript OCR library based on the world’s most popular Optical Character Recognition engine. It’s insanely easy to use on both the client-side and on the server with Node.js.

Server side, Tesseract.js only works with local images. But, with a little help from the request Node package, we can download a remote image from a URL and then OCR it with Tesseract.js.

We’ll tackle this in three steps:

Write code to download a remote file with Node
Write code to OCR that local file with Tessearct
Put the two snippets together

The final product will take just fifteen lines of JavaScript to OCR images from a URL. Sound good? Let’s get started.

Download a remote file with Node.js

Request “is designed to be the simplest way possible to make http calls” in Node.js. We’ll use it to open a URL and then pipe the stream to the local file system using the Node.js standard library. When finished, we’ll fire off a callback.

Paste this code into a new file called download.js:

var request = require('request')
var fs = require('fs')
var url = 'http://tesseract.projectnaptha.com/img/eng_bw.png'
var filename = 'pic.png'

var writeFileStream = fs.createWriteStream(filename)

request(url).pipe(writeFileStream).on('close', function() {
  console.log(url, 'saved to', filename)
})

We’re using the sample image from the Tesseract documentation, which looks like this:

Install request and run the script:

npm install request
node download.js

Check your directory and you should see a new file. Now let’s OCR that downloaded file.

OCR a local image with Tesseract.js and Node.js

Getting started with Tesseract.js is dead simple. Paste this code into a file called ocr.js.

var Tesseract = require('tesseract.js')
var filename = 'pic.png'

Tesseract.recognize(filename)
  .progress(function  (p) { console.log('progress', p)  })
  .catch(err => console.error(err))
  .then(function (result) {
    console.log(result.text)
    process.exit(0)
  })

Install Tesseract.js and run the script:

npm install tesseract.js
node ocr.js

Once Tesseract starts up (~10 seconds on my MacBook Pro), we’ll see progress updates and then find the recognized text in result.text. There’s a ton more data hiding in result if you’re inclined to go digging.

We now have code to download a remote file and code to OCR a local file — we just need to put them together.

OCR a remote image with Tesseract.js

Paste this code into a new file called download-and-ocr.js:

var Tesseract = require('tesseract.js')
var request = require('request')
var fs = require('fs')
var url = 'http://tesseract.projectnaptha.com/img/eng_bw.png'
var filename = 'pic.png'

var writeFile = fs.createWriteStream(filename)

request(url).pipe(writeFile).on('close', function() {
  console.log(url, 'saved to', filename)
  Tesseract.recognize(filename)
    .progress(function  (p) { console.log('progress', p)  })
    .catch(err => console.error(err))
    .then(function (result) {
      console.log(result.text)
      process.exit(0)
    })
});

All we’ve done here is:

Start with the script from download.js
Require Tesseract.js
Paste the the code from ocr.js into the callback that’s run when the file finishes downloading.

Give the script a run, swapping in your own picture URL if you so please:

node download-and-ocr.js

Next Steps

That’s it! Three simple steps and we’re using Tesseract.js to perform OCR on an image from a URL. My personal motivation is to use Tesseract.js in conjunction with Twilio MMS to process photos that I snap while running around NYC. Perhaps I’ll grep phone numbers out of ads and run them through our Lookup API to see if they’re Twilio numbers.

What are you going to build?

If you’d like to learn more, check out:

If you enjoyed this post, give a shout-out to Guillermo Webster and Kevin Kwok for their heroic effort porting Tesseract to JS. And, of course, feel free to drop me a line if you have any questions or build something you’d like to show off.

Happy Hacking.

No related content found.

Related Resources

Twilio Docs

From APIs to SDKs to sample apps

API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.

Resource Center

The latest ebooks, industry reports, and webinars

Learn from customer engagement experts to improve your own communication.

Ahoy

Twilio's developer community hub

Best practices, code samples, and inspiration to build communications and digital engagement experiences.

Tesseract.js: How To OCR Remote Images from a URL in Node

Tesseract.js: How To OCR Remote Images from a URL in Node

Download a remote file with Node.js

OCR a local image with Tesseract.js and Node.js

OCR a remote image with Tesseract.js

Next Steps

Related Posts

Related Resources