Ubuntu – paperless office on a budget

Since paper and myself have never gotten on well I have always been dreaming of a paperless office. A while ago I purchased a Fujitsu ScanSnap S1500 scanner for the office. I did this after doing some research on which Automatic Document Feed (ADF) multipage & duplex scanners were both affordable as well as supported on Linux.

It took a while for me to get around to set all of this up, but the result now is that this scanner is connected to a headless Ubuntu VM and the press of the scanner button will:

  1. scan the document
  2. perform OCR to convert to text
  3. combine the text with PDF to create a searchable PDF
  4. OPTIONAL – send the resulting document into Alfresco Document Management Server via FTP

Install dependencies

NOTE: PPA is only required for support of Fujitsu ScanSnap S1500
sudo apt-add-repository ppa:rolfbensch/sane-git
sudo apt-get update
sudo apt-get install sane sane-utils imagemagick tesseract-ocr pdftk libtiff-tools libsane-extras exactimage wput

Install scanbuttond

Download the “Debian Experimental” package from http://pkgs.org/download/scanbuttond
sudo dpkg -i scanbuttond_0.2.3.cvs20090713-14_i386.deb

This step is only for the Fujitsu ScanSnap support. For other scanners you can probably install from the Ubuntu Repository

Scanner config

vim 40-libsane.rules
#add this line
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="11a2", ENV{libsane_matched}="yes"

Permissions

sudo adduser saned scanner

Useful command lines for troubleshooting

Since I had a few trouble getting this scanner to work properly I found the following commands highly useful in locating the issue.
man sane-usb
sane-find-scanner
scanimage -L
dmesg
tail /var/log/udev

NOTE: If you are using a notebook devices be careful as I spent quite a few hours troubleshooting an error when opening the device from saned. It turned out to be that the USB power-management on the Toshiba notebook caused havoc with saned (http://askubuntu.com/questions/55140/error-during-device-i-o-when-using-usb-scanner). Switching to the desktop that is now housing the scanner fixed that problem. Thank you VIRTUALBOX (I ended up setting up a dedicated VM for this task) !

Configure scanbuttond

vim /etc/default/scanbuttond
#change this line from no to yes
RUN=yes

cd /etc/scanbuttond
sudo cp initscanner.sh.example initscanner.sh
sudo vim initscanner.sh

Uncomment or copy any scanner init command(s).

sudo cp buttonpressed.sh.example buttonpressed.sh
sudo vim buttonpressed.sh

Copy the contents of the scan script below. The script is also hosted on GitHub (https://github.com/leogaggl/misc-scripts/blob/master/buttonpressed.sh)

Scan script

#!/bin/bash
OUT_DIR=/output/directory/name
TMP_DIR=`mktemp -d`
FILE_NAME=scan_`date +%Y%m%d-%H%M%S`
cd $TMP_DIR
echo "################## Scanning ###################"
scanimage --resolution 150 --batch=scan_%03d.pnm --format=pnm --mode Gray --device-name "fujitsu:ScanSnap S1500:67953" --source “ADF Duplex” --page-width 210 --page-height 297 --sleeptimer 1 -y 297 -x 210
echo "################## Cleaning ###################"
for f in ./*.pnm; do
unpaper --size "a4" --overwrite "$f" "$f"
done
echo "############## Converting to TIF ##############"
mogrify -format tif *.pnm
echo "################ OCR ################"
for f in ./*.tif; do
tesseract "$f" "$f" -l eng hocr
hocr2pdf -i "$f" -s -o "$f.pdf" < "$f.html"
done
echo "############## Converting to PDF ##############"
pdftk *.tif.pdf cat output "output.pdf" && rm *.tif.pdf && rm *.tif.html
echo "############## Copy Output File ##############"
cp $FILE_NAME.pdf $OUT_DIR/
echo "############## clean up ##############"
cd ..
rm -rf $TMP_DIR
echo "############## FTP Output File ##############"
#wput $OUT_DIR/$FILE_NAME.pdf ftp://user:pwd@ftp.alfrescoserver.com.au:21/autoscan/pdf/

Credits:

A big thank you & hat tip to the following authors of the following pages:


EDIT (2013-09-16): I found this link describing how to remove empty pages: http://philipp.knechtges.com/?p=190 – might have to investigate this when I have some time.