Ubuntu – paperless office on a budget

Since paper and myself have never gotten on well I have always been dreaming of a paperless office. A while ago I purchased a Fujitsu ScanSnap S1500 scanner for the office. I did this after doing some research on which Automatic Document Feed (ADF) multipage & duplex scanners were both affordable as well as supported on Linux.

It took a while for me to get around to set all of this up, but the result now is that this scanner is connected to a headless Ubuntu VM and the press of the scanner button will:

  1. scan the document
  2. perform OCR to convert to text
  3. combine the text with PDF to create a searchable PDF
  4. OPTIONAL – send the resulting document into Alfresco Document Management Server via FTP

Install dependencies

NOTE: PPA is only required for support of Fujitsu ScanSnap S1500
sudo apt-add-repository ppa:rolfbensch/sane-git
sudo apt-get update
sudo apt-get install sane sane-utils imagemagick tesseract-ocr pdftk libtiff-tools libsane-extras exactimage wput

Install scanbuttond

Download the “Debian Experimental” package from http://pkgs.org/download/scanbuttond
sudo dpkg -i scanbuttond_0.2.3.cvs20090713-14_i386.deb

This step is only for the Fujitsu ScanSnap support. For other scanners you can probably install from the Ubuntu Repository

Scanner config

vim 40-libsane.rules
#add this line
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="11a2", ENV{libsane_matched}="yes"

Permissions

sudo adduser saned scanner

Useful command lines for troubleshooting

Since I had a few trouble getting this scanner to work properly I found the following commands highly useful in locating the issue.
man sane-usb
sane-find-scanner
scanimage -L
dmesg
tail /var/log/udev

NOTE: If you are using a notebook devices be careful as I spent quite a few hours troubleshooting an error when opening the device from saned. It turned out to be that the USB power-management on the Toshiba notebook caused havoc with saned (http://askubuntu.com/questions/55140/error-during-device-i-o-when-using-usb-scanner). Switching to the desktop that is now housing the scanner fixed that problem. Thank you VIRTUALBOX (I ended up setting up a dedicated VM for this task) !

Configure scanbuttond

vim /etc/default/scanbuttond
#change this line from no to yes
RUN=yes

cd /etc/scanbuttond
sudo cp initscanner.sh.example initscanner.sh
sudo vim initscanner.sh

Uncomment or copy any scanner init command(s).

sudo cp buttonpressed.sh.example buttonpressed.sh
sudo vim buttonpressed.sh

Copy the contents of the scan script below. The script is also hosted on GitHub (https://github.com/leogaggl/misc-scripts/blob/master/buttonpressed.sh)

Scan script

#!/bin/bash
OUT_DIR=/output/directory/name
TMP_DIR=`mktemp -d`
FILE_NAME=scan_`date +%Y%m%d-%H%M%S`
cd $TMP_DIR
echo "################## Scanning ###################"
scanimage --resolution 150 --batch=scan_%03d.pnm --format=pnm --mode Gray --device-name "fujitsu:ScanSnap S1500:67953" --source “ADF Duplex” --page-width 210 --page-height 297 --sleeptimer 1 -y 297 -x 210
echo "################## Cleaning ###################"
for f in ./*.pnm; do
unpaper --size "a4" --overwrite "$f" "$f"
done
echo "############## Converting to TIF ##############"
mogrify -format tif *.pnm
echo "################ OCR ################"
for f in ./*.tif; do
tesseract "$f" "$f" -l eng hocr
hocr2pdf -i "$f" -s -o "$f.pdf" < "$f.html" done echo "############## Converting to PDF ##############" pdftk *.tif.pdf cat output "output.pdf" && rm *.tif.pdf && rm *.tif.html echo "############## Copy Output File ##############" cp $FILE_NAME.pdf $OUT_DIR/ echo "############## clean up ##############" cd .. rm -rf $TMP_DIR echo "############## FTP Output File ##############" #wput $OUT_DIR/$FILE_NAME.pdf ftp://user:pwd@ftp.alfrescoserver.com.au:21/autoscan/pdf/

Credits:

A big thank you & hat tip to the following authors of the following pages:


EDIT (2013-09-16): I found this link describing how to remove empty pages: http://philipp.knechtges.com/?p=190 – might have to investigate this when I have some time.

Bulk converting Office documents to PDF

When you need to convert multiple documents to PDF for distribution (or from one Office format to another) there are a few utilities around. The most workable I found is the UNOCONV utility which is build on top of LibreOffice / OpenOffice. This uses the OpenOffice conversion facilities rather than a simple PDF print driver.

On Ubuntu it can be installed via Software Center or via apt-get from the core repositories.
sudo apt-get install unoconv
Combined with the -exec option of the Unix find command this makes conversion of whole directory structures a breeze.
#find all Word Documents and convert to PDF
find . -name "*.doc*" -exec unoconv -f pdf {} \;
#find all Powerpoint Documents and convert to PDF
find . -name "*.ppt*" -exec unoconv -f pdf {} \;

To show all the possible conversion formats you can use:
unoconv --show
The following list of document formats are currently available:

bib – BibTeX [.bib]
doc – Microsoft Word 97/2000/XP [.doc]
doc6 – Microsoft Word 6.0 [.doc]
doc95 – Microsoft Word 95 [.doc]
docbook – DocBook [.xml]
html – HTML Document (OpenOffice.org Writer) [.html]
odt – ODF Text Document [.odt]
ott – Open Document Text [.ott]
ooxml – Microsoft Office Open XML [.xml]
pdf – Portable Document Format [.pdf]
rtf – Rich Text Format [.rtf]
latex – LaTeX 2e [.ltx]
sdw – StarWriter 5.0 [.sdw]
sdw4 – StarWriter 4.0 [.sdw]
sdw3 – StarWriter 3.0 [.sdw]
stw – Open Office.org 1.0 Text Document Template [.stw]
sxw – Open Office.org 1.0 Text Document [.sxw]
text – Text Encoded [.txt]
mediawiki – MediaWiki [.txt]
txt – Text [.txt]
uot – Unified Office Format text [.uot]
vor – StarWriter 5.0 Template [.vor]
vor4 – StarWriter 4.0 Template [.vor]
vor3 – StarWriter 3.0 Template [.vor]
xhtml – XHTML Document [.html]

The following list of graphics formats are currently available:

bmp – Windows Bitmap [.bmp]
emf – Enhanced Metafile [.emf]
eps – Encapsulated PostScript [.eps]
gif – Graphics Interchange Format [.gif]
html – HTML Document (OpenOffice.org Draw) [.html]
jpg – Joint Photographic Experts Group [.jpg]
met – OS/2 Metafile [.met]
odd – OpenDocument Drawing [.odd]
otg – OpenDocument Drawing Template [.otg]
pbm – Portable Bitmap [.pbm]
pct – Mac Pict [.pct]
pdf – Portable Document Format [.pdf]
pgm – Portable Graymap [.pgm]
png – Portable Network Graphic [.png]
ppm – Portable Pixelmap [.ppm]
ras – Sun Raster Image [.ras]
std – OpenOffice.org 1.0 Drawing Template [.std]
svg – Scalable Vector Graphics [.svg]
svm – StarView Metafile [.svm]
swf – Macromedia Flash (SWF) [.swf]
sxd – OpenOffice.org 1.0 Drawing [.sxd]
sxd3 – StarDraw 3.0 [.sxd]
sxd5 – StarDraw 5.0 [.sxd]
tiff – Tagged Image File Format [.tiff]
vor – StarDraw 5.0 Template [.vor]
vor3 – StarDraw 3.0 Template [.vor]
wmf – Windows Metafile [.wmf]
xhtml – XHTML [.xhtml]
xpm – X PixMap [.xpm]

The following list of presentation formats are currently available:

bmp – Windows Bitmap [.bmp]
emf – Enhanced Metafile [.emf]
eps – Encapsulated PostScript [.eps]
gif – Graphics Interchange Format [.gif]
html – HTML Document (OpenOffice.org Impress) [.html]
jpg – Joint Photographic Experts Group [.jpg]
met – OS/2 Metafile [.met]
odg – ODF Drawing (Impress) [.odg]
odp – ODF Presentation [.odp]
otp – ODF Presentation Template [.otp]
pbm – Portable Bitmap [.pbm]
pct – Mac Pict [.pct]
pdf – Portable Document Format [.pdf]
pgm – Portable Graymap [.pgm]
png – Portable Network Graphic [.png]
pot – Microsoft PowerPoint 97/2000/XP Template [.pot]
ppm – Portable Pixelmap [.ppm]
ppt – Microsoft PowerPoint 97/2000/XP [.ppt]
pwp – PlaceWare [.pwp]
ras – Sun Raster Image [.ras]
sda – StarDraw 5.0 (OpenOffice.org Impress) [.sda]
sdd – StarImpress 5.0 [.sdd]
sdd3 – StarDraw 3.0 (OpenOffice.org Impress) [.sdd]
sdd4 – StarImpress 4.0 [.sdd]
sxd – OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd]
sti – OpenOffice.org 1.0 Presentation Template [.sti]
svg – Scalable Vector Graphics [.svg]
svm – StarView Metafile [.svm]
swf – Macromedia Flash (SWF) [.swf]
sxi – OpenOffice.org 1.0 Presentation [.sxi]
tiff – Tagged Image File Format [.tiff]
uop – Unified Office Format presentation [.uop]
vor – StarImpress 5.0 Template [.vor]
vor3 – StarDraw 3.0 Template (OpenOffice.org Impress) [.vor]
vor4 – StarImpress 4.0 Template [.vor]
vor5 – StarDraw 5.0 Template (OpenOffice.org Impress) [.vor]
wmf – Windows Metafile [.wmf]
xhtml – XHTML [.xml]
xpm – X PixMap [.xpm]

The following list of spreadsheet formats are currently available:

csv – Text CSV [.csv]
dbf – dBASE [.dbf]
dif – Data Interchange Format [.dif]
html – HTML Document (OpenOffice.org Calc) [.html]
ods – ODF Spreadsheet [.ods]
ooxml – Microsoft Excel 2003 XML [.xml]
ots – ODF Spreadsheet Template [.ots]
pdf – Portable Document Format [.pdf]
sdc – StarCalc 5.0 [.sdc]
sdc4 – StarCalc 4.0 [.sdc]
sdc3 – StarCalc 3.0 [.sdc]
slk – SYLK [.slk]
stc – OpenOffice.org 1.0 Spreadsheet Template [.stc]
sxc – OpenOffice.org 1.0 Spreadsheet [.sxc]
uos – Unified Office Format spreadsheet [.uos]
vor3 – StarCalc 3.0 Template [.vor]
vor4 – StarCalc 4.0 Template [.vor]
vor – StarCalc 5.0 Template [.vor]
xhtml – XHTML [.xhtml]
xls – Microsoft Excel 97/2000/XP [.xls]
xls5 – Microsoft Excel 5.0 [.xls]
xls95 – Microsoft Excel 95 [.xls]
xlt – Microsoft Excel 97/2000/XP Template [.xlt]
xlt5 – Microsoft Excel 5.0 Template [.xlt]
xlt95 – Microsoft Excel 95 Template [.xlt]