Generate Thumbnail Images from PDF Documents
By Jonathan Hodgson | Published: 23 June 2005 |
Reader Level: Intermediate
This article presents VB.NET code to create thumbnail images from a directory of
Adobe Acrobat PDF documents using the .NET Framework.
Download VB.NET source files - 276 Kb
Download C# source files - 275 Kb

Introduction
This article presents VB.NET code to create thumbnail images from a directory of
Adobe Acrobat PDF documents.
Often when looking for documents it is much easier to find what you want
visually, for example seeing the cover of a document.
The application was written for a website that I was developing that needed to
display links to PDF documents. Instead of just showing a little PDF icon next
to each document we wanted to display the front page of the actual document.
As shown below, this gives the listings better aesthetics and also enables the
users to find documents quicker if they recognise it.

VS
Note: please ignore the strange text,
lorem ipsum is simply dummy text for this example
Hopefully people will agree that having the actual front cover displayed
next to the hyperlink works better than the generic PDF icon.
Background
The web site was a Content Management System (CMS) so new PDF documents were
uploaded to the site by the users. We then had this application scheduled as a
batch service to run every 5 minutes and check for new files.
In the backend system the documents have metadata stored in a SQL Server 2000
database. We would then write a flag to say the thumbnail had been created and
when we generated the HTML content for the page request in ASP/ASP.NET we would
return the appropriate IMG tag and source as appropriate.
Using the Acrobat SDK also meant we could programmically read the PDF metadata
and retrieve the number of pages in the document, which could then be displayed
as well. Although the end users could have entered that information it meant
less work for them and a better overall impression of the web site. Another
advantage was that many users relied on the number of pages to determine
how large the document was rather than the more technical Kb/Mb value.
Approach
To generate the thumbnail image for each document I used the Adobe Acrobat 5.0
SDK and the Microsoft .NET 1.1 Framework.
Note: do not confuse the thumbnails that are part of a PDF document with the
.png files this application generates.
The Acrobat SDK combined with the full version of Adobe Acrobat (sadly the free
reader does not expose the COM interfaces) exposes a COM
library of objects that can be used to manipulate and access PDF information.
So using these COM objects via COM Interop, we can load the PDF
document, get the first page and render that page to the clipboard. Then
using the .NET Framework we can copy this to a bitmap, scale and combine that
image and then save the result as a
.gif or
.png file.
At first I just saved the scaled down image, but then decided to “fancy” up the
thumbnail with a drop-shadow and folded corner. To achieve this effect I
created a transparent .gif, called pdftemplate_portrait.gif,
using Macromedia
Fireworks MX where the main body of the page template was transparent.
By making the bottom-left pixel transparent too we can easily set the
transparent colour for a bitmap in .NET.
I keep the top-right of the image white where the corner folds over, that means
I can just combine the images by drawing the transparent template directly over
the PDF image to achieve the final look.

Pre-requisites
The full version of Adobe Acrobat (the free reader does not expose
the COM interfaces) which exposes a COM library of objects to manipulate and
access PDF information.
The Adobe Acrobat 5.0 SDK which is a free download from the
Adobe Solutions Network website (note: the site requires registration).
The latest SDK for Acrobat 6.0 requires paid membership, so we will use the
previous SDK version.

To quickly see if you have the full version of Adobe Acrobat installed, use regedit.exe
and look under HKEY_CLASSES_ROOT for entry entry called AcroExch.PDDoc.

You'll also need the .NET 1.1 Framework and some PDF files to test
the solution.
The code was written in VB.NET using the .NET 1.1 Framework and
Visual Studio.NET 2003 on Windows XP, but there is no reason it wouldn't work
on Windows NT/2000 or .NET 1.0.
Using the code
The code is quite simple with a try/catch over the main body. It is purposely in
one large block so it's easy to see what it happening and to step through and
examine with the debugger.
Initially we create an instance of AcroExch.PDDoc using
late-binding. The referenced Adobe Acrobat 5.0 Type Library (Acrobat.tlb
from C:\Program Files\Adobe\Acrobat 5.0
SDK\InterAppCommunicationSupport\Headers) does not expose a COM class
you can create using early-binding. By referencing the type library we can get
the Intellisense and strong-typing of the other Acrobat objects.
Pass the filename of the PDF documents to be opened to the PDDoc object,
which can then be accessed to get metadata on the document; GetNumPages()
and GetInfo() for custom document properties.
' Create the document (Can only create the AcroExch.PDDoc object using
' late-binding)
pdfDoc = CreateObject("AcroExch.PDDoc")
' Open the document
ret = pdfDoc.Open(inputFile)
If ret = False Then
Throw New FileNotFoundException
End If
' Get the number of pages
pageCount = pdfDoc.GetNumPages()
Set a reference to the first page of the document as pdfPage, which
is of type Acrobat.CAcroPDPage. From this we can get a rectangle
object of the actual page dimensions. One strange point to notice here is that
the Adobe Acrobat SDK documentation seems incorrect, as the PDFRect
that is returned from the GetSize() method has IDispatch
properties x, y but the PDFRect we need to supply to CopyToClipboard
must have left, right, top, bottom.
Finally we render the PDF page to the clipboard at full size. We could have
Acrobat scale the image down for us by a percentage, but we can get better
visual results using the .NET scaling algorithms of the Bitmap class.
It would have been more efficient to render directly to an off-screen bitmap,
and also not have overwritten what ever was previously on the clipboard, but I
found the clipboard method the most stable way to get a rendered bitmap of the
page using Acrobat.
Although it looks like the pdfPage object has a DrawEx
method that can take an H<CODE>DC I couldn't get the method
to work in a consistently successful way. Calling DrawEx in the
paint event of a Windows Forms application did work but it still wouldn't write
to an off-screen bitmap directly. Therefore the clipboard method is used and if
the process runs on a batch server it won't cause too much worry.
Note: the Draw method is deprecated, as it only works on Win16
systems where hWnd was unique to Windows and not to each process
as on NT.

' Get the first page
pdfPage = pdfDoc.AcquirePage(0)
' Get the size of the page
' This is really strange bug/documentation problem
' The PDFRect you get back from GetSize has properties
' x and y, but the PDFRect you have to supply CopyToClipboard
' has left, right, top, bottom
pdfRectTemp = pdfPage.GetSize
' Create PDFRect to hold dimensions of the page
pdfRect = CreateObject("AcroExch.Rect")
pdfRect.Left = 0
pdfRect.right = pdfRectTemp.x
pdfRect.Top = 0
pdfRect.bottom = pdfRectTemp.y
' Render to clipboard, scaled by 100 percent (ie. original size)
' Even though we want a smaller image, better for us to scale in .NET
' than Acrobat as it would greek out small text
' see http://www.adobe.com/support/techdocs/1dd72.htm
Call pdfPage.CopyToClipboard(pdfRect, 0, 0, 100)
Dim clipboardData As IDataObject = Clipboard.GetDataObject()
Grab the rendered page bitmap from the clipboard and based on the pdfRectTemp
object determine if it's a portait or landscape document. Set the correct file
to load as the template, and if it is landscape, switch the width and height.

Dim pdfBitmap As Bitmap = clipboardData.GetData(DataFormats.Bitmap)
' Size of generated thumbnail in pixels
Dim thumbnailWidth As Integer = 38
Dim thumbnailHeight As Integer = 52
Dim templateFile As String
' Switch between portrait and landscape
If (pdfRectTemp.x < pdfRectTemp.y) Then
templateFile = templatePortraitFile
Else
templateFile = templateLandscapeFile
' Swap width and height (little trick not using third temp variable)
thumbnailWidth = thumbnailWidth Xor thumbnailHeight
thumbnailHeight = thumbnailWidth Xor thumbnailHeight
thumbnailWidth = thumbnailWidth Xor thumbnailHeight
End If
Load the template file as as Bitmap and as an Image.
We use both because the Bitmap class supports MakeTransparent
and the image can easily be passed to the Graphics.DrawImage() method.
It is slightly inefficent but speed isn't the primarly objective for this
application.
Render the pdfImage using the GetThumbnailImage() method
of the .NET Framework Bitmap class, this provides a very smooth
scaled version of the image.
Next create a blank bitmap with room for the template border. Set the templateBitmap
to use the bottom-left pixel of the image as the transparency colour using
calling MakeTransparent(). See an article on
Chris Sells website for more on transparencies in .NET.

Using the new blank bitmap, draw the rendered pdf page image to it and then the
template with transparency directly over the top. Because it is transparent the
main area of the page template will still appear through.
Finally, save the composited image back as a .png or .gif file,
although .png does look better.
' Load the template graphic
Dim templateBitmap As Bitmap = New Bitmap(templateFile)
Dim templateImage As Image = Image.FromFile(templateFile)
' Render to small image using the bitmap class
Dim pdfImage As Image = pdfBitmap.GetThumbnailImage(thumbnailWidth, _
thumbnailHeight, _
Nothing, Nothing)
' Create new blank bitmap (+ 7 for template border)
Dim thumbnailBitmap As Bitmap = New Bitmap(thumbnailWidth + 7, _
thumbnailHeight + 7, _
Imaging.PixelFormat.Format32bppArgb)
' To overlayout the template with the image, we need to set the transparency
' http://www.sellsbrothers.com/writing/default.aspx?
' content=dotnetimagerecoloring.htm
templateBitmap.MakeTransparent()
Dim thumbnailGraphics As Graphics = Graphics.FromImage(thumbnailBitmap)
' Draw rendered pdf image to new blank bitmap
thumbnailGraphics.DrawImage(pdfImage, 2, 2, thumbnailWidth, thumbnailHeight)
' Draw template outline over the bitmap (pdf with show through the
' transparent area)
thumbnailGraphics.DrawImage(templateImage, 0, 0)
' Save as .png file
thumbnailBitmap.Save(outputFile, Imaging.ImageFormat.Png)
Write some feedback to the console as we work through each of the files.
Then actively release the reference code to the COM objects as Acrobat it isn't
the best suited application to opening and closing multiple PDF documents
without falling over. Luckily the code doesn't cause Acrobat to display any UI
that might cause the process to hang waiting for user interaction.
Console.WriteLine("Generated thumbnail... {0}", outputFile)
thumbnailGraphics.Dispose()
pdfDoc.Close()
Marshal.ReleaseComObject(pdfPage)
Marshal.ReleaseComObject(pdfRect)
Marshal.ReleaseComObject(pdfDoc)
Visual Studio.NET Solution
The project you can download has all the VB.NET code and the COM Interop DLL
that was generated. Even though the application is actually a console
application we still need System.Windows.Form as the clipboard
dataformats are from there.
Use the app.config to set the input and output paths for the .pdf
files and .png files respectively. By default it reads and write to C:\thumbnails\.

Output
Running the PDFThumbnail.exe console application will enumerate all the
.pdf files in the directory specified in the .config file
writing out a .png image of the first page.

Which we can see in the screenshot below.

Further Enhancements
Further improvements might be to:
-
Render directly to an off-screen bitmap rather than to the clipboard.
-
Remove the reliance on having a full version of Adobe Acrobat by using
Ghostscript libraries instead.
One case we had was documents that could be viewed internally but were blocked
due to compliance issues for external users, by designing different templates
and rendering them with the page it was obviously the document was private
further enhancing usuability, eg.

Points of Interest
The Adobe Acrobat 5.0 SDK is not the greatest written documentation but most
information is there if you dig a little.
If running under an NT service account the screen resolution and depth make a
difference; for example if your server is only set for 256 colours in 640 x
480, and if the console application is run via the service it will not be able
to render 24-bit colour thumbnails. I've seen the same effect when using
charting controls from ASP, where the production IIS servers had low screen
resolutions set and the colour-depth of the charts was low.
Also, if running in a batch on a server you should check the terms of the
Acrobat license agreement to whether you are allowed to run the Adobe Acrobat
application in a server-type process.
The images are about 2-3Kb in size and for about 3Gb of documents the thumbnails
would take an additional 60MB - so storage requirements are not excessive. The
actual time to generate thumbnails for thousands of documents would be a few
hours, as Acrobat needs to load each document as well as the rendering to the
clipboard, and the .NET bitmap scaling, etc.
References
-
Microsoft
.NET Framework 1.1
documentation
-
Chris Sells' web site for the transparency example code
-
Adobe Acrobat 5.0 SDK documentation and examples
-
Code Complete Second
Edition for the example PDF document (which I hope Steve doesn't mind me
including and which I can totally recommend even nearly ten years since it was
first published)
Conclusion
This article has shown how to manipulate PDF documents using the Acrobat SDK and
combine images using the .NET framework.
At first it can be quite daunting trying to find good information on working
with PDF documents programmatically, although there are now a number of good
commercial components which hide a lot of the underlying postscript
complexities.
I originally wrote this utility in Visual Basic 6 using a third-party imaging
components, but now it is easier to share the code using the .NET framework.
Especially as the complex imaging and manipulation can now be done with a few
simple statements.
Thanks and I hope you enjoyed reading this article; I'd be interested to hear if
people found it useful.
This article was originally published at Code Project
About Jonathan Hodgson
Jonathan Hodgson works as Software Developer in London, UK. He started
programming in the '80s on a trusty 48k Spectrum before moving to PC
development in the early 90s. During the working week most of his time is spent
involved in application development both Windows and Web-based; .NET, C#,
VB.NET, ASP/ASP.NET, SQL Server. He is a Microsoft Certified Software Developer
(MCSD) and MCP for developing web applications using ASP.NET in C# and is
always looking for new projects and challenges to work on.
http://www.jonathanhodgson.co.uk/