Tackling documents, forms and fields with Aspose

The challenge

One subsystem of our Java-based mortgage processing system must generate and deliver about 50 forms for each loan application, populating them with data from our database and delivering them as PDF files.

Some of the forms are industry-standard, but most are custom Word and PDF forms implemented by our customers. We needed to allow them to create those custom forms, marked up with fields that our application would then populate and generate, and, for the Word forms, convert to high-fidelity PDF documents.

In addition to the forms,  the subsystem also has to convert image files in a variety of formats to PDF.

To do this processing, we recently evaluated a number of tools to manipulate Microsoft Word and PDF documents. In this post, we’ll explain how we explored several options and ultimately arrived at our choice, Aspose.Words for Java and Aspose.PDF for Java.

First steps

The first forms we implemented were the industry-standard ones. We decided to have these forms created by engineers, not customers, and made available in a “standard form” library.

As we were more comfortable coding than using Word, these forms were implemented using Apache Velocity templates that were populated and then converted to PDF using Prince XML. Prince allows you to define a form in HTML and generate a PDF from it with high fidelity.

Both are excellent tools and this approach worked for our industry-standard forms. But it did require programming and you have to create HTML/CSS to create a pixel-perfect facsimile of the forms, which is not simple. It’s also time-consuming and the forms not easily modified, thus the approach is not too scalable. But we have quite a few stable forms built and working well with this approach.

Removing the engineers

Our users’ custom forms, however, presented a different challenge. The Velocity/Prince approach required programming. That wasn’t a viable solution for the custom forms, numbering in the hundreds, and often revised.

We needed to allow our users to create forms with embedded fields and upload them to a repository in our application. When needed for a loan, our application would parse the forms for the merge fields, set the fields from database values, and convert the merged document to PDF for delivery. The Word-to-PDF conversion had to be high fidelity and retain all formatting.

The users were most comfortable working with Microsoft Word documents and fillable PDFs so we sought a solution that would allow them to embed Microsoft Word documents and PDFs with embedded fields.

For Word, the solution was Word’s merge field feature. This allows a field to be inserted in the Word document, labeled with a field ID (known to our application), and then located using a Word inspection tool and and set to database values.


Inserting a mail merge field in Microsoft Word

Word document with merge field placeholders

Word document with merge field placeholders

Exploring our options

So we set off to evaluate several tools by writing prototype code and generating documents.

We had some experience with the Apache POI open source project. We use it for some simple exports of data to Excel files. So we examined its ability to manipulate Word documents. While it worked, we found it not robust or mission-critical-ready enough yet for our application and needs, especially for mail merge work (critical for us). Nor could it convert the Word documents to PDF.

We also considered docx4j which is open source, pretty robust (including merge field support) and has an active community. But it does not directly convert Word to PDF either. Some people use iText for the conversion but we desired a one-tool solution and saw fidelity loss using iText.

We decided to explore the Aspose suite of products. Aspose.Words for Java allows manipulation of Word documents, and Aspose.PDF offers PDF manipulation functionality. The products are not open-source.

Aspose offers a no-cost 30-day evaluation license. Given the complexity of what we were doing, we ended up needing more than 30 days and Aspose graciously extended our evaluation licenses.

The Aspose.Words API is intuitive and permitted all the granularity and control we needed with a minimum of coding. It even had the ability to use some advanced Word mail merge features we thought would be beyond a non-Microsoft tool. One such feature is the nested mail merge region feature. This feature allowed several forms to be designed in a way that would have required a lot of custom code if we didn’t have the feature in our tool.

We were pleased with the speed of Aspose.Words in merging the fields and generating the PDFs. We were surprised to find Aspose.Words generates, populates and converts to PDF faster than our Velocity/Prince approach. We assumed the latter was more lightweight.

But wait, there’s more

While Aspose.Words provided the tools we needed to process documents with fields and convert them to PDF, we had other PDF needs as well:

  • Setting PDF fields directly.
  • Converting files in various formats (bmp, gif, jpeg, png, svg, tiff, and txt) to PDF,
  • Concatenating and splitting PDFs
  • Inspecting PDF documents to find and extract specific text in the PDF for finding e-signature fields

The tools that do these kinds of things are Aspose.PDFiText, PDFBox, and JPedal. Aspose.PDF was the only one that provided all of the features we required. Using one tool versus several was a major addition to the “pro” column for Aspose, especially given we were using Aspose.Words as well.

We had experience using iText before and gave it a shot. While you could find and extract page and x/y coordinates of fields, it wasn’t out of the box. You had to do some of the work yourself. You also could not use regular expressions. Only Aspose provided us the ability to use regular expressions, which was very helpful.

One of our requirements was the ability to merge a large number of PDFs (80-100) into one. Using iText, it took an unacceptable (for us) amount of time. We thought this might simply be the nature of the task, but we were surprised to find that Aspose did it significantly (and acceptably) faster.

We found Aspose.PDF was the most capable for converting image files to PDF with high fidelity. Below is one such conversion, from JPG to PDF.

JPG (top) converted to PDF (bottom) with Aspose.PDF

JPG (top) converted to PDF (bottom) with Aspose.PDF

Working with Aspose

We completed the prototyping for all our features using Aspose. As we worked more and more with it, we became convinced that it offered the most robust API, required the least amount of custom coding (very little), and was the most stable and reliable.

Some portions of the Aspose Java API are somewhat non-standard, with method names containing underscores and some that could be more intuitive (e.g. getRectangel_Rename_NameStake Form.get_xfa()). But Aspose appears to be cleaning up several of these issues each release.

Because you don’t have access to the source code, for real nitty-gritty questions or debugging you must rely on the detail level of the documentation (which is quite good), and support (also good). Some of the Java documentation isn’t as complete as the .NET documentation, but the APIs are so similar, you can reference the .NET version even though you’re using Java. And the documentation includes plenty of (.NET and Java) examples for most operations. These are most helpful.

In addition to the API and functionality itself, we were impressed with other factors of the product as well.

The Aspose development and release cycle indicates an active project, with monthly updates generally consisting of dozens of improvements or fixes. Until recently, Aspose did not have a Maven repo, but support was open to the idea and said they’d do that soon. (Maven users will know that without it, you must download the jar after each monthly release or create a Maven artifact in your own repository.) In the two months or so since we brought it up, they have already created a Maven repo. A good sign that they both listen to customer input and are actively improving their product.

The Aspose support staff has been very responsive to our queries, even during our evaluation phase. We typically receive a response with a helpful answer in under 24 hours. Support often asks for code and the artifacts so that they can replicate the problem. The staff has shown themselves to be very technically competent in understanding and resolving our issues, and pleasant to work with.

Aspose it is

Considering all these factors, we decided to go with Aspose.Words and Aspose.PDF. They integrated well into our application, our engineers found it easy to use, and the users are able to work with their familiar Microsoft Office tools. They’re robust, solid, and fast and Aspose provides good documentation and support. They’re not inexpensive, but you pay for quality and they do the job we need very well.

As much as we use and support open source libraries and tools, sometimes the commercial product is the better choice.

Given our good experience with Aspose.Words and Aspose.PDF, we are planning on evaluating Aspose.Cells in the future, to handle our application’s need to process Excel spreadsheets.