A story of Document feature

Generating styled documents and reports based on the data is clearly essential feature for a database system. That’s why document-generating feature was planned and implemented in initial TeamDesk release back on fall 2005.

The process of document generation is relatively straightforward. User creates a nicely looking template, marks certain points as placeholders for the data and uploads the template to TeamDesk. When the document is requested, the system loads the template, finds placeholders, substitutes them with actual data and sends the results to the user – sounds easy.

Then the question arises: how does the user create a template? There are lots of tools. Report designer included with Microsoft Access is one of them. Crystal Reports from SAP is another one. There are lots more – each has its own unique features and its own UI for a template designer part. This means the user should take some time to learn how to use it.

But you can hardly find a user unfamiliar with the text editor. There is a plenty of them, starting from plain text editors like UNIX vi or Windows Notepad, ending up with complex ones like TeX or Word. Among those we would need ones with rich text capabilities, and clearly, Microsoft Word is dominating in this area.

So, the decision was made. The user should be able to create templates as Microsoft Word documents. But then the next question arose.

How would we process .doc files?

Microsoft Word stores documents in a proprietary binary .doc format. To find out the placeholders we would need to crunch bits and bytes – unfortunately, as of 2005 the format was not documented. All we were able to find were some random bits of information on how to build simple documents consisting of few paragraphs with little styling. And, as template creator is given the complete set of editor tools we would need to understand all the elements from the file. The only way to go in this direction would be file format reverse-engineering which is, according to license terms, illegal.

Luckily, Microsoft Word provides a way to manipulate the content of the document via OLE automation interface – though it was no-go, again, due technical problems first, and licensing terms next. Here are excerpts from Microsoft Knowledge Base article:

Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

Current licensing guidelines prevent Office applications from being used on a server to service client requests, unless those clients themselves have licensed copies of Office.

Clearly, we can’t ensure that all requests for the document are covered with appropriate Office license and we can’t provide one if it is missing (you can have no Word, but you can request the document to re-send it by email for example). Yet this requirement led us to a solution.

If we can’t do it on server’s side, and if clients are required to have licensed copy of Word, let’s do template processing on client’s computer. All we needed to do is to write a browser component that will take the template and the data and manipulate the Word through automation to produce the results.

This path imposed some limitations as well. Automation interfaces are available only in Windows versions of Office (bad luck for non-Windows users, bad luck for those having Word Viewer) and since browser component have to deal with OLE, the solution was limited to Microsoft Internet Explorer only. However, in 2005 the share of non-IE users was relatively low.

The situation has changed over last five years. The rise of Firefox (now it took over the market share of Internet Explorer 6) and growing Mac sales led to numerous requests to make document functionality cross-browser and cross-platform and they pushed us to rethink the approach.

The time has passed, old Office versions fade away and we’ve started experimenting with Word 2003 XML format. It was not documented either, but at least it was based on an XML standard – there was no need to deal with bits and bytes anymore. The outcome of these experiments was the tool to generate help pages for our products from Word documents – it was limited, far from ideal, yet it did its job well enough to make this way look promising.

The situation has changed radically in 2007 when Microsoft released a new version of their Office package introducing .docx format for text documents – practically it is a set of XML documents bundled together as a ZIP archive. The XML format for the document was simplified (comparing to Word 2003 XML), and, moreover was fully documented on Microsoft’s Site (one year later they’ve documented binary .DOC format as well, but it was clear it has no future).

We only had to wait some time to let users adapt to the new format. As more and more people switched to DOCX within last three years, the rating of the “Server-Side Document Generation” idea we’ve posted back in 2007 jumped high to 200 putting it in a Top 5 list. For those not willing to upgrade, Microsoft released Office Compatibility Pack allowing Word 2003 to read and write files in a new format. And finally, mainstream support for Word 2003 ended on April 2009; Office 2010 is coming out.

So there is nothing left to prevent us from switching from component-assisted client based document generation to a true server-side solution.

Advantages of a server-side solution are clear: obviously, all you need to create the template is DOCX compatible editor: Word 2007, Word 2003 + Compatibility pack or OpenOffice; and all you need to get the result is DOCX compatible viewer: Word 2003+, Word 2007, Word Viewer 2007, WordPad on Windows 7, OpenOffice or even an iPhone. Want the proof? Here are some development screenshots from a default document (generated on the server side) coming out from our Invoicing application template.

here is the one from my lovely iPhone

…and here is the one from Win7/WordPad. The warning is displayed because WordPad lacks support for complex scripting used in Asian languages we mention in styles. Yet the document is displayed correctly.

Q: Why don’t you support ODF, it’s a standard format after all?

A: Speaking of standards, DOCX is a standard too (ISO/IEC 29500). But we are not open standard zealots; our goal is to make it work for vast majority of users. Unfortunately, at the moment the level of adoption of OpenOffice is hardly comparable to the adoption of Microsoft Office. While OpenOffice claims to read DOCX and Microsoft Office claims to read ODF, yet there are certain compatibility issues may arise. In this situation we are on the side of those who are using the product with bigger market share.

Q: Why don’t you generate PDF files?

A: Consider three keywords: adoption, licensing and transcoding. While we recognize the value of PDF output, if talking about template creation, the level of text processors adoption is probably ten times higher than the level of adoption of Adobe Acrobat – the tool to create PDF files. And DOC-to-PDF transcoding obviously won’t work well because of the different architecture of the formats. Most of DOC-to-PDF transcoders on the market require having an installation of Word on a server to load the document and then “print” it to so-called PDF printer – you can use such a configuration in-house, but it is no way to go in a public production environment because of licensing problems described above. Few others try to do the job on their own, but the quality is not ideal – even Word’s own PDF exporter usually draws few extra pixels here and there. If you can bear such issues, you are free to go – there are a lot of free online converters and having either Word or OpenOffice you can open DOCX and write the document as PDF.

Author

Kirill Bondar

Date

19 Apr 2010