Friday, August 16, 2013

Extending the Swiss Army knife - an overview about writing of filters for LibreOffice

LibreOffice is sometimes regarded as the Swiss army knife when it comes to opening office file-formats. Although it might be a slight exaggeration, it is a point of honour of the development team to try to allow users to load into the suite as many of their documents as possible. Every major release from the first LibreOffice 3.3 came with new and improved import filters, often for file-formats that are under-documented, if any documentation can be found at all. In this article, we would like to present the way import filters interface with LibreOffice and give to an interested developer a starting point for adding her favourite file-format among those LibreOffice is able to open.

Filters creating documents directly into LibreOffice internal structures

In general, an import filter's task is to parse the foreign document, extract from it useful information, and feed it to the application in a way it can understand. Many internal filters, like the MS Word filter, use a direct way of communicating with LibreOffice. They import the document directly into the internal structures that represent those documents. The advantage of this approach is the lack of intermediary: the document is immediately understood by the application and no additional processing is needed. The disadvantage is that this approach requires an intimate knowledge of the internal structures used and has thus a steep learning curve. The next two types of filters will correspond better to a developer that does not want to dive too deep into LibreOffice internals, yet wants to have his work done.

OpenDocument format as an interchange format

Who has not heard about OpenDocument? Hardly anybody ignores its existence. But it is also a convenient interchange format for filter writers. No need in this case to understand the LibreOffice internals apart from some hundred lines of boilerplate code that are documented in various places. It suffices to read the source document and generate a "flat" OpenDocument representation of it. LibreOffice is able to load this kind of representation as if it was loading an ODF document.

XSLT filters

The easiest way to write a filter for an XML-based file-format is using the XSLT filter dialogue. All you need is to have an XSL transform that converts a foreign XML-based file-format to the "flat" ODF, for import filters; and that converts a corresponding ODF XML to the XML used by the foreign file-format, for export filter. Once those transforms exist, the integration with LibreOffice can be done using the user interface.

Picture 1

In the Tools menu, chose XML Filter Settings, you will see listed all the XSLT filters that are already present in your LibreOffice installation along with the information about the application that is supposed to receive the resulting ODF document. Other information that can be found is about the direction of the conversion. Is it an import filter, export filter, or a filter that can import and export a foreign file-format.

If you click at "New", this dialogue will appear.

Picture 2

In the "General" tab, you will be able to chose the user-visible information about the filter: its name, the application that will receive the converted document (for instance LibreOffice Calc (.ods) for a spreadsheet converted to the OpenDocument Spreadsheet format). This information is also used by LibreOffice to group different types of documents. If you chose presentations in the file-picker and your filter specifies that it is converting into the LibreOffice Impress application, then all files having the file-extension associated with the file-format will be shown in the list.

In the "Name of file type", you will be able to describe the file-format that your filter will handle and in the "File extension" field, you will need to put semicolon-separated list of possible extensions for files in the given file-format. For instance, the extensions for the files in Microsoft Excel 2003 XML file-format will end typically with extensions xml or xls. You can add a comment in the "Comments" field. This last field is optional and you can leave it empty if you desire.

Picture 3

The next tab is the actual information about the XSL transformations that will do the conversion. The DocType field makes sense principally for import filters. The XSLT filters typedetection will scan for the string you enter there in the first 4000 bytes of the file. Since the typedetection searches for this string only in those first 4000 bytes, it is necessary to assure that the string one specifies can be found invariably in the very beginning of the file. You can leave the field empty if you desire. Then the typedetection will be done purely on the basis of an extension.

If you are writing an export filter, you will provide in the "XSLT for export" field the transform that will do the conversion from the OpenDocument XML to the file-format for which you write your filter. If this field remains empty, LibreOffice will know that you filter is not an export filter. The same is valid for the "XSLT for import" field. It will contain the path to the XSLT sheet that does the import transformation. Leaving it empty is telling LibreOffice that your filter is not an import filter. There are already several filters bundled with LibreOffice that do conversion only in one direction. For instance, the XHTML filters or the MediaWiki filter are used only to export to the corresponding file-formats.

You also have the option to specify the default template for filters that import from file-formats that don't carry style information. For instance, the bundled DocBook filter uses a template to specify styles of different outline levels. If you don't specify the template, there are two possibilities. Either your transform creates a document with full styles, or you rely on the default styles that LibreOffice uses.

The check-box "The filter needs XSLT 2.0 processor" is to be checked only if your transforms use some exclusive 2.0 features. It is nevertheless advisable to write xslt sheets of the 1.0 version. They are much simpler and, because of the performance issues of other xslt processors out there, LibreOffice uses under the hood libxslt. The fact that libxslt, has only limited support of the 2.0 features is widely offset by the performance improvement that its use brought.

Now, you are done with the integration of your filter, the dialogue in the Picture 1 allows you to test your transforms, and even to export your filter as an extension package and deploy it on different installations of LibreOffice or to distribute it over our extension web-site http://extensions.libreoffice.org

As you can see, the integration of an XSLT-based filter into LibreOffice is rather simple. That is the biggest advantage of this approach. Nevertheless, there are also some disadvantages. Despite of the migration of the XSLT engine to a relatively fast libxslt, the use of xsl transforms on large document can be relatively slow. Another disadvantage is that the transforms are not really good at converting documents where the concepts of the source and target file-formats cannot be easily mapped.

XFilter framework

The XFilter framework is the other way to integrate import filters with LibreOffice. In fact the previous XSLT-based filters use an intermediary layer that uses this framework too. The advantage of using the XFilter framework directly is the use of higher lever programming languages that allow much easier mapping of incompatible concepts, parsing of documents in several passes, as well as much more complex processing of gathered information. Moreover, this is the way to use if you need to write a filter for a file-format that is not XML-based, since the XSLT-based filters cannot be use to convert binary document file-formats.

The use of the XFilter framework is a bit more complicated then the use of the XSLT-based filter dialogue. Nevertheless, it is far from being rocket science. We will examine the steps needed for a typical import filter using the example of the recently added Microsoft Publisher filter in LibreOffice 4. For the sake of simplicity, we first start with the configuration files. You will need to craft two xml fragments, one for the filter description and one for the file-type.

Filter description:

<node oor:name="Publisher Document" oor:op="replace">
    <prop oor:name="Flags">
        <value>IMPORT ALIEN USESOPTIONS 3RDPARTYFILTER PREFERRED</value>
    </prop>
    <prop oor:name="FilterService">
       <value>com.sun.star.comp.Draw.MSPUBImportFilter</value>
    </prop>
    <prop oor:name="UIName">
       <value xml:lang="x-default">Microsoft Publisher 97-2010</value>
    </prop>
    <prop oor:name="FileFormatVersion">
       <value>0</value>
    </prop>
    <prop oor:name="Type">
       <value>draw_Publisher_Document</value>
    </prop>
    <prop oor:name="DocumentService">
       <value>com.sun.star.drawing.DrawingDocument</value>
    </prop>
</node>

The oor:name attribute gives the name of the filter used internally. This name is important because the file-type and a corresponding filter are linked using it. As to the flags, I will mention here only two or three. The others can be used just as they are. The IMPORT flag specifies that we are implementing an import filter. For export filters, the flag is EXPORT and both flags are present for a bi-directional filter. The ALIEN flag is indicating that the filter handles a non-native file-format from the point of view of LibreOffice. When used with EXPORT flag, on export to the given file-format, it will trigger a dialogue warning about a possible data loss.

The FilterService property specifies the service that will be used for converting of the document. It is necessary that it corresponds exactly to the implementation name of your import filter. Since the filter is a so-called UNO component, it uses the java-like naming. The part com.sun.star.comp.Draw indicates that the filter is a component and converts a drawing and the MSPubImportFilter is the actual name of the filter.The UIName indicates a name that will appear in the file-selection dialogue for file-formats where none of the typedetections is able to detect them.The DocumentService property specifies which service will receive the result of the conversion. Here we are converting the Microsoft Publisher files into LibreOffice Draw as a drawing, that is why the document service will be the com.sun.star.drawing.DrawingDocument. If we were converting a text document, the document service would be the com.sun.star.text.TextDocument.

The Type property specifies the file type that the filter handles. This value is important because it must correspond to the oor:name attribute of the corresponding file-type description. It is necessary that the the name of the file-type starts with the indication of the receiving application. Here we use the draw_Publisher_Document and for instance for the Wordperfect file-format, we use in LibreOffice the writer_WordPerfect_Document. But lets profit from this and have a look at the second xml fragment, the file-type one. Here is one that corresponds to our example:

<node oor:name="draw_Publisher_Document" oor:op="replace">
    <prop oor:name="DetectService">
       <value>com.sun.star.comp.Draw.MSPUBImportFilter</value>
    </prop>
    <prop oor:name="Extensions">
       <value>pub</value>
    </prop>
    <prop oor:name="MediaType">
       <value>application/x-mspublisher</value>
    </prop>
    <prop oor:name="Preferred">
       <value>true</value>
    </prop>
    <prop oor:name="PreferredFilter">
        <value>Publisher Document</value>
    </prop>
    <prop oor:name="UIName">
        <value>Microsoft Publisher</value>
    </prop>
</node>

The DetectService specifies a service that is able to determine whether a document is of the given file-format. In our case, the com.sun.star.comp.Draw.MSPUBImportFilter is able to do both, the conversion and the type-detection. In the Extensions property, semi-colon separated values indicate possible extensions for file of the given file-format. In the case of an export filter, the first extension in the list is used for saving with automatic file-extension enabled. The MediaType property basically specifies the mime-type of the file-format. The other element that links the file-format with the corresponding filter is the PreferredFilter property. LibreOffice will invoke the "Publisher Document" to convert the document if the typedetection identifies it as "draw_Publisher_Document". As to the UIName, it specifies the way the document format will be referenced in the list of file-formats in the file-picker.

Now we finished the crafting of the configuration files. It is time to create a boilerplate C++ code. Our filter not only converts from Microsoft Publisher files, but is also able to determine whether a given document is a file-format it can import. For this purpose, it has to support two services: "com.sun.star.document.ImportFilter" and "com.sun.star.document.ExtendedTypeDetection". If we were implementing an export filter, we would have to support also the service "com.sun.star.document.ExportFilter". Besides the com::sun::star::document::XFilter interface that both are bound to implement ExportFilter service must also implement the com::sun::star::document::XExporter interface and ImportFilter has to implement the com::sun::star::document::XImporter. For initialization, the filter must also implement com::sun::star::lang::XInitialization. And since the filter implements UNO servies, it should also implement the com::sun::star::lang::XServiceInfo interface.

But, let us concentrate on the interfaces that are specific to the import filter. The XFilter interface has two functions, the filter and cancel. In our example we will implement the cancel() as a do-nothing function. As for the filter function, it is the one that will do the actual filtering.

sal_Bool SAL_CALL MSPUBImportFilter::filter(const Sequence<PropertyValue> &aDescriptor) {

First, we will have to get the reference to the InputStream that represents the document we want to import. The aDescriptor is a sequence of pairs consisting of the value name and the actual value. The operator>>= will extract the value from the UNO Any (that can contain values of different types) into a variable of the requested type.

    sal_Int32 nLength = aDescriptor.getLength();
    const PropertyValue *pValue = aDescriptor.getConstArray();
    OUString sURL;
    Reference <XInputStream> xInputStream;
    for (sal_Int32 i = 0; i<nLength; i++)
       if (pValue[i].Name == "InputStream")
           pValue[i].Value >>= xInputStream;

Next we will have to specify the import service that will receive the converted document in the form of SAX messages. The com.sun.star.comp.Draw.XMLOasisImporter service is a service that receives the OpenDocument Graphics XML.

    OUString sXMLImportService ("com.sun.star.comp.Draw.XMLOasisImporter");
    Reference <XDocumentHandler> xInternalHandler(
       comphelper::ComponentContext(mxContext).createComponent(sXMLImportService),
       UNO_QUERY);

The XImporter sets up an empty target document for XDocumentHandler to write to.

    Reference <XImporter> xImporter(xInternalHandler, UNO_QUERY_THROW);
    xImporter->setTargetDocument(mxDoc);

At this point, there is enough to plug into a filter that will read the xInputStream and write the resulting XML into the xInternalHandler. On success of the filtering operation, the filter function should return true and false on failure. After the implementation of this filter function, we will have to implement XImporter's setTargetDocument function.

void SAL_CALL MSPUBImportFilter::setTargetDocument(const Reference <XComponent> & xDoc)
{
    mxDoc = xDoc;
}

In our case we just keep the Reference to XComponent in a member variable that we used in the previous snippet to set up an empty target that receives our imported document. And that would be all for the integration of an Import filter. For an export filter we would have to implement also the XExporter's setSourceDocument that is basically symmetrical to XImporter's setTargetDocument.

It is good to note that another way of integrating of filters into LibreOffice could be using the com::sun::star::xml::XExportFilter and com::sun::star::xml::XImportFilter interfaces that are grosso-modo equivalent to the described method. The difference is that the FilterService in the configuration xml file will be in this case always com.sun.star.comp.Writer.XmlFilterAdaptor and the actual filter component, as well as the target and source services are specified in the configuration file in the UserData property. But this is just for an anecdote, since the method I described in detail is much more generic.

When we were creating the xml configuration files, we said that the com.sun.star.comp.Draw.MSPUBImportFilter component is able to do also the type-detection. For that purpose, it must support the com::sun::star::document::XExtendedFilterDetection interface, and thus its detect function.This function should return the string corresponding to the type name in the configuration file if it detects the document and an empty string for the cases when it is not able to identify the document.

OUString SAL_CALL MSPUBImportFilter::detect(Sequence <PropertyValue> &Descriptor)
{
    OUString sTypeName;
    sal_Int32 nLength = Descriptor.getLength();
    sal_Int32 location = nLength;
    const PropertyValue *pValue = Descriptor.getConstArray();
    Reference <XInputStream> xInputStream;
    for (sal_Int32 i = 0; i<nLength; ++i)
       if (pValue[i].Name == "TypeName")
            location=i;
       else if (pValue[i].Name == "InputStream")
           pValue[i].Value >>= xInputStream;

As in the filter function we need to extract from the sequence the InputStream that we will examine. There is one difference, we will keep the reference of the TypeName property, so that we can fill it with the name of the type in case we detected it. The detect function should fill the variable sTypeName with the right string in case the detection was successful. And it is in this case that we will specify this information to the Descriptor and return the name of the type.

    if (!sTypeName.isEmpty())
    {
       if (location == Descriptor.getLength())
       {
           Descriptor.realloc(nLength+1);
           Descriptor[location].Name = "TypeName";
       }
       Descriptor[location].Value <<= sTypeName;
    }
    return sTypeName;
}

It would be not true to say that this is all that is needed to integrate a filter into LibreOffice. There are still some ten to fifty lines of code needed for the implementation of the generic UNO boilerplate, an xml file for the UNO component registration during the build and some makefile changes. Nevertheless, those changes are just trivial and can be done by mimicking existing filters like those in the writerperfect module of the LibreOffice code.

Getting involved

Free software is about people and the LibreOffice projects values highly all contributors, regardless of the size of their contribution. The community is thrilled to welcome anybody that wants to lend hand to make the software better. And why not you? If you think that writing filters for LibreOffice is enough fun for you, there are plenty of dedicated developers ready to help you either on the developer list libreoffice@lists.freedesktop.org or on IRC at #libreoffice-dev channel of the Freenode server. Just drop by and we will help you to write your first filter. We guarantee that you will enjoy and stick with the project.