This document provides step-by-step instructions for generating valid
PDF/A from LaTeX sources.
This page is still a draft. The files and instructions below are still subject to change, even in ways that aren't backward compatible. If you follow these instructions and find any mistakes, or run into problems you cannot solve, please feel free to email me at to selinger@mathstat.dal.ca. I will do my best to improve this document and add troubleshooting hints where necessary.
A PDF/A document is a special kind of PDF document that has been
optimized for long-term archiving. PDF/A is a standard of the
International Organization for Standardization (ISO). Some of the
main features of PDF/A documents are:
|
Many organizations, such as universities, require certain documents,
such as Master's and Ph.D. theses, to be prepared in PDF/A format.
This makes sense, because such documents will be archived, and
should remain readable for the indefinite future.
In such situations, either the document's author, or the institution, or both, will perform a "check" to ensure that the document actually conforms to the PDF/A requirements. When it is done correctly, this check has two parts: automated validation and manual validation.
There are a number of software tools available that perform automated validation of PDF/A documents. Probably the most common of these is the "Preflight" tool in Adobe® Acrobat® Pro ("View" → "Tools" → "Print Production" → "Preflight"). Unfortunately, Acrobat Pro is not free software. There are also a number of free and online tools available. Just google them. Note that not all tools are equivalent: a document validated by one tool could still fail validation using another. |
Some institutions cut corners by relying exclusively on software
validation of PDF/A documents, skipping manual validation. This
encourages authors to submit PDF/A documents of poor quality,
i.e., documents that meet the technical requirements, but not the
intent, of the PDF/A standard.
Almost any PDF document can easily be converted to PDF/A-1b, using automated software tools such as the "Convert to PDF/A-1b" option of the Preflight tool of Acrobat Pro. So you could generate a "plain old" PDF file from your LaTeX sources, and then convert it to PDF/A using Acrobat Pro. While one can sometimes get away with this method, the problem is that the resulting PDF/A-1b is of poor quality: if the input document did not contain metadata or Unicode mappings, then the conversion software will not be able to create this data out of thin air. It will typically generate empty metadata and incorrect Unicode mappings. For example, consider two PDF/A documents containing this paragraph of text:
The first document example1.pdf was generated from LaTeX sources using pdflatex without any special steps, and then converted to PDF/A-1b using Acrobat Pro. Let us try copying and pasting its contents into a text file. The result is gibberish:
It is obvious that the document will not be searchable, nor can it be read aloud by a screen reader. Foreign letters such as "ö" and "α", ligatures such as "fi" and "ffi", and most mathematical symbols have simply been omitted, while some other symbols were translated to random characters. The second document example2.pdf was generated by the methods described below. When copying and pasting from this document, we get this result:
While this is not absolutely perfect (for example, there are no subscripts and superscripts), it is certainly quite readable, and much better than the above. |
Here are my step-by-step instructions for converting an existing LaTeX
document to PDF/A. These steps should work for most LaTeX documents.
To illustrate the steps, I will use my Ph.D. thesis as a running example. I wrote my thesis in 1997, long before I had ever heard of PDF, let alone PDF/A.
|
Within the metadata in the *.xmpdata file, all printable
ASCII characters except '\', '{', '}',
and '%' represent themselves. Also, all printable Unicode
characters from the basic multilingual plane (i.e., up to code point
U+FFFF) can be used directly with the UTF-8 encoding. Consecutive
whitespace characters are combined into a single space. Whitespace
after a macro such as \copyright, \backslash,
or \sep is ignored. Blank lines are not permitted. Moreover,
the following markup can be used:
\␣ - a literal space (for example after a macro) \% - a literal '%' \{ - a literal '{' \} - a literal '}' \backslash - a literal '\' \copyright - the © copyright symbolThe macro \sep is only permitted within \Author, \Keywords, and \Org. It is used to separate multiple authors, keywords, etc. |
Here is a complete list of user-definable metadata fields currently
supported, and their meanings. More may be added in the future.
General information:
\Author - the document's human author. Separate multiple authors with \sep. \Title - the document's title. \Keywords - list of keywords, separated with \sep. \Subject - the abstract. \Org - publishers. Copyright information:
\Copyright - a copyright statement. \CopyrightURL - location of a web page describing the owner and/or rights statement for this document. \Copyrighted - 'True' if the document is copyrighted, and 'False' if it isn't. This is automatically set to 'True' if either \Copyright or \CopyrightURL is specified, but can be overridden. For example, if the copyright statement is "Public Domain", this should be set to 'False'. Publication information:
\PublicationType - The type of publication. If defined, must be one of book, catalog, feed, journal, magazine, manual, newsletter, pamphlet. This is automatically set to "journal" if \Journaltitle is specified, but can be overridden. \Journaltitle - The title of the journal in which the document was published. \Journalnumber - The ISSN for the publication in which the document was published. \Volume - Journal volume. \Issue - Journal issue/number. \Firstpage - First page number of the published version of the document. \Lastpage - Last page number of the published version of the document. \Doi - Digital Object Identifier (DOI) for the document, without the leading "doi:". \CoverDisplayDate - Date on the cover of the journal issue, as a human-readable text string. \CoverDate - Date on the cover of the journal issue, in a format suitable for storing in a database field with a 'date' data type.See the file sample.xmpdata for an example. |
While the above instructions were written for a general audience, here
are some additional remarks that may be relevant to students at
Dalhousie University.
|
Adobe and Acrobat are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries. All other trademarks are the property of their respective owners. |