Archiving WordPress websites with Python and DITA/XML

Why we archive our WordPress websites

We have invested a lot of time and effort in creating content on several WordPress websites. For example, our News from Nan site has over 2GB of text and images created over a period of a number of years.

News from Nan home page
News from Nan home page

Although we greatly appreciate the accessibility, interactivity, and flexibility of the websites, we wanted to archive the site content using a common disk file format like PDF or EPUB, both to protect the current content and to preserve it if the site becomes unavailable or after it has been taken down.

In the archive we wanted to include all the links, images, and metadata (for example, the categories and tags associated with pages and posts), and we wanted to increased the value and usability of our archive by converting the metadata into an index.

Our archiving process

The archiving process we invented has the following steps:

  1. Export the website using the export facility built in to WordPress.
  2. Run a Python script to convert the WordPress output to DITA/XML.
  3. Use a DITA/XML processing tool to transform the DITA file to the output file.  We have been successful in using the oXygen XML Editor, XMLmind, DITA Editor Personal Edition, and DITA Open Toolkit. In our example, our output file is PDF.

Why we use DITA/XML

DITA is an OASIS (Organization for the Advancement of Structured Information Standards) standard used in structured authoring and publishing. We use DITA for the following reasons:

  • DITA is an open (not proprietary) standard and is widely used.
  • Many tools are available that can render the files into various output formats, including PDF, RTF, and EPUB.
  • Once the website content, images, and metadata have been converted to DITA format, they can be reused.

Step 1: Export the site using the WordPress export facility

We export all content, which includes posts, pages, and media.

Here is the exported file displayed in the oXygen editor.

Step 2: Run the Python script to transform the export file to DITA

The inputs to the script are the following:

  • The WordPress site export file.
  • A run parameter file, also in XML format, that defines how the export file is to be processed. The parameters include the following:
    • The path to the export file
    • The title of the archive output
    • The path to a DITA splash page and splash page image to be used in the output
    • An optional list of publication years to be selected
    • An optional list of categories to be selected
    • An optional list of tags to be selected

If no selections are made, the script output will include all of the content from the site.

Here is the run parameter file, displayed in oXygen.

Here is the Python script.

The outputs produced by the script are the following:

  • A DITA concept topic for each page or post from the site. In addition to all the original text, each topic includes any images in the post/page and all of the category or tag metadata as indexterm elements.
  • Two ditamaps, one suitable for PDF processing, and one for HTML.
  • A log listing any errors encountered during processing.

Here is a DITA file.

Here is a DITA map.

Step 3: Use a DITA processing tool to transform the DITA file to PDF

After the DITA files are produced, any of several tools can be used to process the ditamaps to produce a PDF, EPUB, or other output archive. We usually use oXygen, which is our tool of choice for DITA editing and production.

Archive output file

Our preferred output file is PDF. On the document’s title page we include information such as the date it was produced.

Here are a couple of example pages.

Tools we used to create the archive

Here are the specific tools and environments we use to produce the archive file:

  • Our News from Nan website is at the latest release of WordPress, which is currently 4.9.7.
  • The script is executed on a MacBook at the MacOS release High Sierra 10.13.6.
  • We do our Python processing on Python 3.7.0. Development was all done using IDLE.
  • We have successfully processed the ditamaps using the following tools:
    • oXygen XML Editor 18.1
    • XMLmind DITA Editor Personal Edition 8.1.0
    • DITA Open Toolkit 2.3.3