This ‘plugin’ is a DITA to WordPress importer. Specifically it is a WordPress import module which will take the two-pane ‘Web Help’ output from the DITA Open Toolkit and import the hierarchy of XHTML pages into WordPress. It will import images too, though not as WordPress attachments.
This tool was written as part of an online help project in my last job. As an add-on to WordPress to be distributed to customers it was licensed under the GNU GPL Version 2 with explicit understanding of my employers.
I have retained their copyright notice as it was written for them though the concept, ideas and implementation are all mine.
There is also a zip file for you to download containing the sample DITA web help files that comes with the DITA-OT.
Feedback is welcome. Please use the comment box at the below.
Here is the contents of the readme almost verbatim:
It was written to import the XHTML output of the DITA Open Toolkit. A tool which takes XML topics in DITA format and converts them to a number of formats, including PDF, Win Help, and XHTML. It uses the body tag to grab what it needs.
It is very rough and specific to the in-house requirements of Northgate (my last company). It also works on WPMU.
It uses PHP5’s XML manipulation, and at least one part requires MYSQL 5 (for sub-selects) and has some quirky stuff in it. For instance importing 1200 files in one go on windows used to always time out (PHP timeout calc on windows uses wallclock time not cpu time), so it can be restarted and it will process from where it left off.
I mentioned it and DITA over in this post  on WP-Docs as part of this conversation .
It expects the XHTML output from producing ‘web help’ with DITA-OT 1.4. This is a hierarchical tree of XHTML files with a top level two pane frame index file with a table of contents in one pane and the help topics in the other.
It imports those help topics, grabbing the contents of the body tag and doing some manipulation to get everything to work in WordPress as well as satisfy the original requirements.
It uses a staging table (automatically created) and can be re-run to update the same topics (if you regenerate them). It can also be re-run to continue processing if there is a failure half way through.
Basic processing is as follows:
You supply the path to the top of the DITA output tree (where index.html is generated) If under WPMU, you supply the blog into which you want to import.
It then loads all the files it can find (explicitly ignoring index.html) into the staging table.
The load process does the following:
* converts the paths of any links to other files
* strips out empty anchor tags that DITA-OT generates (It adds id’s *and* empty anchors as fragment targets!)
* It takes the meta tag ‘description’ and uses that as the excerpt
* It takes meta tag keywords and pops them into a page meta tag
* it looks for some specific internal meta tags and saves those (deleted, replaced-by, and prodname)
* it then finds image references , copies the images to the blog directory and adjusts the paths in the HTML (for WPMU it puts them in the correct blog files directory, for standard WP it puts them in the blog root!)
* it then removes some DITA-OT specific stuff we didn’t want (the short description for related links – though it leaves the links)
* it finds a specific span that ought ot be a heading and turns itr into one (h3)
* it finds the parent of the page (if there is one) and stores it so the hierarchy will work
* it extracts the cleaned body contents ready to be the page contents
* it grabs the html page title and uses that as the WP page title
* it uses the DITA id as the slug for the page
* we also had a requirement that the DITA id of the page match the html filename — I’ve made that optional i which case it uses the filename as the slug
The next step of the import looks to see which, if any, of the imported page are updates to existing ones (the id/filename will match an existing slug). It will do an update for those not an insert, and it will record the updated ones if they had comments to be squirted into a post about updates (internal requirement).
Then it will process those updates. By the way, the WP revision stuff works — it will create a new revision for each time you update the page.
Next it inserts new pages
Then it has to flush thew rewrite rules. We had great problems with internal links and rewrite rules – so there is probably a bit of belt and braces stuff going on.
Next it revisits all the pages resolving the parents correctly — so that the hierarchy is created properly. And then it has to flush the rewrite rules again (the paths have changed).
Finally it call update_guids — more belt and braces.
There is an option to empty your posts table before importing. You would not normally want to do that! And another option to delete the pages which are still referenced in the staging table. It’s a clean up after a failed import step less drastic than cleaning up everything. It is hopefully not needed now.
In the source file itself there are a couple more settings you can adjust. There are two different debug levels: set $debug and or $debug_extra to true.
The loop size (how many records to process at once) is adjustable (default 75).
And there is even an option to import posts instead of pages — this is experimental and probably wouldn’t work. For instance it needs to detect category meta tags in the XHTML and add them to the post. The add to post code is half there.
== Installation ==
Copy this file to the wp-admin/import folder. It is not a plugin.