Word Document to Asciidoc Conversion
I had content in Word documents that I needed to convert to Asciidoc for our book. Here are the steps I found to work best:
-
Save Word doc as HTML
-
Encode as UTF-8
-
Use pandoc to convert from HTML to AsciiDoc
-
Use Sublime Text 2 search and replace (using some regular expressions) to strip out crazy things
-
Use Sublime Text 2 to perform any remaining formatting
Save Word doc as HTML
Open the document in Word, and then save as a web page. Select the "Save only Display Information into HTML" option when saving. Exit from Word (and wave it goodbye as you do!).
Encode as UTF-8
Open the html file in Sublime Text 2. Avert your eyes at the horror that is Word-formatted HTML. Reopen with encoding UTF-8 and save the file:
If I don’t recode as UTF-8, then the next step will fail with the error:
pandoc: Cannot decode byte '\x6f': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
Use Pandoc to convert from HTML to AsciiDoc
Run pandoc. For example, the following command takes ConventionSheet.htm
and converts it to the AsciiDoc file file.asc
:
pandoc -f html -t asciidoc -o file.asc ConventionSheet.htm
Use Sublime Text 2 search and replace (using some regular expressions) to strip out crazy things
Weird single quotes need to go:
If you had reviewing turned on in Word, then reviewer comments and changes will likely be present in the HTML. Remove these using a search and replace with the following Regex in the search field:
\[line-through\]\*(.+)\*
When matched lines cross line breaks then you can use the single line option (?s
) in your regex for search and replace:
(?s)\[line-through\]\*.(.*?)\*
Use Sublime Text 2 to perform any remaining AsciiDoc formatting
Monospace any regex or other special characters (these will cause problems for the AsciiDoc parser) in the document.
Edit the AsciiDoc document as you wish! Note that GitHub now natively displays AsciiDoc files (using AsciiDoctor behind the scenes), just as it does for Markdown.