Friday, July 12, 2013

Reading the Microsoft Word docx file format

After having done some programming to read Microsoft Word files, I thought I'd write about how the Word 2007 or Office Open XML file format is put together. This isn't complete, but this will get you started.

Cracking the door open

When investigating a mystery file, the first thing a Unix junkie does is run file on it. file is a nifty program that will try to identify what sort of data it's looking at, without paying any attention to the file extension. Let's do that now:
$ ls
Lecture 1.docx
$ file Lecture\ 1.docx 
Lecture 1.docx: Zip archive data, at least v2.0 to extract

That's cool. It's really a zip archive! Let's unzip it:
$ mkdir data 
$ cd data 
$ unzip ../Lecture\ 1.docx 
Archive:  ../Lecture 1.docx
  inflating: [Content_Types].xml     
  inflating: _rels/.rels             
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml       
  inflating: word/footnotes.xml      
  inflating: word/endnotes.xml       
  inflating: word/header1.xml        
  inflating: word/theme/theme1.xml   
 extracting: docProps/thumbnail.jpeg  
  inflating: word/settings.xml       
  inflating: word/fontTable.xml      
  inflating: word/styles.xml         
  inflating: word/stylesWithEffects.xml  
  inflating: docProps/app.xml        
  inflating: docProps/core.xml       
  inflating: word/webSettings.xml    
  inflating: word/numbering.xml   
$ ls -F
[Content_Types].xml  _rels/         docProps/     word/

Delve deeper and deeper, for all is in it

Alright! Now we are getting somewhere. The most important directory is the word directory, so I'll cd there:
$ cd word
$ ls -F
_rels/                 footnotes.xml          styles.xml
document.xml           header1.xml            stylesWithEffects.xml
endnotes.xml           numbering.xml          theme/
fontTable.xml          settings.xml           webSettings.xml
There are a bunch of useful files here. Some highlights:
document.xml
Contains the text of the document. In my case, it also holds the audio notes I recorded during the lecture.
media/
This folder may or may not be present, depending on whether you have inserted any pictures. Yep. They're stored here.
numbering.xml
Tells Word how to indent and number or bullet any auto-lists you might have. Also, tells it how to number or bullet any lists you don't have. To 8 levels of indentation. Yeah.
_rels/document.xml.rels
If you have hyperlinks or embedded pictures, the hyperlink URLs and internal paths to the pictures are here.
styles.xml
Contains info about default font face and size in the document. If you are one of the few people that bother to use Styles in Word instead of manually choosing fonts and colors, that info is here too.

Rubber Ducky, you're the one... You make bath time so much fun...

If we open one of these files up, it's going to look nasty: all the tags are smooshed together. Use your favorite text editor to tidy it with pretty indents and line breaks. Since I'm using vim and I have xmllint, I added this key mapping to my .vimrc:
map ,x :silent 1,$!xmllint --format --recover - 2>/dev/null <CR>
The lone dash after --recover is important! That lets it know to use standard input.
To invoke the command, I get out of edit mode and press the comma and x in sequence. Works great! (Or so I thought, until I fed it my monster document.xml with 17MB of embedded audio data. More on that later. You probably won't have a 17 MB xml document, though.)
If you don't have xmllint, you can get it on most Linux systems with:
$ sudo apt-get install libxml2-utils
If you decide to modify things and zip it up into a docx again, Microsoft Word won't care about the new whitespace. It'll read the document just fine, then cheerfully obliterate the indents and newlines when saving again.

OOXML in a nutshell

For most programs reading OOXML (Office Open XML), you only need to know about a handful of different tags. I've made a tree of some of them for convenience, and bolded the most important ones:
  • w:document
    • w:body
      • w:p - paragraph
        • w:pPr - paragraph properties
          • w:ind - indent
          • w:jc - justification
          • w:pStyle - paragraph style name
          • w:numPr - numbered/bulleted list properties. Links against word/numbering.xml
            • w:ilvl - indent level into the list
            • w:numId - abstract num id
          • w:tabs - tabstops
            • w:tab - tabstop position
        • w:r - styled text run
          • w:t - the actual text itself
          • w:tab - a tab character (different than w:tab above!)
          • w:rPr - run properties
            • w:b - bold
            • w:color - text color
            • w:highlight
            • w:i - italic
            • w:rFonts - font name
            • w:strike - strikethrough
            • w:sz - font size
            • w:u - underline
        • w:hyperlink - contains 1 or more w:r's, as above. Links against word/_rels/document.xml.rels
Since different runs of text may have different styling, each paragraph (w:p) contains several runs (w:r). Since a run may contain a tab in the middle, each run may have multiple tabs (w:tab) and texts (w:t).
Font sizes seem to be in halves of a point, and indent values are measured in an interesting unit - twips. There are 1440 twips per inch, or about 567 per cm.
If you don't see how to encode something here, create different Word documents and experiment. That's how I found all of these. It's possible to create Word documents programmatically by copying everything but the document.xml file, and then creating that one component from scratch! (In fact, that's exactly what python-docx does....)

Special stuff: Extracting the audio

In my file, I had recorded a lecture , but when listening to it, I discovered that it was too soft to hear easily. So, I mucked around with it.
In the first step, I Googled the tag name (w:fldData) where I found the audio data. That turned up this nice little gem:
...fldData... Word expects Base64 encoded data....
And that told me everything I needed to know. I copied and pasted the base64 data into a new text file, saved it, and then ran it through base64 to decode it in the terminal:
$ base64 -D -i base64audio.txt -o audio
$ file audio
audio: ISO Media, Apple QuickTime movie
Aha! I'll just stick on the correct file extension now.
$ mv audio audio.mov
And now I can double-click it and open it! Great!
I modified it in Audacity, saved it as an AAC (which is how the original was), and ran the result through base64 to encode it again:
$ base64 audio2 -o audio2.txt -b 80
Unfortunately, sticking this into the OOXML hasn't been working; there are evidently several ways of encoding AAC, and Word is expecting a particular format of AAC. If anyone's got any ideas, please comment!

No comments:

Post a Comment