Cracking the door openWhen investigating a mystery file, the first thing a Unix junkie does is run file on it. file is a nifty program that will try to identify what sort of data it's looking at, without paying any attention to the file extension. Let's do that now:
$ ls Lecture 1.docx $ file Lecture\ 1.docx Lecture 1.docx: Zip archive data, at least v2.0 to extract
$ mkdir data $ cd data $ unzip ../Lecture\ 1.docx Archive: ../Lecture 1.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/footnotes.xml inflating: word/endnotes.xml inflating: word/header1.xml inflating: word/theme/theme1.xml extracting: docProps/thumbnail.jpeg inflating: word/settings.xml inflating: word/fontTable.xml inflating: word/styles.xml inflating: word/stylesWithEffects.xml inflating: docProps/app.xml inflating: docProps/core.xml inflating: word/webSettings.xml inflating: word/numbering.xml $ ls -F [Content_Types].xml _rels/ docProps/ word/
Delve deeper and deeper, for all is in itAlright! Now we are getting somewhere. The most important directory is the word directory, so I'll cd there:
$ cd word $ ls -F _rels/ footnotes.xml styles.xml document.xml header1.xml stylesWithEffects.xml endnotes.xml numbering.xml theme/ fontTable.xml settings.xml webSettings.xml
- Contains the text of the document. In my case, it also holds the audio notes I recorded during the lecture.
- This folder may or may not be present, depending on whether you have inserted any pictures. Yep. They're stored here.
- Tells Word how to indent and number or bullet any auto-lists you might have. Also, tells it how to number or bullet any lists you don't have. To 8 levels of indentation. Yeah.
- If you have hyperlinks or embedded pictures, the hyperlink URLs and internal paths to the pictures are here.
- Contains info about default font face and size in the document. If you are one of the few people that bother to use Styles in Word instead of manually choosing fonts and colors, that info is here too.
Rubber Ducky, you're the one... You make bath time so much fun...If we open one of these files up, it's going to look nasty: all the tags are smooshed together. Use your favorite text editor to tidy it with pretty indents and line breaks. Since I'm using vim and I have xmllint, I added this key mapping to my .vimrc:
map ,x :silent 1,$!xmllint --format --recover - 2>/dev/null <CR>
To invoke the command, I get out of edit mode and press the comma and x in sequence. Works great! (Or so I thought, until I fed it my monster document.xml with 17MB of embedded audio data. More on that later. You probably won't have a 17 MB xml document, though.)
If you don't have xmllint, you can get it on most Linux systems with:
$ sudo apt-get install libxml2-utils
OOXML in a nutshellFor most programs reading OOXML (Office Open XML), you only need to know about a handful of different tags. I've made a tree of some of them for convenience, and bolded the most important ones:
- w:p - paragraph
- w:pPr - paragraph properties
- w:ind - indent
- w:jc - justification
- w:pStyle - paragraph style name
- w:numPr - numbered/bulleted list properties. Links against word/numbering.xml
- w:ilvl - indent level into the list
- w:numId - abstract num id
- w:tabs - tabstops
- w:tab - tabstop position
- w:r - styled text run
- w:t - the actual text itself
- w:tab - a tab character (different than w:tab above!)
- w:rPr - run properties
- w:b - bold
- w:color - text color
- w:i - italic
- w:rFonts - font name
- w:strike - strikethrough
- w:sz - font size
- w:u - underline
- w:hyperlink - contains 1 or more w:r's, as above. Links against word/_rels/document.xml.rels
Font sizes seem to be in halves of a point, and indent values are measured in an interesting unit - twips. There are 1440 twips per inch, or about 567 per cm.
If you don't see how to encode something here, create different Word documents and experiment. That's how I found all of these. It's possible to create Word documents programmatically by copying everything but the document.xml file, and then creating that one component from scratch! (In fact, that's exactly what python-docx does....)
Special stuff: Extracting the audioIn my file, I had recorded a lecture , but when listening to it, I discovered that it was too soft to hear easily. So, I mucked around with it.
In the first step, I Googled the tag name (w:fldData) where I found the audio data. That turned up this nice little gem:
...fldData... Word expects Base64 encoded data....And that told me everything I needed to know. I copied and pasted the base64 data into a new text file, saved it, and then ran it through base64 to decode it in the terminal:
$ base64 -D -i base64audio.txt -o audio $ file audio audio: ISO Media, Apple QuickTime movie
$ mv audio audio.mov
I modified it in Audacity, saved it as an AAC (which is how the original was), and ran the result through base64 to encode it again:
$ base64 audio2 -o audio2.txt -b 80