A newbie's guide to .docx
.docx is the default file format in the latest version of Microsoft Word in Office 2007. All docs in the new office family is based upon an open,
standardized specification called Office Open XML.
One .docx file is actually a collection of many files, stored in an archive or (zip-file). Let's dive into an example .docx file and see what's inside
Here is an ordinary document (in Norwegian - sorry). A little bit of text and some
formatting. Now, you could work with this document as any other document created
earlier with Word. However, for those interested in the internal representation
read on...
Step 1: Rename your document to .zip
Open the document in Windows Explorer, right click and rename to .zip as shown here:
Step 2: Extract zip-file to a new folder
Once the file is renamed to .zip you can use it like any other zip-file. Obviously,
we want to look inside. And this is where the magic appears. Extract to the current
folder and a number of files and directories appears like this:
Step 3: Explore the various files and folders
In the rool level, we have 3 folders "_rels", "docProps" and "Word". In addition
with have a file called [Content_Types].xml. The [Content_Types].xml file describes
the contents of the zip-package and is used internally to Word as a table of contents
for further processing. The rels folder will hold a map of all the relationships
within the package. It is a map over all the files in the package and how they relate
to each other.
Folder: _rels
In a minimum document it holds one file .rels which is a xml-file
like this:
Folder: docProps
The docProps folder contains at least app.xml and core.xml. The files hold meta-information about a document, such as it's creator, when it was last opened, saved, edited and so forth. It also holds the word-count, number of paragraphs etc. For our sample
document app.xml looks like this:
Folder: word
Now moving on to the word folder we get to the actual content of the word document. From the folder structure above you can see a number of xml-files. The
most important of all xml-files in the entire zip-package is the document.xml
Why? Because it is here the content as you know it is stored. Let's look at it from
our Hello World example:
<?xml
version="1.0" encoding="UTF-8" standalone="yes"
?>
-
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officedocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officedocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
-
<w:p
w:rsidR="00B45928" w:rsidRDefault="004D007F" w:rsidP="004D007F">
<w:pStyle
w:val="IntenseQuote" />
</w:pPr>
<w:t>Heisann, dette
er en test!</w:t>
</w:r>
</w:p>
<w:p
w:rsidR="004D007F" w:rsidRDefault="004D007F" w:rsidP="004D007F" />
-
<w:p
w:rsidR="004D007F" w:rsidRPr="004D007F" w:rsidRDefault="004D007F" w:rsidP="004D007F">
<w:t>Jommen sa jeg smør....</w:t>
</w:r>
</w:p>
-
<w:sectPr w:rsidR="004D007F" w:rsidRPr="004D007F" w:rsidSect="00B45928">
<w:pgSz
w:w="11906" w:h="16838" />
<w:pgMar
w:top="1417" w:right="1417" w:bottom="1417" w:left="1417" w:header="708" w:footer="708" w:gutter="0" />
<w:docGrid
w:linePitch="360" />
</w:sectPr>
</w:body>
</w:document>
Here you recognize our text from the first screenshot. Within the special xml-files we have our content in plain text....*phew* So, if you are really desperate and
need the actual text from an document - this is the place to look. But, I recommend
that you use this online conversion tool, or even better - purchase Microsoft Office
2007 and start creating .docx files yourself.