The “corrupt Word docx” scam

This is an oldie but a goodie: suppose the homework paper is due on Friday at midnight, and you go to grade the papers on Sunday. You open a .docx paper from a student, and Word shows you the following messages:


What happened? Well, most of us will contact the student and say, “Hey, your file was corrupted!” and then the student says, “Oh, I’m so sorry” and returns a “correct” version. Was it a simple mistake, or did the student just scam you out of 2 extra days to work on the paper?

There are instructions online for how to purposely corrupt a file for this purpose, plenty of videos on how to do it, and there is even a whole Web site that will do the work for you: (I am reluctant to link to these bottom feeders, so you’ll have to type these in yourself.)

With a little work, it is possible to recover any existing text from a corrupt .docx file. Here are the steps to do so. I am on a Mac, but variations of this might work on Windows as well.

1. Download the .docx file to a safe place on your computer. I put mine in a subdirectory on my Desktop called ‘test’.

2. Find out what type of file you are really dealing with. Open Terminal (in Applications | Utilities) and run the ‘file’ command.

flossmole2:Desktop megan$ cd test

flossmole2:test megan$ ls -l
-rw-r--r--@ 1 megan staff 499449 Dec 10 15:41 test.docx

flossmole2:test megan$ file test.docx
test.docx: Microsoft OOXML

3. It looks like we are dealing with an OOXML file. This is a compressed XML version of a bunch of files and folders that actually make up the “single” Word document. We need to uncompress it and start poking around at the folder and file structure stored within it. This document from Microsoft explains the different folders and files within an OOXML file.

4. The first thing we need to do is uncompress the file. To do this, rename the file to a .zip extension (“mv oldname newname”), and then use the commandline unzipper (“unzip newname”) to uncompress & extract it:

flossmole2:test megan$ mv test.docx

flossmole2:test megan$ unzip 
error []: missing 126 bytes in zipfile
 (attempting to process anyway)
error []: attempt to seek before beginning of zipfile
 (please check that you have transferred or created the zipfile in the
 appropriate BINARY mode and that you have compiled UnZip properly)
 (attempting to re-compensate)
 inflating: _rels/.rels 
 inflating: docProps/core.xml 
 inflating: docProps/app.xml 
 inflating: word/document.xml bad CRC 7f17798c (should be fea69872)
file #5: bad zipfile offset (local header sig): 7940
 (attempting to re-compensate)
 inflating: word/styles.xml 
 inflating: word/fontTable.xml 
 inflating: word/theme/theme1.xml 
 inflating: word/theme/_rels/theme1.xml.rels 
 inflating: word/header1.xml 
 inflating: word/footer1.xml 
 inflating: word/media/image1.png 
 inflating: word/settings.xml 
 inflating: word/_rels/document.xml.rels 
 inflating: [Content_Types].xml

Right away you’ll see that the unzipper is having trouble with the file because it has been corrupted. That’s ok, we’ll still be able to poke around and find some stuff inside it.

5. Here is the contents list of the directory now after we’ve extracted everything:

flossmole2:test megan$ ls -la
total 984
drwxr-xr-x 7 megan staff 238 Dec 10 16:36 .
drwx------+ 14 megan staff 476 Dec 10 16:29 ..
-rw-r--r--@ 1 megan staff 1999 Dec 7 18:17 [Content_Types].xml
drwxr-xr-x@ 3 megan staff 102 Dec 10 16:36 _rels
drwxr-xr-x@ 4 megan staff 136 Dec 10 16:36 docProps
-rw-r--r--@ 1 megan staff 499449 Dec 10 15:41
drwxr-xr-x@ 11 megan staff 374 Dec 10 16:36 word

6. Let’s explore down into the ‘word’ folder, since the Microsoft page page said that’s where all the interesting stuff would be. Here’s what the folder structure looks like in the Finder.

7. We will need to open the file in a text editor, or programmer’s editor. I like TextWrangler (download TextWrangler free). Right-click and tell it to open in TextWrangler:

8. You will see an error about incorrectly formatted XML and UTF8. Click past that.

9. Use the Find dialogue box to create a grep (regular expression) string like this. It will remove all the XML tags.

10. Voila! Now you can see the text. You can remove the remaining stray characters with a Find | Replace as you need to. Remember that as you go down further and further in the file, you will see more and more stray marks. This is because of the way the file was corrupted (intentionally or not).