Converting Word to HTML

If you are the local HTML bitch a “webworker”, you’ll know that converting Word ((Microsoft Office Word or Open Office Word. Probably applies to Star Office, AbiWord etc. as well.)) documents into valid and semantic HTML is a common, tedious and time consuming task.

So, after spending a gazillion hours ((approximately)) doing that, I’d like to show you my way of handling that.

But let me first say some words on, well, Word.

As you probably know, Word can actually export to HTML all by itself, and since it saves in XML internally anyway, you’ll expect the transition to XHTML oder HTML (via XSLT for example) to be fairly easy and flawless. Let me assure you, it is not.

The first problem is that most documents are pretty messed up to start with. Secondly, that Word has some tricks up its sleeves for us.

Imagine a simple Word document that contains only one heading and one paragraph.

“Semantic” Worddokument Nonsemantic Worddokument
Semantic Word document Nonsemantic Word Document

Heading 1

Paragraph

Heading 1

As you see, you can save yourself a lot of headache if you use Word wisely (that is, in a semantic rather than visual manner). But most of the time the document will be written by others and handed to you, so you’ll be stuck with that crap anyway. This is the scenario I devised my “technique” for.

For this solution you’ll need:

The procedure:

  1. Copy the document into TinyMCE using the “Paste from Word”-Feature,
  2. Use the function “Cleanup messy code”,
  3. Copy into the result into your Editor, save as .html,
  4. Open in Firefox, “Cleanup” with Tidy,
  5. copy the cleaned HTML, and save it again as .html,
  6. repeat 4. and 5. until all errors are gone

This should have eliminated most errors (and most formatting as well), we’re now ready to go into the details.

Both TinyMCE and Tidy insert classes and styles into our document, we’ll want to get rid of these, and while we’re at it, we use some Regular Expressions to clean up a bit.

Purpose (Perl compatible) Regular expression Replacement string
Remove all superfluous attributes
!<([^]*?"?)>!

Put all attributes in double quotes
!!

duplicate name attributes in form fields as ids
!!

Well, thats it. All that’s left will be up to you – you’ll just have to go through to source manually… good luck.