CopyDoc2XML – Part 1: The evolution of a tool

I haven’t really posted anything in a while because I’ve been working since June-ish as Tech Lead on a mammoth mobile project. It has kept me incredibly busy, but also has been a fantastic learning experience so far. I can’t mention the client yet, so for now I will refer to it as Project Blackbeard.

This is the first in a series of posts about a tool I am midway through writing to allow what the client sees as a copydeck to be converted into an xml file needed to make the copy appear in an app/website.

Part 1 focuses on the problem, and the convoluted process I forced myself into when solving it.
Part 2 will focus on the actual development of the final solution up to its present state.
Part 3 will focus on features that I hope to add in the future if I am blessed with more free time.


One mundane but required part of every project is copy. For those not in the know, copy is the stuff you read, or as regular humans call it, the text. Copy inevitably leads to hours and hours of xml editing or creation of a CMS of some kind. I prefer XML over a full blown CMS, simply because XML provides a greater level of freedom.

As far as my love of XML goes, after the fifth or sixth hour of modifying copy nodes or attributes to match all manner of changes, my patience and concentration always wears thin. In addition to this we have localization. Localization is the art – yes, it truly is an art – of introducing different bodies of text for different locales, while maintaining the style and flow of the originally designed text. People often forget the last bit. Localization has broken the spirits of many a [unsuspecting junior] developer, including myself.

With my project background, I have had a pretty intimate relationship with localization and its finer points. Fedex Experience, Heineken Know the Signs and Philips Cinema, to name a few, all had a backbreaking amount of copy in various languages. It’s not just about dumping the text on the page and hoping it displays nicely either, we’ve had to build subtitle engines, text animation engines and voice-over engines which all must be able to handle lots of different languages.

I’ve always had plans to write a decent XML editor so that we can take the editing duties away from the developers and place it in less skilled hands. My plan was to write an AIR application to load an XML Schema and a config file with fonts, sizes etc written by the developers, and then for someone else to input the copy and modify the properties defined in the schema, with a live preview to show how these elements display. Unfortunately, as always, I never have time to get this done.

Version I – Birth

On Project Blackbeard, we have a couple of close deadlines, and I didn’t want my team to spend a lot of time building masses of xml for all the copy. I asked my backend developer on the project to build a simple CMS that did the following:

  • allow us to assign an id for that piece of copy
  • allow us to write a description of that piece of copy
  • allow us to specify an example image
  • allow us to specify a max character limit and single line/ multiline
  • provide us with a field to write comments on the copy provided
  • allow the client to edit the copy within the limits we had specified
  • allow the client to write comments about the copy
  • save each line to the backend
  • save an xml file to the dev svn for the app devs

He gave me a tool. It did the job, but it had no finesse. Time to do some fixes.

Version II – Infancy

The tool wasn’t something I could ask the client to spend their time in. From a usability point of view it wasn’t very friendly either. Every time you added a new row it would refresh the page. I spent a little time over the next couple of days making it pretty, and adding some javascript to dynamically add rows and other things to make it a bit more user friendly.

After that little effort I reassessed the current workflow. The client was writing their copy into a GoogleDocs spreadsheet, then our Project Manager was taking this and forcing it into the XML CMS we had created. Then the developers edited the xml file generated to suit their needs. The XML tool still wasn’t particularly user friendly, it was was nowhere near as feature complete as I’d have liked given the limited time I could spend on it. It wasn’t really working.

I wrote down the features I’d like to include in the tool, and realized that what I wanted to do was rewrite the GoogleDocs spreadsheet editing mechanism but with fewer columns. Not brilliant. The client were using our GoogleDocs spreadsheet file anyway, so why not try and put that file to use? After all, Google Docs has a great interface which is familiar to everyone and allows import and export of different formats. Enter Mach III.

Version III – Confusing Adolescence

After a little digging around in GoogleDocs, I decided that the best way to go about this was to use the ‘Download as’ feature, as this allows the table to be exported to HTML, which was far more structured than the text or csv options. I wasn’t planning on writing a parser for Excel Spreadsheets, PDFs or Open Office files.

The plan was now to export the contents of the GoogleDoc as HTML, parse this HTML and grab the bits I need. Everything was nicely placed in a Table element so parsing the HTML and generating the XML proved ridiculously easy. It was all accomplished in about 20-30 lines of Javascript. Unfortunately I hit a block with where to put the output. I couldn’t write it to a file, I was in my own HTML page. I tried dropping the generated XML into the DOM, but lost all manner of formatting and newlines within the body of the text.

At this point I moved from an HTML page to an AIR application so that I could directly export to a file, and I wouldn’t lose my formatting. I quickly realized that I was missing something that had made my life so easy, the DOM. Running up and down the DOM tree in my Javascript version was natural. In Actionscript it wasn’t so simple. Again I was running on limited time, so I quickly dropped AIR and started thinking DOM, but more powerful. I considered where the user was when they were working on the copydeck; in their browser. I didn’t want to make them change to another application, but a web page wasn’t cutting it. Then it hit me. I could write an extension for Google Chrome. I could have written a Firefox extension, but I’ve done that before so I figured I’d try something new.

Bam, V4.

Version IV – Adulthood and Learning New Tricks

Now I started picking up speed. I had my basic functionality done with my original Javascript and HTML version, I now just had to format it into something Chrome-worthy. I made one concession because Chrome extensions are just web pages with slightly elevated permissions, but not high enough for proper filesystem access. Chrome does allow you access to the system clipboard, so my output is put there to be pasted into a file of the user’s choosing.

The first iteration of this tool was pretty simple. In the GoogleDoc the user clicks File -> Download As -> HTML. This opens a popup automatically and they need to copy the source of that HTML popup. Now they open the extension and paste the source into the input text area. All the column numbers were hard coded at this point. The copy was in column 8, the ID attribute was the row number and the nodes created were one level deep and called “string”. The output was a bunch of XML ‘string’ tags with an id and the copy wrapped in CDATA. Next I hard coded the ID column into the extension so that the XML nodes used ID values from the document rather than the row numbers.

And that was that. I had accomplished what was needed at least. The client could continue to edit in the GoogleDoc, and the developers were given an XML file that had all the required data.

Since that point I have devoted as little time as possible to the tool as I still have a lot to do on Project Blackbeard. I try and limit myself to one feature a day. I have since added some nice features like only activating the extension when on a GoogleDoc Spreadsheet, all of which will be covered in Part 2. I’ll hopefully find some time to write that soon.