Simple clear advice in plain English

New document file formats explained

The war between Open Document and Open XML is raging but users will benefit from the XML-based formats

Small, compatible and secure are the watchwords of the new generation of office files.

In future, all the important office software manufacturers are planning to use XML-based file formats. Despite this, there’s no sign of a standard format emerging.

There are already two rival camps: the Open Document format being promoted by IBM, Sun (Star Office) and Openoffice.org, and Microsoft’s own variant of XML.

Office 2007 will read and write Microsoft’s own Open XML files, but it won’t support Open Document out of the box.

Microsoft has recently relented somewhat with the announcement of its Open XML Translator project, which will let developers create a bridge between the rival formats.

Although this is being presented as a battle of the document formats, both sides are technically quite similar. Both file types have a common basis in XML (Extensible Markup Language).

All alphanumeric document content – presentations, text or tables – is stored in XML files. All other document elements, such as graphics and OLE (Object Linking and Embedding) or VBA (Visual Basic for Applications) objects, are strictly separated from them.

Further XML files belonging to the document can hold supplementary information (known as metadata) about format templates and definitions, comments, paths to linked resources, the author, number of characters and so on.

Open minded
In both Open Document and Open XML, all the constituent parts of the document are kept together in a Zip container file that appears as the actual document to the user.

Both types of file use a compressed archive, which reduces the storage space required. XML is slim, but this makes it smaller still.

Embedded picture files are converted into a space-saving format during saves and then the lossless Zip compression shrinks them further. In our tests, files which were saved in the new format shrank by 50-90 per cent compared with their original size.

Better data integrity is promised with the use of a CRC checksum (Cyclic Redundancy Check) – a familiar component of the Zip compression algorithm. This checks the integrity of each file in the archive.

The CRC is highly sensitive to any modifications to the archived data. But even if part of the Zip archive contains errors, you can still make use of the remaining data.

Once a document has been saved from Star Office (Odt, Open Document Text) or Microsoft Word (DocX, Word Open XML), you can rely on the content being stored securely, having been checked by a proven algorithm.

Separate data storage, compression and CRC testing also have other advantages.

As Windows has its own decompression routine, the use of Zip compression is a plus point. If need be the files can be worked on without any special software.

Simply change the document extension from Odt or Docx to Zip. You can then view the data container like a compressed file with Windows Explorer or a Zip-compatible compression utility such as PKzip for Windows.

In the case of Openoffice.org, all the text is saved in pure XML files. You can use copy and paste to move it to another program – for example, a text editor – without having the suite’s Writer component installed on your PC.

By using a PHP script and the add-on Pclzip, it’s possible to extract the content from large quantities of documents automatically.

The possibility of harvesting all or part of the content from Office files is very attractive for organisations needing to process the data from document management systems, and XML simplifies this process.

Article tags

Reader Comments

   

Add your comment

All fields must be completed. Your email address will not be displayed or used to send marketing messages.

All messages will be checked by moderators before appearing on the site.

See our Privacy Policy for more information.

Related articles

Tips for Word illustration

Get the best out of Microsoft Word

Unless you've had a training course on Word, you probably know and use only a small percentage of its capabilities. Here are 10 top features for you to try out

Lulu screenshot

Design and publish your book

Before digital printing technology, getting your book out to the masses cost a lot of money. We show you how to get your epic printed using online publishing

Save as Works screenshot

Why are some recipients unable to open my Works attachments?

Most Works or Word suffixes should be able to be read by all computers but saving documents in Rich Text Format (.rtf) or HTML may solve the problem

Question & Answer

Q.Why can't my browser find the website address I typed...

> Read the answer

Q.All updates have been downloaded, so why won't Windows...

> Read the answer

Q.How do I stop Windows 7 search?

> Read the answer

Best deals on the web

img

Apple MacBook Pro (MC724LL/A)

£999.99- Buy it now

img

Sony Vaio VPCF23P1E/B

£679.98- Buy it now

img

Samsung 300E5A-A01DX

£449.99- Buy it now

Great benefits for subscribers!

Most popular articles

Poll

Which is your preferred web browser

Jargon Buster

Computing terms explained in plain English

VGA

Video Graphics Array. Standard socket for connecting a monitor to a computer.

Great shopping deals from Computeractive