Difference Algorithm Applied To Excel Workbooks

Finding out how your document has changed from one version to another is an important topic, especially if it has modifications made to it by several different authors. This is especially the case for Microsoft Excel documents, which are used for financial planning, complex calculations and often contain Visual Basic for Applications macros. You may want to change one value between otherwise identical Microsoft Excel workbooks and see if there are any significant numeric differences. DiffEngineX is ideal for this as it can ignore differences below a user specified value or percentage change.

If you want to compare a plain text file, you have at your disposal a vast array of tools, including Microsoft WinDiff, which is included as part of Visual Studio. Microsoft Word has a built-in facility to compare an original with a revised document. However from Office '95 to Office 2010, Excel has no intrinsic functionality to find out how one spreadsheet differs from another one.

Excel documents are not straightforward to compare. The spreadsheet cells can either contain a formula (which performs a calculation based on the value of other cells) or a constant (text, number, date...). If it contains a formula, there is a choice of whether the formula itself should be compared or its calculated value (i.e. =6*7 or 42). Not only do the visible cells have to be compared, but also defined names, cell comments and the Visual Basic for Applications (VBA) macros.

Some spreadsheets contain data, Visual Basic macros and formulae. Other people use Excel to import and visualize imported database data. If using a difference algorithm to compare the latter, it is important to realize the data needs to be in sorted order first. (Excel's built-in sort functionality is under its Data tab.) Diff algorithms will not reorder rows to find similarities, but are limited to the insertion of blanks to line up the similarities.

What is a difference algorithm?

Excel Difference Algorithm Align Rows

There isn't just one difference algorithm, but rather a family of them. They all have the aim of finding the longest, in-order run of similarities between two strings of letters, lines of source code or rows. The aim is always to report the minimum number of differences.

Difference algorithms are used in the field of biology to compare protein and DNA sequences for similarities. Often different life processes will use very similar protein sequences with the only differences between them being small insertions and deletions. A difference algorithm will describe the changes necessary to convert one sequence into another.

Standard difference algorithms do have drawbacks. If a run of characters is moved out-of-sequence it won't be recognized as being shared between two documents. Consider the case below in which the string of characters "two three" is compared against "three two". The algorithm matches the longest, in-order run of characters and so correctly spots "three" is common between the two strings, but fails to realize "two" has only been moved back.

two-three compared against three-two
t w o t h r e e
t h r e e t w o

After diff algorithm applied
t w o t h r e e
t h r e e t w o

If you are comparing database rows imported into Excel, pre-sorting them first will ensure that no similar row between two sets of data is missed. If you have to carry out this step, make sure you save your changes to the file system (File->Save) before opening the workbooks in DiffEngineX.

Microsoft Excel Workbooks

DiffEngineX uses a standard diff algorithm to align the similar rows and columns of worksheets before comparing the cells (and any attached comments). A more sophisticated algorithm is used to compare the Visual Basic macros embedded in the spreadsheets, in so much as lines of code moved up or down (but not changed) will be recognized.

Names have their definitions simply compared against each other for equality. A diff algorithm is not needed here.

DiffEngineX can optionally highlight in red the precise characters that differ between two spreadsheet cells.

Row Alignment by DiffEngineX Alters Cell References of Differences in Workbook Copies

One of the principles of DiffEngineX is not to alter the workbooks being compared. It does however need to insert blank rows in order to align similar rows. Similar rows need to be aligned as they won't be recognized as identical if they have different Excel row numbers. This is why DiffEngineX automatically creates copies of your workbooks and compares those instead.

However the process of row alignment alters the workbook copies. Rows are shifted down and so each difference will have a different cell reference as compared to the original workbook. The difference report produced by DiffEngineX lists each pair of cell differences against their Excel cell reference.

It seems hardly fair to give a cell reference for a difference that refers to a temporary workbook (altered by blank row insertion) produced during the process of comparison instead of the real, unaltered originals.

The solution is that DiffEngineX gives cell references with respect to both the altered workbook copies and the unaltered originals. DiffEngineX provides an option to hyperlink each reported difference back to its corresponding spreadsheet cell.

Software Download

A free 30 day trial of our software to find the differences between Excel spreadsheets is available.