[SGVLUG] diff command and binary

Emerson, Tom Tom.Emerson at wbconsultant.com
Thu Mar 2 14:25:18 PST 2006


> -----Original Message-----
> Behalf Of matti
> 
> Ok - I've encountered a frustrating problem
> 
> pulled up a backup file and compared it
> the same file... they should be the same,
> but - perhaps the user did a slight modification
> this morning on one.
> 
> SO I got 2 binary excel files which differ.
> 
> I want to determine HOW much they differ
> and in what ways.

The simple act of OPENING the file will change it -- to test, I did the following:

  * started Excel
  * immediately saved the (blank) worksheet
  * exited Excel
  * copied the file from the command prompt
  * startee Excel
  * opened the COPY
  * exited Excel

FC shows differences between the files -- not a lot, but "different" at a binary level:

C:\>copy book1.xls book2.xls
        1 file(s) copied.

C:\>fc book1.xls book2.xls
Comparing files Book1.xls and BOOK2.XLS
***** Book1.xls


***** BOOK2.XLS

 G???>???   

*****

***** Book1.xls


?   

***** BOOK2.XLS

W
o
r
k
b
o
o
k

*****

***** Book1.xls

W
o
r
k
b
o
o
k

> 
> is there a way to do this.
> (the file sizes are the same.. )

note that "fc" [the "dos" equivalent to "diff"] also has a binary comparison mode:


C:\>help fc
Compares two files or sets of files and displays the differences between
them


FC [/A] [/C] [/L] [/LBn] [/N] [/T] [/U] [/W] [/nnnn] [drive1:][path1]filename1
          [drive2:][path2]filename2
FC /B [drive1:][path1]filename1 [drive2:][path2]filename2

   /A     Displays only first and last lines for each set of differences.
   /B     Performs a binary comparison.
   /C     Disregards the case of letters.
   /L     Compares files as ASCII text.
   /LBn   Sets the maximum consecutive mismatches to the specified number of
          lines.
   /N     Displays the line numbers on an ASCII comparison.
   /T     Does not expand tabs to spaces.
   /U     Compare files as UNICODE text files.
   /W     Compresses white space (tabs and spaces) for comparison.
   /nnnn  Specifies the number of consecutive lines that must match after a
          mismatch.


C:\>fc /b book1.xls book2.xls
Comparing files Book1.xls and BOOK2.XLS
0000346C: 00 20
0000346D: 00 47
0000346E: 00 C6
0000346F: 00 05
00003470: 00 3F
00003471: 00 3E
00003472: 00 C6
00003473: 00 01

considering what whas zeroes before is now binary data, I'd imagine this is some form of timestamp -- sure enough, a second copy-and-compare operation yields:


C:\>copy book2.xls book3.xls
        1 file(s) copied.

C:\>fc /b book2.xls book3.xls
Comparing files book2.xls and BOOK3.XLS
FC: no differences encountered

[open book3 in Excel and immediately close]

C:\>fc /b book2.xls book3.xls
Comparing files book2.xls and BOOK3.XLS
0000346C: 20 30
0000346D: 47 6A
0000346E: C6 FA
0000346F: 05 2E
00003470: 3F 40

C:\>

three bytes are still the same (perhaps they represent the day?)  Note that marking the file "read only" (attrib +r) will keep Excel from updating this value.

I'd suggest copying the "current" file to a workfile, opening the workfile and closing it as I have done here, the compare backup-->current and current-->workfile; if the differences are similar to what I've shown here, I'd feel reasonably confident there is no actual change other than "having opened the file".

BUT... to get at what /actual/ changes may have occured, how about openoffice?  I believe the 2.0 version saves files in XML format, (which is then gzipped to reclaim the space taken by the XML formatting) -- open the backup and "current" file in openoffice (marking them read-only prior to the start "just to be safe") and save them in the "native" openoffice format [there may even be an option to not gzip them as well, but if not I think gzunzip will work against the file]  You can then use XML tools to determine the difference (or even an ascii FC/diff against the file should work as the file is essentially "human readable" at this point.  [note: it may take some finessing to get the output in the first place -- you might need the -c and -S switches for gunzip...]

tom at osnut:/srv/nethome/tom2/Documents/wireless> less wfinder_data.sxc
Archive:  wfinder_data.sxc
 Length   Method    Size  Ratio   Date   Time   CRC-32    Name
--------  ------  ------- -----   ----   ----   ------    ----
   10191  Defl:N     1949  81%  05-18-03 17:12  90241a31  content.xml
    5433  Defl:N     1301  76%  05-18-03 17:12  1404ddb8  styles.xml
    1076  Stored     1076   0%  05-18-03 17:12  f5381c62  meta.xml
    7478  Defl:N     1355  82%  05-18-03 17:12  f15b0f40  settings.xml
     750  Defl:N      252  66%  05-18-03 17:12  5313cb53  META-INF/manifest.xml
--------          -------  ---                            -------
   24928             5933  76%                            5 files

gunzip complains that there are "multiple entries in the file" and stops after the first one, using the "-c" flag and piping the result to a newly named file shows the file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE office:document-content PUBLIC "-//OpenOffice.org//DTD OfficeDocument1.0//EN" "office.dtd">
<office:document-content
xmlns:office="http://openoffice.org/2000/office"

[...snipped a dozen similar lines...]

office:class="spreadsheet"
office:version="1.0">
<office:script/>
<office:font-decls><style:font-decl style:name="Arial Unicode MS" 
fo:font-family="&apos;Arial Unicode

[etc.]

<office:body>
<table:table table:name="Sheet1" table:style-name="ta1">
<table:table-column table:style-name="co1" table:default-cell-style-name="Default"/>
<table:table-column table:style-name="co2" table:default-cell-style-name="Default"/>
<table:table-column table:style-name="co3" table:default-cell-style-name="Default"/>

[ditto above]

<table:table-row table:style-name="ro1">
  <table:table-cell>
    <text:p>Location</text:p>
  </table:table-cell>
  <table:table-cell>
    <text:p>Operator</text:p>
  </table:table-cell>

(actually, I pretty-printed this a bit -- the actual file has no line breaks or extraneous whitespace...)



More information about the SGVLUG mailing list