Thursday, August 26, 2021

Git Diff using Pandoc for Binary Documents

As I stated in my Documents as Code post, text formats such as Markdown work well with Git as it was written for source code that is in a text based format and therefore doesn’t understand what has changed between two revisions of a binary document.

So, if others are writing most of their documentation in either Microsoft Word or OpenOffice’s Writer applications, how can you examine the evolving content between the various commits via a git diff in a Git repository?

First, create a git repository:

$ git init binary_diff
$ cd binary_diff/

Then, create a *.odt document and add a simple line of text such as “hello.” Stage the file and commit the doc to the repo:

$ git add file.odt
$ git commit -m "Create file.odt with hello"

Now, change the text in the doc to “Hello Solar System.” Add and commit the updated doc:

$ git commit -am "Update the file.odt file"

Let’s see the git log output:

$ git log --oneline

f14e810 (HEAD -\> main) Update the file.odt file
a2f8e6a Create file.odt with hello

Next, issue a git diff on the first and last commit to show that binary files do not show the differences:

$ git diff a2f8e6a..f14e810

diff --git a/file.odt b/file.odt
index e08debd..02d4dce 100644
Binary files a/file.odt and b/file.odt differ

Not very helpful huh?

In order to enable diffs on binary files, do the following. First, create a .gitattributes file and add the following:

*.docx diff=docx
*.odt diff=odt

Then, add this to the .git/config file:

[diff "docx"]
    textconv = pandoc --to=plain
[diff "odt"]
    textconv = pandoc --to=plain

Now, do a git diff on the first and last commit to show that binary files do show the differences

$ git diff a2f8e6a..f14e810
diff --git a/file.odt b/file.odt
index 02d4dce..e08debd 100644
--- a/file.odt
+++ b/file.odt
@@ -1 +1 @@
-hello
+Hello Solar System

You will find that you can get the same result with *.docx file diffs.

This fix enables you to view how the .docx/.odt files have changed between the various commits.

No comments: