Thursday, August 26, 2021

Git Diff using Pandoc for Binary Documents

As I stated in my Documents as Code post, text formats such as Markdown work well with Git as it was written for source code that is in a text based format and therefore doesn’t understand what has changed between two revisions of a binary document.

So, if others are writing most of their documentation in either Microsoft Word or OpenOffice’s Writer applications, how can you examine the evolving content between the various commits via a git diff in a Git repository?

First, create a git repository:

$ git init binary_diff
$ cd binary_diff/

Then, create a *.odt document and add a simple line of text such as “hello.” Stage the file and commit the doc to the repo:

$ git add file.odt
$ git commit -m "Create file.odt with hello"

Now, change the text in the doc to “Hello Solar System.” Add and commit the updated doc:

$ git commit -am "Update the file.odt file"

Let’s see the git log output:

$ git log --oneline

f14e810 (HEAD -\> main) Update the file.odt file
a2f8e6a Create file.odt with hello

Next, issue a git diff on the first and last commit to show that binary files do not show the differences:

$ git diff a2f8e6a..f14e810

diff --git a/file.odt b/file.odt
index e08debd..02d4dce 100644
Binary files a/file.odt and b/file.odt differ

Not very helpful huh?

In order to enable diffs on binary files, do the following. First, create a .gitattributes file and add the following:

*.docx diff=docx
*.odt diff=odt

Then, add this to the .git/config file:

[diff "docx"]
    textconv = pandoc --to=plain
[diff "odt"]
    textconv = pandoc --to=plain

Now, do a git diff on the first and last commit to show that binary files do show the differences

$ git diff a2f8e6a..f14e810
diff --git a/file.odt b/file.odt
index 02d4dce..e08debd 100644
--- a/file.odt
+++ b/file.odt
@@ -1 +1 @@
-hello
+Hello Solar System

You will find that you can get the same result with *.docx file diffs.

This fix enables you to view how the .docx/.odt files have changed between the various commits.

Wednesday, August 25, 2021

What Text Format is Best for Git and GitHub?

For me this question was answered initially by considering what works best in Git and GitHub. Given that the readme file format in GitHub is Markdown, this is the path that I am on.

What are the other formats? One is LaTeX. Per Introduction to LaTeX (latex-project.org) , LaTeX “is a document preparation system for high-quality typesetting. It is most often used for medium-to-large technical or scientific documents, but it can be used for almost any form of publishing and provides a powerful platform for layout and format.”

My goal is to write documentation for the software projects/repositories in which I am engaged. I don't need high-quality typesetting to expose scientific formula but rather explain a code's business function and construction. Markdown is easy to learn and well supported.

Moreover, I have started using Hugo, one of the most popular open-source static site generators, to build my own personal website as well as a potential tool for documentation. After years of wrestling HTML/CSS and JavaScript, I am happy to be able to stand up a static site in minutes with Hugo. Hugo also has excellent Markdown support out of the box. In fact, you write your posts in Markdown.

As I stated in my previous post, Documents as Code ,when writing documents, I like to use either Microsoft Word or OpenOffice’s Writer. Both provide spell check along with grammar help and a thesaurus. Here is where the problem emerges. Microsoft Word (*.docx) or OpenOffice’s Writer (*.odt) produce binary files. Git and GitHub do not play well with binary files.

So, what did I do to better accommodate the process of *.docs/*.odt to Markdown? Enter Pandoc. As Pandoc's site states, “If you need to convert files from one markup format into another, pandoc is your swiss-army knife.”

In addition to this, Pandoc is a CLI tool. There is no graphic user interface. Therefore, you have to open a terminal in your Operating System of choice. For example, to convert this *.odt doc to Markdown, I did the following from the CLI:

$ pandoc 'What Text Format is best for Git and GitHub.odt' 
-o BestTextFormatForGit.md

For *.docs/*.odt files that have tables, I use the following providing options for table conversion into Markdown:

$ pandoc 'What Text Format is best for Git and GitHub.odt' -f 
odt -t markdown-simple_tables-multiline_tables-grid_tables
-o BestTextFormatForGit.md

In fact, this post was generated with the intial pandoc command above from an *.odt binary file. I only added the fenced code block sections to the Markdown to highlight the pandoc command output.

Next question to answer, “How do you do a git diff with binary *.docx/*.odt files?”

Wednesday, August 18, 2021

Documents as Code

This post assumes a basic knowledge of Git and GitHub.

I love the work flow of using Git and GitHub in developing code. I have been thinking how cool it would be to use the same tools and processes that I use with Git and GitHub for other disciplines such as the legal field or any job where you create and edit documents. In short, almost all, if not all fields. Part of my motivation is with the fact that I teach programming to Business Informatics students at a local university. Most of them will not be Software Developers when they graduate. However, how great it would be for them if they understood what their the common workflow of their co-working Software Devs? Secondly, I love how the Git and GitHub workflows assist me in better understanding cause and effect of my work as well as other possibilities within counterfactual scenarios.

What is the Git workflow? Per Atlasian, “A Git workflow is a recipe or recommendation for how to use Git to accomplish work in a consistent and productive manner.” Essentially, Git workflows are governed by branches. Using a branch means you deviate from the main stream of development and continue to do work without interfering with the main stream of work (see Git - Branches in a Nutshell (git-scm.com) ). Branches allow different team members to work independently and then combine their work when ready. For more see: https://about.gitlab.com/topics/version-control/what-is-git-workflow/ A commonly utilized Git workflow is the Gitflow Workflow. This workflow was first published and made popular by Vincent Driessen at nvie.

When writing documents, I like to use either Microsoft Word or OpenOffice’s Writer. Both provide spell check along with grammar help and a thesaurus. Here is where the problem emerges. Microsoft Word or OpenOffice’s Writer produce binary files. Git and GitHub do not play well with binary files. Git was written for source code that is in a text based format and therefore doesn’t understand what has changed between two revisions of a binary document. Most enterprises use some office suite such as Microsoft Word which produces binary files. While tools such as MS Word, or OpenOffice’s Writer, which I am using now, work great to produce and read docs, you can’t use Git or GitHub to review the document’s history.

Again, my point is that in order for buy-in from non-coder types, great tools such as Git and GitHub need to function with binary files such as *.docx and *.odt.

When searching for a resource that discuss the treating an enterprises document knowledge base as artifacts to use within the Git workflow to assist in the docx/odt conversion to text, I found the book Docs Like Code by Anne Gentle. This is from Docs Like Code:

When we say docs, we mean streamlined, tightly phrased, and fast-moving information that helps developers understand complex application interfaces. Docs can be anything from a single web page for a startup to an entire developer reference site. Modern docs, with their web and mobile interfaces and supportive user experience, are purposeful, instructive, and even beautiful. When we say treat docs like code, we mean that you: Store the doc source files in a version control system. Build the doc artifacts automatically. Ensure that a trusted set of reviewers meticulously reviews the docs. Publish the artifacts without much human intervention.

The next question, is to what text format is best for Git and GitHub?

See https://blog.front-matter.io/mfenner/using-microsoft-word-with-git as a resource for your Git config, etc.
See Generate PDF invoices from Markdown using Pandoc - DEV Community for markdown to PDF conversion.

Monday, August 02, 2021

The Basic Git Rebase

I have a bash shell script that creates three commits in the master branch. Then, the script creates and checks out a new branch called feature. In the feature branch two commits are created. Finally, two more commits are created in the master branch.

Here we run the script:

We will take a look at the master branch:

Now, a look at the feature branch:

Note that the feature branch history includes the initial commits from the master branch. In addition, you should note that we diverged our work when making commits on two different branches.

Let’s now take the changes that was introduced in F1 and F2 and reapply it on top of M5. In Git, this is called rebasing. With the rebase command, you can take all the changes that were committed on one branch and replay them on a different branch.

For this example, we are on the feature branch. From here we rebase it onto the master branch as follows:

As per https://git-scm.com/book/en/v2/Git-Branching-Rebasing, “This operation works by going to the common ancestor of the two branches (the one you’re on and the one you’re rebasing onto), getting the diff introduced by each commit of the branch you’re on, saving those diffs to temporary files, resetting the current branch to the same commit as the branch you are rebasing onto, and finally applying each change in turn.”

Now that we have rebased the feature branch commits onto the master branch, here is the following commit history:

Next, you can go back to the master branch, view its history, do a fast-forward merge, and then view its history again to see the .

Here is the result:

Enjoy!