Onboarding to a new team at work, I asked for a simple task to continue getting more acquainted with our codebases and processes. (Plus, I’d be afk for a few days and needed something on the small side.) We’ve been reorganizing one of our repositories by changing some plugins from git submodules to git subtrees. Not a big priority, but a good maintenance job for learning some more about our processes. We also needed to sync some changes from another repo into the subtree in the main repo. Seems like a simple task, eh?
Supposedly, one of our directories was already a subtree. I started looking up how to interact with subtrees, and the update command should be a simple pull command:
$ git subtree pull --prefix=cron/cron-control --squash git@github.com:Automattic/Cron-Control.git main
From https://github.com/Automattic/Cron-Control
* branch main -> FETCH_HEAD
fatal: can't squash-merge: 'cron/cron-control' was never added.
So… it’s not a subtree? After some poking around, I was able to get the subtree working by deleting the directory, and adding the subtree fresh. But why? If we have to do this everytime we want to update the subtree, it’d be pretty clunky. (Plus, why even use subtrees at all if git doesn’t think it’s a subtree.)
After checking with ChatGPT, I found a way to view local subtrees, and went back to a fresh clone to see what was up.
$ git log | grep git-subtree-dir
git-subtree-dir: cron/cron-control
git-subtree-dir: cron/cron-control
...
So it is a subtree already?
Something fishy was going on. I found a stack overflow answer about a similar problem. However, none of the answers seemed to apply, but they left some hints that the commit message could be an issue. So I dug into the source, using GIT_TRACE=1 to provide a bit more info.
$ GIT_TRACE=1 git subtree pull --prefix=cron/cron-control --squash cron-control main
...
13:35:54.672155 run-command.c:659 trace: run_command: git maintenance run --auto --no-quiet
13:35:54.676014 git.c:463 trace: built-in: git maintenance run --auto --no-quiet
13:35:54.681205 git.c:463 trace: built-in: git rev-parse -q --verify 'FETCH_HEAD^{commit}'
13:35:54.686303 git.c:463 trace: built-in: git diff-index HEAD --exit-code --quiet
13:35:54.694880 git.c:463 trace: built-in: git diff-index --cached HEAD --exit-code --quiet
13:35:54.700252 git.c:463 trace: built-in: git log '--grep=^git-subtree-dir: cron/cron-control/*$' --no-show-signature '--pretty=format:START %H%n%s%n%n%b%nEND%n' HEAD
fatal: can't squash-merge: 'cron/cron-control' was never added.
Hm. So to detect subtrees, git uses git log '--grep=^git-subtree-dir: cron/cron-control/*$'. (Source code here.) Running this locally provided no matches. However, since we used grep previously and got results, I figured a different regex should show a match. git log '--grep=^git-subtree-dir: cron/cron-control/*' shows a few entries. Maybe there’s some dangling whitespace. Yep, git log '--grep=^git-subtree-dir: cron/cron-control/*\s$' does match the commit which originally added this directory as a subtree.
This made me wonder: what whitespace is there? Dumping the original commit with git show -s --format=%B 048779a0bebc4dbfb899f3ad8c0b68f64d489e8d > test.txt, and looking at it in my editor showed no extra space at the end of the line about subtree-dir. My next thought was line endings… and sure enough, this commit has mixed line endings, but most commits do not. 🙃
$ git show -s --format=%B 048779a0bebc4dbfb899f3ad8c0b68f64d489e8d > test.txt
$ file test.txt
test.txt: Unicode text, UTF-8 text, with CRLF, LF line terminators
And of course, an updated regex proves it too: git log '--grep=^git-subtree-dir: cron/cron-control/*\r$' matches the commit. Something wrote this commit with the wrong line endings.
Equipped with this knowledge, a Google search led me to this discussion, which indicates that any commit message created in the GitHub web interface uses CRLF line endings.
And what do we use frequently that might cause that? The squash merge strategy in GitHub PRs. After testing creating a few PRs, squash merging, dumping the commit, and checking line endings, I verified that even if you don’t make any edits to the squashed commit in GitHub’s interface, it’ll still get CRLF line endings. The only way to avoid this is to make sure the original commit message is used. The normal merge strategy doesn’t change the line endings, because it doesn’t overwrite the original commit data you created locally. (As far as I can tell!)
In other words, if you use squash merge in GitHub when creating a subtree, git will not understand that this directory is in fact a subtree. All because the commit message which creates the subtree uses a carriage return!
The quick solution on our end is to just not use “squash merge” in GitHub when dealing with subtrees. If we wanted to squash or rewrite the history locally, we can definitely still do that, so long as that locally created commit is copied identically to GitHub when the PR is merged. (Plus, the metadata about subtrees needs to remain in the commit message.)
As a result, updating a subtree in the future should be as simple as git subtree pull --prefix=$subtree_path --squash $remote $branch.