Git LFS: The better way to handle large files in git

Git LFS A Better Way To Manage Large Files In Git

In this module, we discuss how to handle large and binary files effectively using git LFS. Git LFS is one of the must-know features that are often unknown to most git users. In this tutorial, we will learn how to use Git LFS effectively and where to use it.

Recommended read: Git stash command

Problems with git

Before we shift to newer technology it’s important that we acknowledge and analyze the problems the current technology has. Git is a great tool but we have to acknowledge that it is not perfect(no software is). Git faces the following issues while handling large and binary files:

  • Larger files will automatically contribute to slower fetch and pulls.
  • Any update in the binary file is registered as a complete file change in git. Git will store the entire binary in the git history instead of just storing changes. Frequent updating of these binary files makes the repository grow in size in a unwanted manner. Larger repository means slower fetch and pulls.

These problems seem pretty obvious and there is no perfect solution (yet) to handle them. There are some workarounds and hacks that can be used to get rid of such problems.

There is some good news for you as you don’t need to do these things manually. Git LFS which is an open-source extension of git employs some workarounds to make your git experience with large files smoother. But how exactly does it do it?

How does Git LFS work?

Git LFS uses the lazy pull and fetch for downloading the files and their different versions. This means that these files and their history are not downloaded by default. The relevant version is only downloaded when you checkout a commit containing an LFS file. This method saves a lot of space and pull and fetch time. We are not going to the intricate details of how git-lfs works. If you are intent on learning more about it, you can refer to this blog by Atlassian.

Installing git-lfs on Ubuntu

For Debian and Arch-based systems, the git-lfs program can be downloaded from the official repositories.

# For Debian-based system
sudo apt install git-lfs

# For Arch-based system
sudo pacman -S git-lfs

For other Linux distributions we can use the command for downloading git-lfs

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

sudo apt-get install git-lfs

To activate git LFS use the command

# Activate git LFS
git lfs install

You are all set to get going with git LFS.

How to use git lfs?

Git LFS works under the hood. In most cases, you won’t interface with git LFS directly. For example, the push, pull checkout commands are still handled using git. Though git-lfs does the job of downloading these LFS files on demand under the hood, you don’t actually need to worry about it’s working.

Probably the only thing you need to worry about is what files should be treated as LFS.

If you want to add a file to git and mark it as an LFS object, you need to use the track command. The track command tells the git-lfs which files are to be treated as LFS objects. To mark a new file as an LFS object, use the command:

# Mark a particular file as an LFS object
git lfs track "<filename.extension>"

If you want to track a specific type of file as an LFS object, we can use the wildcards for doing so. Here is an example for marking all mp4 files in the current folder as LFS:

# Mark all mp4 files in the folder as LFS objects
git lfs track "*.mp4"

Similar to how git stores the files to ignored in .gitignore, git-lfs lists the name of the files to be tracked in a file called .gitattribute. This .gitattribute sets the LFS objects for the repository. Sharing the .gitattributes will ensure that all developers working on the same code can reuse the tracking list.

The above method will work only on the files that have not been tracked by git before. If you already have a repository with a large file system that is tracked by git, you need to migrate your files from git tracking to git-lfs tracking.

Using this method will reset history for the current branch to use git-lfs for those files. If you want to reset for all branches you can use the –everything flag, but it is absolutely not recommended.

# Migrate the files to be tracked under git-lfs
git lfs migrate import --include="<files to be tracked>"

Git migrate rewrites a lot of git history, you might want to see the effect of the migrate command before actually migrating it. It can be done using by replacing import with info.

# A dry run showing the effects of migrate
git lfs migrate info --include="<files to be tracked>"

Closing Remarks

Before we end this article, here are a few things that may seem trivial but are actually important when you are working with Git LFS:

  • The LFS files are stored in the git LFS cache. These are not cleaned by default. To stop the repository from occupying an absurd amount of space after a point in time, you need to manually clean the cache using the command git lfs prune.
  • Make sure all the developers working on the repository has git LFS installed on their system. This is very important because otherwise git history may not match for different remotes and will result in weird-looking errors.
  • Git migrate rewrites the history of the repository, make sure you fully understand its consequences before using it.

References