With help from Fernando Perez, Karthik Ram, Matthew Brett and illustrations from Scott Chacon's pro-git book

Git for scientists

Who uses a version control system?

We all do

Using a proper version control system promotes:

  • Reproducibility : logging of every step
  • Peace of mind : a robust backup system
  • Flexibility : zero-cost branching
  • Collaboration : synchronization across multiple computers/people

This is preferable to a system like Dropbox, because it gives you fine-grained control over what and when you want to version

You can use version control for:

  • Documents: papers/grants
  • Analysis code
  • Tabular and text data
  • Teaching

Introducing git

Try typing git at the terminal

The first time you start to use git, you will need to configure your author information:

$ git config --global user.name 'Ariel Rokem'

You will want to make sure that your editor is configured:

$ git config --global core.editor nano  # Mac/Linux

$ git config --global core.editor "'C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin" # Windows

TextWrangler, SublimeText and TextMate are not good editors for this. If you know how to use vi, you can use that

While we're at it, let's configure it to use colors, which is helpful:

$ git config --global color.ui "auto"

To see all the items in your configuration:

$ git config --global --list

Git keeps snap-shots of your work over time

Five levels of git usage:

  • Local - linear
  • Local - branching
  • Remote - single user
  • Remote - small team
  • Remote - collaboration

Local and linear use

Make a directory

$ mkdir science

$ cd science

Initialize a git repository, a set of files and directories that are associated with each other and tracked together

$ git init

or

$ git init science

What is in here?

$ ls -a
./    ../   .git/

The repository stores its information in the .git directory.

To check what is going on

$ git status
# On branch master
#
# Initial commit
#
nothing to commit (create/copy files and use "git add" to track)

Let's make a file with some text:

$ echo "Cogito ergo sum" > cogito.txt
$ git status

    # On branch master
    #
    # Initial commit
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #   cogito.txt
    nothing added to commit but untracked files present (use "git add" to track)

Files need to be explicitely added to the repository to be tracked by git:

$ git add cogito.txt
$ git status

    # On branch master
    #
    # Initial commit
    #
    # Changes to be committed:
    #   (use "git rm --cached <file>..." to unstage)
    #
    #   new file:   cogito.txt
    #

This might be a good point to save a snap-shot of the current state of our repository

Type:

$ git commit

This should open a text editor with the following content:

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Author:    arokem <arokem@gmail.com>
#
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#       new file:   cogito.txt
#

Enter an informative commit message above this:

Initial commit of the cogito.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Author:    arokem <arokem@gmail.com>
#
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#       new file:   cogito.txt
#

Save and exit the commit message file

Something like this should appear in your terminal:

$ git commit
[master (root-commit) dbedc93] Initial commit of the cogito.
 Author: arokem <arokem@gmail.com>
 1 file changed, 1 insertion(+)
 create mode 100644 cogito.txt

To see the history of the repo, you can always type:

$ git log
commit dbedc93842554629a0bc441b38dc4b6824355aee
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:12:12 2013 -0700

    Initial commit of the cogito.

Let's make some changes to the file

$echo "Edo ergo sum" >> cogito.txt 

What has changed?

$git diff
diff --git a/cogito.txt b/cogito.txt
index 01c7cba..9084743 100644
--- a/cogito.txt
+++ b/cogito.txt
@@ -1 +1,2 @@
 Cogito ergo sum
+Edo ergo sum
$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#   modified:   cogito.txt
#
no changes added to commit (use "git add" and/or "git commit -a")
$ git commit 
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#   modified:   cogito.txt
#
no changes added to commit (use "git add" and/or "git commit -a")

We could run:

$ git add cogito.txt 

Followed by:

$ git commit       

A shortcut is given by:

$ git commit -a

After doing that:

$ git status
# On branch master
nothing to commit (working directory clean)
$ git log
commit 48143556b5a8b86aecb78d06ed749ab2b1c37248
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:32:01 2013 -0700

    Man's gotta eat.

commit dbedc93842554629a0bc441b38dc4b6824355aee
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:12:12 2013 -0700

    Initial commit of the cogito.

The cycle of git virtue: work, (add), commit, work, (add), commit

Check out this wonderful visualization: http://ndpsoftware.com/git-cheatsheet.html

git commands we've seen so far:

  • git init
  • git status
  • git add
  • git commit
  • git log
  • git diff

Before moving on: the anatomy of a commit

Now we can understand that a repository is simply a group of linked commits

Exercise:

  • Make changes to the file we already have
  • Add another file and make some changes to it.
  • Make commits as you go along
  • Remove this second file (Take a look at git help rm)

Next stage

Local and branching

What is a branch? Simply a label for the 'current' commit in a sequence of ongoing commits:

Let's make another branch that we will call testing:

$ git branch testing

We can switch over to this other branch by checking it out:

$ git checkout testing

There can be multiple branches alive at any point in time; the working directory is the state of a special pointer called HEAD.

In this example there are two branches, master and testing, and testing is the currently active branch since it's what HEAD points to:

Let's see what git has done. If we type:

$less .git/HEAD
ref: refs/heads/testing
$less .git/refs/heads/testing 
48143556b5a8b86aecb78d06ed749ab2b1c37248

Once new commits are made on a branch, HEAD and the branch label move with the new commits:

$ echo "dx/dy=(x2-x1)/(y2-y1)" > geometry.txt

$ git add geometry.txt

$ git status
# On branch testing
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#   new file:   geometry.txt
#
$ git commit 

[testing 0c00437] Another kind of science
 Author: arokem <arokem@gmail.com>
 1 file changed, 1 insertion(+)
 create mode 100644 geometry.txt
$ git log

commit 0c004374d232b0c49068f7c4b31226710fc2599f
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 13:24:59 2013 -0700

    Another kind of science

commit 48143556b5a8b86aecb78d06ed749ab2b1c37248
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:32:01 2013 -0700

    Man's gotta eat.

commit dbedc93842554629a0bc441b38dc4b6824355aee
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:12:12 2013 -0700

    Initial commit of the cogito.

If we continue to work on both branches, the history could diverge:

$git checkout master
    Switched to branch 'master'

Note that this changes the files on your disk!

$ ls
cogito.txt

If we do some more work here:

$ echo "Non pane solo" >> cogito.txt 

This also works:

$ git commit cogito.txt
[master 61da4a4] Not on bread alone
 Author: arokem <arokem@gmail.com>
 1 file changed, 1 insertion(+)

$ git log

commit 61da4a4963c00b2a51b4503d3eb13894a4bcac41
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 13:40:11 2013 -0700

    Not on bread alone

commit 48143556b5a8b86aecb78d06ed749ab2b1c37248
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:32:01 2013 -0700

    Man's gotta eat.

commit dbedc93842554629a0bc441b38dc4b6824355aee
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 12:12:12 2013 -0700

    Initial commit of the cogito.

This is our current situation

But based on this graph structure, git can compute the necessary information to merge the divergent branches back:

$git merge testing

Merge made by the 'recursive' strategy.
 geometry.txt |    1 +
 1 file changed, 1 insertion(+)
 create mode 100644 geometry.txt
$ git log --graph


*   commit 7c8adb5345ce035ca436b779472e38d56a2c3fcf
|\  Merge: 61da4a4 0c00437
| | Author: arokem <arokem@gmail.com>
| | Date:   Sun Apr 28 13:43:41 2013 -0700
| | 
| |     Merge branch 'testing'
| |   
| * commit 0c004374d232b0c49068f7c4b31226710fc2599f
| | Author: arokem <arokem@gmail.com>
| | Date:   Sun Apr 28 13:24:59 2013 -0700
| | 
| |     Another kind of science
| |   
* | commit 61da4a4963c00b2a51b4503d3eb13894a4bcac41
|/  Author: arokem <arokem@gmail.com>
|   Date:   Sun Apr 28 13:40:11 2013 -0700
|   
|       Not on bread alone
|  
* commit 48143556b5a8b86aecb78d06ed749ab2b1c37248
| Author: arokem <arokem@gmail.com>
| Date:   Sun Apr 28 12:32:01 2013 -0700
| 
|     Man's gotta eat.
|  
* commit dbedc93842554629a0bc441b38dc4b6824355aee
  Author: arokem <arokem@gmail.com>
  Date:   Sun Apr 28 12:12:12 2013 -0700

      Initial commit of the cogito.

Note: I have my bash terminal configured to give me information about what branch HEAD is currently pointing to

You can get that too, by putting the following in your ~/.bash_profile file:

# Set the prompt to show the current git branch:                              
function parse_git_branch {
  ref=$(git symbolic-ref HEAD 2> /dev/null) || return
  echo "("${ref#refs/heads/}")"
}

PS1="\h:\W$RED \$(parse_git_branch)$NO_COLOR $"

We've now seen:

  • git branch
  • git checkout
  • git merge

Exercise - play around with branches a bit:

Make a different branch, make some changes and merge your changes into master

What happens if you edit the same content on two different branches?

Merge conflicts

$ git checkout -b nap

After a little bit of editing:

$ git diff
diff --git a/cogito.txt b/cogito.txt
index d2fb549..48cd2cc 100644
--- a/cogito.txt
+++ b/cogito.txt
@@ -1,3 +1,3 @@
 Cogito ergo sum
-Edo ergo sum
+Edo ergo dormio
 Non pane solo
$ git commit cogito.txt

[nap 200d2ac] I need a nap
 Author: arokem <arokem@gmail.com>
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git log

commit 200d2ac38dbd8f6266a7c2b1fca422ab9c18e920
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 14:02:23 2013 -0700

    I need a nap

commit 7c8adb5345ce035ca436b779472e38d56a2c3fcf
Merge: 61da4a4 0c00437
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 13:43:41 2013 -0700

    Merge branch 'testing'

Meanwhile, on master:

$ git log -p

    commit 54dfe80854b4fc8311974f194292ec286207284e
    Author: arokem <arokem@gmail.com>
    Date:   Sun Apr 28 14:03:33 2013 -0700

        Work will set you free.

    diff --git a/cogito.txt b/cogito.txt
    index d2fb549..e30514f 100644
    --- a/cogito.txt
    +++ b/cogito.txt
    @@ -1,3 +1,3 @@
     Cogito ergo sum
    -Edo ergo sum
    +Edo ergo laboro
     Non pane solo

    commit 7c8adb5345ce035ca436b779472e38d56a2c3fcf
    Merge: 61da4a4 0c00437
    Author: arokem <arokem@gmail.com>
    Date:   Sun Apr 28 13:43:41 2013 -0700

    Merge branch 'testing'
$ git merge nap

Auto-merging cogito.txt
CONFLICT (content): Merge conflict in cogito.txt
Automatic merge failed; fix conflicts and then commit the result.

When I open the file I see the following:

Cogito ergo sum
<<<<<<< HEAD
Edo ergo laboro
=======
Edo ergo dormio
>>>>>>> nap
Non pane solo

I Edit this to:

Cogito ergo sum
Edo ergo laboro
Non pane solo
$ git add cogito.txt

$ git commit 

The commit message will automatically open as:

Merge branch 'nap'

Conflicts:
        cogito.txt
#
# It looks like you may be committing a merge.
# If this is not correct, please remove the file
#       .git/MERGE_HEAD
# and try again.


# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Author:    arokem <arokem@gmail.com>
#
# On branch master
$ git log --graph

*   commit ad3cd0eff8454f92e3251e831fc9541225374cdc
|\  Merge: 54dfe80 200d2ac
| | Author: arokem <arokem@gmail.com>
| | Date:   Sun Apr 28 14:09:09 2013 -0700
| | 
| |     Merge branch 'nap'
| |     
| |     Conflicts:
| |             cogito.txt
| |   
| * commit 200d2ac38dbd8f6266a7c2b1fca422ab9c18e920
| | Author: arokem <arokem@gmail.com>
| | Date:   Sun Apr 28 14:02:23 2013 -0700
| | 
| |     I need a nap
| |   
* | commit 54dfe80854b4fc8311974f194292ec286207284e
|/  Author: arokem <arokem@gmail.com>
|   Date:   Sun Apr 28 14:03:33 2013 -0700
|   
|       Work will set you free.
|    

The cycle of virtue extends: branch, work, commit, merge, branch, work, commit, merge

Once you have merged your branches, you can run git branch -d

$ git branch -a

    * master
      nap
      testing
$git br -d nap

    Deleted branch nap (was 200d2ac).
$git branch -a

    * master
      testing

Suggestion: never work on master

You should now know:

  • How to make branches and work with them
  • What to do in case of a conflict

Moving on: single user working with a remote

We will demonstate this using github.com, but there are other options, such as bitbucket.org

Go to github.com and get yourself a user account

Meanwhile, I will entertain you by showing you some interesting github repositories:

Get a free micro account here

Do we have any remotes here?

`

Since nothing shows up, apparently we don't

Since the call git remote -v didn't produce any output, it means we have no remote repositories configured. We will now proceed to do so. Once logged into github, go to the new repository page and make a repository called science.

Do not check the box that says Initialize this repository with a README, since we already have an existing repository here. That option is useful when you're starting first at Github and don't have a repo made already on a local computer.

We can now follow the instructions from the next page:

$ git remote add origin https://github.com/arokem/science.git # Put your user name here
$git remote -v

origin  https://github.com/arokem/science.git (fetch)
origin  https://github.com/arokem/science.git (push)
$git push -u origin master

    Counting objects: 21, done.
    Delta compression using up to 4 threads.
    Compressing objects: 100% (12/12), done.
    Writing objects: 100% (21/21), 1.90 KiB, done.
    Total 21 (delta 2), reused 0 (delta 0)
    To https://github.com/arokem/science.git
     * [new branch]      master -> master
    Branch master set up to track remote branch master from origin.

This is useful for backup

Our repo is copied to github's webservers and if our laptop is stolen, we can easily recover our work.

This is useful for synchronization

We can see that by simulating another computer as another directory on the same machine.

$ git clone https://github.com/arokem/science.git

    Cloning into 'science'...
    remote: Counting objects: 21, done.
    remote: Compressing objects: 100% (10/10), done.
    remote: Total 21 (delta 2), reused 21 (delta 2)
    Unpacking objects: 100% (21/21), done.
$ cd science/

$ ls

cogito.txt    geometry.txt

Going back to the original directory, I make a change in one of the files:

$ echo "pow(c,2) = sqrt(pow(a,2)+pow(b,2))" >> geometry.txt 

$ git commit -a -m"Pythagoras knew this"
[master 6e56586] Pythagoras knew this
 Author: arokem <arokem@gmail.com>
 1 file changed, 1 insertion(+)

We push that up to github:

$ git push

Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 337 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/arokem/science.git
   ad3cd0e..6e56586  master -> master

Switching to the new directory (think "another computer")

$git pull

remote: Counting objects: 5, done.
remote: Compressing objects: 100% (3/3), done.
Unpacking objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 3 (delta 0)
From https://github.com/arokem/science
   ad3cd0e..6e56586  master     -> origin/master
Updating ad3cd0e..6e56586
Fast-forward
 geometry.txt |    1 +
 1 file changed, 1 insertion(+)

This actually does two distinct operations:

$ git fetch 

Which brings in changes from the remote, and:

$ git merge origin/master

which, well - you know what merge does, no?

Let's recap

You should now know how to:

  • Set up a repository on github
  • Connect a local repository with a remote on github
  • Synchronize the repository across multiple computers

Collaborating with others : small teams

To demonstrate collaborate with a small team, we will set up a shared collaboration with one partner (the person sitting next to you). This will show the basic workflow of collaborating on a project with a small team where everyone has write privileges to the same repository.

Note for SVN users: this is similar to the classic SVN workflow, with the distinction that commit and push are separate steps. SVN, having no local repository, commits directly to the shared central resource, so to a first approximation you can think of svn commit as being synonymous with git commit; git push.

We will have two people, let's call them Alice and Bob, sharing a repository. Alice will be the owner of the repo and she will give Bob write privileges.

We begin with a simple synchronization example, much like we just did above, but now between two people instead of one person. Otherwise it's the same:

  • Bob clones Alice's repository.
  • Bob makes changes to a file and commits them locally.
  • Bob pushes his changes to github.
  • Alice pulls Bob's changes into her own repository.

Next, we will have both parties make non-conflicting changes each, and commit them locally. Then both try to push their changes:

  • Alice adds a new file, alice.txt to the repo and commits.
  • Bob adds bob.txt and commits.
  • Alice pushes to github.
  • Bob tries to push to github. What happens here?

The problem is that Bob's changes create a commit that conflicts with Alice's, so git refuses to apply them. It forces Bob to first do the merge on his machine, so that if there is a conflict in the merge, Bob deals with the conflict manually (git could try to do the merge on the server, but in that case if there's a conflict, the server repo would be left in a conflicted state without a human to fix things up). The solution is for Bob to first pull the changes (pull in git is really fetch+merge), and then push again.

Pull requests

This is a mechanism implemented on github, which allows you to discuss changes, before merging them into master.

Remember that I suggested not to work on master?

Here's why. Imagine that I am working on cleaning up one of the files in our project

$ git checkout -b cogito_cleanup 

After some work:

$ git diff

diff --git a/cogito.txt b/cogito.txt
index e30514f..01c7cba 100644
--- a/cogito.txt
+++ b/cogito.txt
@@ -1,3 +1 @@
 Cogito ergo sum
-Edo ergo laboro
-Non pane solo
$ git commit -a -m"Only the essentials"

[cogito_cleanup 21e0764] Only the essentials
 Author: arokem <arokem@gmail.com>
 1 file changed, 2 deletions(-)
$ git push
fatal: The current branch cogito_cleanup has no upstream branch.
To push the current branch and set the remote as upstream, use

    git push --set-upstream origin cogito_cleanup
$ git push --set_upstream origin cogito_cleanup

Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 305 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/arokem/science.git
 * [new branch]      cogito_cleanup -> cogito_cleanup
$ git pull
remote: Counting objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (1/1), done.
From https://github.com/arokem/science
   892c594..c89c54c  master     -> origin/master
Updating 892c594..c89c54c
Fast-forward
 cogito.txt |    2 --
 1 file changed, 2 deletions(-)
$ git log -p

commit c89c54ca3b08af7424ce6b06c9e3792d0ee770e3
Merge: 892c594 21e0764
Author: Ariel Rokem <arokem@gmail.com>
Date:   Sun Apr 28 16:05:47 2013 -0700

    Merge pull request #1 from arokem/cogito_cleanup

    Only the essentials

commit 21e07642eb65fe768b68215cef95f1b32834ab85
Author: arokem <arokem@gmail.com>
Date:   Sun Apr 28 15:55:10 2013 -0700

    Only the essentials

diff --git a/cogito.txt b/cogito.txt
index e30514f..01c7cba 100644
--- a/cogito.txt
+++ b/cogito.txt
@@ -1,3 +1 @@
 Cogito ergo sum
-Edo ergo laboro
-Non pane solo

Exercise :

  • Grant your partner permissions on your github repository (Settings => Collaborators), then make a pull request.
  • Have a discussion on the diff of the PR
  • Merge the PR, resolving conflicts as they arise

"Full-contact git + github": distributed collaboration with large teams

We'll say only a few words about this.

The main additional concept that you need to know about is the concept of a fork. This is a copy of the original repository that is on your own github account. In this case, you will clone your own fork onto your machine and you will issue pull-requests

Questions?