Software Carpentry at Nikhef, day 1

# Software Carpentry at Nikhef, day 1 :::info :information_source: On this page you will find notes for the first day of the Software Carpentry workshop organized on 26 September 2023. ::: ## Code of Conduct Everyone who participates in Carpentries activities is required to conform to the [Code of Conduct](https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html). This document also outlines how to report an incident if needed. ## :watch: Schedule September 26 | | **Unix Shell**| |------|------| | 09:30 | Navigating and working with files and directories | | 10:30 | Morning break | | 10:45 | Automation (pipes, filters, loops & scripts) | | 12:30 | Lunch break | | 13:15 | Finding things | | 14:00 | *END* | | | **Git** | | 14:15 | Setting up and working with Git | | 15:45 | Afternoon break | | 16:00 | Collaborating via Git | | 17:30 | *END* | ## Unix shell ### :link: Links * Setup page: https://kb.nikhef.nl/computing-course/swc-setup/ * Lesson material: https://kb.nikhef.nl/computing-course/shell-novice/ * Reference page: https://kb.nikhef.nl/computing-course/shell-novice/reference.html ### 1. Introducing the Shell <code>ls</code> <code>ks</code> (error) ### 2. Navigating Files and Directories ```bash pwd ls ls -F ls --help man ls (press q to escape it) ls -j (error + info) ls -F ls -F Desktop/shell-lesson-data cd Desktop pwd cd shell-lesson-data cd exercise-data pwd ls -F cd shell-lesson-data (error) cd .. pwd ls -F -a cd pwd cd Desktop/shell-lesson-data/exercise-data pwd cd /Users/reinder/Desktop/shell-lesson-data cd ~/Desktop/shell-lesson-data cd exercise-data/creatures cd - cd - ``` :::success :pencil: **Absolute vs Relative Paths** Starting from `/Users/amanda/data`, which of the following commands could Amanda use to navigate to her home directory, which is `/Users/amanda`? 1. `cd .` 1. `cd /` 1. `cd /home/amanda` 1. `cd ../..` 1. `cd ~` 1. `cd home` 1. `cd ~/data/..` 1. `cd` 1. `cd ..` <details> <summary>Solution</summary> <ol type="1"> <li>No: <code>.</code> stands for the current directory.</li> <li>No: <code>/</code> stands for the root directory.</li> <li>No: Amanda’s home directory is <code>/Users/amanda</code>.</li> <li>No: this command goes up two levels, i.e. ends in <code>/Users</code>.</li> <li>Yes: <code>~</code> stands for the user’s home directory, in this case <code>/Users/amanda</code>.</li> <li>No: this command would navigate into a directory home in the current directory if it exists.</li> <li>Yes: unnecessarily complicated, but correct.</li> <li>Yes: shortcut to go back to the user’s home directory.</li> <li>Yes: goes up one level.</li> </ol> </details> ::: ``` clear cd \.. cd \.. pwd ls north-pacific-gyre/ ls nor[tab] (completes nor to north-pacific-gyre) ls n[tab] ls north-pacific-gyre/goo[tab] (2 options; press multiple times) ls north-pacific-gyre/goos[tab] (completes) ``` ### 3. Working With Files and Directories ``` pwd cd exercise-data/writing ls -F mkdir thesis ls -F ls -F thesis mkdir -p \../project/data \../project/results ls -FR \../project cd thesis nano draft.txt (opens editor) Add text CTRL-O (save, enter) CTRL-X (editor closes) ls cd \.. pwd mv thesis/draft.txt thesis/quotes.txt ls thesis mv thesis/quotes.txt . ls thesis/quotes.txt [arrow up: gets previous command(s)] ls quotes.txt cp quotes.txt thesis/quotations.txt ls quotes.txt thesis/quotations.txt cp -r thesis thesis_backup ls thesis thesis_backup cp thesis thesis_backup (error) ``` :::success :pencil: **Renaming Files** Suppose that you created a plain-text file in your current directory to contain a list of the statistical tests you will need to do to analyze your data, and named it: `statstics.txt` After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so? 1. `cp statstics.txt statistics.txt` 1. `mv statstics.txt statistics.txt` 1. `mv statstics.txt .` 1. `cp statstics.txt .` <details> <summary>Solution</summary> 1. No. While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted. 1. Yes, this would work to rename the file. 1. No, the period(.) indicates where to move the file, but does not provide a new file name; identical file names cannot be created. 1. No, the period(.) indicates where to copy the file, but does not provide a new file name; identical file names cannot be created. </details> ::: ``` rm quotes.txt ls rm thesis (error) rm -ri thesis (y: yes , n: no finishes command) ``` :::success :pencil: **Copy with Multiple Filenames** For this exercise, you can test the commands in the `shell-lesson-data/exercise-data` directory. In the example below, what does `cp` do when given several filenames and a directory name? ``` $ mkdir backup $ cp creatures/minotaur.dat creatures/unicorn.dat backup/ ``` In the example below, what does `cp` do when given three or more file names? ``` $ cd creatures $ ls -F basilisk.dat minotaur.dat unicorn.dat $ cp minotaur.dat unicorn.dat basilisk.dat ``` <details> <summary>Solution</summary> If given more than one file name followed by a directory name (i.e. the destination directory must be the last argument), `cp` copies the files to the named directory. If given three file names, `cp` throws an error such as the one below, because it is expecting a directory name as the last argument. ``` cp: target 'basilisk.dat' is not a directory ``` </details> ::: ``` cd .. cd alkanes ls ls *.pdb ls p*.pdb ls ?ethane.pdb ls *ethane.pdb ls ???ane.pdb ``` ### 4. Pipes and Filters ``` ls cd exercise-data/alkanes ls wc cubanes.pdb wc *.pdb wc -l *.pdb wc -m *.pdb wc -l cntl-C wc -l *.pdb wc -l *.pdb > lengths.txt ls lengths.txt cat lengths.txt less lengths.txt Q ``` :::success :pencil: **What Does `sort -n` Do?** The file `shell-lesson-data/exercise-data/numbers.txt` contains the following lines: ``` 10 2 19 22 6 ``` If we run sort on this file, the output is: ``` 10 19 2 22 6 ``` If we run `sort -n` on the same file, we get this instead: ``` 2 6 10 19 22 ``` Explain why `-n` has this effect. <details> <summary>Solution</summary> The `-n` option specifies a numerical rather than an alphanumerical sort. </details> ::: ``` sort -n lengths.txt sort lengths.txt sort -n lengths.txt > sorted_lengths.txt head -n 1 sorted_lengths.txt head -n 2 sorted_lengths.txt sort -n lengths.txt | head -n 1 wc -l *.pdb | sort -n wc -l *.pdb | sort -n | head -n 1 ``` :::success :pencil: **Pipe Reading Comprehension** A file called `animals.csv` (in the `shell-lesson-data/exercise-data/animal-counts` folder) contains the following data: ``` 2012-11-05,deer,5 2012-11-05,rabbit,22 2012-11-05,raccoon,7 2012-11-06,rabbit,19 2012-11-06,deer,2 2012-11-06,fox,4 2012-11-07,rabbit,16 2012-11-07,bear,1 ``` What text passes through each of the pipes and the final redirect in the pipeline below? Note, the `sort -r` command sorts in reverse order. ``` $ cat animals.csv | head -n 5 | tail -n 3 | sort -r > final.txt ``` Hint: build the pipeline up one command at a time to test your understanding <details> <summary>Solution</summary> The head command extracts the first 5 lines from `animals.csv`. Then, the last 3 lines are extracted from the previous 5 by using the `tail` command. With the `sort -r` command those 3 lines are sorted in reverse order and finally, the output is redirected to a file `final.txt`. The content of this file can be checked by executing `cat final.txt`. The file should contain the following lines: ``` 2012-11-06,rabbit,19 2012-11-06,deer,2 2012-11-05,raccoon,7 ``` </details> ::: :::success :pencil: **Pipe Construction** For the file `animals.csv` from the previous exercise, consider the following command: ``` $ cut -d , -f 2 animals.csv ``` The `cut` command is used to remove or ‘cut out’ certain sections of each line in the file, and cut expects the lines to be separated into columns by a `Tab` character. A character used in this way is a called a **delimiter**. In the example above we use the `-d` option to specify the comma as our delimiter character. We have also used the `-f` option to specify that we want to extract the second field (column). This gives the following output: ``` deer rabbit raccoon rabbit deer fox rabbit bear ``` The uniq command filters out adjacent matching lines in a file. How could you extend this pipeline (using uniq and another command) to find out what animals the file contains (without any duplicates in their names)? <details> <summary>Solution</summary> ``` $ cut -d , -f 2 animals.csv | sort | uniq ``` </details> ::: ``` cd .. ls cd ../north-pacific-gyre/ ls wc -l *.txt wc -l *.txt | sort -n | head -n 5 wc -l *.txt | sort -n | tail -n 5 ls *Z.txt ``` ### 5. Loops ``` cd exercise-data/creatures ls head -n 5 basilisk.dat minotaur.dat unicorn.dat head -n 5 basilisk.dat minotaur.dat unicorn.dat | tail -n 2 for filename in basilisk.dat minotaur.dat unicorn.dat do echo $filename head -n 2 $filename | tail -n 1 done for x in basilisk.dat minotaur.dat unicorn.dat do echo $x head -n 2 $x | tail -n 1 done ``` :::success :pencil: **Saving to a File in a Loop - Part One** In the `shell-lesson-data/exercise-data/alkanes` directory, what is the effect of this loop? ``` for alkanes in *.pdb do echo $alkanes cat $alkanes > alkanes.pdb done ``` 1. Prints `cubane.pdb`, `ethane.pdb`, `methane.pdb`, `octane.pdb`, `pentane.pdb` and `propane.pdb`, and the text from `propane.pdb` will be saved to a file called `alkanes.pdb`. 1. Prints `cubane.pdb`, `ethane.pdb`, and `methane.pdb`, and the text from all three files would be concatenated and saved to a file called `alkanes.pdb`. 1. Prints `cubane.pdb`, `ethane.pdb`, `methane.pdb`, `octane.pdb`, and `pentane.pdb`, and the text from `propane.pdb` will be saved to a file called `alkanes.pdb`. 1. None of the above. <details> <summary>Solution</summary> 1. The text from each file in turn gets written to the `alkanes.pdb` file. However, the file gets overwritten on each loop iteration, so the final content of `alkanes.pdb` is the text from the `propane.pdb` file. </details> ::: ``` for filename in *.dat do echo $filename head -n 100 $filename | tail -n 20 done echo hello there for filename in *.dat do $filename # gives error head -n 100 $filename | tail -n 20 done for filename in "red dragon.dat" "purple unicorn.dat" do echo "$filename" head -n 100 "$filename" | tail -n 20 done for filename in "red dragon.dat" "purple unicorn.dat" do echo $filename head -n 10 $filename | tail -n 4 done cp *.dat original-*.dat cp basilisk.dat minotaur.dat unicorn.dat original-*.dat for $filename in *.dat do cp $filename original-$filename done ls -l cd ../.. cd north-pacific-gyre ls for datafile in NENE*A.txt NENE*B.txt do echo $datafile done for datafile in NENE*A.txt NENE*B.txt do echo $datafile stats-$datafile done for datafile in NENE*A.txt NENE*B.txt; do bash goostats.sh $datafile stats-$datafile; done crtl-C for datafile in NENE*A.txt NENE*B.txt; do echo $datafile; bash goostats.sh $datafile stats-$datafile; done ``` ### 6. Shell Scripts ``` cd ../exercise-data/alkanes nano middle.sh head -n 15 octane.pdb | tail -n 5 >>cntl-X<< bash middle.sh nano middle.sh head -n 15 "$1" | tail -n 5 >>cntl-X<< bash middle.sh octane.pdb bash middle.sh pentane.pdb nano middle.sh head -n "$2" "$1" | tail -n "$3" >>cntl-X<< bash middle.sh octane.pdb 15 5 bash middle.sh octane.pdb 20 10 nano middle.sh # Select lines from the middle of the file # Usage: bash middle.sh filename end_line number_of_lines head -n "$2" "$1" | tail -n "$3" >>cntl-X<< wc -l *.pdb | sort -n nano sorted.sh # sort files by their length # Usage: bash sorted.sh one_or_more_filenames wc -l "$@" | sort -n >>cntl-X<< bash sorted.sh *.pdb ../creatures/*.dat cd ../../north-pacific-gyre/ nano do-stat.sh #Calculate stats for data files for datafile in "$@" do echo $datafile bash goostats.sh $datafile stats-$datafile done >>cntl-X<< bash do-stats.sh NENE*A.txt NENE*B.txt bash do-stats.sh NENE*A.txt NENE*B.txt | wc -l nano do-stat.sh #Calculate stats for data files for datafile in ENE*A.txt NENE*B.txt do echo $datafile bash goostats.sh $datafile stats-$datafile done >>cntl-X<< bash do-stats.sh ``` :::success :pencil: **List Unique Species** Leah has several hundred data files, each of which is formatted like this: ``` 2013-11-05,deer,5 2013-11-05,rabbit,22 2013-11-05,raccoon,7 2013-11-06,rabbit,19 2013-11-06,deer,2 2013-11-06,fox,1 2013-11-07,rabbit,18 2013-11-07,bear,1 ``` An example of this type of file is given in `shell-lesson-data/exercise-data/animal-counts/animals.csv`. We can use the command `cut -d , -f 2 animals.csv | sort | uniq` to produce the unique species in `animals.csv`. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead. Write a shell script called species.sh that takes any number of filenames as command-line arguments, and uses a variation of the above command to print a list of the unique species appearing in each of those files separately. <details> <summary>Solution</summary> ``` # Script to find unique species in csv files where species is the second data field # This script accepts any number of file names as command line arguments # Loop over all files for file in $@ do echo "Unique species in $file:" # Extract species names cut -d , -f 2 $file | sort | uniq done ``` </details> ::: ### 7. Finding Things ``` cd cd Desktop/shell-lesson-data/exercise-data/writing cat haiku.txt grep not haiku.txt grep The haiku.txt grep -w The haiku.txt grep "is not" haiku.txt grep -n "it" haiku.txt grep -nw "the" haiku.txt grep -nwi "the" haiku.txt grep -nwv "the" haiku.txt grep -r Yesterday . grep -E "^.o" haiku.txt find . cd \.. find . find . -type d find . -type f find . -name \*.txt find . -name "\*.txt" wc -l \$(find . -name "\*.txt") grep "searching" \$(find . -name "\*.txt") ``` ## Version Control with Git ### :link: Links * Setup page: https://kb.nikhef.nl/computing-course/swc-setup/ * Lesson material: https://kb.nikhef.nl/computing-course/git-novice/ * Reference page: https://kb.nikhef.nl/computing-course/git-novice/reference.html * The Turing Way chapter: https://the-turing-way.netlify.app/reproducible-research/vcs.html * List of Git GUIs: https://en.wikipedia.org/wiki/Comparison_of_Git_GUIs ### 1. Automated Version Control ### 2. Setting Up Git ``` cd cd Desktop git config --global user.name "Your Name" cat ~/.gitconfig git config --global user.email "your.email@address.com" ``` Linux & Mac: ``` git config --global core.autocrlf input ``` Windows: ``` git config --global core.autocrlf input ``` For all: ``` git config --global core.editor "nano -w" git config --global init.defaultBranch main git config --global --edit git config --list git config -h man git man git config git help ``` ### 3. Creating a Repository ``` mkdir planets ll cd planets git init ll ll -a git status ``` :::success :pencil: **Places to create git repositories** Along with tracking information about planets (the project we have already created), Dracula would also like to track information about moons. Despite Wolfman’s concerns, Dracula creates a moons project inside his planets project with the following sequence of commands: ``` $ cd ~/Desktop # return to Desktop directory $ cd planets # go into planets directory, which is already a Git repository $ ls -a # ensure the .git subdirectory is still present in the planets directory $ mkdir moons # make a subdirectory planets/moons $ cd moons # go into moons subdirectory $ git init # make the moons subdirectory a Git repository $ ls -a # ensure the .git subdirectory is present indicating we have created a new Git repository ``` Is the git init command, run inside the moons subdirectory, required for tracking files stored in the moons subdirectory? <details> <summary>Solution</summary> No. Dracula does not need to make the moons subdirectory a Git repository because the planets repository can track any files, sub-directories, and subdirectory files under the planets directory. Thus, in order to track all information about moons, Dracula only needed to add the moons subdirectory to the planets directory. Additionally, Git repositories can interfere with each other if they are “nested”: the outer repository will try to version-control the inner repository. Therefore, it’s best to create each new Git repository in a separate directory. To be sure that there is no conflicting repository in the directory, check the output of git status. If it looks like the following, you are good to go to create a new repository as shown above: ``` $ git status ``` ``` fatal: Not a git repository (or any of the parent directories): .git ``` </details> ::: ``` rm -rf moons/ mkdir moons ll ``` ### 4. Tracking Changes ``` cd planets nano mars.txt Cold and dry, but everything is my favorite colour cntl-X cat mars.txt git status git add mars.txt git status git commit -m "Start notes on Mars as base" git status git log ll -a .git/ ll nano mars.txt Cold and dry, but everything is my favorite colour The two moons may be a problem for Wolfman cntl-X cat mars.txt git status git diff git commit -m "Changes" git add mars.txt git commit Add concern about effect of Mars' moons on Wolfman crtl-X nano mars.txt Cold and dry, but everything is my favorite colour The two moons may be a problem for Wolfman But the Mummy will appreciate the lack of humidity crtl-X git diff gid add mars.txt git status git diff git diff --staged git commit Discuss concerns about Mars' climate for Mummy crtl-X git commit git status git diff --staged git log git log -1 git log -2 git log --oneline git log --oneline --graph ll mkdir spaceships git status git add spaceships git status touch spaceships/apollo-11 spaceships/sputnik-1 ll spaceships/ git status git add spaceships git status git commit -m 'Add some initial thoughts on spaceships' ``` :::success :pencil: **`bio` Repository** 1. Create a new Git repository on your computer called bio. 2. Write a three-line biography for yourself in a file called `me.txt`, commit your changes 3. Modify one line, add a fourth line 4. Display the differences between its updated state and its original state. ::: ### 5. Exploring History ``` nano mars.txt Cold and dry, but everything is my favorite colour The two moons may be a problem for Wolfman But the Mummy will appreciate the lack of humidity An ill-considered change ctrl-X cat mars.txt git diff HEAD mars.txt git diff HEAD~1 mars.txt git diff HEAD~3 mars.txt git show HEAD~3 mars.txt git diff f22b25e mars.txt git status git diff git checkout HEAD mars.txt cat mars.txt git status git log git checkout f22b25e mars.txt git checkout HEAD mars.txt git status git checkout f22b25e cat mars.txt git checkout main cat mars.txt nano mars.txt Cold and dry, but everything is my favorite colour The two moons may be a problem for Wolfman But the Mummy will appreciate the lack of humidity Extra line ctrl-X git add mars.txt git status git restore --staged mars.txt git status git add mars.txt git status nano mars.txt git diff git diff --staged git add mars.txt git diff --staged git restore --staged mars.txt ``` :::success :pencil: **Understanding Workflow and History** What is the output of the last command in ``` $ cd planets $ echo "Venus is beautiful and full of love" > venus.txt $ git add venus.txt $ echo "Venus is too hot to be suitable as a base" >> venus.txt $ git commit -m "Comment on Venus as an unsuitable base" $ git checkout HEAD venus.txt $ cat venus.txt #this will print the contents of venus.txt to the screen ``` 1. `Venus is too hot to be suitable as a base` 2. `Venus is beautiful and full of love` 3. `Venus is beautiful and full of love` `Venus is too hot to be suitable as a base` 4. Error because you have changed venus.txt without committing the changes <details> <summary>Solution</summary> The answer is 2. The command `git add venus.txt` places the current version of `venus.txt` into the staging area. The changes to the file from the second echo command are only applied to the working copy, not the version in the staging area. So, when `git commit -m "Comment on Venus as an unsuitable base"` is executed, the version of `venus.txt` committed to the repository is the one from the staging area and has only one line. At this time, the working copy still has the second line (and `git status` will show that the file is modified). However, `git checkout HEAD venus.txt` replaces the working copy with the most recently committed version of `venus.txt`. So, `cat venus.txt` will output `Venus is beautiful and full of love.` </details> ::: ``` nano mars.txt git status git add mars.txt git commit -m 'Bad change' git log git revert f22b25e git status git log git log --patch mars.txt ``` ### 6. Ignoring Things ``` mkdir results touch a.dat b.dat c.dat results/a.out results/b.out ll results git status nano .gitignore *.dat results/ ctrl-X git status git add .gitignore git commit -m "ignore data files and the results folder" git status git add a.dat git add -f a.dat git status git restore --staged a.dat git status --ignored nano .gitignore *.dat results/ !a.dat ctrl-X git add a.dat nano .gitignore !a.dat *.dat results/ ctrl-X git add a.dat ``` ### 7. Remotes in GitHub ``` git remote add origin git@gitlab.nikhef.nl:vlad/planets.git git remote -v git remote add origin2 git@gitlab.nikhef.nl:vlad/planets.git git remote -v git remote remove origin2 ssh-keygen -t ed25519 -C "vlad@tran.sylvan.ia" ls -al ~/.ssh ssh -T git@gitlab.nikhef.nl cat .ssh/id_ed25519.pub ssh -T git@gitlab.nikhef.nl cd planets ll git status git add .gitignore git diff --staged git commit git status git remote -v git push origin main git branch -v git branch -r git branch -vv git push -u origin main git branch -vv git pull ``` ### 8. Collaborating ``` mkdir copy cd copy git clone git@gitlab.nikhef.nl:vlad/planets.git cd planets nano venus.txt git add venus.txt git commit ```