Hello World! example for looper
This tutorial demonstrates how to install
looper and use it to run a pipeline on a PEP project.
1. Install the latest version of looper:
pip install --user --upgrade looper
2. Download and unzip the hello_looper repository
The hello looper repository contains a basic functional example project (in
/project) and a looper-compatible pipeline (in
/pipeline) that can run on that project. Let's download and unzip it:
--2020-05-21 08:23:43-- https://github.com/pepkit/hello_looper/archive/master.zip Resolving github.com (github.com)... 220.127.116.11 Connecting to github.com (github.com)|18.104.22.168|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://codeload.github.com/pepkit/hello_looper/zip/master [following] --2020-05-21 08:23:43-- https://codeload.github.com/pepkit/hello_looper/zip/master Resolving codeload.github.com (codeload.github.com)... 22.214.171.124 Connecting to codeload.github.com (codeload.github.com)|126.96.36.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 5.20K --.-KB/s in 0.004s 2020-05-21 08:23:44 (1.25 MB/s) - ‘master.zip’ saved 
Archive: master.zip c8c4088d6e14df05071fb99809dfc86b2a55d86a creating: hello_looper-master/ inflating: hello_looper-master/README.md creating: hello_looper-master/data/ inflating: hello_looper-master/data/frog1_data.txt inflating: hello_looper-master/data/frog2_data.txt inflating: hello_looper-master/looper_pipelines.md inflating: hello_looper-master/output.txt creating: hello_looper-master/pipeline/ inflating: hello_looper-master/pipeline/count_lines.sh inflating: hello_looper-master/pipeline/pipeline_interface.yaml creating: hello_looper-master/project/ inflating: hello_looper-master/project/project_config.yaml inflating: hello_looper-master/project/sample_annotation.csv
3. Run it
Run it by changing to the directory and then invoking
looper run on the project configuration file.
!looper run hello_looper-master/project/project_config.yaml
Looper version: 1.2.0-dev Command: run Ignoring invalid pipeline interface source: ../pipeline/pipeline_interface.yaml. Caught exception: FileNotFoundError(2, 'No such file or directory') > Not submitted: No pipeline interfaces defined > Not submitted: No pipeline interfaces defined Looper finished Samples valid for job generation: 0 of 2 Commands submitted: 0 of 0 Jobs submitted: 0 1 unique reasons for submission failure: No pipeline interfaces defined Summary of failures: No pipeline interfaces defined: frog_2, frog_1
Voila! You've run your very first pipeline across multiple samples using
Exploring the results
Now, let's inspect the
hello_looper repository you downloaded. It has 3 components, each in a subfolder:
hello_looper-master/data/ ├── frog1_data.txt └── frog2_data.txt hello_looper-master/pipeline/ ├── count_lines.sh └── pipeline_interface.yaml hello_looper-master/project/ ├── project_config.yaml └── sample_annotation.csv 0 directories, 6 files
/data-- contains 2 data files for 2 samples. These input files were each passed to the pipeline.
/pipeline-- contains the script we want to run on each sample in our project. Our pipeline is a very simple shell script named
count_lines.sh, which (duh!) counts the number of lines in an input file.
/project-- contains 2 files that describe metadata for the project (
project_config.yaml) and the samples (
sample_annotation.csv). This particular project describes just two samples listed in the annotation file. These files together make up a PEP-formatted project, and can therefore be read by any PEP-compatible tool, including
When we invoke
looper from the command line we told it to
looper reads the project/project_config.yaml file, which points to a few things:
- the project/sample_annotation.csv file, which specifies a few samples, their type, and path to data file
output_dir, which is where looper results are saved. Results will be saved in
pipeline_interface.yamlfile, (pipeline/pipeline_interface.yaml), which tells looper how to connect to the pipeline (pipeline/count_lines.sh).
The 3 folders (
pipeline) are modular; there is no need for these to live in any predetermined folder structure. For this example, the data and pipeline are included locally, but in practice, they are usually in a separate folder; you can point to anything (so data, pipelines, and projects may reside in distinct spaces on disk). You may also include more than one pipeline interface in your
project_config.yaml, so in a looper project, many-to-many relationships are possible.
Outputs of pipeline runs will be under the directory specified in the
output_dir variable under the
paths section in the project config file (see defining a project). Let's inspect that
project_config.yaml file to see what it says under
metadata: sample_annotation: sample_annotation.csv output_dir: $HOME/hello_looper_results pipeline_interfaces: ../pipeline/pipeline_interface.yaml
Alright, next let's explore what this pipeline stuck into our
/home/nsheff/hello_looper_results ├── results_pipeline └── submission ├── count_lines.sh_frog_1.log ├── count_lines.sh_frog_1.sub ├── count_lines.sh_frog_2.log ├── count_lines.sh_frog_2.sub ├── frog_1.yaml └── frog_2.yaml 2 directories, 6 files
Inside of an
output_dir there will be two directories:
results_pipeline- a directory with output of the pipeline(s), for each sample/pipeline combination (often one per sample)
submissions- which holds a YAML representation of each sample and a log file for each submitted job
From here to running hundreds of samples of various sample types is virtually the same effort!
A few more basic looper options
Looper also provides a few other simple arguments that let you adjust what it does. You can find a complete reference of usage in the docs. Here are a few of the more common options:
-d: Dry run mode (creates submission scripts, but does not execute them)
--limit: Only run a few samples
--lumpn: Run several commands together as a single job. This is useful when you have a quick pipeline to run on many samples and want to group them.
There are also other commands:
looper check: checks on the status (running, failed, completed) of your jobs
looper summarize: produces an output file that summarizes your project results
looper destroy: completely erases all results so you can restart
looper rerun: rerun only jobs that have failed.
On your own
looper on your own, you will need to prepare 2 things: a project (metadata that define what you want to process), and pipelines (how to process data). To link your project to
looper, you will need to define a project. You will want to either use pre-made
looper-compatible pipelines or link your own custom-built pipelines. These docs will also show you how to connect your pipeline to your project.