Developer Tutorial - Creating Pipeline

Introduction

Via Foundry is an easy-to-use platform for creating, deploying, and executing complex nextflow pipelines for high throughput data processing.

Foundry provides:

A drag and drop user interface to build nextflow pipelines
Reproducible pipelines with version tracking
Seamless portability to different computing environments with containerization
Simplified pipeline sharing using GitHub (github.com)
Support for continuous integration and tests (travis-ci.org)
Easy re-execution of pipelines by copying previous runs settings
Integrated data analysis and reporting interface with R markdown support

Our aim is;

Reusability
Reproducibility
Shareability
Easy execution
Easy monitoring
Easy reporting

Expected learning outcome

To understand the basics of Foundry, how to use pipeline builder for different objectives and to familiarize yourself with Nextflow and some standard software packages for such analysis. This guide will walk you through how to start using Foundry pipelines and creating new pipelines.

Before you start

Please go to https://viafoundry.com and login into your account. If you have an issue about login, please let us know about it (support@viascientific.com). We will set an account for you.

Exercise 1 - Creating processes

Once logged in, click on the Projects section at the top menu and click Add a New Project button. This is the place to configure your project. To access pipeline builder page, click Pipelines tab and then click Create Pipeline button.

Now you can write a descripton about your pipeline using Description tab, start developing your pipeline using Workflow tab, and adding extra files or setting some extra parameters using Advanced tab. Let's get into some details about the pipeline elements.

What is a "process"?

Process is a basic programming element in Nextflow to run user scripts. Please click here to learn more about Nextflow's processes.

A process usually has inputs, outputs and script sections. In this tutorial, you will see sections that include necesseary information to define a process shown in the left side of the picture below. Please, use that information to fill "Add new process" form shown in the middle section in the picture below. Foundry will then convert this information to a nextflow process shown in the left side of the picture. Once a process created, it can be used in the pipeline builder. The example how it looks is shown in the bottom left side in the picture. The mapping between the sections shown in colored rectangles.

The process we will create in this exercise;

FastQC process
Hisat2 process
RSeQC process

You’ll notice several buttons at the left menu. New processes are created by clicking blue New process button .

1. FastQC process

a. First, please click, blue New process button in the left menu to open "Add New Process" window.

b. Please enter FastQC for the process name and define a new "Menu Group".

c. In the FastQC process, we have an input, an output and a line of a command we are going to use to execute the fastqc process.

Name: "FastQC"
Menu Group: "Tutorial"
Inputs: 
  reads(fastq,set) name: val(name),file(reads)

Outputs: 
  outputFileHTML(html,file) name: "*.html"

Script:
  fastqc ${reads}

d. Lets select input and output parameters (reads and outputFileHTML) and define their "Input Names" that we are going to use in the script section.

e. Let's enter the script section

f. Press "Save changes" button at the bottom of the modal to create the process. Now this process is ready to use. We will use it in the Exercise 2.

2. Hisat2 process

Let's create Hisat2 process.

a. First, please click, blue “New process” button to open "Add New Process" modal.

b. Inputs, outputs and scripts should be defined like below;

Name: "Hisat2"
Menu Group: "Tutorial"
Inputs: 
  reads(fastq,set) name: val(name),file(reads)
  hisat2Index(file) name: hisat2Index

Outputs: 
  mapped_reads(bam,set) name: val(name), file("${name}.bam")
  outputFileTxt(txt,file) name: "${name}.align_summary.txt"

Script:
  basename=\$(basename ${hisat2Index}/*.8.ht2 | cut -d. -f1)
  hisat2 -x ${hisat2Index}/\${basename} -U ${reads} -S ${name}.sam &> ${name}.align_summary.txt
  samtools view -bS ${name}.sam > ${name}.bam

c. After you select input(reads and hisat2Index) and output parameters (mapped_reads and outputFileTxt), add their names and enter the script. The page should look like this;

d. Please save changes before you close the screen.

3. RSeQC process

a. First, please click, blue “New process” button to open "Add New Process" modal.

b. The form should be filled using the information below;

Name: "RSeQC"
Menu Group: "Tutorial"
Inputs:
  mapped_reads(bam,set) name: val(name), file(bam)
  bedFile(bed,file) name: bed

Outputs: 
  outputFileTxt(txt,file) name: "RSeQC.${name}.txt"

Script:
  read_distribution.py  -i ${bam} -r ${bed}> RSeQC.${name}.txt

c. After you select input output parameters, enter their names and the script. The page should look like this;

d. Please, save changes before you close the screen.

Here Exercise 1 is finished. Please move to Exercise 2 to build the pipeline using the processes you created in Exercise 1.

Exercise 2 - Building a pipeline

Before you start building the pipeline make sure you have the processes available in your Process Menu.

a. At the top of the page, you’ll notice Pipeline Name box. You can rename your pipeline by clicking here. Please enter a name to your pipeline. E.g. "RNA-Seq-Tutorial" and press save button.

b. Please drag and drop FastQC, Hisat2 and RSeQC to your workspace;

c. Please drag and drop three Input parameters and change their names to Input_Reads, Hisat2_Index and bedFile and connect them to their processes;

d. Connect your Hisat2 process with RSeQC process using mapped_reads parameter in both. You will observe that, when the types match you can connect the two processes using their matching input and output parameters.

e. Drag & Drop three output parameters from the sidebar

and name them FastQC_output, Hisat2_Summary, and RSeQC_output and connect them to their corresponding processes. While naming, click their "Publish to Web Directory" and choose the right output format according to the output type of the process.

f. Overall pipeline should look like below;

Exercise 3 - Executing a pipeline

1. Once a pipeline is created, you will notice “Run” button at the right top of the page.

2. This button opens a new window where you can select your project by clicking on the project. You will then proceed by entering run name which will be added to your run list of the project. Clicking “Save run” will redirect you to the “run page” where you can initiate your run.

3. Here, please choose your Run Environment (Via Demo Environment(AWS Batch))

4. Then click the Advanced tab and go to Run Container section. Click Use Docker Image and enter theImage Path below;

Run Container:
Use Docker Image: Checked
Image Path: public.ecr.aws/t4w5x8f2/viascientific/rnaseq:3.0

5. Now, we are ready to enter the inputs we defined for the pipeline. Click the Run Settings tab to enter bed file. Please use the Manually tab.

bedFile:  s3://viascientific/run_data/genome_data/mousetest/mm10/refseq_170804/genes/genes.bed

6. Second, enter the hisat2 index directory. Please use the Manually tab.

Hisat2_Index: s3://viascientific/run_data/genome_data/mousetest/mm10/refseq_170804/Hisat2Index

Creating Collection

7. To enter Input_Reads, click Enter File button. Then go to Files Tab and click "Add File" button.

8. Enter the location of your files and click Search button to get the list of files:

File Location:

s3://viascientific/run_data/test_data/fastq_mouse_single

8. Then please choose Single List for the Collection Type and press add all files button.

9. Here there is an option to change the names but we will keep them as they are. Enter a collection name and click "save files".

collection name: test collection

10. In the next screen, the user can still add or remove some samples. Let's click "Save file" button to process all samples.

Running Pipeline

11. After we fill the inputs, the orange "Waiting" button at the top right should turn to green "Run" button. Now, you can press that button to start your run.

12. All run should finish in a couple of minutes. When the run finalized the log section will be look like below;

a. Logs:

b. Timeline:

Reports

13. In the report section, you can monitor all defined reports in the pipeline;

a. FastQC

b. Hisat2

c. RSeQC