Genome assembly

Overview

Teaching: 45 min
Exercises: 30 min

Questions

How can I assemble a genome with the reads generated from the nanopore sequencers.

Objectives

Hybrid genome assembly using Nanopore and Illumina reads.

Scaffolding assembled contigs using RagTag.

Accessing genome assembly quality using BBMap and BUSCO.

Genome assembly

An overview of Genome Assembly

Once we have the “clean” Nanopore reads, we are ready to assemble those reads into contiguous sequences (contigs). There are many tools which deploy different algorithms to assemble genomes with nanopore reads. Due to the nature of the Long-Read sequencing technologies, the pipeline for assembling error-prone Nanopore reads incorporate error correction to facilitate identifying overlapping reads and improve assembly.

Genome assembly using SMARTdenovo

We will assemble the draft assembly using de novo assembler for our Nanopore reads. This software does not include error correction steps. SMARTdenovo deploys overlap-layout-consensus (OLC) algorithm.

Overlap-Layout-Consensus (OLC) algorithm

An assembly methods finds overlaps among all the reads, and then build the overlap graph to sort into the contigs. Align those contigs to make a consensus sequence.

SLURM scripts

To get started with SMARTdenovo, lets create new directory in your work directory and copy a submission script template from /blue/general_workshop/share/bash_files directory.

$ cd /blue/general_workshop/<username>

$ mkdir SMARTdenovo

$ cd SMARTdenovo

$ cp /blue/general_workshop/share/bash_files/Smartdenovo.sh ./Smartdenovo.sh

Note: .sh is commonly used extension for shell scripts. Using an extension is not mandatory.

Adding information to SLURM script

We have to modify some information in the template to provide more information to SLURM about the job.

We can use a small text editor program called nano to edit a file. This will open a basic text editor.

$ nano Smartdenovo.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: Smartdenovo.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=Smartdenovo            # Job name
#SBATCH --account=general_workshop        # Account to run the computational task
#SBATCH --qos=general_workshop            # Account allocation
#SBATCH --mail-type=END,FAIL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>       # You need provide your email address
#SBATCH --ntasks=1                        # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=8gb                         # Job memory request
#SBATCH --time 12:00:00                   # Time limit hrs:min:sec
#SBATCH --output=SuwSamrtdenovo_%j.out    # Standard output and error log

pwd; hostname; date

module load smartdenovo/20180219

# Run genome assembly
smartdenovo.pl -p -t 1 Suw -c 1 /blue/general_workshop/share/Suwannee/Suw2_filtered_3000bp_60X.fastq > Suw.mak

make -f Suw.mak

-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Editing in nano

nano is a command line editor. You can only move your cursor with arrow keys: ↑, ↓, ← and →. Clicking with mouse does not change the position of the cursor. Be careful, you may be editing in wrong place.

If you accidentally edited in wrong place, exit nano, delete the script slurm.sh and copy again from share directory.

Change the <email_address> to your email address where you can check email. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Checking usage of `smartdenovo.pl`

$ module load smartdenovo/20180219  
$ smartdenovo.pl

Usage: smartdenovo.pl [options] <reads.fa>
Options:
  -p STR     output prefix [wtasm]
  -e STR     engine of overlaper, compressed kmer overlapper(zmo), dot matrix overlapper(dmo), [dmo]
  -t INT     number of threads [8]
  -k INT     k-mer length for overlapping [16]
             for large genome as human, please use 17
  -J INT     min read length [5000]
  -c INT     generate consensus, [0]

Make: software build automation tool

The make program is an intelligent utility and works based on the changes you do in your source file (makefile), which is Suw.mak. Makefile simplifies the process of building program.

Running a job in SLURM

Submitting a SMARTdenovo job

To submit the job to SLURM, sbatch command is used.

$ sbatch Smartdenovo.sh

Submitted batch job <jobid>

Checking status of a SLURM job

You can check the status of the job using the command squeue. -u argument accepts <username> and displays all jobs by the user. -A argument accepts account name and displays all jobs using the resources allocated to that account.

$ squeue -u <username>

             JOBID PARTITION     NAME     USER        ST       TIME  NODES NODELIST(REASON)
          43807246 hpg-milan Smartden     <username>   R       0:14      1 c0709a-s3

$ squeue -A general_workshop

    JOBID PARTITION        NAME    USER  ST       TIME  NODES NODELIST(REASON)
  <jobid> hpg2-comp    Smartden <user1>   R       0:07      1 c29a-s2
  <jobid> hpg2-comp    Smartden <user2>   R       1:32      1 c15a-s1
  <jobid> hpg2-comp    Smartden <user3>   R       2:01      1 c09a-s4

Understanding Job Status

Under status ST, R stands for Running and PD stands for pending. If the job is pending, a reason may be provided in last column. Eg:

None: Just taking a while before running.

Priority: Higher priority jobs exist in this partition.

QOSMaxCpuPerUserLimit: The user is already using max number of CPU that they are allowed to use.

Checking the log file

The SLURM submission script contains a line #SBATCH --output Samrtdenovo_%j.out. Thus, the output for this job with be in the file Samrtdenovo_<jobid>.log.

The job might take about a day, so we will not have the output by the end of today. Let’s copy the log file from /blue/general_workshop/share/Suwanee2/Smartdenovo directory.

$ cp /blue/general_workshop/share/Suwannee/Smartdenovo/SuwSmartdenovo_58583802.out ./SuwSmartdenovo_58583802.out

$ ls

Smartdenovo.sh  Suw.fa.gz  SuwSamrtdenovo_<jobid>.out
Suw.dmo.ovl     Suw.mak    SuwSmartdenovo_58583802.out

$ head SuwSmartdenovo_58583802.out

/blue/jeremybrawner/smithk/F_Cir/Suwannee2/Smartdenovo # wroking directory
c5a-s23.ufhpc # hostname (a label that is assigned to a device connected to a computer network)
Sat Sep 12 12:02:30 EDT 2020 (Date)
/apps/smartdenovo/20180219/bin/wtpre -J 5000 /blue/jeremybrawner/smithk/F_Cir/Suwannee2/Suw2_filtered_3000bp_60X.fastq | gzip -c -1 > Suw.fa.gz
/apps/smartdenovo/20180219/bin/wtzmo -t 8 -i Suw.fa.gz -fo Suw.dmo.ovl -k 16 -z 10 -Z 16 -U -1 -m 0.1 -A 1000
[Sat Sep 12 12:04:43 2020] loading long reads
[Sat Sep 12 12:05:39 2020] Done, 93129 reads (length >= 0)
[Sat Sep 12 12:05:39 2020] sorted sequences by length dsc
[Sat Sep 12 12:05:39 2020] calculating overlaps, 8 threads
[Sat Sep 12 12:05:39 2020] indexing 1/1

$ tail SuwSmartdenovo_58583802.out

103 tips, 7 bubbles, 0 chimera, 5 non-bog, 0 recoveries
[Sun Sep 13 07:02:30 2020] repair 115 bog elements
3 tips, 0 bubbles, 0 chimera, 1 non-bog, 0 recoveries
[Sun Sep 13 07:02:30 2020] repair 4 bog elements
0 tips, 0 bubbles, 0 chimera, 0 non-bog, 0 recoveries
[Sun Sep 13 07:02:30 2020] generated 2067 unitigs
[Sun Sep 13 07:02:30 2020] recover 16 edges inter unitigs
[Sun Sep 13 07:03:00 2020] output 21 independent unitigs
[Sun Sep 13 07:03:00 2020] Done
/apps/smartdenovo/20180219/bin/wtcns -t 8 Suw.dmo.lay > Suw.dmo.cns 2> Suw.dmo.cns.log

The end of log file provides an important information that Nanopore reads were assembled into 21 unitgs.

Unitig vs Contig

Contig is a set of overlapped DNA fragments. While unitig contains multiple contigs.

Polishing contigs using Racon

The SMARTdenovo does not include error correction/polishing steps. Racon aims to generate consensus sequences which is similar or better quality compared to the output generated from other assembly algorithms deploying error correction or consensus steps. Racon support data generated from different Long-read technologies.

Mapping of Nanopore reads to the assembly

To get started with Racon, we need to map the Nanopore reads to our genome assembly.

Burrows-Wheeler Alignment (BWA) tool aligns DNA fragments to genome assembly/reference genome. Three algorithms are available: BWA-backtrack (short-read alignment only), BWA-SW and BWA-MEM. To process the long reads, BWA-MEM is faster and more accurate than BWA-SW.

Let’s create new directory in your work directory and copy a submission script template from /blue/general_workshop/share/bash_files directory.

$ cd /blue/general_workshop/<username>

$ mkdir Polishing

$ cd Polishing

$ cp /blue/general_workshop/share/bash_files/bwa.sh ./bwa.sh

Once you copy the script to your working directory, we need to edit the script.

$ nano bwa.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: bwa.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=BWA                # Job name
#SBATCH --account=general_workshop    # Account to run the computational task
#SBATCH --qos=general_workshop        # Account allocation
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>   # You need provide your email address
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=8gb                    # Job memory request
#SBATCH --time 12:00:00               # Time limit hrs:min:sec
#SBATCH --output=Suw_index_%j.out     # Standard output and error log

pwd; hostname; date

module load bwa/0.7.17

# First generate index file of assembly
bwa index /blue/general_workshop/share/Suwannee/Smartdenovo/Suw.utg.fa

# Align filtered Nanopore reads to the assembly
bwa mem -t 1 -x ont2d /blue/general_workshop/share/Suwannee/Smartdenovo/Suw.utg.fa \
/blue/general_workshop/share/Suwannee/Suw2_filtered_3000bp_60X.fastq > Suw.sam

-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Change the <email_address> to your email address where you can check email. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Checking usage of `bwa mem`

$ module load bwa
$ bwa mem

Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:
      
       -t INT        number of threads [1]

Scoring options:

-x STR        read type. Setting -x changes multiple parameters unless overridden [null]
                     pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0  (PacBio reads to ref)
                     ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0  (Oxford Nanopore 2D-reads to ref)
                     intractg: -B9 -O16 -L5  (intra-species contigs to ref)

Note: Due to the page size limitation, only partial options are shown.

A faster option for mapping

minimap2 has relaced BWA-MEM for PacBio or Nanopore reads alignment. minimap2 retains major features of BWA-MEM, but about 50 times faster, more accurate.

Submitting a BWA job

Before we submit a new job, we need to cancel your previous job.

$ squeue -u <username>
$ scancel <jobid> 
$ squeue -u <username>

To submit the job to SLURM, sbatch command is used.

$ sbatch bwa.sh

Submitted batch job <jobid>

Constructing the index from the assembly only takes about a minute but alignment will take about an hours. Therefore, we will use pre-computed result to proceed to the next step.

Run Racon

Let’s copy a submission script template from /blue/general_workshop/share/bash_files directory.

Check your current working directory.

$ pwd

If for some reasons you are not at /blue/general_workshop/username/Polishing, navigate yourself to the right directory.

$ cd /blue/general_workshop/<username>/Polishing

$ cp /blue/general_workshop/share/bash_files/Racon.sh ./Racon.sh

Once you copy the script to your working directory, we need to edit the script.

$ nano Racon.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: Racon.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=Racon              # Job name
#SBATCH --account=general_workshop    # Account to run the computational task
#SBATCH --qos=general_workshop        # Account allocation
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>   # You need provide your email address
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=8gb                    # Job memory request
#SBATCH --time 12:00:00               # Time limit hrs:min:sec
#SBATCH --output=Racon_Suw_%j.out     # Standard output and error log

pwd; hostname; date

module load racon

racon -t 1 /blue/general_workshop/share/Suwannee/Suw2_filtered_3000bp_60X.fastq \
/blue/general_workshop/share/Suwannee/Polishing/Suw.sam \
/blue/general_workshop/share/Suwannee/Smartdenovo/Suw.utg.fa\
> ./SuwRacon.fasta

-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Change the <email_address> to your email address where you can check email. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Checking usage of `racon`

$ module load racon
$ racon -h

usage: racon [options ...] <sequences> <overlaps> <target sequences>

    #default output is stdout
    <sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences used for correction
    <overlaps>
        input file in MHAP/PAF/SAM format (can be compressed with gzip)
        containing overlaps between sequences and target sequences
    <target sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences which will be corrected

    options:
        -t, --threads <int>
            default: 1
            number of threads
        -h, --help
            prints the usage

Submitting a Racon job

Before we submit a new job, we need to cancel your previous job.

$ squeue -u <username>
$ scancel <jobid> 
$ squeue -u <username>

To submit the job to SLURM, sbatch command is used

$ sbatch Racon.sh

Submitted batch job <jobid>

Checking the log file

Correction will take about 30 minutes. Let’s copy the output file from /blue/general_workshop/share/Suwanee/Polishing directory.

$ cp /blue/general_workshop/share/Suwannee/Polishing/Racon_Suw_58629660.out ./Racon_Suw_58629660.out

$ ls

bwa.sh    Racon_Suw_<jobid>.out   Suw_index_<jobid>.out   SuwRacon.fasta
Racon.sh  Racon_Suw_58629660.out  Suw_index_58628076.out  Suw.sam

$ cat Racon_Suw_58629660.out

/blue/jeremybrawner/smithk/F_Cir/Suwannee2/mapping
c2a-s22.ufhpc
Mon Sep 14 21:03:13 EDT 2020
[racon::Polisher::initialize] loaded target sequences 0.962586 s
[racon::Polisher::initialize] loaded sequences 57.437520 s
[racon::Polisher::initialize] loaded overlaps 40.405221 s
[racon::Polisher::initialize] aligning overlaps [====================] 19.403566 s
[racon::Polisher::initialize] transformed data into windows 15.280284 s
[racon::Polisher::polish] generating consensus [====================] 1307.944380 s
[racon::Polisher::] total = 1442.492319 s

Racon will take about 30 minutes to finish correcting our assembly. We might not have enough time to finish the racon program. That is alright. We have also provided pre-computed output (corrected draft assembly) at /blue/general_workshop/share/Suwanee2/Polishing.

Assembly polishing using Pilon

Pilon is a software aims to automatically improve draft assembly. Pilon requires two inputs: a FASTA file of assembly and BAM file of Illumina reads aligned to the assembly. Pilon identifies inconsistencies between the assembly and reads. Pilon aims to improve the input assembly including: 1) Single base differences, 2) Small indels, 3) Larger indel or block substitution events, 4) Gap filling, and 5) Identification of local mis-assemblies. Outputs from Pilon tool are a FASTA file of improved assembly and optional VCF file to visualize the discrepancy between FASTA file of assembly and Illumina reads.

Polishing Racon-corrected assembly using Pilon

Three major steps are involved:

Align Illumina reads to Racon-corrected assembly using BWA
Convert SAM file to BAM file and sort the bam file based on reads position in the Racon assembly
Correcting assembly using Pilon using sorted BAM file consisting of Illumina paired-end alignments, aligned to the Racon-corrected assembly.

Run Pilon

Let’s create new directory in your work directory and copy a submission script template from /blue/general_workshop/share/bash_files directory.

$ cd /blue/general_workshop/<username>/Polishing

$ mkdir Pilon

$ cd Pilon

$ cp /blue/general_workshop/share/bash_files/pilonRound1.sh ./pilonRound1.sh

Once you copy the script to your working directory, we need to edit the script.

$ nano pilonRound1.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: pilonRound1.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=Pilon                    # Job name
#SBATCH --account=general_workshop          # Account to run the computational task
#SBATCH --qos=general_workshop              # Account allocation
#SBATCH --mail-type=END,FAIL                # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>         # You need provide your email address
#SBATCH --ntasks=1                          # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=8gb                          # Job memory request
#SBATCH --time 12:00:00                     # Time limit hrs:min:sec
#SBATCH --output=SuwGenome_pilon_%j.out     # Standard output and error log
pwd; hostname; date

module load bwa/0.7.17 samtools/1.15 pilon/1.24

# First create index of Racon-corrected assembly.
bwa index /blue/general_workshop/share/Suwannee/Polishing/SuwRacon.fasta

# Two steps here: 
# 1. mapping illumina reads to Racon-corrected assembly
# 2. sorted output SAM file into BAM file sorted by the mapping evidence
bwa mem -t 1 /blue/general_workshop/share/Suwannee/Polishing/SuwRacon.fasta \
/blue/general_workshop/share/Suwannee/TrimFastX_R1.fq \
/blue/general_workshop/share/Suwannee/TrimFastX_R2.fq | \
samtools view - -Sb | samtools sort - -o pilon1.sorted.bam

# Create index of mapping evidence
samtools index pilon1.sorted.bam

# Telling JAVA how much memory it should use (8 GB)
export _JAVA_OPTIONS="-Xmx8g"

# Run pilon 
pilon --genome /blue/general_workshop/share/Suwannee/Polishing/SuwRacon.fasta \
--fix all --changes --frags pilon1.sorted.bam --threads 1 --output Suw_pilon1 \
--outdir /blue/general_workshop/<username>/Polishing/Pilon

-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Change the <email_address> to your email address where you can check email. Also, replace <username> to your username. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Checking usage of `samtools view`

$ module load samtools 
$ samtools view

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Output options:
  -b, --bam                  Output BAM
General options:
  -S           Ignored (input format is auto-detected)

Note: Due to the page size limitation, only partial options are shown.

samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
-o FILE    Write final output to FILE rather than standard output

Note: Due to the page size limitation, only partial options are shown.

module load pilon 
pilon --help

Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]
                 [...other options...]
--genome genome.fasta
              The input genome we are trying to improve, which must be the reference used
              for the bam alignments.  At least one of --frags or --jumps must also be given.
--frags frags.bam
              A bam file consisting of fragment paired-end alignments, aligned to the --genome
              argument using bwa or bowtie2.  This argument may be specifed more than once.
 --changes
              If specified, a file listing changes in the <output>.fasta will be generated.
--fix fixlist
              A comma-separated list of categories of issues to try to fix:
                "snps": try to fix individual base errors;
                "indels": try to fix small indels;
                "gaps": try to fill gaps;
                "local": try to detect and fix local misassemblies;
                "all": all of the above (default);
                "bases": shorthand for "snps" and "indels" (for back compatibility);
                "none": none of the above; new fasta file will not be written.

Submitting a Pilon job

Racon should be done by now. Check the job status first.

Before we submit a new job, we need to cancel your previous job.

$ squeue -u <username>
$ scancel <jobid> 
$ squeue -u <username>

To avoid any potential error. We will submit the script using the Racon-corrected assembly computed by us previously.

To submit the job to SLURM, sbatch command is used

$ sbatch pilonRound1.sh

Submitted batch job <jobid>

Further polishing Pilon assembly using Pilon

We polished the Pilon assembly two times. We mapped Illumina reads to 1st Pilon-polished assembly to obtained 2nd Pilon-polished assembly. Pilon program takes about 30 minutes. We have prepared the 2nd Pilon-polished assembly for the next step.

Scaffolding: mapping the polished assembly to the reference genome

RagTag is a collection of software tools for scaffolding and improving genome assemblies. RagTag performs:

Homology-based misassemble correction
Homology-based assembly scaffolding and patching
Scaffold merging

In our case, we will use the scaffolding function.

RagTag

Correction: RagTag identifies and correct potential misassembles in the query assembly using reference genome.

Scaffold: RagTag orders and orients sequences of draft query assembly into longer sequences using reference genome. Sequences will be jointed with gap without altering the sequences.

Patch: RagTag fills the gaps in the query assembly using reference genome.

Merge: RagTag merges different scaffoldings of the same query assembly, which can improve scaffolding using different scaffolding tools.

Run RagTag

First, create a RagTag folder under /blue/general_workshop/<username>/Polishing, then copy script from /blue/general_workshop/share/bash_files

$ cd /blue/general_workshop/<username>/Polishing
$ mkdir RagTag
$ cd RagTag
$ cp /blue/general_workshop/share/bash_files/ragtag.sh ./ragtag.sh

Adding information to SLURM script

We have to modify some information in the template to provide more information to SLURM about the job.

$ nano ragtag.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: ragtag.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=RagTag             # Job name
#SBATCH --account=general_workshop    # Account to run the computational task
#SBATCH --qos=general_workshop        # Account allocation
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>   # You need provide your email address
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=1GB                     # Job memory request
#SBATCH --time 1:00:00                # Time limit hrs:min:sec
#SBATCH --output=Ragtag_%j.out        # Standard output and error log

module load ragtag/2.0.1
 
ragtag.py scaffold /blue/general_workshop/share/GCA_000497325.3/GCA_000497325.3_OldRef.fna \
/blue/general_workshop/share/Suwannee/Polishing/pilon_round2/Suw_pilon2.fasta


-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Change the <email_address> to your email address where you can check email. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Running a job in SLURM

Submitting a RagTag job

Before we submit a new job, we need to cancel your previous job.

$ squeue -u <username>
$ scancel <jobid> 
$ squeue -u <username>

To submit the job to SLURM, sbatch command is used.

$ sbatch ragtag.sh

Submitted batch job <jobid>

RagTag only takes a minutes, so let’s check the output in /blue/general_workshop/<username>/Polishing/Ragtag/ragtag_output.

$ cd /blue/general_workshop/<username>/Polishing/Ragtag/ragtag_output
$ ls

ragtag.scaffold.agp          ragtag.scaffold.confidence.txt  ragtag.scaffold.stats
ragtag.scaffold.asm.paf      ragtag.scaffold.err
ragtag.scaffold.asm.paf.log  ragtag.scaffold.fasta

Outputs of RagTag

ragtag.scaffold.agp: The order and orientation of query assembly in AGP format.

ragtag.scaffold.fasta: Scaffold in FASTA format, defined by the ragtag.scaffold.agp

ragtag.scaffold.stats: Summary statistics for the scaffolding process.

Let’s look at the statistics report.

cat ragtag.scaffold.stats

placed_sequences        placed_bp       unplaced_sequences      unplaced_bp     gap_bp  gap_sequences
19                      46269058        2                       917303          600     6

Quality assessment of assemblies

Computing metrics of assemblies using QUAST

To perform assembly evaluation, we will run QUAST. QUAST computes several common metrics including misassembles, contig N50, genome fraction (aligned based in the assemblies/reference genome). QUAST provides several outputs including report.txt, assessment summary in plain text file, and HTML file, a report including interactive plots. You can read the complete manual here.

Run QUAST

First, create a QUAST folder at /blue/general_workshop/<username>, then copy a submision script from /blue/general_workshop/share/bash_files

$ cd /blue/general_workshop/<username>
$ mkdir QUAST
$ cd QUAST
$ cp /blue/general_workshop/share/bash_files/quast.sh ./quast.sh

Adding information to SLURM script

We have to modify some information in the template to provide more information to SLURM about the job.

$ nano quast.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: quast.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=QUAST              # Job name
#SBATCH --account=general_workshop    # Account to run the computational task
#SBATCH --qos=general_workshop        # Account allocation
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>   # You need provide your email address
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=1gb                    # Job memory request
#SBATCH --time 1:00:00               # Time limit hrs:min:sec
#SBATCH --output=QUAST_%j.out         # Standard output and error log

pwd; hostname; date

module load quast 

quast.py /blue/general_workshop/plyu/Polishing/Ragtag/ragtag_output/ragtag.scaffold.fasta \
/blue/general_workshop/share/GCA_000497325.3/GCA_000497325.3_OldRef.fna \
/blue/general_workshop/share/GCA_024047395.1/GCA_024047395.1_New_Ref.fna \
-o quastResult 


-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Change the <email_address> to your email address where you can check email. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Fusarium circinatum has 21 assemblies available on NCBI Assembly. We will compared two most recent uploaded assemblies: GCA_000497325.3 (Uploaded in 2018; previous reference genome) and GCA_024047395.1 (Uploaded in July 2022; current reference genome).

GCA_000497325.3 assembly was assembled from reads generated by ABI Solid Sequencing and 454 (Next-generation seducing technologies); GCA_024047395.1 is a hybrid assembly using PacBio technology and Illumina HiSeq.

Running a job in SLURM

Submitting a QUAST job

To submit the job to SLURM, sbatch command is used.

$ sbatch quast.sh

Submitted batch job <jobid>

QUAST analysis will only take couple minutes, so we let’s look at the outputs.

$ ls

QUAST_<jobid>.out  quastResult  quast.sh

$ cd quastResult

$ ls

basic_stats         report.html  transposed_report.tex
icarus.html         report.pdf   transposed_report.tsv
icarus_viewers      report.tex   transposed_report.txt
QUAST_43889307.out  report.tsv
quast.log           report.tx

Let’s download report.html file. On your dashboard of UFRC OnDemand:

Click Files, then /blue/general_workshop.
Find your folder, click QUAST, then click quastResult.
Select report.html, then click Download.

Open the HTML file locating in your local folder.

Evaluate assembly completeness using BUSCO

A measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content was proposed. A oopen-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO, is available (Simao et al., 2015).

Run BUSCO analysis

First, create a BUSCO folder under your folder, then copy the configuration of BUSCO containing all the required dependencies from /blue/general_workshop/share/BUSCO/augustus.

$ cd /blue/general_workshop/<username>
$ mkdir BUSCO
$ cd BUSCO
$ cp -r /blue/general_workshop/share/BUSCO/augustus ./augustus
$ cp /blue/general_workshop/share/bash_files/busco.sh ./busco.sh

Adding information to SLURM script

We have to modify some information in the template to provide more information to SLURM about the job.

$ nano busco.sh

-----------------------------------------------------------------------------------------------
 GNU nano 2.3.1                      File: busco.sh
-----------------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name=BUSCO              # Job name
#SBATCH --account=general_workshop    # Account to run the computational task
#SBATCH --qos=general_workshop        # Account allocation
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>   # You need provide your email address
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --mem=2gb                    # Job memory request
#SBATCH --time 1:00:00               # Time limit hrs:min:sec
#SBATCH --output=BUSCO_%j.out         # Standard output and error log

pwd; hostname; date

module load busco/5.3.0

busco -f -i /blue/general_workshop/share/Suwannee/Polishing/ragtag/ragtag_output/ragtag.scaffolds.fasta\
-o BUSCO_out_Suw --lineage_dataset hypocreales_odb10 -m genome 

-----------------------------------------------------------------------------------------------
^G Get Help     ^O WriteOut     ^R Read File     ^Y Prev Page     ^K Cut Text       ^C Cur Pos
^X Exit         ^J Justify      ^W Where Is      ^V Next Page     ^U UnCut Text     ^T To Spell
-----------------------------------------------------------------------------------------------

Change the <email_address> to your email address where you can check email. Once you are done, press Ctrl+x to return to bash prompt. Press Y and Enter to save the changes made to the file.

Checking usage of `busco`

$ module load busco/5.3.0  
$ busco -h

usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]

optional arguments:
  -i FASTA FILE, --in FASTA FILE
                        Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set.
-o OUTPUT, --out OUTPUT
                        Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path
-l LINEAGE, --lineage_dataset LINEAGE
                        Specify the name of the BUSCO lineage to be used.
-m MODE, --mode MODE  Specify which BUSCO analysis mode to run.
                        There are three valid modes:
                        - geno or genome, for genome assemblies (DNA)

Running a job in SLURM

Submitting a BUSCO job

To submit the job to SLURM, sbatch command is used.

$ sbatch busco.sh

Submitted batch job <jobid>

BUSCO analysis will take about an hour, so we prepared the pre-computed output.

We will copy the output of BUSCO analyses from our assembly, reference genome (GCA_024047395.1) and previous reference genome available on NCBI (GCA_000497325.3).

$ cp /blue/general_workshop/share/BUSCO/BUSCO_out_Suw/short_summary.specific.hypocreales_odb10.BUSCO_out_Suw.txt ./BUSCO_out_Suw.txt

$ cp /blue/general_workshop/share/BUSCO/BUSCO_out_Fc_ref/short_summary.specific.hypocreales_odb10.BUSCO_out_Fc_ref.txt ./BUSCO_out_Fc_ref.txt

$ cp /blue/general_workshop/share/BUSCO/BUSCO_out_Fc_oldref/short_summary.specific.hypocreales_odb10.BUSCO_out_Fc_oldref.txt ./BUSCO_out_Fc_oldref.txt
$ ls

augustus
busco.sh
BUSCO_<username>.out
BUSCO_out_Fc_oldref.txt
BUSCO_out_Fc_ref.txt
BUSCO_out_Suw.txt

$ cat BUSCO_out_Suw.txt

# BUSCO version is: 5.3.0 
# The lineage dataset is: hypocreales_odb10 (Creation date: 2020-08-05, number of genomes: 50, number of BUSCOs: 4494)
# Summarized benchmarking in BUSCO notation for file /blue/general_workshop/share/Suwannee/Polishing/ragtag/ragtag_output/ragtag.scaffolds.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

        ***** Results: *****

        C:98.0%[S:97.7%,D:0.3%],F:0.3%,M:1.7%,n:4494       
        4401    Complete BUSCOs (C)                        
        4389    Complete and single-copy BUSCOs (S)        
        12      Complete and duplicated BUSCOs (D)         
        12      Fragmented BUSCOs (F)                      
        81      Missing BUSCOs (M)                         
        4494    Total BUSCO groups searched                

Dependencies and versions:
        hmmsearch: 3.1
        metaeuk: 5.34c21f2

$ cat BUSCO_out_Fc_oldref.txt

# BUSCO version is: 5.3.0 
# The lineage dataset is: hypocreales_odb10 (Creation date: 2020-08-05, number of genomes: 50, number of BUSCOs: 4494)
# Summarized benchmarking in BUSCO notation for file /blue/general_workshop/share/BUSCO/GCA_000497325.3/GCA_000497325.3_old_ref.fna
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

        ***** Results: *****

        C:94.0%[S:93.9%,D:0.1%],F:3.1%,M:2.9%,n:4494       
        4225    Complete BUSCOs (C)                        
        4219    Complete and single-copy BUSCOs (S)        
        6       Complete and duplicated BUSCOs (D)         
        139     Fragmented BUSCOs (F)                      
        130     Missing BUSCOs (M)                         
        4494    Total BUSCO groups searched                

Dependencies and versions:
        hmmsearch: 3.1
        metaeuk: 5.34c21f2

$ cat BUSCO_out_Fc_ref.txt

# BUSCO version is: 5.3.0 
# The lineage dataset is: hypocreales_odb10 (Creation date: 2020-08-05, number of genomes: 50, number of BUSCOs: 4494)
# Summarized benchmarking in BUSCO notation for file /blue/general_workshop/share/BUSCO/GCA_024047395.1/GCA_024047395.1_assembly_ref.fa
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

        ***** Results: *****

        C:97.6%[S:97.3%,D:0.3%],F:0.6%,M:1.8%,n:4494       
        4384    Complete BUSCOs (C)                        
        4372    Complete and single-copy BUSCOs (S)        
        12      Complete and duplicated BUSCOs (D)         
        28      Fragmented BUSCOs (F)                      
        82      Missing BUSCOs (M)                         
        4494    Total BUSCO groups searched                

Dependencies and versions:
        hmmsearch: 3.1
        metaeuk: 5.34c21f2

References and additional reading

SMARTdenovo

Overlap Layout Consensus assembly

BWA

Racon

Bowtie2

Pilon

RagTag

QUAST

BUSCO

Comparative Genomics of Fusarium circinatum Isolates Used to Screen Southern Pines for Pitch Canker Resistance

Key Points

Don’t forget to edit the email in the SLURM scripts and username.

There are several other ways to submit the job. Some software support multithread and you can write the array scripts request resources.

previous episode

Oxford nanopore sequencing applications for plant pathologists

lesson home

Genome assembly

Overview

Genome assembly

An overview of Genome Assembly

Genome assembly using SMARTdenovo

Overlap-Layout-Consensus (OLC) algorithm

SLURM scripts

Adding information to SLURM script

Editing in nano

Checking usage of smartdenovo.pl

Make: software build automation tool

Running a job in SLURM

Submitting a SMARTdenovo job

Checking status of a SLURM job

Understanding Job Status

Checking the log file

Unitig vs Contig

Polishing contigs using Racon

Mapping of Nanopore reads to the assembly

Checking usage of bwa mem

A faster option for mapping

Submitting a BWA job

Run Racon

Checking usage of racon

Submitting a Racon job

Checking the log file

Assembly polishing using Pilon

Polishing Racon-corrected assembly using Pilon

Run Pilon

Checking usage of samtools view

Submitting a Pilon job

Further polishing Pilon assembly using Pilon

Scaffolding: mapping the polished assembly to the reference genome

RagTag

Run RagTag

Adding information to SLURM script

Running a job in SLURM

Submitting a RagTag job

Outputs of RagTag

Quality assessment of assemblies

Computing metrics of assemblies using QUAST

Run QUAST

Adding information to SLURM script

Running a job in SLURM

Submitting a QUAST job

Evaluate assembly completeness using BUSCO

Run BUSCO analysis

Adding information to SLURM script

Checking usage of busco

Running a job in SLURM

Submitting a BUSCO job

References and additional reading

Key Points

previous episode

lesson home

Checking usage of `smartdenovo.pl`

Checking usage of `bwa mem`

Checking usage of `racon`

Checking usage of `samtools view`

Checking usage of `busco`