QuickStart

Quick start

1. Good to know

The following items are worth keeping in mind when working with Ananas:

Using fastq or fasta files indifferently, which can also be gzipped (the format will be automatically detected)
Not trimming reads for quality or other reasons (removing entire pairs is OK)
Assembling each sample (with both the forward and reverse reads) in a multi-sample data set separately
Using strand-specific data, if possible
Taking advantage of information provided by Ananas on exactly which read has been used in which contig and in what positions.
If you need isoforms in the first place, you can run Ananas with the top parameter set to 0. If your final assembly includes isoforms, and you subsequently need to remove them, you can post-process the data by running GetTopFromFasta
Performing test runs with test data (downloadable here) in order to get familiar with the Ananas software and with all the options that could be chosen

2. Installation

Ananas runs on Linux and requires cmake and gcc version 5 or higher. To download and compile, type:

git clone https://github.com/AnanasAssembler/AnanasAssembler.git
cd AnanasAssembler
./configure
make -C build -j 4

3. Assemble a transcriptome

N.B. Before running your data we recommend downloading some test data here and performing a few test runs.

Run:

Ananas -i <file1_r.fastq,file1_l.fastq> -o <output_directory> -dir <read_orientation> -n <cores>

All the parameter which have a default value, have it specified at the end of the line with (def= )
If no value is given by the user, this default value will be used

The list of all options:

    -i<string> : input fasta file
    This should be in the form of either one file containing all reads or several files separated by comma. 
    These files can be in fasta or fastq format and can also be gzipped.

    -dir<string> : direction of pairs: fr towards each other, ff same direction, na unpaired
    fr: forward-reverse are those reads that are facing eachother, one in the sense direction and one in antisense
    ff: forward-forward are those reads that both face the same way and and are both sense or antisense.
    na: not-applicable applies to reads that are not paired

    -o<string> : output directory (def=ananas_out)
    The directory where all the output and intermediate files will be placed 

    -m<double> : minimum overlap identity (def=0.98)
    The minimum acceptable identitiy for the overlapping read in order to use it in creating assemblies 

    -mg<double> : minimum identity for grouping (def=0.99) 
    The minimum acceptable identity of sequences to be grouped together into a consensus read

    -b<int> : bandwidth of alignments (maximum indel size) (def=0)  
    This is the bandwidth used in the alignment of reads
    If the sequenced reads are expected to contain high levels of indels, this bandwidth should be set to higher values 

    -strand<bool> : strand specificity (0=no 1=yes) (def=0) 
    Determine whether the given reads are for strand-specific data or not 

    -libSize<int> : Maximum library size (def=500)   
    The maximum length of the library size as specified by the sequencing platform where the sequenced reads have been obtained 

    -minContigLen<int> : minimum length of a single-contig scaffold to report (def=200) 
    The minimum length of a contig to be accepted as output 

    -ml<int> : minimum overlap (for alignments) (def=35)
    Minimum length of overlapping region between two reads to be used in the assembly 

    -maxOverlap<int> : Threshold on the maximum number of overlaps per read, default is twice the read size (def=0)
    The maximum number of overlaps to allow for each read. This is to limit the significance of reads that contain repeated structures

    -s<int> : step size (for alignments) (def=5)
    The step size to use in the modifed suffix array when building the read overlaps

    -n<int> : number of CPU cores (def=1)
    Number of cores to use in the general search stages

    -n2<int> : number of CPU cores for isoform enumeration (def=1)
    Number of cores to use in the exahustive search stage. Unlike other stages, this stage does not used shared memory among the cores

    -no<int> : number of processes for overlap finding (def=2)
    Number of cores to use in the overlap finding stage 

    -readGroupFile<string> : read grouping information file if available (def=)
    Provide read group file if available from previous runs 

    -outReadNames<string> : Print grouped read names associating them to their index (def=)
    Output file name for detailed information of which read groups with which other reads 

    -prefix<string> : The prefix to add to all generated contig names (def=Sample1)
    Certain prefix to be added to all final assembled contigs 

    -rr<bool> : Remove redundant transcripts (def=0)
    Run assembled contigs through self alignment and remove those that are fully contained in another contig 

    -top<bool> : Keep only the top contig from any scaffold (def=1)
    Isoforms will not be produced in final assembly 

    -gaps<bool> : Use gapped alignments for consensus (def=0)

    -noIso<bool> : No isoforms (skips exhaustive search) (def=0)
    Skip final exhaustive search step 

    -pairRestrict<bool> : Restrict contigs with pair-end support (def=1)
    Keep contigs only to the point where they are fully supported by paired reads 

    -ll<int> : Application logging level - defaults to 0, choose 1 to 4 for debugging (def=0)    
    Used only for debugging purposes

Output:

All assembled transcripts will be in the output directory in final.fa.
All the corresponding contigs details will be in the output directory in final.layout.

N.B. If you are interested in getting only the top contigs (highest degree of reliability, defined as read-pairs support), run:

GetTopFromFasta -if final.fa -il final.layout

The top assembled transcripts (without isoforms) will be in the output directory in final.fa.top.
Info about read ids and coordinates in all the top assembled contigs will be in the output directory in final.layout.top.

4. Downstream processing

The assessment of the output assembly could be made using InSeqt or Quast.
All downstream processing tools in the Trinity package, such as for annotation, should also be applicable to Ananas assemblies.

If you are interested in read count data (e.g. gene expression analysis) you can get this piece of information by using the script Reads_Count_Statistics.pl located here. The script generates a file with information about length, read count and number of read-pairs for every contig in the assembly.
For subsequent quantitative comparisons, normalization, and differential expression analyses, we recommend moose².

If you are interested in read location data (e.g. alternative splicing analysis), you can get this piece of information by looking at the final.layout file. Info about all alternative contigs belonging to a scaffold, as well as info about read ids and coordinates in all the assembled contigs will be in this file.

Ananas homepage