Faster2

News

What is Faster2

Faster2 is an extensible C++11 framework and program for efficient access and extraction of DNA/RNA and protein sequences from FASTA and FASTQ files. It works with large file collections of raw as well as compressed data, and is based on the set of filters that can be organized into a pipeline. Faster2 performs input data indexing in order to accelerate all supported operations. It can be easily customized and extended with new filters, and its pipeline building sub-system can be incorporated into other tools. Faster2 is not a database system nor a data analytics tool. Its sole purpose is to simplify tedious operations that are part of everyday tasks performed routinely by bioinformaticians and computational biologists, and yet often require writing specialized text-processing scripts.

↑ Top

Requirements

Faster2 is written in C++11 and extensively uses the Boost library:

↑ Top

Download

The latest version of Faster2 is 0.31 2014-09-09, and it can be obtained from here.
The tarball provides a README file that explains how to install Faster2.
If you are a Linux user you can download our 64bit executable from here.
You can download pre-release updates from GitLab.

↑ Top

Tutorial

In this tutorial I will first show you how to download and install Faster2, and then how to use some of its basic filters. At the end I will explain how to write a simple filter for Faster2 and how to add it to the main program.

Installation

Let us start with downloading and compiling Faster2. In your favorite terminal enter the directory that will be our working location. For example, I will use folder tmp in my home directory:

$ cd
$ cd tmp/

Now run wget to get the latest Faster2 archive:

$ wget http://www.jzola.org/faster2/faster2-current.tar.bz2
$ ls -la
total 20
drwxr-xr-x 2 zola zola 4096 xxx x xx:xx .
drwxr-xr-x 68 zola zola 4096 xxx x xx:xx ..
-rw-r--r-- 1 zola zola 10571 xxx x xx:xx faster2-current.tar.bz2

The next step is to uncompress the archive with tar and bzip2:

$ tar xfjv faster2-current.tar.bz2
faster2/
faster2/bio/
faster2/bio/fastx_iterator.hpp
faster2/AbstractFilter.hpp
faster2/faster2.cpp
faster2/index.hpp
faster2/NamesFilter.hpp
faster2/jaz/
faster2/jaz/string.hpp
faster2/jaz/LICENSE
faster2/jaz/algorithm.hpp
faster2/jaz/files.hpp
faster2/Makefile
faster2/SampleFilter.hpp
faster2/SelectFilter.hpp
faster2/README
faster2/FilterFilter.hpp
faster2/LICENSE
faster2/ReportFilter.hpp
faster2/FilterFactory.hpp
faster2/pipe.hpp
faster2/stream.hpp
faster2/PrintFilter.hpp

Voila, we are ready to compile. Faster2 makes the extensive use of C++11 features, and the Boost library. These days Boost is routinely provided with any Linux distro, so just make sure that you have it installed. The C++11 support on the other hand varies between compilers, but GCC and Clang seem to be the most advanced. I tested Faster2 with g++ 4.6 and clang++ 3.1, and so I recommend you use one of them. Let us take a look into Faster2 Makefile:

$ cd faster2
$ head -n 10 Makefile
CXX=g++

#BOOST_INCLUDE=-I/usr/local/boost/include
#BOOST_LIB=-L/usr/local/boost/lib

CXXFLAGS=-std=c++0x -O3 -I. $(BOOST_INCLUDE)
#CXXFLAGS=-std=c++0x -stdlib=libc++ -O3 -I. $(BOOST_INCLUDE)

LDLIBS=$(BOOST_LIB) -lboost_iostreams -lboost_system -lboost_filesystem

This Makefile requires no tuning if you have the standard Boost installation and you are fine with using GCC. If you want to change compiler then simply edit CXX variable, and optionally CXXFLAGS, if you need to set compiler specific options. For instance, to use Clang I do the following changes:

$ head -n 10 Makefile
CXX=clang++

#BOOST_INCLUDE=-I/usr/local/boost/include
#BOOST_LIB=-L/usr/local/boost/lib

#CXXFLAGS=-std=c++0x -O3 -I. $(BOOST_INCLUDE)
CXXFLAGS=-std=c++0x -stdlib=libc++ -O3 -I. $(BOOST_INCLUDE)

LDLIBS=$(BOOST_LIB) -lboost_iostreams -lboost_system -lboost_filesystem

This enables clang++ compiler and makes it use the libc++ standard library (which must be available in the system). If the Boost library is not installed in the default location you should uncomment and customize BOOST_INCLUDE and BOOST_LIB variables. For instance, in the Makefile distributed with Faster2 both variables are prepared for the case where Boost has been installed in /usr/local/boost directory. OK, we are ready to run make.

$ make
clang++ -std=c++0x -stdlib=libc++ -O3 -I. -I/usr/local/boost/include faster2.cpp
 -o faster2 -L/usr/local/boost/lib -lboost_iostreams -lboost_system -lboost_filesystem

If everything ran smoothly we can launch faster2:

$ ./faster2
Version: faster2 0.1 2012-06-22
Copyright: (c) 2012 Jaroslaw Zola <jaroslaw.zola@gmail.com>
License: Distributed under the MIT License

Usage: faster2 DIR COMMAND|FILTER[,FILTER1,FILTER2,...]
where DIR is the database directory
and COMMAND is one of:
   index ['fasta'|'fastq']           create database index
and FILTER is any of:
   filter <'N'|size>                 filter by string or size
   names [file]                      write names of sequences
   print [file] ['fasta'|'fastq']    write sequences
   report [file]                     write report
   sample <size> [seed]              create sample without replacement
   select <name> [name1 name2 ...]   select by name

Congratulation, installation is completed. Now you can (but do not have to) copy faster2 executable to some more convenient location, for instance bin/ directory in your home folder (if you have one), and you can remove the unpacked faster2/ directory.

Creating Index

Faster2 relays on data indexing. The principal idea is the following: first we create an index of FASTA or FASTQ files in a given directory, which is one time effort. Next, we pass the index to filters that in turn are organized into a pipeline implementing a task of interest. Faster2 works with raw text files as well as gzip and bzip2 compressed files. However, it does not support directories that consist of FASTA and FASTQ files at the same time. In other words, it can index either FASTA or FASTQ files but not both together. Consider the following data directory with FASTA files:

$ ls -la data/
drwxr-xr-x 2 zola zola xxxx xxx x xx:xx .
drwxr-xr-x 5 zola zola xxxx xxx x xx:xx ..
-rw-r--r-- 1 zola zola 8005 xxx x xx:xx file0.fa
-rw-r--r-- 1 zola zola 2682 xxx x xx:xx file1.fa.gz
-rw-r--r-- 1 zola zola 2620 xxx x xx:xx file2.fa.bz2

To index this directory we call index command and we specify type of sequence. Here, we can either use "nt" for DNA/RNA or "aa" for proteins:

$ ./faster data/ index nt
$ ls -la data/
total 28 drwxr-xr-x 2 zola zola 4096 xxx x xx:xx .
drwxr-xr-x 5 zola zola 4096 xxx x xx:xx ..
-rw-r--r-- 1 zola zola 219 xxx x xx:xx .f2index
-rw-r--r-- 1 zola zola 8005 xxx x xx:xx file0.fa
-rw-r--r-- 1 zola zola 2682 xxx x xx:xx file1.fa.gz
-rw-r--r-- 1 zola zola 2620 xxx x xx:xx file2.fa.bz2

Observe that a new file has been created in the indexed directory. This is a binary file that stores the actual index. Note also that Faster2 transparently handled different data compression formats. Faster2 by default assumes that files are in the FASTA format. To create index of FASTQ files it is sufficient to add fastq option to the indexing command:

$ ./faster2 data/ index nt fastq
Error: failed to build index

One important thing to keep in mind is that index is static and captures the state of the indexed directory as it was at the time of indexing. Hence, whenever content of the directory changes index must be recreated. Finally, Faster2 is very flexible and handles all possible variants of FASTA and FASTQ files.

Using Filters

Once index has been created we are ready to start using filters. In general we can specify as many filters as we want and in any order we like. All filters must be separated by coma. Collectively, specified filters form a pipeline in which output of one filter is passed as an input to the following filter. Faster2 implicitly adds to the beginning of the pipeline a filter that selects all sequences. The example below demonstrates a simple pipeline that selects sequences that are strictly DNA/RNA or protein (e.g. for DNA/RNA have no 'N' or 'X' bases), next creates from the selected sequences a random sample of size 10, and writes it to the FASTA file.

$ ./faster2 data/ filter N, sample 10, print sample10.fa

Creating Report

The most basic filter we can run is report. This filter summarizes output of the previous filter in the pipeline. For instance, the following command will generate summary of the entire indexed data:

$ ./faster2 data/ report
total files     3
total reads     3
quality scores  no
clean sequences     3
average sequence    7915.67
shortest sequence   7788
longest sequence    8166

By default report is written to the standard output. However, you can specify a file name in which report should be stored. Keep in mind that special name '-' is used to denote the standard output. In fact, this is true for other filters that write output, such as e.g. print and names.

Writing Selected Sequences

As we already explained data flows between filters in a sequential manner. At any point in the pipeline we can insert print filter that will write result of the processing to FASTA or FASTQ file. For instance, to write to the standard output a randomly selected sequence from the pool of all sequences we can combine sample and print filters:

$ ./faster2 data/ sample 1, print

↑ Top

Filters

↑ Top

FAQ

↑ Top

Copyright © 2012-2016 Jaroslaw Zola