HAYSTAC documentation
Introduction
HAYSTAC is a comprehensive computational tool for identifying species from DNA sequence data. It can pre-process sequencing data (adapter trimming), build a database, analyse sequencing data and perform a deamination profile analysis. It can work in two different modes. One mode performs metagenomic identifications from samples containing multiple organisms, and outputs mean posterior species abundances. The second mode can perform species identifications from single organism samples and it outputs species assignment posterior probabilities.
Installation
HAYSTAC can be run on either macOS or Linux based systems, and the source code is available on github
.
The recommended way to install haystac
, and all if its dependencies, is via the [mamba](https://mamba.readthedocs.io/en/latest/installation.html)
package manager (a fast replacement for [conda](https://docs.conda.io/projects/conda/en/latest/index.html)).
You can install haystac
using conda
, however, it will take significantly longer to install and analyses will run slower.
Install mamba
If you do not have either mamba
or conda
already installed, please refer to the [install instructions](
https://mamba.readthedocs.io/en/latest/installation.html) for mambaforge
.
If you have conda
installed, but not mamba
, then install mamba
into the base environment:
conda install -n base -c conda-forge mamba
Install haystac
Then use mamba
to install haystac
into a new environment:
mamba create -c conda-forge -c bioconda -n haystac haystac
And activate the environment:
conda activate haystac
We recommend that you install haystac
into a new environment to avoid dependency conflicts with other software.
Git
Clone from github
:
git clone https://github.com/antonisdim/haystac.git
Workflow

HAYSTAC was designed to be used as a species identifier for either single organism or metagenomic samples. The full execution of the pipeline includes the construction of a database, the processing of a sample file and lastly the analysis of a sample against a specific database. For that purpose three modules have been designed, each of which has its own outputs.
First the haystac database
module can be used to construct a database, based on the user’s needs and preferences (custom NCBI query, and/or custom NCBI accessions for each species, and/or prokarytic representative RefSeq, and/or custom sequences). After the user input is collected, the user must provide a path where the outputs of the database can be stored.
The haystac sample
module prepares a sample to be analysed. It deals with trimming adapters and collapsing PE reads if needed and it counts the number of reads that are included in the sample file provided by the user.
The haystac analyse
module performs the analysis of a sample against a specific database. All its outputs are created under the path that the user specifies with the --output
path.
Outputs
This is a list of all the outputs produced after a successful run of each of haystac
’s modules. All these outputs will be found under the --output
path specified by the user when running any of haystac
’s modules. We advise to create separate output directories for each one of the modules (database
, sample
and analyse
).
Expected outputs for haystac database
db_taxa_accessions.tsv: includes all the taxon/accession pairs that are included in a database.
idx_database.done: file indicating that all the individual bowtie2 indices for each taxon have been prepared
entrez: directory containing all the results from querying the NCBI, including the nucleotide and taxonomy databases
bowtie: directory containing the bowtie2 index files for all out database, that will be used for the filtering alignment
database_inputs: directory containing the representative RefSeq table that is downloaded from NCBI.
Expected outputs for haystac sample
fastq_inputs: folder containing the outputs of the sample module.
fastq_inputs/meta: directory that includes the read count file.
fastq_inputs/(SE | PE | COLLAPSED): directory containing the trimmed reads produced by AdapteRremoval
Expected outputs for haystac analyse
bam: directory containing the bam file of the filtering alignment
fastq: directory containing the filtered reads in fastq format along with their average read length
alignments: directory where all the individual alignment bam files for each taxon in a database of the filtered reads are outputted
ts_tv_counts: directory where all the transition and transversion counts are stored per taxon
probabilities: directory where the likelihood matrix and final posterior abundances/probabilities are stored. The final output for abundance calculation has the suffix posterior_abundance.tsv,
mapdamage: directory that includes all the mapdamage profiles for every taxon in our Database
reads: directory including all the Dirichlet reads for each taxon in out database.
Tutorial
Configuring HAYSTAC
If a user is running haystac for the first time on a machine, they might want to run the haystac config
module first. The user can use this module to provide an API key for querying NCBI and their preferred path for the cache genomes folder, among other things. All of the options have default values that can be used. If a user later on wishes to change any of these parameters specifically they can either run haystac config
to pass a value to a specific argument.
Here is an example command that allows the configuration of using conda as a package manager for running the other haystac modules.:
haystac config --use-conda True
Building the database
In order to build the database we will be using the database
module of HAYSTAC.
First we need to know what organisms we would like to include in our database. Do we only need the complete genomes of a specific genus or do we want more genera?
Constructing the Query
After deciding what taxa we would like to include in our database, we need to construct an NCBI query that will return all the accessions that belong to the taxa that we are interested in. One of the best ways to construct such a query is to go on the website of NCBI’s Nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/), type in our query and get the correctly formatted query string from the “Search details” box. We can then use that string for the construction of our database.
For example if we would like to build a database of all the complete genomes of the species in the Yersinia genus we can use the following command::
haystac database --mode build \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--output yersinia_example
For each species (or any other user defined taxonomic rank), the longest sequence per taxon will be used to populate our database.
Representative RefSeq species
When constructing a database there is always the option to include the species of the prokaryotic representative RefSeq database as well. All you need to do is include the corresponding flag in your command.:
haystac database --mode build \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--output yersinia_example \
--refseq-rep prokaryote_rep
Important note on RefSeq databases
haystac database
currently can build databases from three RefSeq tables, the prokaryotic representative RefSeq table, the eukaryotes RefSeq table and the viruses RefSeq table. When the prokaryotic representative the database is built, only species of microorganisms are included (strains are excluded), whereas in the eukaryotes and viruses databases subspecies and strains are included respectively. To build any of the above databases, specify the desired RefSeq table to be used by the --refseq-rep
flag (prokaryote_rep
for the prokaryotic representative, eukaryotes
for the eukaryotes and viruses
for the viruses table).
Providing custom accessions
It is also possible to provide your own accessions for a selected species/taxon. For that you will need to prepare a tab delimited file with two columns. The first column is the name of the taxon, that cannot contain any special characters, other than an underscore (‘_’), and the second column is a valid NCBI accession code.
Here is an example of the contents of such a file::
Yersinia_ruckeri NZ_CP025800.1
The we can simply run the following command::
haystac database --mode build \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--output yersinia_example \
--accessions-file acc_example.txt
Providing custom sequences
It is also possible to provide your own sequences for a taxon. To do that you will need a a tab delimited file containing the the name of the taxon with no special characters, and an underscore (‘_’) instead of spaces, a user defined accession code and the path of the fasta file. The fasta file that the path point to can be either uncompressed or compressed with gzip/bgzip.
Here is an example of such a file::
Yersinia_ruckeri user_seq_1 ~/example_sequence/user_seq_1.fasta
The we can simply run the following command::
haystac database --mode build \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--db-output yersinia_example \
--sequences-file seq_example.txt
Combinations
All of the previous options can be combined into one command. It is important to note that only one sequence file per taxon is allowed in our database, and the priority goes user defined accessions or sequences, representative RefSeq and then user specified query. It should be noted that for custom NCBI queries, plasmids can be fetched if they are part of a genome assembly. The only exception to the one sequence file per taxon rule is plasmid sequences of the RefSeq representative that are not part of complete genome assemblies and for that reason they are downloaded separately.
Index building
For the first part of the analysis an index out of all the genomes that are included in our database needs to be build. This is a process that can take big amounts of memory depending on the number and the complexity of sequences that our database includes. For that reason the user can specify the desired amount of memory resources available to haystac
and the program will try to build the required index. This can be specified through the --mem
flag, that can be appended to the any of the commands shown above. Memory resources need to be specified in MB. If the memory resources provided are less than the size of the files that need to be indexed an error will be raised. We also advise caution when changing the bowtie2 file size scaling factor.
Database building modes
For the complete construction of a database, sequences need to be downloaded and subsequently indexed.
By specifying --mode build
to haystac database
, the program downloads and indexes all the sequences that have been requested by the user in one step.
If a user would like to only download sequence data and index them later it is possible to do so, by specifying haystac database --mode fetch
, to download the sequences first and then execute haystac database --mode index
in order to perform the indexing.
If mode fetch
is run first then mode index
should be run subsequently, and not mode build
, otherwise an error will be raised.
Here is an example of building a database in two steps instead of one::
haystac database --mode fetch \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--output yersinia_example
haystac database --mode index \
--output yersinia_example
Building a mitochondrial DNA database
When a user is providing a query about eukaryotes it is also possible to build a database with only mitochondrial genomes (by default whole genome assemblies will be fetched for a given query). In order to do that a user can specify the --mtDNA
flag when running haystac database
. We strongly advise against having a mixed database of full eukaryotic genome assemblies for certain taxa and only mtDNA sequences for other taxa, as this will bias the identifications towards the taxa with full genome assemblies.
Preparing a sample for analysis
After our database is built we need to prepare our samples for analysis. For that purpose, we are using the sample
module of haystac. The input files can be SE, PE or collapsed reads. If the reads are collapsed they are going to be treated as SE reads.
It is possible to trim sequencing adapters and collapse PE reads by specifying the relative flags. Samples (specific sequencing runs) can be also downloaded from the SRA if an sra run accession is provided.
If you have SE or already collapsed reads you only need to specify a file path for the --fastq
flag.
If your input is PE reads then you will need to specify file paths for both the --fastq-r1
and --fastq-r2
.
If you want to download files from the SRA all you need to do is provide an SRA accession for the --sra
flag.
Here is an example of downloading reads from the SRA, trimming sequencing adapters and collapsing reads.:
haystac sample --sra ERR1018966 \
--output sample_example
Sample analysis
In order to analyse any sample we will need to use the analyse
module of haystac.
Filtering Alignment
The first step for the sample analysis is to filter in all the reads that align to any of the genomes in our database. For that we will need to use the haystac analyse --mode filter
.
Here is an example command::
haystac analyse --mode filter \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Database Alignments
After we have filtered our libraries we can align the filtered reads against all the genomes that are included in our database. This can be done by using mode align
of haystac analyse
.
For example::
haystac analyse --mode align \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Unless the user has a deep understanding of their dataset we advise to be cautious when changing the base mismatch probability that is used later on in the method’s probabilistic model.
Likelihood calculation
After all the individual alignments have been competed, the number of transitions and transversions will be counted for every read that has aligned against any of the reference genomes in our database. Then the likelihoods and posterior probabilities for each read being sampled from a given reference genome will be calculated. For this step we can use the likelihoods
mode of haystac analyse
.:
haystac analyse --mode likelihoods \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Important Note on the Dirichlet Assignment process during Likelihood calculation
It is important to be aware of the individual read posterior probability threshold, for a read to be assigned to a taxon. As a default HAYSTAC uses the conservative 0.75 probability threshold for the Dirichlet assignment. The higher value you pick the more conservative the assignments become. It is useful to sometimes pick a value depending on what taxa are being identified. If there is a need to distinguish between closely related taxa then a more conservative threshold would increase the specificity of the analysis therefore being more appropriate, whereas when you’re trying to generally characterise a metagenome a less conservative value could increase the sensitivity of the analysis be more helpful.
Single organism sample or metagenome ?
Depending on whether we would like to identify the species a sample is belongs to, or perform a metagenomic analysis, we can use the probabilities
or abundances
mode of haystac analyse
respectively.
Assignment Probability Calculation
In order to calculate posterior assignment probabilities we can run the following command::
haystac analyse --mode probabilities \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Mean Posterior Abundances
In order to calculate mean posterior abundances we can run the following command::
haystac analyse --mode abundances \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Along with the abundance calculation, we also perform a chi2 test to assess if the reads that have been assigned to a taxon are clustering around specific genomic areas or if they represent a random sample of the organism’s genome. The results of this test should be trusted for low depth sequencing data (equal or less than 1X). The null hypothesis is that there is no read clustering.
Reads
After the mean posterior abundances have been calculated for a sample, all the reads that have been assigned to a taxon through the Dirichlet process can be outputted in separate bam files ready for further downstream analyses (like assembling or variant calling for instance) via the reads
module. Reads that have been assigned to the Grey and Dark Matter are outputted in fastq files as they have not been uniquely assigned to a taxon.
Here is an example command::
haystac analyse --mode reads \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Mapdamage analysis
If our samples are ancient we can use mapDamage to estimate the level of deamination in the reads that have aligned to any taxon in our database. For that we can use the mapdamage
module of haystac. The mapDamage analysis will be performed on the subset of reads that have been uniquely assigned to a taxon through the dirichlet process. This module can be either run independently or after the reads
module.
Here is an example command::
haystac analyse --mode mapdamage \
--database yersinia_example \
--sample sample_example \
--output analysis_output
Important note on sample analysis
The first 3 steps (modes: filter
, align
, likelihoods
) can be executed automatically when you call the probabilities
or abundances
mode of haystac.
Command Line Interface
haystac config
Optional arguments:
-h, --help Show this help message and exit
--cache <path> Cache folder for storing genomes downloaded from NCBI
and other shared data (default:
/home/antony/haystac/cache)
--clear-cache Clear the contents of the cache folder, and delete the
folder itself (default: False)
--api-key <code> Personal NCBI API key (increases max concurrent requests
from 3 to 10,
https://www.ncbi.nlm.nih.gov/account/register/)
--use-conda <bool> Use conda as a package manger (default: False)
haystac database
Required arguments:
--mode <mode> Database creation mode for haystac [fetch, index,
build]
--output <path> Path to the database output directory
Required choice:
--query <query> Database query in the NCBI query language. Please
refer to the documentation for assistance with
constructing a valid query.
--query-file <path> File containing a database query in the NCBI query
language.
--accessions-file <path>
Tab delimited file containing one record per row: the
name of the taxon, and a valid NCBI accession code
from the nucleotide, assembly or WGS databases.
--sequences-file <path>
Tab delimited file containing one record per row: the
name of the taxon, a user defined accession code, and
the path to the fasta file (optionally compressed).
--refseq-rep <table> Use one of the RefSeq curated tables to construct a
DB. Includes all prokaryotic species (excluding
strains) from the representative RefSeq DB, or all the
species and strains from the viruses DB, or all the
species and subspecies from the eukaryotes DB. If
multiple accessions exist for a given species/strain,
the first pair of species/accession is kept. Available
RefSeq tables to use [prokaryote_rep, viruses,
eukaryotes].
Optional arguments:
--force-accessions Disable validation checks for 'anomalous' assembly
flags in NCBI (default: False)
--exclude-accessions <accession> [<accession> ...]
List of NCBI accessions to exclude. (default: [])
--resolve-accessions Pick the first accession when two accessions for a
taxon can be found in user provided input files
(default: False)
--bowtie2-scaling <float>
Rescaling factor to keep the bowtie2 mutlifasta index
below the maximum memory limit (default: 25.0)
--rank <rank> Taxonomic rank to perform the identifications on
[genus, species, subspecies, serotype] (default:
species)
--genera <genus> [<genus> ...]
List of genera to restrict the abundance calculations.
--mtDNA For eukaryotes, download mitochondrial genomes only.
Not to be used with --refseq-rep or queries containing
prokaryotes (default: False)
--seed <int> Random seed for database indexing
Common arguments:
-h, --help Show this help message and exit
--cores <int> Maximum number of CPU cores to use (default: MAX_CPUs)
--mem <int> Maximum memory (MB) to use (default: MAX_MEM)
--unlock Unlock the output directory following a crash or hard
restart (default: False)
--debug Enable debugging mode (default: False)
--snakemake '<json>' Pass additional flags to the `snakemake` scheduler..
haystac sample
Required arguments:
--output <path> Path to the sample output directory
Required choice:
--fastq <path> Single-end fastq input file (optionally compressed).
--fastq-r1 <path> Paired-end forward strand (R1) fastq input file.
--fastq-r2 <path> Paired-end reverse strand (R2) fastq input file.
--sra <accession> Download fastq input from the SRA database
Optional arguments:
--collapse <bool> Collapse overlapping paired-end reads, e.g. for aDNA
(default: False)
--trim-adapters <bool>
Automatically trim sequencing adapters from fastq
input (default: True)
Common arguments:
-h, --help Show this help message and exit
--cores <int> Maximum number of CPU cores to use (default: MAX_CPUs)
--mem <int> Maximum memory (MB) to use (default: MAX_MEM)
--unlock Unlock the output directory following a crash or hard
restart (default: False)
--debug Enable debugging mode (default: False)
--snakemake '<json>' Pass additional flags to the ``snakemake`` scheduler.
haystac analyse
Required arguments:
--mode <mode> Analysis mode for the selected sample [filter, align,
likelihoods, probabilities, abundances, reads,
mapdamage]
--database <path> Path to the database output directory
--sample <path> Path to the sample output directory
--output <path> Path to the analysis output directory
Optional arguments:
--genera <genus> [<genus> ...]
List of genera to restrict the abundance calculations
(default: [])
--min-prob <float> Minimum posterior probability to assign an aligned
read to a given species (default: 0.75)
--mismatch-probability <float>
Base mismatch probability (default: 0.05)
Common arguments:
-h, --help Show this help message and exit
--cores <int> Maximum number of CPU cores to use (default: MAX_CPUs)
--mem <int> Maximum memory (MB) to use (default: MAX_MEM)
--unlock Unlock the output directory following a crash or hard
restart (default: False)
--debug Enable debugging mode (default: False)
--snakemake '<json>' Pass additional flags to the `snakemake` scheduler.
Developer documentation
Todo
Write developer documentation.
FAQs
Todo
write FAQs
Tracking issues and bugs
haystac is under active development and we encourage you to report any issues you encounter via the GitHub issue tracker (https://github.com/antonisdim/haystac/issues).
Citations
A preprint describing haystac
is available on bioRxiv:
Dimopoulos, E.A.*, Carmagnini, A.*, Velsko, I.M., Warinner, C., Larson, G., Frantz, L.A.F., Irving-Pease, E.K., 2020. HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. bioRxiv 2020.12.16.419085. https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1
Contributing
Evangelos Antonios Dimopoulos, Evan K. Irving-Pease, Alberto Carmagnini
License
MIT