Usage

Quick Start

./bin/run.sh --project <project> --sample <samples.tsv> --pheno <phenos.tsv>

Detailed Usage

The global run script (bin/run.sh) is idempotent, meaning the program fails, you can rerun it with the same arguments and it will skip any preprocessing files that already exist. Note that it is possible for the program to fail having written some, but not all data to a file. In this case, you will need to delete the file (in <project>/data/preprocessed) so that PSL-GWAS does not skip that file.

PSL-GWAS automatically uses the max number of available threads and 95% of the max memory of the system it runs on.

PSL-GWAS takes in de novo assembled contigs or fully sequenced genomes in FASTA (.fa, .fsa) format.

  1. Create a file holding the string IDs and paths to your sample genomes.

    This is a tab-separated file with two columns. It must include a header but the column names in the header can be anything. The first column holds unique string IDs for each sample, and the second column holds paths to that sample’s FASTA file. The paths are should either be absolute or relative to the root of the repo. So, if you store your sample genomes in a directory called contigs/ which you put in the raw/ directory, your sample file would look like this:

    ID Path
    s1 <project>/data/raw/contigs/s1_ctgs.fa
    s2 <project>/data/raw/contigs/s2_ctgs.fa
    s3 <project>/data/raw/contigs/s3_ctgs.fa

    You can call this file anything. samples.tsv is simple and easy. This file goes in <project>/data/raw/

  2. Create a file holding the string IDs and phenotype values for your samples.

    This is a tab-separated file with at least two. It must include a header. The first column of the header can be anything (we recommend “ID”), and the remaining columns of the header should be the names of the phenotypes you are testing. The first column holds unique string IDs for each sample (that must match the ids in the samples file), and the remaining columns hold data on whether each sample displays the ID. When phenotype information is not known, use “NA”. Phenotype information can be binary {0, 1} or continuous [0, 1]. 0 means the sample does not display the phenotype, 1 means it displays it with high strength, and intermediate values mean it displays it with weaker strength.

    If you are testing antibiotic resistance, your phenotypes file could look like this:

    ID Penicillin Ampicillin
    s1 0 0.75
    s2 1 NA
    s3 1 1

    You can call this file anything. phenos.tsv is simple and easy. This file goes in <project>/data/raw/

  3. Run the gwas from the root of the repo

./bin/run.sh --project <project> --sample <samples.tsv> --pheno <phenos.tsv> ...

Options Reference for run.sh

Required flags:
--project PROJECT
 Name of project, defined with startproject.sh <project>.
--sample SAMPLE
 Basename of samples file. Ex: samples.tsv
--pheno PHENO Basename of phenos file. Ex: phenos.tsv
Optional flags:
-d, --debug More verbose logging.
-k K, --k K Kmer length in nucleotide bases. Default: 31
--minkf MINKF Minimum kmer frequency used during preprocessing filtering. Default: 0.01
--maxkf MAXKF Maximum kmer frequency, used during preprocessing filtering. Default: 0.99
--correlation-thresh CORRELATION_THRESH
 Correlation threshold for filtering in preprocessing. Default: 0.5
-p, --param Ignore k, minkf, maxkf, and correlation-thresh options and use param file in project directory.
--truth TRUTH Fasta file holding truths data for benchmarking or weight learning. Labels correspond to phenos, sequences hold genes or unitigs that cause the phenotype.
--postgres DATABASE
 Postgres database the default user (usually your username) can access.
--baseline BASELINE
 Optional baseline data (if you have some prior information on the strength of specific kmer, pheno associations)
--separate-phenos N
 Separate PSL output into one file per phenotype, taking the best N kmers for each phenotype
--no-consolidate
 Do not consolidate best N kmers for each phenotype using custom assembler. Defaults to false.