Fast and scalable pathogen discovery program with accurate genome relative abundance estimation.

READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes.
The software is an effort from the Microbial Genomics Laboratory (PI: Arnab Pain) in KAUST.

Publication:
Naeem, R., M. Rashid, and A. Pain,READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation. Bioinformatics, 2013. 29(3): p. 391-392.

DOWNLOAD, INSTALL and TEST RUN on a bash shell

  1. wget http://www.cbrc.kaust.edu.sa/readscan/download/readscan-0.5.tar.gz

  2. tar -zxvf readscan-0.5.tar.gz

  3. follow the instruction on the README inside

  4. Output consists of a venn_stats file and microbes_stats.txt. Please refer to TROUBLESHOOTING for solutions. Please keep the readscan_search.log and readscan.log file which would help to debug any problems.

MANUAL

For updated manual see

  • perldoc readscan_lsf.pl

  • perldoc readscan_makeflow.pl

  • perldoc readscan_sge.pl

  • perldoc readscan.pl

TROUBLESHOOTING

  1. What are the prerequisites for installing readscan ?

    readscan depends on

    • perl

    • smalt v 0.6.3

    • Unix utilities make,sort,split,cat etc.

    • To run on Platform LSF or Sun Grid Engine no additional tools are needed

    • To run on Load levelers other than Platform LSF or Sun Grid Engine Makeflow is required.

  2. normal: User cannot use the queue. Job not submitted.

    On LSF clusters the jobs will submitted to the default queue. Try changing the queue

    by passing --lsf q=anotherqueuename

  3. Error in rusage section: Job-level resource requirement values must satisfy limits set by the queue-level resource requirement values. Job not submitted.

    On some LSF clusters the jobs will not be submitted as the rusage section may not satisfy the queue-level resource requirements. It is possible to override the default LSF rusage ie., -R string

    by passing --lsf RMh='span[ptile=8]' --lsf RMp='span[ptile=8]' should fix this

    or alternatively

    --lsf R=-1

    would totally suppress the resource string passed to bsub

  4. TERM_MEMLIMIT: job killed after reaching LSF memory usage limit. Exited with signal termination: Killed.

    [0] smalt.c:330 ERROR: memory allocation failed

    On LSF Try increasing the memory limit with

    readscan_lsf.pl index -k 13 -s 6 --lsf R='select[mem>6291] rusage[mem=6291]' --lsf M=6291000 bacterial_all.fasta

    On SGE Try increasing the memory limit with

    readscan_sge.pl index -k 13 --sampling_step_size 6 --sge l='h_vmem=6291M,virtual_free=2645.5M' bacterial_all.fasta

  5. cannot create <outputdir> at readscan.pl line <line>.

    Try deleting the outdir if it already exists readscan will try to create a new directory with name of the input fastq file

  6. How to interpret the results stats file?

    sample stats file

    The stats file has 6 sections species,genus,family,order,class,phylum and sequence The first 5 sections has 3 columns rank,parent taxon_id,taxon_id,name and Genome relative abundance (GRA).
    They are sorted by most abundant to least abundant taxon on GRA values.
    The last sequnce section has additional columns namely

    NO_OF_ALIGNS - number of alignments on a particular reference
    BASES_COVERED - number of bases covered on a particular reference
    REF_LENTH - length of the reference
    PERC_COVERAGE - percentage of the reference covered by atleast 1 base
    MEAN_CONTIG_LENGTH = sum(contig_length X number_of_reads_supporting_the_contig)/sum(number_of_reads)
    REF_NAME - name of the reference sequence

  7. Very low percentage of reads map to the host and pathogen databases

    Try setting the minid parameter for smalt to 0.01 or less

    --smalt yMh=0.01 --smalt yMp=0.01

  8. How to compile an updated reference dataset ?

    The reference datasets (bacterial,virual,fungi and human) are nothing but multiple FASTA sequences concatenated into a single multifasta file. The reference datasets provided on this page may not be upto date. Users who wish to compile an upto date reference datasets of microbial and human references may download them from NCBI RefSeq FTP page.

  9. How to compile an updated Taxon file?

    Upto date Taxon files can be downloaded NCBI Taxonomy FTP page.

All Downloads

CONTACT

Please contact Raeece Naeem or Mamoon Rashid for any questions and comments about READSCAN.