README_CSBLAST

INTRODUCTION
------------

Our method CS-BLAST for context-specific protein sequence searching is a simple
extension of BLAST that is able to significantly improve its sensitivity and
alignment quality: First, a context-specific sequence profile is generated for
the query sequence using a libray of context profiles.  This step is fast
(runtime about 1s for a typical protein of length 250). Then PSI-BLAST is
jump-started with this profile. Since the output of CS-BLAST is generated by
PSI-BLAST, users do not have to get accustomed to different command line
options or output formats.

References:
Angermueller C, Biegert A, and Soeding J (2012), 
"Discriminative modelling of context-specific amino acid substitution probabilities", 
Bioinformatics, 28 (24), 3240-3247.
Biegert A, and Soeding J (2009), 
"Sequence context-specific profiles for homology searching", 
Proc Natl Acad Sci USA, 106 (10), 3770-3775.

Abstract:
Sequence alignment and database searching are essential tools in biology
because a protein's function can often be inferred from homologous
proteins. Standard sequence comparison methods use substitution matrices to
find the alignment with the best sum of similarity scores between aligned
residues. These similarity scores do not take the local sequence context into
account. Here, we present an approach that derives context-specific amino acid
similarities from short windows centered on each query sequence residue. Our
results demonstrate that the sequence context contains much more information
about the expected mutations than just the residue itself: By employing our
context-specific similarities (CS-BLAST) in combination with NCBI BLAST, we
increase the sensitivity more than two-fold on a difficult benchmark set,
without loss of speed.  Alignment quality is likewise improved
significantly. Furthermore, we demonstrate considerable improvements when
applying this paradigm to sequence profiles: Two iterations of CSI-BLAST, our
context-specific version of PSI-BLAST, are more sensitive than five iterations
of PSI-BLAST.  The paradigm for biological sequence comparison presented here
is very general.  It can replace substitution matrices in sequence- and
profile-based alignment and search methods for both protein and nucleotide
sequences.


AVAILABILITY
------------

CS-BLAST executables are freely available for academic users and can be 
downloaded from:

ftp://toolkit.lmb.uni-muenchen.de/csblast/

A free CS-BLAST webserver can be accessed at:

http://toolkit.tuebingen.mpg.de/cs_blast


CONTACT
-------
If you have any problems or questions regarding the suite of cs-tools please
contact Christof Angermueller at angermueller@lmb.uni-muenchen.de.


CS-BLAST USAGE
--------------

Since CS-BLAST is essentially a wrapper around PSI-BLAST, it accepts the same
command line arguments as PSI-BLAST plus additional arguments that govern the
generation of the context-specific profile.

csblast version 2.2.3
Usage: csblast -i <infile> -D <context data> --blast-path <blastpgp dir> [options] [blastpgp options]

Options:
  -i, --infile <file>            Input file with query sequence
  -D, --context-data <file>      Path to profile library with context profiles
  -o, --outfile <file>           Output file with search results (def=stdout)
  -B, --alifile <file>           Input alignment file for CSI-BLAST restart
  -d, --database <dbname>        Protein database to search against (def=nr)
  -j, --iters [1,inf[            Maximum number of iterations to use in CSI-BLAST (def=1)
  -h, --incl-evalue [0,inf[      E-value threshold for inclusion in  CSI-BLAST (def=0.0020)
  -v, --descr [1,inf[            Number of sequences to show one-line descriptions for (def=500)
  -b, --alis [1,inf[             Number of sequences to show alignments for (def=250)
  -x, --pc-admix ]0,1]           Pseudocount admix for context-specific pseudocounts (def=0.90)
  -z, --pc-neff [1,inf[          Target Neff for pseudocounts admixture (def=0.00)
  -c, --pc-ali [0,inf[           Constant for alignment pseudocounts in CSI-BLAST (def=12.0)
      --pc-engine crf|lib        Specify engine for pseudocount generation (def=auto)
      --alignhits <file>         Write multiple alignment of hits in PSI format to file
      --weight-center [0,inf[    Weight of central profile column (def=1.60)
      --weight-decay [0,inf[     Parameter for exponential decay of window weights (def=0.85)
      --global-weights           Use global instead of position-specific sequence weights (def=off)
      --best                     Include only the best HSP per hit in alignment (def=off)
      --blast-path <path>        Path to directory with blastpgp executable (or set BLAST_PATH)
      --shift [-1,1]             Substitution score offset (def=-0.005)

For example, to search with a query sequence in FASTA format against the NR
database using the discriminative model, type the following command:

> bin/csblast -i query.seq -d nr -D data/K4000.crf --blast-path <blastpgp dir>

where <blastpgp dir> is the absolute path to the directory with the blastpgp
executable (alternatively you can also set the BLAST_PATH environment variable).


CSI-BLAST USAGE
---------------

The usage of CSI-BLAST, the context-specific version of PSI-BLAST, is very
similar to CS-BLAST. To perform iterative search with CSI-BLAST you simply use the
csblast executable and specify the desired number of iterations (-j argument):

> bin/csblast -i query.seq -d nr -D K4000.crf -j 3 -h 0.002 --blast-path <blastpgp dir>

This command will run a three iteration CSI-BLAST search against the NR database
with an inclusion threshold of 0.002. Please note that because CSI-BLAST has to
parse the PSI-BLAST output in order to build an alignment of hits after each
iteration, the output of CSI-BLAST is currently restricted to "-m 0" format.


BUILDING A CONTEXT-SPECIFIC PROFILE
-----------------------------------

In case you don't want to search for homologous proteins but simply want to build
a context-specific profile from an input sequence or alignment you can use the
program csbuild (formerly called cscounts). Its usage is as follows:

csbuild version 2.2.3
Build a profile or PSSM from an alignment or sequence.
Copyright (c) 2010-2012 Andreas Biegert, Christof Angermueller, Johannes Soeding, and LMU Munich

Usage: csbuild -i <infile> [options]
       csbuild -i <infile> -D <contextdata> [options]

Options:
  -i, --infile <file>            Input file with alignment or sequence
  -o, --outfile <file>           Output file for generated profile (def: <infile>.prf)
  -I, --informat seq|fas|...     Input format: seq, fas, a2m, or a3m (def=auto)
  -O, --outformat prf|chk        Outformat: profile or PSI-BLAST checkpoint (def=prf)
  -M, --match-assign [0:100]     Make all FASTA columns with less than X% gaps match columns
                                 (def: make columns with residue in first sequence match columns)
  -D, --context-data <file>      Add context-specific pseudocounts with profile library (def=off)
  -x, --pc-admix [0,1]           Pseudocount admixture for context-specific pseudocounts (def=0.90)
  -z, --pc-neff [1,inf[          Target Neff for pseudocounts admixture (def=0.00)
  -c, --pc-ali [0,inf[           Constant in pseudocount calculation for alignments (def=12.0)
      --pc-engine crf|lib        Specify engine for pseudocount generation (def=auto)
      --weight-center [0,inf[    Weight of central profile column for context-specific pseudocounts (def=1.60)
      --weight-decay [0,inf[     Parameter for exponential decay of window weights (def=0.85)
      --global-weights           Use global instead of position-specific sequence weights (def=off)

For example, to add context-specific pseudocounts to your FASTA formatted input
sequence in infile.seq and write the resulting context-specific profile to
outfile.prf, execute the following command:

> csbuild -i infile.seq -o outfile.prf -D K4000.lib

The format of the context-specific profile in outfile.prf is analogous to the
format of the context profile library (described below). It can be easily
parsed to recover the amino acid frequencies in the profile. As alternative to
this profile format you can also write the context-profile as binary PSI-BLAST
checkpoint file, which can then be used to jumpstart PSI-BLAST via its -R
option. The csbuild command would then look like this:

> csbuild -i infile.seq -o outfile.chk -D K4000.crf -O chk


OBTAINING POSTERIOR PROBABILITIIES
----------------------------------

You can use the program csprobs to obtain posterior probabilities of the
K=4000 context profiles at each query postion. For example, running

> csbuild -i infile.seq -D K4000.lib

gives you index and probability of all context profiles with probability above
10%. Each line contains the posteriors for one query position formatted as

query_position: profile_index -> profile_probability"  ...

You can also use the "--matrix" switch to get the full L*K posterior probability
matrix.


PROFILE LIBRARY FORMAT
----------------------

The profile library file (K4000.lib) basically contains a small header with
general information about the library itself, followd by a list of
human-readable profile records each starting with the keyword "ContextProfile":

ContextProfile
INDEX   0
PRIOR   0.0004827394287
NCOLS   13
ALPH    20
LOG     1
        A       R       N       D       C       Q       E       G       H       I       L       K       M       F       P       S       T       W       Y       V
1       3321    3730    4360    4012    7086    4413    3605    3930    5750    4491    3423    3728    5770    5181    4773    3694    4327    7559    5721    4435
2       3578    3733    4049    3805    7154    4377    3766    3500    5298    4821    3803    3661    6276    5155    4873    3795    4164    6846    5334    4502
3       3671    4590    4604    4445    6861    4979    4422    4516    5335    3535    2880    4536    5354    4065    4492    3913    4333    6745    4651    3670
4       3774    4428    4413    3754    7642    4927    3734    3678    5896    5693    5048    3697    6940    6127    2035    3770    4561    8201    6105    5337
5       4301    4280    3365    2894    8074    4710    3697    2011    5396    6456    5789    4156    7554    6724    4686    3921    4707    8211    6470    5855
6       4540    6578    7000    7283    6471    7102    6574    6381    6683    2799    2255    6529    4886    2688    6719    5956    5362    5338    3361    2676
7       4460    3224    4338    4020    7723    3842    2734    5434    5266    5102    5220    3087    6943    6140    5109    3783    3392    7684    5837    4400
8       4846    6273    7253    7340    6430    6888    6534    6582    7761    2088    2273    6739    5445    4039    5692    6007    4915    6791    5655    1936
9       3831    4094    5318    4989    6748    5108    4088    4914    5673    3330    3012    4276    5703    4661    5970    4136    3721    7007    4999    2802
10      3455    4226    3773    3100    6943    4723    3483    2561    5448    6267    5441    4316    7048    6586    4139    3509    4443    7693    6160    5368
11      3923    4896    5176    4997    6562    5282    4251    4802    6115    3296    3056    4406    5580    4023    3964    4263    3952    5468    4964    3116
12      3640    4409    4316    3863    6438    4876    3854    3683    5225    4394    4072    4426    5414    5101    4110    3583    3701    6198    4592    4185
13      3734    4466    4548    3491    6379    4936    3763    2852    5750    4548    4013    4344    6168    5201    4121    3828    3992    7115    5732    4040
//

The header section describes attributes such as the profile index (starting from
zero), prior probability of the profile (PRIOR), as well as the number of profile
columns (NCOLS) and the alphabet size used (ALPH, in this case amino acid). The
main body of the profile record describes the amino acid distribution at each
position of the profile (one line per profile column). To save space each amino
acid probability p is written as

v = -1000 * log2(p)

or  '*' if p is zero. To recover the original amino acid probability from its
value v in K4000.lib you simply calculate

p = 2^(v / -1000).