/
README_CSBLAST
228 lines (173 loc) · 12 KB
/
README_CSBLAST
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
INTRODUCTION
------------
Our method CS-BLAST for context-specific protein sequence searching is a simple
extension of BLAST that is able to significantly improve its sensitivity and
alignment quality: First, a context-specific sequence profile is generated for
the query sequence using a libray of context profiles. This step is fast
(runtime about 1s for a typical protein of length 250). Then PSI-BLAST is
jump-started with this profile. Since the output of CS-BLAST is generated by
PSI-BLAST, users do not have to get accustomed to different command line
options or output formats.
References:
Angermueller C, Biegert A, and Soeding J (2012),
"Discriminative modelling of context-specific amino acid substitution probabilities",
Bioinformatics, 28 (24), 3240-3247.
Biegert A, and Soeding J (2009),
"Sequence context-specific profiles for homology searching",
Proc Natl Acad Sci USA, 106 (10), 3770-3775.
Abstract:
Sequence alignment and database searching are essential tools in biology
because a protein's function can often be inferred from homologous
proteins. Standard sequence comparison methods use substitution matrices to
find the alignment with the best sum of similarity scores between aligned
residues. These similarity scores do not take the local sequence context into
account. Here, we present an approach that derives context-specific amino acid
similarities from short windows centered on each query sequence residue. Our
results demonstrate that the sequence context contains much more information
about the expected mutations than just the residue itself: By employing our
context-specific similarities (CS-BLAST) in combination with NCBI BLAST, we
increase the sensitivity more than two-fold on a difficult benchmark set,
without loss of speed. Alignment quality is likewise improved
significantly. Furthermore, we demonstrate considerable improvements when
applying this paradigm to sequence profiles: Two iterations of CSI-BLAST, our
context-specific version of PSI-BLAST, are more sensitive than five iterations
of PSI-BLAST. The paradigm for biological sequence comparison presented here
is very general. It can replace substitution matrices in sequence- and
profile-based alignment and search methods for both protein and nucleotide
sequences.
AVAILABILITY
------------
CS-BLAST executables are freely available for academic users and can be
downloaded from:
ftp://toolkit.lmb.uni-muenchen.de/csblast/
A free CS-BLAST webserver can be accessed at:
http://toolkit.tuebingen.mpg.de/cs_blast
CONTACT
-------
If you have any problems or questions regarding the suite of cs-tools please
contact Christof Angermueller at angermueller@lmb.uni-muenchen.de.
CS-BLAST USAGE
--------------
Since CS-BLAST is essentially a wrapper around PSI-BLAST, it accepts the same
command line arguments as PSI-BLAST plus additional arguments that govern the
generation of the context-specific profile.
csblast version 2.2.3
Usage: csblast -i <infile> -D <context data> --blast-path <blastpgp dir> [options] [blastpgp options]
Options:
-i, --infile <file> Input file with query sequence
-D, --context-data <file> Path to profile library with context profiles
-o, --outfile <file> Output file with search results (def=stdout)
-B, --alifile <file> Input alignment file for CSI-BLAST restart
-d, --database <dbname> Protein database to search against (def=nr)
-j, --iters [1,inf[ Maximum number of iterations to use in CSI-BLAST (def=1)
-h, --incl-evalue [0,inf[ E-value threshold for inclusion in CSI-BLAST (def=0.0020)
-v, --descr [1,inf[ Number of sequences to show one-line descriptions for (def=500)
-b, --alis [1,inf[ Number of sequences to show alignments for (def=250)
-x, --pc-admix ]0,1] Pseudocount admix for context-specific pseudocounts (def=0.90)
-z, --pc-neff [1,inf[ Target Neff for pseudocounts admixture (def=0.00)
-c, --pc-ali [0,inf[ Constant for alignment pseudocounts in CSI-BLAST (def=12.0)
--pc-engine crf|lib Specify engine for pseudocount generation (def=auto)
--alignhits <file> Write multiple alignment of hits in PSI format to file
--weight-center [0,inf[ Weight of central profile column (def=1.60)
--weight-decay [0,inf[ Parameter for exponential decay of window weights (def=0.85)
--global-weights Use global instead of position-specific sequence weights (def=off)
--best Include only the best HSP per hit in alignment (def=off)
--blast-path <path> Path to directory with blastpgp executable (or set BLAST_PATH)
--shift [-1,1] Substitution score offset (def=-0.005)
For example, to search with a query sequence in FASTA format against the NR
database using the discriminative model, type the following command:
> bin/csblast -i query.seq -d nr -D data/K4000.crf --blast-path <blastpgp dir>
where <blastpgp dir> is the absolute path to the directory with the blastpgp
executable (alternatively you can also set the BLAST_PATH environment variable).
CSI-BLAST USAGE
---------------
The usage of CSI-BLAST, the context-specific version of PSI-BLAST, is very
similar to CS-BLAST. To perform iterative search with CSI-BLAST you simply use the
csblast executable and specify the desired number of iterations (-j argument):
> bin/csblast -i query.seq -d nr -D K4000.crf -j 3 -h 0.002 --blast-path <blastpgp dir>
This command will run a three iteration CSI-BLAST search against the NR database
with an inclusion threshold of 0.002. Please note that because CSI-BLAST has to
parse the PSI-BLAST output in order to build an alignment of hits after each
iteration, the output of CSI-BLAST is currently restricted to "-m 0" format.
BUILDING A CONTEXT-SPECIFIC PROFILE
-----------------------------------
In case you don't want to search for homologous proteins but simply want to build
a context-specific profile from an input sequence or alignment you can use the
program csbuild (formerly called cscounts). Its usage is as follows:
csbuild version 2.2.3
Build a profile or PSSM from an alignment or sequence.
Copyright (c) 2010-2012 Andreas Biegert, Christof Angermueller, Johannes Soeding, and LMU Munich
Usage: csbuild -i <infile> [options]
csbuild -i <infile> -D <contextdata> [options]
Options:
-i, --infile <file> Input file with alignment or sequence
-o, --outfile <file> Output file for generated profile (def: <infile>.prf)
-I, --informat seq|fas|... Input format: seq, fas, a2m, or a3m (def=auto)
-O, --outformat prf|chk Outformat: profile or PSI-BLAST checkpoint (def=prf)
-M, --match-assign [0:100] Make all FASTA columns with less than X% gaps match columns
(def: make columns with residue in first sequence match columns)
-D, --context-data <file> Add context-specific pseudocounts with profile library (def=off)
-x, --pc-admix [0,1] Pseudocount admixture for context-specific pseudocounts (def=0.90)
-z, --pc-neff [1,inf[ Target Neff for pseudocounts admixture (def=0.00)
-c, --pc-ali [0,inf[ Constant in pseudocount calculation for alignments (def=12.0)
--pc-engine crf|lib Specify engine for pseudocount generation (def=auto)
--weight-center [0,inf[ Weight of central profile column for context-specific pseudocounts (def=1.60)
--weight-decay [0,inf[ Parameter for exponential decay of window weights (def=0.85)
--global-weights Use global instead of position-specific sequence weights (def=off)
For example, to add context-specific pseudocounts to your FASTA formatted input
sequence in infile.seq and write the resulting context-specific profile to
outfile.prf, execute the following command:
> csbuild -i infile.seq -o outfile.prf -D K4000.lib
The format of the context-specific profile in outfile.prf is analogous to the
format of the context profile library (described below). It can be easily
parsed to recover the amino acid frequencies in the profile. As alternative to
this profile format you can also write the context-profile as binary PSI-BLAST
checkpoint file, which can then be used to jumpstart PSI-BLAST via its -R
option. The csbuild command would then look like this:
> csbuild -i infile.seq -o outfile.chk -D K4000.crf -O chk
OBTAINING POSTERIOR PROBABILITIIES
----------------------------------
You can use the program csprobs to obtain posterior probabilities of the
K=4000 context profiles at each query postion. For example, running
> csbuild -i infile.seq -D K4000.lib
gives you index and probability of all context profiles with probability above
10%. Each line contains the posteriors for one query position formatted as
query_position: profile_index -> profile_probability" ...
You can also use the "--matrix" switch to get the full L*K posterior probability
matrix.
PROFILE LIBRARY FORMAT
----------------------
The profile library file (K4000.lib) basically contains a small header with
general information about the library itself, followd by a list of
human-readable profile records each starting with the keyword "ContextProfile":
ContextProfile
INDEX 0
PRIOR 0.0004827394287
NCOLS 13
ALPH 20
LOG 1
A R N D C Q E G H I L K M F P S T W Y V
1 3321 3730 4360 4012 7086 4413 3605 3930 5750 4491 3423 3728 5770 5181 4773 3694 4327 7559 5721 4435
2 3578 3733 4049 3805 7154 4377 3766 3500 5298 4821 3803 3661 6276 5155 4873 3795 4164 6846 5334 4502
3 3671 4590 4604 4445 6861 4979 4422 4516 5335 3535 2880 4536 5354 4065 4492 3913 4333 6745 4651 3670
4 3774 4428 4413 3754 7642 4927 3734 3678 5896 5693 5048 3697 6940 6127 2035 3770 4561 8201 6105 5337
5 4301 4280 3365 2894 8074 4710 3697 2011 5396 6456 5789 4156 7554 6724 4686 3921 4707 8211 6470 5855
6 4540 6578 7000 7283 6471 7102 6574 6381 6683 2799 2255 6529 4886 2688 6719 5956 5362 5338 3361 2676
7 4460 3224 4338 4020 7723 3842 2734 5434 5266 5102 5220 3087 6943 6140 5109 3783 3392 7684 5837 4400
8 4846 6273 7253 7340 6430 6888 6534 6582 7761 2088 2273 6739 5445 4039 5692 6007 4915 6791 5655 1936
9 3831 4094 5318 4989 6748 5108 4088 4914 5673 3330 3012 4276 5703 4661 5970 4136 3721 7007 4999 2802
10 3455 4226 3773 3100 6943 4723 3483 2561 5448 6267 5441 4316 7048 6586 4139 3509 4443 7693 6160 5368
11 3923 4896 5176 4997 6562 5282 4251 4802 6115 3296 3056 4406 5580 4023 3964 4263 3952 5468 4964 3116
12 3640 4409 4316 3863 6438 4876 3854 3683 5225 4394 4072 4426 5414 5101 4110 3583 3701 6198 4592 4185
13 3734 4466 4548 3491 6379 4936 3763 2852 5750 4548 4013 4344 6168 5201 4121 3828 3992 7115 5732 4040
//
The header section describes attributes such as the profile index (starting from
zero), prior probability of the profile (PRIOR), as well as the number of profile
columns (NCOLS) and the alphabet size used (ALPH, in this case amino acid). The
main body of the profile record describes the amino acid distribution at each
position of the profile (one line per profile column). To save space each amino
acid probability p is written as
v = -1000 * log2(p)
or '*' if p is zero. To recover the original amino acid probability from its
value v in K4000.lib you simply calculate
p = 2^(v / -1000).