icaass - Whole DNA database clustering and querying


     icaass  [-mode  Update|U|u|Query|Q|q|Orphans|O|o ]
     [-anti-sense  Yes|No|y|n ] [-index  filename ]
     [-ini  filename ]  [-threshold nn.n ]  [-score n ]
     [ -extmin n ]  [-screen.I n ]  -seq filename ...


     ICAass takes files of DNA sequence information and  produces
     an  index  file  which  links  similar sequences together in
     clusters.  ICAass differs from N2tool and ICAtool because it
     uses a novel type of global sequence comparison algorithm to
     determine when two sequences are  considered  similar.   The
     comparison  is asymmetric because the algorithm searches for
     instances  where  one  sequence  is  approximately  repeated
     within the length of another. To facilitate the rapid detec-
     tion of such matches, the program needs to be provided  with
     a  size  sorted  file  of  DNA  sequences  (longest sequence
     first).  This allows the program to use the rapid  incremen-
     tal  clustering  approach  also  used  in  ICAtool.  In this
     Update mode, ICAass has been used to reduce  the  redundancy
     in  copies  of  the  separate divisions of Release 28 of the
     EMBL DNA database.  The size-sorted  sequence  file  can  be
     produced using the program ssort.

     ICAass also has a query mode that allows it to be  used  for
     rapid  database  searches.   ICAass is especially quick when
     querying a database with a whole batch  of  query  sequences
     because  the  database is only loaded a single time and then
     stored in a compressed form in memory for repeated scanning.
     The  comparison  algorithm  used  when querying is different
     from that used when creating a cluster index.  The  sequence
     comparisons  are  potentially less sensitive than those per-
     formed by ICAtool but always more sensitive than those  per-
     formed  by  BLASTN using its default settings. When in query
     mode, it is possible to select how many matches are returned
     by  either a simple count (-print option), or by their score
     (-score option).

     The program also has a special 'orphans' mode which allows a
     simple index to be created almost instantly without perform-
     ing any sequence comparisons.  This mode enables local data-
     bases  to  be  indexed and searched with a minimum of effort
     and resources.  Indexes produced on the  basis  of  sequence
     similarity are quicker to search but the indexing can take a
     significant time.  If the  database  is  only  going  to  be
     searched  a  hundred  or  so times then the effort of proper
     cluster indexing is not worthwhile.

     Sequences can be spread amongst any number of files and  new
     files  can  be  added  at any time to increase the number of
     sequences clustered.  Various sequence formats are supported
     including   GenBank,   EMBL,  plain,  (unformatted  sequence
     files),Staden's semi-colon and Experiment file formats,  and
     also  2 NBRF/FASTA style formats with the description either
     on the same line as '>sequence-name' or with the description
     on  the line immediately following the sequence name.  Extra
     files of sequences can be added  at  any  time  without  any
     penalty  of  recalculation but no sequences referenced by an
     index should ever be deleted.


     ICAass can get its configuration parameters from the command
     line  or  from a user initial configuration file or just set
     to built in defaults.   Parameter  settings  over-ride  each
     other  with defaults being set first, then the configuration
     file then finally the command line.


  -anti-sense Yes|No|y|n
     Determines whether sequences should also be compared in  the
     opposite sense to how they are entered. Default is no.

  -index filename
     Defines the name  of  the  index  file  existing  or  to  be
     created.  Default  is  "cluster.index" in the current direc-

  -ini filename
     Defines the name of the file which holds the user's  initial
     configuration  file. Default is "ICAtool.ini" in the current

  -mode Update|U|u|Query|Q|q|Orphans|O|o
     Defines how the program will operate. In Update mode  ICAass
     will  perform full database clustering on the basis of pair-
     wise sequence  comparisons.  In  Orphans  mode  ICAass  will
     almost  instantly  index  all  the  sequences separately and
     without any sequence comparisons being performed.  In  Query
     mode ICAass will use an existing cluster or orphan index and
     search for local sequence similarities between  the  indexed
     and query sequences.

  -seq filename1 filename2 filenameN
     This flag denotes the start of a  list  of  space  separated
     filenames  which  hold DNA sequence information. No default,
     always required.

  -threshold nn.n
     When creating a cluster  index,  this  flag  determines  the
     subsequence  similarity  score that defines the threshold at
     which one sequence is said to be an Approximate  SubSequence
     of  another.  The threshold corresponds to the percentage of
     the putative ASS that is also represented  in  the  superse-
     quence.   A  fixed  gap-start penalty is subtracted from the
     number of matching bases for alignment gaps  in  either  the
     ASS or the supersequence.  This gap-start cost is equivalent
     to 8 bases of the  ASS  not  being  present  in  the  longer
     sequence.   Minimum value is 25.0, Maximum is 100 though 'N'
     characters are randomized so be careful.

  -score n
     When querying it is useful to select matches on the basis of
     scores. ICAass uses a simple +1 (match), -1 (mismatch) scor-
     ing scheme to make score screening of ungapped matching seg-
     ments is easy.

     If in query mode then this determines the number of  charac-
     ters per printed line. Default is 80.

  -extmin n
     This is a kind of sensititvity control. It  limits  how  far
     the  program  searches  past mismatches before deciding that
     there is no more alignment to be found.  The  more  negative
     the number, the further beyond the edges of a good alignment
     the program will search. Useful values could  be  between  8
     (most  insensitive)  to  -20.  The  default is approximately


     If this file is present then all startup details present  in
     it will be read. An example would be

     If this file or an equivalent is present when in UPDATE mode
     then  any  extra sequences are added to this existing index.
     An cluster or orphan index file is needed to  perform  data-
     base querying.


     N2tool(1),     ICAass(1),     ICAprint(1),      ICAstats(1),
     ICAmatches(1), tofasta(1), ssort(1), just30(1)


     None of the ICAtools check  their  command  line  parameters
     fully.   Only  those  parameters  that  are  recognized  are

     Doesn't use base ambiguity symbols properly: use only 'n' or
     'N' which are converted to random bases.