MCL FAQ(1)                USER COMMANDS                MCL FAQ(1)



  NAME
          mclfaq - faqs and facts about the MCL cluster algorithm

          MCL  refers to the generic MCL algorithm and the MCL pro-
          cess on which the algorithm is based. mcl refers  to  the
          implementation.  This  FAQ  answers  questions related to
          both. In some places MCL is written where MCL or mcl  can
          be  read. This is the case for example in section 2, What
          kind of graphs.  It should in general be obvious from the
          context.

          This FAQ does not begin to attempt to explain the motiva-
          tion and mathematics  behind  the  MCL  algorithm  -  the
          internals  are  not  explained.  A broad view is given in
          faq 1.2, and see also faq 1.5 and section REFERENCES.

          Some additional sections preceed the actual faq  entries.
          The TOC section contains a listing of all questions.

  RESOURCES
          mcl  development is discussed on mcl-devel@lists.mdcc.cx,
          this list is archived at http://lists.mdcc.cx/mcl-devel/.

          See the REFERENCES Section for publications detailing the
          mathematics behind the MCL algorithm.

  TOC
   1..... General questions
    1.1.. For whom is mcl and for whom is this FAQ?
    1.2.. What is the relationship between the MCL process, the MCL
          algorithm, and the 'mcl' implementation?
    1.3.. What do the letters MCL stand for?
    1.4.. How  could you be so feebleminded to use MCL as abbrevia-
          tion? Why is it labeled 'Markov cluster' anyway?
    1.5.. Where can I learn about the  innards  of  the  MCL  algo-
          rithm/process?
    1.6.. For which platforms is mcl available?
    1.7.. How does mcl's versioning scheme work?

   2..... What kind of graphs
    2.1.. What is legal input for MCL?
    2.2.. What is sensible input for MCL?
    2.3.. Does MCL work for weighted graphs?
    2.4.. Does MCL work for directed graphs?
    2.5.. Can  MCL  work  for  lattices / directed acyclic graphs /
          DAGs?
    2.6.. Does MCL work for tree graphs?
    2.7.. For what kind of graphs does MCL work well and for  which
          does it not?
    2.8.. What  makes  a  good input graph?  How do I construct the
          similarities?  How to make them satisfy this Markov  con-
          dition?
    2.9.. My input graph is directed. Is that bad?
    2.10. Why  does mcl like undirected graphs and why does it dis-
          like uni-directed graphs so much?
    2.11. How do I check that my  graph/matrix  is  symmetric/undi-
          rected?

   3..... Resource tuning / accuracy
    3.1.. What do you mean by resource tuning?
    3.2.. How do I compute the maximum amount of RAM needed by mcl?
    3.3.. How  much  does  the  mcl  clustering  differ  from   the
          clustering  resulting  from a perfectly computed MCL pro-
          cess?
    3.4.. How do I know that I am using enough resources?
    3.5.. Where is the mathematical analysis of  this  mcl  pruning
          strategy?
    3.6.. What  qualitative statements can be made about the effect
          of pruning?
    3.7.. At different high resource levels my clusterings are  not
          identical.  How can I trust the output clustering?

   4..... Tuning cluster granularity
    4.1.. How do I tune cluster granularity?
    4.2.. The effect of inflation on cluster granularity.
    4.3.. The  effect  of  similarity  distribution  homogeneity on
          cluster granularity.
    4.4.. The effect of initial centering on cluster granularity.
    4.5.. How to implement two-level approaches using mcl.

   5..... Implementing the MCL algorithm
    5.1.. How easy is it to implement the MCL algorithm?

   6..... Cluster overlap / MCL iterand cluster interpretation
    6.1.. Introduction
    6.2.. Can the clusterings returned by mcl contain overlap?
    6.3.. How do I  obtain  the  clusterings  associated  with  MCL
          iterands?

   7..... Miscellaneous
    7.1.. How do I find the default settings of mcl?
    7.2.. What's next?

  FAQ
                              General questions

    1.1   For whom is mcl and for whom is this FAQ?

          For  everybody  with  an  appetite  for graph clustering.
          Regarding the FAQ, I have kept the amount of  mathematics
          as  low  as  possible,  as far as matrix analysis is con-
          cerned.  Inevitably, some terminology pops  up  and  some
          references  are made to the innards of the MCL algorithm,
          especially in the  section  on  resources  and  accuracy.
          Graph   terminology  is  used  somewhat  more  carelessly
          though. The future might bring definition entries,  right
          now you have to do without.  Mathematically inclined peo-
          ple may be interested in the pointers found in the REFER-
          ENCES section.

          Given  this mention of mathematics, let me point out this
          one time only that using mcl is extremely straightforward
          anyway.  You  need  only mcl and an input graph (refer to
          the mcl manual page), and many people  trained  in  some-
          thing else than mathematics are using mcl happily.

    1.2   What is the relationship between the MCL process, the MCL
          algorithm, and the 'mcl' implementation?

          mcl is what you use for clustering. It implements the MCL
          algorithm,  which  is a cluster algorithm for graphs. The
          MCL algorithm is basically a shell in which the MCL  pro-
          cess is computed and interpreted. I will describe them in
          the natural, reverse, order.

          The MCL process generates a row  of  stochastic  matrices
          given  some  initial stochastic matrix. The elements with
          even  index  are  obtained  by  expanding  the   previous
          element,  and the elements with odd index are obtained by
          inflating the previous element given some inflation  con-
          stant.  Expansion  is nothing but normal matrix squaring,
          and inflation  is  a  particular  way  of  rescaling  the
          entries  of  a  stochastic  matrix  such  that it remains
          stochastic.

          The row of MCL elements (from  the  MCL  process)  is  in
          principle  without end, but what happens is that the ele-
          ments converge to some specific kind  of  matrix,  called
          the  limit  of  the process. The heuristic underlying MCL
          predicts that the interaction of expansion with inflation
          will  lead to a limit exhibiting cluster structure in the
          graph associated with the initial matrix. This is  indeed
          the  case,  and  several  mathematical  results  tie  MCL
          iterands and limits and the MCL  interpretation  together
          (REFERENCES).

          The  MCL  algorithm is simply a shell around the MCL pro-
          cess in which an input graph is transformed into an  ini-
          tial  matrix  suitable for starting the process, in which
          inflation parameters are set, and in which the  MCL  pro-
          cess  is  stopped once the limit is reached, and in which
          the result is interpreted as a clustering.

          The mcl implementation supplies the functionality of  the
          MCL  algorithm,  with some extra facilities for manipula-
          tion of the input graph, interpreting the result, manipu-
          lating  resources  while computing the process, and moni-
          toring the state of these manipulations.

    1.3   What do the letters MCL stand for?

          For Markov Cluster. The MCL algorithm is a cluster  algo-
          rithm  that  is  basically  a shell in which an algebraic
          process is computed.  This process iteratively  generates
          stochastic matrices, also known as Markov matrices, named
          after the famous Russian mathematician Andrei Markov.

    1.4   How could you be so feebleminded to use MCL as  abbrevia-
          tion? Why is it labeled 'Markov cluster' anyway?

          Sigh.  It is a widely known fact that a TLA or Three-Let-
          ter-Acronym is the canonical self-describing abbreviation
          for  the name of a species with which computing terminol-
          ogy is infested (quoted from the Free  Online  Dictionary
          of Computing). Back when I was thinking of a nice tag for
          this cute algorithm, I was totally  unaware  of  this.  I
          naturally  dismissed  MC (and would still do that today).
          Then MCL occurred to  me,  and  without  giving  it  much
          thought  I  started  using it.  A Google search (or was I
          still using Alta-Vista back then?)  might  have  kept  me
          from going astray.

          Indeed,  MCL  is used as a tag for Macintosh Common Lisp,
          Mission Critical Linux,  Monte  Carlo  Localization,  MUD
          Client  for  Linux, Movement for Canadian Literacy, and a
          gazillion other things - refer to  the  file  mclmcl.txt.
          Confusing. It seems that the three characters MCL possess
          otherworldly  magical  powers  making  them  an  ever  so
          strange  and  strong  attractor  in the space of TLAs. It
          probably helps that Em-See-Ell (Em-Say-Ell in Dutch)  has
          some  rhythm  to  it  as well. Anyway MCL stuck, and it's
          here to stay.

          On a more general level, the label Markov Cluster is  not
          an  entirely fortunate choice either. Although phrased in
          the language of stochastic  matrices,  MCL  theory  bears
          very little relation to Markov theory, and is much closer
          to matrix analysis (including Hilbert's distance) and the
          theory of dynamical systems. No results have been derived
          in the latter framework, but many conjectures  are  natu-
          rally posed in the language of dynamical systems.

    1.5   Where  can  I  learn  about  the innards of the MCL algo-
          rithm/process?

          Currently, the most basic explanation of  the  MCL  algo-
          rithm  is  found in the technical report [2]. It contains
          sections on several other (related) subjects though,  and
          it  assumes  some  working  knowledge  on  graphs, matrix
          arithmetic, and stochastic matrices.

    1.6   For which platforms is mcl available?

          It should compile and run on  virtually  any  flavour  of
          UNIX  (including  Linux  and the BSD variants of course).
          Following the instructions in the  INSTALL  file  shipped
          with  mcl should be straightforward and sufficient. Cour-
          tesy to Joost van Baal who completely autofooled mcl.

          Building MCL on Wintel (Windows on Intel chip) should  be
          straightforward  if  you  use  the  full  suite of cygwin
          tools. Install cygwin if you do not have it yet.  In  the
          cygwin  shell,  unpack  mcl and simply issue the commands
          ./configure, make, make install, i.e. follow the instruc-
          tions in INSTALL.

          This  MCL implementation has not yet been reported to run
          on MAC. For the latest Mac OS X one would expect that  it
          is certainly possible to make this happen.

          If  you  have further questions or news about this issue,
          contact mcl-devel <at> lists <dot> mdcc <dot> cx.

    1.7   How does mcl's versioning scheme work?

          The current setup, which I hope to continue, is this. All
          releases  are  identified  by  a  date stamp. For example
          02-095 denotes day 95 in the year 2002. This  date  stamp
          agrees  (as  of  April  2000)  with the (differently pre-
          sented) date stamp used in all manual pages shipped  with
          that release.  For example, the date stamp of the FAQ you
          are reading is  4 Jul 2003, which  corresponds  with  the
          MCL  stamp 03-185.  The Changelog file contains a list of
          what's changed/added with each  release.  Currently,  the
          date  stamp  is  the  primary  way  of identifying an mcl
          release. When asked for its version by  using  --version,
          mcl  outputs  both  the date stamp and a version tag (see
          below).

          In early 2002 it occurred to me that mcl should, in addi-
          tion  to  time  stamps,  also have something like version
          numbers, wanting to  use  those  to  indicate  noteworthy
          changes. The April 2002 release got version tag 1.001, in
          order to celebrate the then-recent addition of this  FAQ,
          mcl's  new logging facility --log, and clmimac to the MCL
          distribution. The January 2003 release  had  its  version
          number bumped to 1.002, marking MCL's ability to directly
          deal with a much more general  type  of  graph  encoding.
          Currently, the version tag is not used in the mcl distri-
          bution name - only the date stamp is used for that.

                             What kind of graphs

    2.1   What is legal input for MCL?

          Any graph (encoded as a matrix of similarities)  that  is
          nonnegative,  i.e.  all  similarities are greater than or
          equal to zero.

    2.2   What is sensible input for MCL?

          It is ok for graphs  to  be  weighted,  and  they  should
          preferably  be symmetric.  They should certainly not con-
          tain parts that are  (almost)  cyclic,  although  nothing
          stops you from experimenting with such input.

    2.3   Does MCL work for weighted graphs?

          Yes,  unequivocally.  They  should  preferably be symmet-
          ric/undirected though.  See entries 2.7 and 2.8.

    2.4   Does MCL work for directed graphs?

          Maybe, with a big caveat. See entries 2.8 and 2.9.

    2.5   Can MCL work for lattices /  directed  acyclic  graphs  /
          DAGs?

          Such  graphs  [term]  can  surely  exhibit  clear cluster
          structure. If they do, there is only one way for  mcl  to
          find  out.  You have to change all arcs to edges, i.e. if
          there is an arc from i to j with similarity s(i,j)  -  by
          the  DAG  property  this  implies  s(j,i) = 0 - then make
          s(j,i) equal to s(i,j).

          This may feel like throwing  away  valuable  information,
          but  in truth the information that is thrown away (direc-
          tion) is not informative with respect to the presence  of
          cluster structure. This may well deserve a longer discus-
          sion than would be justified here.

    2.6   Does MCL work for tree graphs?

          Nah, I don't think so. More info at entry 2.7.

    2.7   For what kind of graphs does MCL work well and for  which
          does it not?

          Graphs in which the diameter [term] of (subgraphs induced
          by) natural clusters  is  not  too  large.  Additionally,
          graphs  should  preferably  be  (almost)  undirected (see
          entry below) and not so sparse that  the  cardinality  of
          the edge set is close to the number of nodes.

          A  class  of  such  very  sparse  graphs  is that of tree
          graphs. You might look into graph visualization  software
          and  research  if you are interested in decomposing trees
          into 'tight' subtrees.

          The diameter criterion could be violated by neighbourhood
          graphs  derived from vector data. In the specific case of
          2 and 3 dimensional data,  you  might  be  interested  in
          image  segmentation  and  boundary detection, and for the
          general case there is a  host  of  other  algorithms  out
          there. [add]

          In  case  of  weighted  graphs, the notion of diameter is
          sometimes  not  applicable.  Generalizing   this   notion
          requires  inspecting  the mixing properties of a subgraph
          induced by a natural cluster in terms  of  its  spectrum.
          However,  the diameter statement is something grounded on
          heuristic considerations (confirmed by practical evidence
          [4])  to  begin with, so you should probably forget about
          mixing properties.

    2.8   What makes a good input graph?  How do  I  construct  the
          similarities?   How to make them satisfy this Markov con-
          dition?

          To begin with the last one: you need  not  and  must  not
          make  the  input  graph  such  that  it is stochastic aka
          Markovian [term]. What you need to do  is  make  a  graph
          that   is  preferably  symmetric/undirected,  i.e.  where
          s(i,j) = s(j,i) for all nodes i and j.  It  need  not  be
          perfectly undirected, see the following faq for a discus-
          sion of that.  mcl will work with  the  graph  of  random
          walks  that is associated with your input graph, and that
          is the natural state of affairs.

          The input graph should preferably be honest in the  sense
          that  if  s(x,y)=N and s(x,z)=200N (i.e. the similarities
          differ by a factor 200), then this should really  reflect
          that  the  similarity  of  y to x is neglectible compared
          with the similarity of z to x.

          For the rest, anything goes. Try  to  get  a  feeling  by
          experimenting.  Sometimes it is a good idea to filter out
          high-frequency and/or low-frequency data, i.e. nodes with
          either  very many neighbours or extremely few neighbours.

    2.9   My input graph is directed. Is that bad?

          It depends. The class of directed graphs can be viewed as
          a  spectrum  going from undirected graphs to uni-directed
          graphs. Uni-directed is terminology I am inventing  here,
          which I define as the property that for all node pairs i,
          j, at least one of s(i,j) or s(j,i)  is  zero.  In  other
          words,  if  there  is  an arc going from i to j in a uni-
          directed graph, then there is no arc going from j to i. I
          call  a  node pair i, j, almost uni-directed if s(i,j) <<
          s(j,i) or vice versa, i.e. if the similarities differ  by
          an order of magnitude.

          If  a  graph  does  not  have  (large)  subparts that are
          (almost) uni-directed, have a go with mcl. Otherwise, try
          to make your graph less uni-directed.  You are in charge,
          so do anything with  your  graph  as  you  see  fit,  but
          preferably  abstain from feeding mcl uni-directed graphs.

    2.10  Why does mcl like undirected graphs and why does it  dis-
          like uni-directed graphs so much?

          Mathematically,  the  mcl  iterands will be nice when the
          input graph is symmetric, where  nice  is  in  this  case
          diagonally  symmetric  to a semi-positive definite matrix
          (ignore as needed). For one thing, such nice matrices can
          be  interpreted  as clusterings in a way that generalizes
          the interpretation of the mcl limit as a  clustering  (if
          you  are  curious  to these intermediate clusterings, see
          faq entry 6.3).  See the REFERENCES section for  pointers
          to mathematical publications.

          The  reason  that mcl dislikes uni-directed graphs is not
          very mcl specific, it has more to do with the  clustering
          problem  itself.   Somehow,  directionality  thwarts  the
          notion of cluster structure.  [add].

    2.11  How do I check that my  graph/matrix  is  symmetric/undi-
          rected?

          Whether  your  graph  is  created by third-party software
          (e.g. the TribeMCL module (maintained by Anton  Enright))
          or  by  custom  sofware written by someone you know (e.g.
          yourself), it is advisable to test whether  the  software
          generates symmetric matrices. This can be done as follows
          using the mcx utility, assuming that you want to test the
          matrix  stored in file matrix.mci. The mcx utility should
          be available on your system if mcl was installed  in  the
          normal way.

          mcx /matrix.mci lm tp -1 mul add /check wm

          This  loads  the  graph/matrix  stored in matrix.mci into
          mcx's memory with the mcx lm  primitive.  -  the  leading
          slash is how strings are introduced in the stack language
          interpreted by mcx. The transpose of that matrix is  then
          pushed  on the stack with the tp primitive and multiplied
          by minus one. The two matrices are added, and the  result
          is  written  to the file check.  The transposed matrix is
          the mirrored version of the  original  matrix  stored  in
          matrix.mci.  If  a  graph/matrix is undirected/symmetric,
          the mirrored image is necessarily the  same,  so  if  you
          subtract  one  from the other it should yield an all zero
          matrix.

          Thus, the file check should look like this:

          (mclheader
          mcltype matrix
          dimensions <num>x<num>
          )
          (mclmatrix
          begin
          )

          Where <num> is the same as in  the  file  matrix.mci.  If
          this  is  not  the  case, find out what's prohibiting you
          from  feeding  mcl  symmetric  matrices.  Note  that  any
          nonzero  entries found in the matrix stored as check cor-
          respond to node pairs for which the arcs in the two  pos-
          sible directions have different weight.

                         Resource tuning / accuracy

    3.1   What do you mean by resource tuning?

          mcl  computes  a process in which stochastic matrices are
          alternately expanded and inflated. Expansion  is  nothing
          but  standard  matrix multiplication, inflation is a par-
          ticular way of rescaling the matrix entries.

          Expansion causes problems  in  terms  of  both  time  and
          space. mcl works with matrices of dimension N, where N is
          the number of nodes in the input graph.   If  no  precau-
          tions  are  taken,  the  number  of  entries  in  the mcl
          iterands  (which  are  stochastic  matrices)  will   soon
          approach  the  square  of N. The time it takes to compute
          such a matrix will be proportional to the cube of  N.  If
          your  input  graph has 100.000 nodes, the memory require-
          ments become infeasible and the time requirements  become
          impossible.

          What  mcl  does  is  perturbing the process it computes a
          little by removing the smallest entries -  it  keeps  its
          matrices  sparse.  This is a natural thing to do, because
          the matrices are sparse in a weighted sense (a very  high
          proportion  of  the stochastic mass is contained in rela-
          tively few entries), and the process converges to  matri-
          ces  that are extremely sparse, with usually no more than
          N entries.  It is thus known that the  MCL  iterands  are
          sparse  in a weighted sense and are usually very close to
          truly sparse matrices.  The way mcl perturbs its matrices
          is  by  the  strategy of pruning, selection, and recovery
          that is extensively described in the mcl(1) manual  page.
          The  question then is: What is the effect of this pertur-
          bation on the resulting clustering, i.e.  how  would  the
          clustering  resulting  from a perfectly computed mcl pro-
          cess compare with the clustering I  have  on  disk?   Faq
          entry 3.3 discusses this issue.

          The  amount  of resources used by mcl is bounded in terms
          of the maximum number of neighbours a node is allowed  to
          have  during all computations.  Equivalently, this is the
          maximum number of nonzero entries  a  matrix  column  can
          possibly  have.  This  number, finally, is the maximum of
          the the values corresponding with the -S and -R  options.
          The  latter  two are listed when using the -z option (see
          faq 7.1).

    3.2   How do I compute the maximum amount of RAM needed by mcl?

          It is rougly equal to

          2 * s * K * N

          bytes,  where  2 is the number of matrices held in memory
          by mcl, s is the size of a single cell (c.q. matrix entry
          or  node/arc  specification), N is the number of nodes in
          the input graph, and where K is the maximum of the values
          corresponding  with  the  -S  and  -R  options  (and this
          assumes that the average node degree in the  input  graph
          does not exceed K either). The value of s can be found by
          using the -z option. It is listed in  one  of  the  first
          lines  of  the  resulting output. s equals the size of an
          int plus the size of a float, which will  be  8  on  most
          systems.   The  estimate  above will in most cases be way
          too pessimistic (meaning you do not need that  amount  of
          memory).

          The -how-much-ram option is provided by mcl for computing
          the bound given above. This options takes as argument the
          number of nodes in the input graph.

          The  theoretically  more  precise upper bound is slightly
          larger due to overhead. It is something like

          ( 2 * s * (K + c)) * N

          where c is 5 or so, but one should not pay  attention  to
          such a small difference anyway.

    3.3   How much does the mcl clustering differ from the cluster-
          ing resulting from a perfectly computed MCL process?

          For graphs with up until a few thousand nodes a perfectly
          computed  MCL  process can be achieved by abstaining from
          pruning  and  doing  full-blown  matrix  arithmetic.   Of
          course, this still leaves the issue of machine precision,
          but let us wholeheartedly ignore that.

          Such experiments give evidence (albeit  incidental)  that
          pruning  is  indeed  really  what it is thought to be - a
          small perturbation. In  many  cases,  the  'approximated'
          clustering  is  identical  to  the 'exact' clustering. In
          other cases, they are very close to each other  in  terms
          of the metric split/join distance as computed by clmdist.
          Some experiments with  randomly  generated  test  graphs,
          clustering, and pruning are described in [4].

          On  a different level of abstraction, note that perturba-
          tions of the inflation parameter will also lead  to  per-
          turbations  in  the  resulting  clusterings,  and surely,
          large changes in the inflation parameter will in  general
          lead  to  large  shifts  in the clusterings. Node/cluster
          pairs that are different for  the  approximated  and  the
          exact  clustering  will very likely correspond with nodes
          that are in a boundary region between two or  more  clus-
          ters  anyway, as the perturbation is not likely to move a
          node from one core of attraction to another.

          Faq entry 3.6 has more to say about this subject.

    3.4   How do I know that I am using enough resources?

          In mcl parlance, this becomes  how  do  I  know  that  my
          -scheme  parameter is high enough or more elaborately how
          do I know that the values of the {-P, -S, -R, -pct} combo
          are high enough?

          There  are  several  aspects. First, watch the jury marks
          reported by mcl when it's done.  The jury marks are three
          grades,  each  out of 100. They indicate how well pruning
          went. If the marks are in  the  seventies,  eighties,  or
          nineties,  mcl is probably doing fine. If they are in the
          eighties or lower, try to see if you can  get  the  marks
          higher  by  spending  more  resources  (e.g. increase the
          parameter to the -scheme option).

          Second, you  can  do  multiple  mcl  runs  for  different
          resource  schemes,  and compare the resulting clusterings
          using clmdist. See the clmdist manual for a case study.

    3.5   Where is the mathematical analysis of  this  mcl  pruning
          strategy?

          There is none. [add]

          Ok, the next entry gives an engineer's rule of thumb.

    3.6   What  qualitative statements can be made about the effect
          of pruning?

          The more severe pruning is, the more the computed process
          will  tend  to  converge prematurely. This will generally
          lead to finer-grained clusterings.  In cases where  prun-
          ing  was severe, the mcl clustering will likely be closer
          to a clustering ideally resulting from another  MCL  pro-
          cess  with higher inflation value, than to the clustering
          ideally resulting from the same MCL process. Strong  sup-
          port   for   this  is  found  in  a  general  observation
          illustrated by the following  example.  Suppose  u  is  a
          stochastic vector resulting from expansion:

          u   =  0.300 0.200 0.200 0.100 0.050 0.050 0.050 0.050

          Applying inflation with inflation value 2.0 to u gives

          v   =  0.474 0.211 0.211 0.053 0.013 0.013 0.013 0.013

          Now  suppose  we first apply pruning to u such that the 3
          largest entries 0.300, 0.200 and 0.200 survive,  throwing
          away  30 percent of the stochastic mass (which is quite a
          lot by all means).  We rescale those  three  entries  and
          obtain

          u'  =  0.429 0.286 0.286 0.000 0.000 0.000 0.000 0.000

          Applying inflation with inflation value 2.0 to u' gives

          v'  =  0.529 0.235 0.235 0.000 0.000 0.000 0.000 0.000

          If  we  had applied inflation with inflation value 2.5 to
          u, we would have obtained

          v'' =  0.531 0.201 0.201 0.038 0.007 0.007 0.007 0.007

          The vectors v' and v'' are much closer to each other than
          the vectors v' and v, illustrating the general idea.

          In  practice, mcl should (on average) do much better than
          in this example.

    3.7   At different high resource levels my clusterings are  not
          identical.  How can I trust the output clustering?

          Did  you  read  all  other  entries in this section? That
          should have reassured you somewhat,  except  perhaps  for
          Faq answer 3.5.

          You  need  not  feel  uncomfortable  with the clusterings
          still being different at high resource levels, if ever so
          slightly. In all likelihood, there are anyway nodes which
          are not in any core of attraction, and that  are  on  the
          boundary between two or more clusterings. They may go one
          way or another, and these are the  nodes  which  will  go
          different  ways  even at high resource levels. Such nodes
          may be stable in clusterings obtained for lower inflation
          values (i.e. coarser clusterings), in which the different
          clusters to which they are attracted are merged.

          By the way, you do know all  about  clmdist,  don't  you?
          Because  the statement that clusterings are not identical
          should be quantified: How much do they differ? This issue
          is  discussed  in the clmdist manual page - clmdist gives
          you a robust measure  for  the  distance  (dissimilarity)
          between two clusterings.

          There  are  other means of gaining trust in a clustering,
          and there are different issues at play. There is the mat-
          ter  of how accurately this mcl computed the mcl process,
          and there is the matter of how well the chosen  inflation
          parameter fits the data. The first can be judged by look-
          ing at the jury marks (faq 3.4) and applying  clmdist  to
          different  clusterings.  The second can be judged by mea-
          surement (e.g. use clminfo) and/or inspection  (use  your
          judgment).

                         Tuning cluster granularity

    4.1   How do I tune cluster granularity?

          There  are several ways for influencing cluster granular-
          ity. These ways and their  relative  merits  are  succes-
          sively discussed below. The clmdist(1) manual contains an
          example of doing multiple mcl runs for finding  granular-
          ily   different   clusterings,   using  the  most  common
          approach, namely that of varying inflation.

    4.2   The effect of inflation on cluster granularity.

          The main handle for changing inflation is the -I  option.
          This  is also the principal handle for regulating cluster
          granularity. Unless you are mangling huge graphs it could
          be  the  only mcl option you ever need besides the output
          redirection option -o.

          Increasing the value of -I will increase  cluster  granu-
          larity.   Conceivable  values  are from 1.1 to 5.0 or so,
          but the range of suitable values will certainly depend on
          your  input  graph.  For many graphs, 1.1 will be far too
          low, and for many other graphs, 5.0 will be far too high.
          You  will have to find the right value or range of values
          by experimenting, using your judgment, and using measure-
          ment  tools  such as clmdist and clminfo. The default 2.0
          is a good value to begin the experimental stage with.

          For experiments that are  more  subtle  with  respect  to
          inflation, mcl provides the -i option in conjunction with
          the -l (small letter ell) option. Do  this  only  if  you
          have the intention of playing around with mcl in order to
          study the characteristics of the  process  that  it  com-
          putes,  and  maybe,  just  maybe,  use it in a production
          environment if you find it useful. In the first vein, you
          may  be  interested  to  know  that  mcx  is a stack lan-
          guage/interpreter in which the entire MCL  algorithm  can
          be written in three lines of code. It provides comprehen-
          sive access to the MCL graph and matrix  libraries.  How-
          ever,  the mcx interface to the MCL pruning facilities is
          not yet satisfactory at this time.

    4.3   The effect  of  similarity  distribution  homogeneity  on
          cluster granularity.

          How  similarities  in  the input graph were derived, con-
          structed, adapted, filtered (et cetera) will affect clus-
          ter  granularity.   It is important that the similarities
          are honest; refer to faq 2.8.

          Another issue is that homogeneous  similarities  tend  to
          result in more coarse-grained clusterings. You can make a
          set of similarities more  homogeneous  by  applying  some
          function  to  all of them, e.g. for all pairs of nodes (x
          y) replace S(x,y) by the square root, the  logarithm,  or
          some  other convex function. Note that you need not worry
          about scaling, i.e. the possibly large changes in  magni-
          tude of the similarities. MCL is not affected by absolute
          magnitudes, it is only affected by magnitudes taken rela-
          tive to each other.

          Here is how to make a graph more homogeneous with respect
          to  the  weight  function.  Given  orig.mci,   clustering
          revised.mci as constructed below should generally lead to
          coarser clusterings.

          mcx /orig.mci lm 0.5 hdp /revised.mci wm

          The parameter 0.5 can be changed to other values  in  the
          range [0..1.0].  The closer it is to zero, the more clus-
          terings will tend to be coarse.

          If the parameter is chosen larger than 1.0,  say  in  the
          range  [1.2..5.0]  then  clusterings will tend to be more
          finer-graine. For example,

          mcx /orig.mci lm 3.0 hdp /revised.mci wm

    4.4   The effect of initial centering on cluster granularity.

          This refers to the -c parameter, which adds loops to  the
          input  graph.  Its default value is 1.0, which results in
          loops of a somehow 'neutral' weight to be added.  If  you
          need  to really fine-tune granularity, this option can be
          of use, otherwise  you  should  abstain  from  using  it.
          Increasing its value will increase cluster granularity.

          Conceivable/normal  values  are  in the range 1.0 to 5.0,
          but nothing stops  you  from  going  higher  or  slightly
          lower.  Going  lower  than  0.5  is definitely not a good
          idea.

          If you are into clustering at high levels of granularity,
          there  is  the  issue  whether to further increase -I, or
          whether to start increasing or further  increase  -c.  It
          will  really  depend  on the characteristics of the graph
          you are working with, and at this point in time I  cannot
          even  give  advice  in terms of a general categorization.
          Experiment, learn, and let me know  the  results  if  you
          like.

    4.5   How to implement two-level approaches using mcl.

          If changing inflation does not yield clusterings that are
          sufficiently coarse to your liking, you may consider try-
          ing a two-level approach.  Presumably your input graph is
          very large if you find yourself in  this  situation.  You
          should be aware of the possibility that the graph you are
          clustering simply does not posses  the  type  of  coarse-
          grained structure that you are looking for.

          Two-level  approaches  can be implemented in a variety of
          ways, and you may wish to invoke tools  other  than  mcl.
          However,  it  is  possible  to  experiment with two-level
          approaches using mcl and its associated utility mcx. Here
          is  how, assuming your original graph is called orig.mci.

          Warning
          This approach is a  little  crude,  and  will  suffer  if
          (many) small clusters are present.

          mcl orig.mci -I 5.0 -c 3.0 -scheme 5 -o orig.i5.mco

          Cluster  it first so that you get a fine-grained cluster-
          ing.  Since orig.mci is likely a large graph, I opted for
          a high scheme.

          mcx /orig.i5.mco lm tp exch     # line continues
                      /orig.mci lm exch mul mul tp add /coarse.mci wm

          This  transforms  the  clustering+graph  into a new graph
          coarse.mci where the clusters are nodes.  You  may,  upon
          inspection,  wish to change the homogeneity of the weight
          distribution by applying  the  method  described  in  faq
          entry 4.3 - but that's something best left for optionally
          fine-tuning this method once you decide it has merits.

          mcl coarse.mci -I 2.0 -c 0.0 -scheme 5 -o coarse.mco

          Cluster the coarsened graph, and keep the loops  as  com-
          puted in the coarsening step.

          mcx /orig.i5.mco lm /coarse.mco lm mul /projected.mco wm

          Project the 'coarsened' clustering back onto the original
          graph.  Now projected.mco should be a coarse cluster  for
          orig.mci.

          There are a lot of parameters to play with here; e.g. the
          5.0,  3.0  and  2.0,  and  1.0.  These  seem   reasonable
          defaults.

                       Implementing the MCL algorithm

    5.1   How easy is it to implement the MCL algorithm?

          Very easy, if you will be doing small graphs only, say up
          to a few thousand entries at most. These  are  the  basic
          ingredients:

          o  Adding  loops  to  the  input  graph,  conversion to a
             stochastic matrix.
          o  Matrix multiplication and matrix inflation.
          o  The interpretation function mapping  MCL  limits  onto
             clusterings.

          These  must be wrapped in a program that does graph input
          and  cluster  output,  alternates  multiplication   (i.e.
          expansion)  and  inflation in a loop, monitors the matrix
          iterands thus found, quits the loop when  convergence  is
          detected, and interprets the last iterand.

          Implementing matrix muliplication is a standard exercise.
          Implementing inflation is  nearly  trivial.  The  hardest
          part may actually be the interpretation function, because
          you need to cover the corner cases of overlap and attrac-
          tor systems of cardinality greater than one.

          In Mathematica or Maple, this should be doable in at most
          50 lines of code.  For perl you may need 50 more lines  -
          note that MCL does not use intricate and expensive opera-
          tions such as matrix inversion or matrix  reductions.  In
          lower  level  languages such as C a basic MCL program may
          need a few hundred lines, but the largest part will prob-
          ably be input/output and interpretation.

          It  is  perhaps even such that implementing the basic MCL
          algorithm makes a nice programming exercise. However,  if
          you  need  an  implementation that scales to several hun-
          dreds of thousands of nodes  and  possibly  beyond,  then
          your  duties  become  much  heavier.  This is because one
          needs to prune MCL iterands  (c.q.  matrices)  such  that
          they  remain  sparse.  This  must  be  done carefully and
          preferably in such a way that a trade-off between  speed,
          memory  usage,  and potential losses or gains in accuracy
          can be controlled via monitoring and logging of  relevant
          characteristics.   Some  other  points are i) support for
          threading via pthreads, openMP, or  some  other  parallel
          programming API.  ii) a robust and generic interpretation
          function is written in terms of weakly  connected  compo-
          nents.

            Cluster overlap / MCL iterand cluster interpretation

    6.1   Introduction

          A   natural  mapping  exists  of  MCL  iterands  to  DAGs
          (directed acyclic graphs). This is because  MCL  iterands
          are  generally  diagonally  positive  semi-definite - see
          [3].  Such a DAG can be interpreted as a clustering, sim-
          ply  by  taking as cores all endnodes (sinks) of the DAG,
          and by attaching to each core all the  nodes  that  reach
          it.  This  procedure may result in clusterings containing
          overlap.

          In the MCL limit, the associated DAG  has  in  general  a
          very  degenerated  (no offense meant) form, which induces
          overlap only on very rare occasions (see faq entry  6.2).

          Interpreting  mcl  iterands  as  clusterings  may well be
          interesting.  Few experiments have been done so  far.  It
          is clear though that early iterands generally contain the
          most overlap (when interpreted as  clusterings).  Overlap
          dissappears soon as the iterand index increases. For more
          information, consult the other entries  in  this  section
          and the clmimac manual page.

    6.2   Can the clusterings returned by mcl contain overlap?

          No. Clusterings resulting from the abstract MCL algorithm
          may in theory contain overlap, but the default  behaviour
          in mcl is to remove it should it occur, by allocating the
          nodes in overlap to the first cluster in which  they  are
          seen. mcl will warn you if this occurs. This behaviour is
          switched off by supplying --overlap.

          Do note that overlap is mostly a theoretical possibility.
          It  is  conjectured that it requires the presence of very
          strong symmetries in the input graph, to the extent  that
          there  exists  an automorphism of the input graph mapping
          the overlapping part onto itself.

          It is possible  to  construct  (highly  symmetric)  input
          graphs leading to cluster overlap. Examples of overlap in
          which a few nodes are involved  are  easy  to  construct;
          examples  with  many nodes are exceptionally hard to con-
          struct.

          Clusterings  associated   with   intermediate/early   MCL
          iterands may very well contain overlap, see the introduc-
          tion in this section and other entries.

    6.3   How do I  obtain  the  clusterings  associated  with  MCL
          iterands?

          There  are two options. If you are interested in cluster-
          ings containing overlap, you should go for the second. If
          not,  use  the first, but beware that the resulting clus-
          terings may contain overlap.

          The first solution is to use -dump cls (probably in  con-
          junction  with  either -L or -dumpi in order to limit the
          number of matrices written). This will cause mcl to write
          the  clustering  generically associated with each iterand
          to file. The -dumpstem option may be convenient as  well.

          The  second  solution  is  to  use  the  -dump ite option
          (-dumpi and -dumpstem may be of  use  again).  This  will
          cause  mcl  to  write  the intermediate iterands to file.
          After that, you can apply clmimac  (interpret  matrix  as
          clustering)  to  those  iterands.  clmimac  has  a -tight
          parameter which affects the mapping of matrices to  clus-
          terings.  It takes a value between 0 and 100 as argument.
          The default is 100 and corresponds with the  strict  map-
          ping.  Lowering the -tight value will generally result in
          clusterings containing more overlap. This will  have  the
          largest effect for early iterands; its effect will dimin-
          ish as the iterand index increases.

          When set to 0, the -tight parameter results in the  clus-
          tering  associated  with  the  DAG associated with an MCL
          iterand as described in [3]. This  DAG  is  pruned  (thus
          possibly  resulting in less overlap in the clustering) by
          increasing the -tight parameter. [add]

                                Miscellaneous

    7.1   How do I find the default settings of mcl?

          Use -z to find out the actual settings  -  it  shows  the
          settings as resulting from the command line options (e.g.
          the default settings if no other options are given).

    7.2   What's next?

          I'd like to port MCL to cluster computing, using  one  of
          the  PVM,  MPI,  or  openMP  frameworks.   For  the 1.002
          release, mcl's internals were  rewritten  to  allow  more
          general  matrix  computations.  Among other things, mcl's
          data structures and primitive  operations  are  now  more
          suited to be employed in a distributed computing environ-
          ment. However, much remains to be  done  before  mcl  can
          operate in such an environment.

          At  some  point  in the future a second, xml-based, ascii
          input format may be introduced.

          If you feel that mcl should support some  other  standard
          matrix format, let us know.

  BUGS
          This  FAQ  tries  to compromise between being concise and
          comprehensive. The collection of answers  should  prefer-
          ably  cover the universe of questions at a pleasant level
          of semantic granularity  without  too  much  overlap.  It
          should offer value to people interested in clustering but
          without sound mathematical training. Therefore,  if  this
          FAQ has not failed somewhere, it must have failed.

          Send criticism and missing questions for consideration to
          mcl-faq at micans.org.

  AUTHOR
          Stijn van Dongen.

  SEE ALSO
          mcxio(5), mcl, mclfaq, mclpipeline, mcxassemble, mcxsubs,
          mcxconvert,  mcxmap,  clmdist, clminfo, clmmeet, clmimac,
          clmresidue, clmformat.

          If enabled in  this  installation:  mclblastline,  mcxde-
          blast.

          mcl's home at http://micans.org/mcl/.

  REFERENCES
          [1]  Stijn  van  Dongen. Graph Clustering by Flow Simula-
          tion.  PhD thesis, University of Utrecht, May 2000.
          http://www.library.uu.nl/digia-
          rchief/dip/diss/1895620/inhoud.htm

          [2]  Stijn  van  Dongen.  A cluster algorithm for graphs.
          Technical Report INS-R0010, National  Research  Institute
          for  Mathematics and Computer Science in the Netherlands,
          Amsterdam, May 2000.
          http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z

          [3] Stijn van Dongen. A stochastic uncoupling process for
          graphs.   Technical  Report  INS-R0011, National Research
          Institute for Mathematics and  Computer  Science  in  the
          Netherlands, Amsterdam, May 2000.
          http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z

          [4]  Stijn  van  Dongen.  Performance  criteria for graph
          clustering  and  Markov  cluster  experiments.  Technical
          Report  INS-R0012, National Research Institute for Mathe-
          matics and Computer Science in the  Netherlands,  Amster-
          dam, May 2000.
          http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z

          [5]  Enright A.J., Van Dongen S., Ouzounis C.A.  An effi-
          cient algorithm for large-scale detection of protein fam-
          ilies, Nucleic Acids Research 30(7):1575-1584 (2002).

  NOTES
          This   page   was  generated  from  ZOEM  manual  macros,
          http://micans.org/zoem. Both html and roff pages  can  be
          created  from  the  same  source without having to bother
          with all the usual  conversion  problems,  while  keeping
          some level of sophistication in the typesetting.



  MCL FAQ 1.003, 03-185       4 Jul 2003                 MCL FAQ(1)
