  SylFilter - a message filter

  Copyright (C) 2011-2013 Hiroyuki Yamamoto <hiro-y@kcn.ne.jp>
  Copyright (C) 2011-2013 Sylpheed Development Team


About This Program
==================

This is SylFilter, a generic message filter library and command-line tools.
SylFilter provides a bayesian filter which is very popular as a spam filtering
algorithm. SylFilter is also internationalized and can be applied to any
languages.

SylFilter library provides simple but powerful C APIs and can be used from C
programs.

SylFilter command-line tool can be used as a junk filter program like major
tools such as bogofilter and bsfilter etc.

SylFilter is free software and distributed under the BSD-like license.
See COPYING for detail.


Install
=======

This program requires GLib and a key-value store engine. Install them before building.
Currently SQLite (enabled by default), QDBM and GDBM are supported for key-value store engine.

  $ ./configure
  ( $ ./configure --disable-sqlite --enable-qdbm (enables QDBM) )
  ( $ ./configure --disable-sqlite --enable-gdbm (enables GDBM) )

  $ make
  $ sudo make install

By default, built-in subset of libsylph is used for message parsing.
To use libsylph installed on your system, specify --with-libsylph option.

  ./configure --with-libsylph=builtin     use built-in LibSylph (default)
  ./configure --with-libsylph=standalone  use standalone version of LibSylph
  ./configure --with-libsylph=sylpheed    use Sylpheed's LibSylph

If libsylph is installed on non-standard location, also use
--with-libsylph-dir option.


Usage
=====

SylFilter accepts rfc822 message files (for example: MH, Maildir, eml).

Learning junk mails

  $ sylfilter -j ~/Mail/junk/*

Learning clean mails

  $ sylfilter -c ~/Mail/clean/*

Classifying mails

  $ sylfilter ~/Mail/inbox/1234

Show learn status

  $ sylfilter -s

Show learn status and all learned tokens

  $ sylfilter -s -v

Show help message

  $ sylfilter -h
  $ sylfilter --help


Usage with Sylpheed
===================

On 'Common preferences... - Junk mail - Learning command:', manually set
each command as following:

Junk                : sylfilter -j
Not Junk            : sylfilter -c
Classifying command : sylfilter


Other information
=================

Token database files are created under ~/.sylfilter/ .
(On Windows: %APPDATA%\SylFilter\)


Library Design
==============

The filtering of SylFilter consists of a set of simple filter modules.

         (Learning)                   (Classifying)

        rfc822 message                rfc822 message
              |                             |
   [ text content filter ]       [ text content filter ]
              |                             |
  [ word separator filter ]       [ blacklist filter ]  --> spam
              |                             |
      [ n-gram filter ]         [ word separator filter ]
              |                             |
     [ learning filter ]            [ n-gram filter ]
                                            |
                                   [ bayesian filter ]  --> spam
                                            |
                                         non-spam

The library users can create arbitrary combination of provided filters.
Users also can add their original custom filters.

Please read the source of src/sylfilter.c for library usage.


Algorithm of Bayesian Filter
============================

SylFilter implements Fisher's method which is described by Gary Robinson.
It is also implemented by bogofilter and bsfilter.

  http://radio-weblogs.com/0101454/stories/2002/09/16/spamDetection.html
  http://www.bgl.nu/bogofilter/fisher.html

SylFilter initially implemented the customized version of algorithm
described by Paul Graham.

  http://paulgraham.com/spam.html
  http://paulgraham.com/better.html

Robinson-Fisher method is used by default.

Basically the algorithm can be described as follows:

1. Counts the number of occurrences of words in a spam and non-spam.
2. Calculates the probability that a message containing it is a spam for
   each words in a message.
3. Calculates the combined probability using important words in the message.

See the above Web pages for the detail.
