Login | Register
My pages Projects Community openCollabNet

Avoiding Vogon Poetry

Generally speaking, SubEtha tries to modify messages as little as possible as they are passed through the system. Hopefully this means the receiver will see the message in the same encoding and character set that it was sent with. However, there are a few cases where this might not be the case:

  • The SubEtha web interface always uses UTF-8. All contents of Mail messages will be converted to this encoding before being rendered as HTML.
  • SubEtha allows list administrators to use plugin modules called Mail Filters to arbitrarily manipulate the message stream. When a mail filter adds text or replaces text in a header, it (almost) always uses the Java platform default encoding. For example, a SubjectFilter may receive a message with the subject header encoded Shift-JIS, prepend "[ListName] ", and save the new header encoded with UTF-8.

If you're running all your lists in your native language, you can ignore any of these settings and everything will probably work fine. However, we recommend the following, especially if you plan to host lists in multiple languages:

  1. Set the JVM platform default character encoding to UTF-8

    This will cause any text that SubEtha creates or modifies to be encoded in the most portable form possible.

    HOW: In the jboss start script (look in ${jboss.dir}/bin/run.conf or run.sh or run.bat), set JAVA_OPTS=-Dfile.encoding=UTF-8.

  2. Make sure your database tables are UTF-8

    While message content is stored as-is in the database as BLOBs, the Subject is also kept in a separate column for performance reasons. You'll want this to be encoded UTF-8. Fortunately most modern databases use UTF-8 by default.

    HOW: This is, unfortunately, very specific to the database vendor. For MySQL, you want to include the following in your my.cnf config file:

    Note that you must set this before you run SubEtha for the first time. SubEtha creates the database schema for you, and will happily create them with the wrong charset. If you do this wrong, either manually change the charset for each table (MySQL: alter table [tablename] character set=utf8) or drop the database and start again (this will delete your data!).

  3. Use substitution filters sparingly

    The more you manipulate text, the more you risk replacing the original encoding of a message. In particular, avoid the Subject filter on lists that converse in multiple languages.