0
|
1 /* ========================================================================
|
|
2 * Copyright 1988-2006 University of Washington
|
|
3 *
|
|
4 * Licensed under the Apache License, Version 2.0 (the "License");
|
|
5 * you may not use this file except in compliance with the License.
|
|
6 * You may obtain a copy of the License at
|
|
7 *
|
|
8 * http://www.apache.org/licenses/LICENSE-2.0
|
|
9 *
|
|
10 *
|
|
11 * ========================================================================
|
|
12 */
|
|
13
|
|
14 Mailbox Format Characteristics
|
|
15 Mark Crispin
|
|
16 11 December 2006
|
|
17
|
|
18
|
|
19 When a mailbox storage technology uses local files and
|
|
20 directories directly, the file(s) and directories are layed out in a
|
|
21 mailbox format.
|
|
22
|
|
23 I. Flat-File Formats
|
|
24
|
|
25 In these formats, a mailbox and all the messages inside are a
|
|
26 single file on the filesystem. The mailbox name is the name of the
|
|
27 file in the filesystem, relative to the user's "mail home directory."
|
|
28
|
|
29 A flat-file format mailbox is always a file, never a directory.
|
|
30 This means that it is impossible to have a flat-file format mailbox
|
|
31 that has inferior mailbox names under it (so-called "dual-usage"
|
|
32 mailboxes). For some inexplicable reason, some people want this.
|
|
33
|
|
34 The mail home directory is usually the same as the user login
|
|
35 home directory if that concept is meaningful; otherwise, it is some
|
|
36 other default directory (e.g. "C:\My Documents" on Windows 98). This
|
|
37 can be redefined by modifying the c-client source code or in an
|
|
38 application via the SET_HOMEDIR mail_parameters() call.
|
|
39
|
|
40 For example, a mailbox named "project" is likely to be found in
|
|
41 the file "project" in the user's home directory. Similarly, a mailbox
|
|
42 named "test/trial1" (assuming a UNIX system) is likely to be found in
|
|
43 the file "trial1" in the subdirectory "test" in the user's home
|
|
44 directory.
|
|
45
|
|
46 Note that the name "INBOX" has special semantics and rules, as
|
|
47 described in the file naming.txt.
|
|
48
|
|
49 The following flat-file formats are supported by c-client as of
|
|
50 the time of this writing:
|
|
51
|
|
52 . unix This is the traditional UNIX mailbox format, in use for nearly
|
|
53 30 years. It uses a line starting with "From " to indicate
|
|
54 start of message, and stores the message status inside the
|
|
55 RFC822 message header.
|
|
56
|
|
57 unix is not particularly efficient; the entire mailbox file
|
|
58 must be read when the mailbox is open, and when reading message
|
|
59 texts it is necessary to convert the newline convention to
|
|
60 Internet standard CR LF form. unix preserves UIDs, and allows
|
|
61 the creation of keywords.
|
|
62
|
|
63 Only one process may have a unix-format mailbox open
|
|
64 read/write at a time.
|
|
65
|
|
66 . mmdf This is the format used by the MMDF mailer. It uses a line
|
|
67 consisting of 4 <CTRL/A> (0x01) characters to indicate start
|
|
68 and end of message. Optionally, there may also be a unix
|
|
69 format "From " line. It otherwise has the same
|
|
70 characteristics as unix format.
|
|
71
|
|
72 . mbx This is the current preferred mailbox format. It can be
|
|
73 handled quite efficiently by c-client, without the problems
|
|
74 that exist with unix and mmdf formats. Messages are stored
|
|
75 in Internet standard CR LF format.
|
|
76
|
|
77 mbx permits shared access, including shared expunge. It
|
|
78 preserves UIDs, and allows the creation of keywords.
|
|
79
|
|
80 . mtx This is supported for compatibility with the past. This is
|
|
81 the old Tenex/TOPS-20 mail.txt format. It can be handled
|
|
82 quite efficiently by c-client, and has most of the
|
|
83 characteristics of mbx format.
|
|
84
|
|
85 mtx is deficient in that it does not support shared expunge;
|
|
86 it has no means to store UIDs; and it has no way to define
|
|
87 keywords except through an external configuration file.
|
|
88
|
|
89 . tenex This is supported for compatibility with the past. This is
|
|
90 the old Columbia MM format. This is similar to mtx format,
|
|
91 only it uses UNIX-style bare-LF newlines instead of CR LF
|
|
92 newlines, thus incurring a performance penalty for newline
|
|
93 conversion.
|
|
94
|
|
95 . phile This is not strictly a format. Any file which is not in a
|
|
96 recognized format is in phile format, which treats the entire
|
|
97 contents of the file as a single message.
|
|
98
|
|
99
|
|
100 II. File/Message Formats
|
|
101
|
|
102 In these formats, a mailbox is a directory, and each the messages
|
|
103 inside are separate files inside the directory. The file names of
|
|
104 these files are generally the text form of a number, which also
|
|
105 matches the UID of the message.
|
|
106
|
|
107 In the case of mx, the mailbox name is the name of the directory
|
|
108 in the filesystem, relative to the user's "mail home directory." In
|
|
109 the case of news and mh, the mailbox name is in a separate namespace
|
|
110 as described in the file naming.txt.
|
|
111
|
|
112 A file/message format mailbox is always a directory. This means
|
|
113 that it is possible to have a file/message format mailbox that has
|
|
114 inferior mailbox names under it (so-called "dual-usage" mailboxes).
|
|
115 For some inexplicable reason, some people want this.
|
|
116
|
|
117 Note that the name "INBOX" has special semantics and rules, as
|
|
118 described in the file naming.txt.
|
|
119
|
|
120 The following file/message formats are supported by c-client as of
|
|
121 the time of this writing:
|
|
122
|
|
123 . mx This is an experimental format, and may be removed in a future
|
|
124 release. An mx format mailbox has a .mxindex file which holds
|
|
125 the message status and unique identifiers. Messages are
|
|
126 stored in Internet standard CF LF form, so the file size of
|
|
127 the message file equals the size of the message.
|
|
128
|
|
129 mx is somewhat inefficient; the entire directory must be read
|
|
130 and each file stat()'d. We found it intolerable for a
|
|
131 moderate sized mailbox (2000 messages) and have more or less
|
|
132 abandoned it.
|
|
133
|
|
134 . mh This is supported for compatibility with the past. This is
|
|
135 the format used by the old mh program.
|
|
136
|
|
137 mh is very inefficient; the entire directory must be read
|
|
138 and each file stat()'d, and in order to determine the size
|
|
139 of a message, the entire file must be read and newline
|
|
140 conversion performed.
|
|
141
|
|
142 mh is deficient in that it does not support any permanent
|
|
143 flags or keywords; and has no means to store UIDs (because
|
|
144 the mh "compress" command renames all the files, that's
|
|
145 why).
|
|
146
|
|
147 . news This is an export of the local filesystem's news spool, e.g.
|
|
148 /var/spool/news. Access to mailboxes in news format is read
|
|
149 only; however, message "deleted" status is preserved in a
|
|
150 .newsrc file in the user's home directory. There is no other
|
|
151 status or keywords.
|
|
152
|
|
153 news is very inefficient; the entire directory must be
|
|
154 read and each file stat()'d, and in order to determine the
|
|
155 size of a message, the entire file must be read and newline
|
|
156 conversion performed.
|
|
157
|
|
158 news is deficient in that it does not support permanent flags
|
|
159 other than deleted; does not support keywords; and has no
|
|
160 expunge.
|
|
161
|
|
162
|
|
163 Soapbox on File/Message Formats
|
|
164
|
|
165 If it sounds from the above descriptions that we're not putting
|
|
166 too much effort into file/message formats, you are correct.
|
|
167
|
|
168 There's a general reason why file/message formats are a bad idea.
|
|
169 Just about every filesystem in existance serializes file creation and
|
|
170 deletions because these manipulate the free space map. This turns out
|
|
171 to be an enormous problem when you start creating/deleting more than a
|
|
172 few messages per second; you spend all your time thrashing in the
|
|
173 filesystem.
|
|
174
|
|
175 It is also extremely slow to do a text search through a
|
|
176 file/message format mailbox. All of those open()s and close()s really
|
|
177 add up to major filesystem thrashing.
|
|
178
|
|
179
|
|
180 What about Cyrus and Maildir?
|
|
181
|
|
182 Both formats are vulnerable to the filesystem thrashing outlined
|
|
183 above.
|
|
184
|
|
185 The Cyrus format used by CMU's Cyrus server (and Esys' server)
|
|
186 has a special associated flat file in each directory that contains
|
|
187 extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs)
|
|
188 about the messages. Put another way, it's a (considerably) more
|
|
189 featureful form of mx. It also uses certain operating system
|
|
190 facilities (e.g. file/memory mapping) which are not available on older
|
|
191 systems, at a cost of much more limited portability than c-client.
|
|
192 These considerably ameliorate the fundamental problems with
|
|
193 file/message formats; in fact, Cyrus is halfway to being a database.
|
|
194 Rather than support Cyrus format in c-client, you should run Cyrus or
|
|
195 Esys if you want that format.
|
|
196
|
|
197 The Maildir format used by qmail has all of the performance
|
|
198 disadvantages of mh noted above, with the additional problem that the
|
|
199 files are renamed in order to change their status so you end up having
|
|
200 to rescan the directory frequently to locate the current names
|
|
201 (particularly in a shared mailbox scenario). It doesn't scale, and it
|
|
202 represents a support nightmare; it is therefore not supported in the
|
|
203 official distribution. Maildir support code for c-client is available
|
|
204 from third parties; but, if you use it, it is entirely at your own
|
|
205 risk (read: don't complain about how poorly it performs or bugs).
|
|
206
|
|
207
|
|
208 So what does this all mean?
|
|
209
|
|
210 A database (such as used by Exchange) is really a much better
|
|
211 approach if you want to move away from flat files. mx and especially
|
|
212 Cyrus take a tenative step in that direction; mx failed mostly because
|
|
213 it didn't go anywhere near far enough. Cyrus goes much further, and
|
|
214 scores remarkable benefits from doing so.
|
|
215
|
|
216 However, a well-designed pure database without the overhead of
|
|
217 separate files would do even better.
|