2009/Aug/21, version 0.005

 

SIXPACK MEDIUM is a Unicode font, consisting of about 4000 characters

SIXPACK is a symbol font

It is a programmer’s font, for displaying the symbols programmers use

–—   names, literals, numerals, special symbols and punctuation   —–

in

specification, code and documentation

using all the six scripts of modern European languages

SIXPACK is a work-in-progress, but the recent additions of Yiddish, Armenian and Georgian means that all modern indigenous European languages can be represented, while algebra and logic are covered fairly well, so maybe it is time to get some feedback.

The rest of this document is unlikely to interest more than a dozen people on the planet, but we might as well make it a matter of record anyway.

 

1.                              the character set               

1.a)                                    names

1.b)                                    literals

1.b) (i)                                       modern European languages

1.b) (ii)                                      scripts

1.b) (iii)                                    easy extensions

1.b) (iv)                                    insular forms

1.b) (v)                                     diacritics

1.c)                                    numerals

1.d)                                    special symbols

1.d) (i)                                       APL and Z

1.d) (ii)                                      mathematical alphanumerics

1.d) (iii)                                    shapes

1.e)                                    punctuation

1.e) (i)                                        spaces

1.e) (ii)                                      other punctuation

2.                  the letter shapes

3.                  the font file

4.                  the future

5.                  the competition

6.                  versions

0.     the name of the font

The font was originally designed on a grid based 1/6th of an em-space, which, it was hoped, would enable it to be used with a very simple renderer, as part of a personal IDE.

Both renderer and IDE remain unfulfilled ambitions, and the grid now divides the em-space into 12 parts, but the name was retained on the slender excuse that it denoted the principal source of inspiration, when things got tough and imagination failed.

Finally, now that the current set of scripts looks something like complete, it seemed appropriate to name the font after the 6 scripts it covers. OTOH, maybe IPA can be counted as a separate script, and the font renamed SEVENPACK? So, if and when Arabic gets added, would the name need to be changed again, to EIGHTPACK?

It might not be a bad idea, because the font world contains other entities using a similar name:

Fonts.com supply a font called Sixpack Regular from Panache Typography, at http://www.fonts.com/findfonts/detail.htm?pid=418135 .

Another font, called Sergeant Sixpack (named after a cartoon character possessing an anatomical feature which surely(?) has nothing to do with the consumption of beer), is available free for personal use, from http://www.dafont.com/sergeant-sixpack.font and a number of other sites.

The FontShop and ITC package together 6 styles of a single face, under the title Sixpack, as (e.g) at http://www.fontshop.com/fonts/downloads/lucasfonts/thesans_light_sixpack/ and
http://www.itcfonts.com/fonts/detail.htm?Pid=4342140 .

MyFonts package together 6 different fonts under the same label, as at
http://www.myfonts.com/PurchaseOptions?sku[]=293086&sku[]=293085 .

None of which is in any way related to the following.

1.     the character set

Most programming languages confine themselves rigidly to 7-bit ASCII. That is no longer sufficient.

It is no longer sufficient because literals also get restricted to 7-bit, and we end up with this sort of illegible code:
            ривет мир  .
Or 8-bits are permitted, but with no guarantee that the display device will know what encoding is intended.

On the other hand, aiming to support proper display for all literals is altogether too broad a specification, given the amount of time available for this project, so the objective is restricted to modern European languages, and easy extensions thereof.

And aiming to support all the mathematical formulae that might be found in a program’s documentation is, likewise, too broad a specification. The set of special symbols grows slowly, new characters being added as the need arises, or requests are received, while formatting of mathematical equations (as in MathML) is not a font issue, and is therefore simply ignored.

1.a)     names

Most programming languages confine themselves rigidly to 7-bit ASCII, using it for names (A-Z upper- and/or lowercase, the digits 0-9, maybe an underscore or a hyphen) and as the source of special symbols.

SIXPACK includes 7-bit ASCII as a subset, as required by the Unicode standard, and therefore appears to be adequate for all valid names in all currently available programming languages.

1.b)     literals

SIXPACK aims to provide the characters necessary to enable those programming languages which support Unicode natively, to display literals as literals (and not as strings of codepoints) – provided, that is, the literal uses a modern European language.

1.b (i)     modern European languages

Europe is taken to extend north to the Pole, south to the Med to include Malta and Crete, west to include Greenland, and east to the Urals and the Caspian Sea, which includes Georgia and the Caucasus in general, Turkey, Armenia and Azerbaijan. Others may argue where Europe begins and ends, but that’s a good enough definition for this font.

Modern means the 21st Century. The font will be kept up-to-date with all official changes, such as the introduction by Germany of an uppercase ß.

It already covers the latter part of the 20th Century as well:

 —   in 1991, Azeri reverted from Cyrillic to Latin script;

       in the 1950s, Irish moved to the “new orthography” (see below), lenition being denoted, in all cases, by a following h, rather than a dot above;

    —    in 1928, Attatürk switched the Turkish language from Arabic script to Latin script;

    —    in 1905, the most recent use of Glagolitic was a book printed in Rome.

In the case of Azeri and Irish, both forms are provided for, but any claim to cover 20th Century European usage prior to 1928 would require this font to include Arabic characters (and Arabic typographical rules), which may happen one day, but not soon.

Arabic apart, much nineteenth century usage is provided for, and some medievalist requirements, but that’s by-the-by.

Language is undefined. No attempt is made to distinguish languages from dialects: anything in a published standard is a candidate for inclusion, but it helps if the thing is in Unicode.

SIXPACK meets the requirements of MES-2 (http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf) 
save that (i) the angle brackets provided are U+27e8 and 27e9 (not U+2329 and 232a, whose use for mathematics is now discouraged), and (ii) polytonic Greek is perhaps best provided via OpenType tables, and is therefore deferred.

1.b (ii)     scripts

Here, a script is the set of characters used to write a language, or a group of (not necessarily related) languages.

In Europe , the big three are Latin, Greek and Cyrillic. Latin and Greek coverage includes all characters for 21st century usage, and quite a lot more. The following Unicode blocks are covered:

     0000 - 00FF Latin-1

     0100 - 017F Latin Extended-A

     0180 - 024F Latin Extended-B

     1E00 - 1EFF Latin Extended Additional

     2C60 - 2C7F Latin Extended-C

     A720 - A7FF Latin Extended-D

     0370 - 03FF Greek and Coptic

     0400 - 04FF Cyrillic

     0500 - 052F Cyrillic Supplement

     A640 - A69F Cyrillic Extended-B

Yiddish is a Germanic language, whose written form uses the Hebrew script (while Maltese, by way of contrast, is a semitic language whose written form uses the Latin script). It is not clear how far this font will meet the needs of Yiddish speakers, but it does provide glyphs for all the characters in the Mimer SQL Yiddish Collation Chart ( http://developer.mimer.com/charts/yiddish.htm ).

The modern Hebrew language should be an easy extension, but the matter of vowels needs re-examining, and cantillation marks are only a distant possibility.

Including Caucasian languages raises the need for two extra scripts: Armenian and the Georgian Mkhedruli script have now been added. The following Unicode blocks are covered:

     0530 - 058F Armenian

     0590 - 05FF Hebrew (in part)

     10A0 - 10FF Georgian (in part)

The International Phonetic Alphabet is used for the written representation of the spoken form(s) of a language. IPA differs from Latin in that it is unicameral, or caseless, and it is therefore sometimes convenient to consider it as a distinct script.

SIXPACK provides a complete of characters for the current IPA standard, and will incorporate any additions as necessary.

Also, a full set of glyphs is also provided for that overblown gallimaufry, the Uralic Phonetic Alphabet and its extensions. Teuthonista (used for German dialects) promises to be even more difficult, but the layout rules are still a matter for discussion.

The following Unicode blocks are covered:

     0250 - 02AF IPA Extensions

     1D00 - 1D7F Phonetic Extensions

     1D80 - 1DBF Phonetic Extensions Supplement

1.b (iii)     easy extensions

The character set is not necessarily limited to modern European usage. When the addition of just a handful of characters to the existing character set will enable the font to cover other languages, then these additional characters will almost always be added. That is the policy of allowing easy extensions.

It sometimes happens that, when previously unwritten languages are reduced to a written form using Latin script, IPA symbols are used to augment the familiar A-to-Z — leading to a need for corresponding uppercase forms.

Even without extensions, the character set for Modern European languages will cover the vast majority of usage in North and South America , and Australasia.

Easy extensions to the Latin script enable it to cover the written forms of much of sub-Saharan Africa, and parts of Central and Southern Asia, while easy extensions to the Cyrillic script cover large parts of Central Asia.

Also, where it is possible to provide for liturgical, ecclesiastical or religious uses by easy extensions, this has been done, including the provision of Romanised Pali for Buddhists.

Coptic and Georgian Khutsuri (asomtavruli and nuskhuri) are not considered easy extensions.

1.b (iv)     insular forms

The insulae in question here are the British Isles, and insular forms describes the letter shapes used for writing Celtic languages and Anglo-Saxon, but above all, for copying the Gospels in Latin. The style, developed from half-uncial, can be seen in the Durham Gospel (ca. 670 AD), but undoubtedly originated sometime earlier in Ireland, and remains in use today.

(This is sometimes described as “insular script”, where “script” refers to the style of the letter shapes. The uncial or half-uncial letter forms differ from roman letter forms, as do italic letter forms—the three styles remaining, nonetheless, representations of the Latin script.)

20th century Irish can be written using uncial forms or roman forms, while lenition (basically, a “softening” or “weakening” of the sound) can be represented by a dot above, or by a following letter h, or a mixture of the two. All four combinations can be found, although the dot above is more commonly used with uncial forms, the dots all being positioned at the same height above the base line. This neat horizontal alignment is not so easy with roman letter forms, and is not even attempted in this font, but the necessary characters are provided, all the same.

The insular forms included within Unicode (U+1d79, U+a779~a787) are all provided, but they are really only useful where it is required to distinguish an insular form from a Roman form in plaintext—not a common requirement in a programming environment.

A similar comment applies to the letter i, which commonly loses its dot (or tittle) in uncial forms. With or without a dot, the letter is still U+0069, LATIN SMALL LETTER I, not the U+0131 LATIN SMALL LETTER DOTLESS I, found in Azeri and Turkish. With insular forms, the presence or absence of the dot is a feature of the font employed, not of the character portrayed.

1.c)     numerals

Numbers are an abstract concept, numerals are their written forms. Numerals are constructed from digits, just as words are constructed from letters. The following numerals all represent the same quantity:
            1100    in binary
            14        in octal
            12        in decimal
            0x0C    in hexadecimal
 and      XII       in Roman numerals.

Neat columnar tabulation requires all digits to be the same width. Not only are the decimal digits, U+2007, the figure space (of course), and the nut fractions, all the same width, but so too are old-style figures, and the hexadecimal digits A-F. (The latter are available at codepoints U+FF21~FF26, a truly reprehensible misuse of fullwidth forms, which will need correcting before Japanese is added to the list of scripts. Old-style figures cannot be accessed until the font has been equipped with the necessary OpenType tables). Piece fractions are untested, but will not, in general, be the same width as the case fractions.

Complex numbers may be written with their real and imaginary parts separated by i or j or U+0131, the dotless i, or U+1d6a4, the italic dotless i, or U+1d6a5, the italic dotless j – according to choice – but they will not necessarily align vertically with the digits.

Specifically Roman numerals (found in the block U+2150~218f Number Forms) are not currently provided.

1.d)     special symbols

In programming languages, each of the strings ‘*’, ‘×’, ‘{times}’ and ‘multiply … by …’ is known as a “symbol”. Each denotes a single semantic entity (in this case, the same semantic entity). We shall avoid the unqualified use of the term symbol.

In this context, special symbol means pretty well everything that isn’t obviously a letter from the written form of a natural language, a digit or punctuation.

SIXPACK aims to provide all the special symbols currently used in computing science, whether practical, theoretical or not yet implemented.

As with names, most programming languages confine themselves rigidly to 7-bit ASCII for their special symbols (using ‘*’ to denote multiplication, and ‘^’ to denote exponentiation, for example), but there a few additional symbols required for APL and Z, relational algebra and functional programming. Theoretical computer science builds on the foundations of mathematics, and so a number of logic symbols are also included.

SIXPACK is intended not only for coding, but also for documentation, which may include a conventional mathematical representation of a process being computerised. This will most commonly be an algebraic process, or a discrete approximation to a continuous process, so the symbols provided are mostly algebraic, with only the commoner symbols from analysis.

Increasing use of symbolic manipulation, as in computational algebra, means this set will inevitably need expanding.

1.d (i)     APL and Z

Glyphs are provided for all the codepoints stipulated in the respective repertoires for APL ( http://www.dkuug.dk/jtc1/sc22/open/n3067.pdf ) and Z (Appendix A of ISO/IEC 13568:2002(E), from http://standards.iso.org/ittf/PubliclyAvailableStandards/c021573_ISO_IEC_13568_2002(E).zip ).

Both repertoires include circles. One side-effect of including a set of shapes to UTR 25’s specifications (v.i), is that the circles provided in SIXPACK are not the same size as the sample glyphs given in the repertoires for APL and Z. One day, it may be possible to alter the display according to context, using something like language tags, but right now, it is not possible to reconcile these differences.

For APL applications, Arial Unicode MS will suffice; APL385 ( http://www.vector.org.uk/resource/ ) is popular within the APL community, and has the added merits of (i) providing identical glyphs at duplicate codepoints, to smooth out minor differences between vendors, and (ii) ensuring that U+233e, the circle-jot, is consistent with U+25cb, the APL circle and U+2218, the APL jot.

Zed.ttf (obtainable from http://www.cs.kent.ac.uk/people/staff/rej/Zedfont/latest/ or http://fonts.goldenweb.it/pan_file/l/en/font2/Zed.ttf/d2/Freeware_fonts/c/z/default.html) is recognised by a number of Z tools, but it is not a Unicode font; CZTSans.ttf (http://www.cs.waikato.ac.nz/~marku/czt/eclipse.html ), as used by the CZT Eclipse Plugin, is. (The dividing issue seems to be whether you are working to the Z Standard, or the Z Reference Manual.)

1.d (ii)     mathematical alphanumerics

Mathematical Alphanumeric Symbols may be found in U+1d400 to 1d4ff). They are to be used only where the format (italic, bold, double-struck, &c) of a letter carries semantic significance.

Apart from the dotless-i and the dotless-j mentioned above, only uppercase doublestruck A-Z are currently provided in this version of the font.

1.d (iii)     shapes

UTR 25, Unicode Support for Mathematics ( http://www.unicode.org/reports/tr25/ ), identifies a number of existing characters as being part of a sequence of circles of increasing size. Additional conditions – the need for the largest to enclose uppercase letters, the fact that the second-largest must by implication be about caps height, and the requirement that (some of) the larger circles should be able to enclose (some of) the smaller circles – define the sizes of the circles with very little latitude.

Similar (less complete) sequences of other simple shapes are defined therefrom, via an undefined requirement that dissimilar shapes should have the “same visual impact”.

SIXPACK is the first, and currently the only, font to include a set of shapes meeting all the requirements of UTR 25.

The current requirements, however, are not without their problems (some of the problems described in Shapes I and Shapes II persist). One issue is that, for instance, the glyphs for U+2B25~U+2B28 MEDIUM DIAMONDs and MEDIUM LOZENGEs are shown in Table 2.5 of UTR 25 in the column headed “medium small”. According to UTR 25, “the intended sizes of existing characters and their names as shown in the code charts are not always consistent”, but there seems little point in naming a character “medium” and then pretending it is “medium small”. In this font, if a character’s name, or its notes, define it as “medium” sized, then the glyph provided within this font is “medium” sized.

If “medium small” diamonds and lozenges are required, they will have to be provided by some other mechanism.

Another issue is the (probably unintended) changes to the appearance of APL and Z. The Z repertoire specifies the use of  U+2218 RING OPERATOR for function composition, which is perfectly sensible given that the comments for U+2218 says “= composite function”. The same codepoint is used for the “APL jot”, with rather different semantics. That may have been a good idea once (before version 1 of UTR 25), but Table 2.5 of UTR 25 now shows U+2218 as a “very small” circle, which is certainly not the letterform used for function composition in all the recent texts consulted.

Similar comments apply to APL’s use of a diamond.

APL and Z aficionados will presumably continue to use their own, specialised, fonts, untroubled by the unexplained reduction in size of U+2218. Mathematicians seeking a suitable symbol for functional composition may opt for U+26ac MEDIUM SMALL CIRCLE if they find it more visually appealing, so long as they don’t need MathML’s parsing facilities.

1.e)     punctuation

1.e (i)     spaces

Unicode defines quite a lot of space characters, and all those which fall within our “modern European” remit are included.

Much software does not recognise the more esoteric spaces, and those that do, do not always respect the advance widths defined within a font: for instance, Winword, up to and including 2003, uses its own advance width for U+2002.

The em quad, U+2001, is a square whose height and width are both equal to the nominal point size of the font. The en quad, U+2000, is half that width. There is no obvious visual relationship between these areas of white space, and the letters M and N.

The em space, U+2003, on the other hand, has precisely the same advance width as the letter M. The en space, U+2002, is half that width, the thick space, U+2004, one third, the mid space, U+2005, one quarter, and the thin space, U+2009, one sixth. The hair space, U+200a, is half the width of a thin space, and thus one-twelfth the width of an em space. Apart from the quads mentioned above, all characters have an advance width which is an integer multiple of the width of a hair space. This allows low-tech rendering machines to achieve precise vertical alignment (give or take a rounding error or two) by concatenating spaces, and thus permits the piece-wise assembly of multi-storey symbols.

MathML, with monolithic multi-storey glyphs, will give a superior result, but this is not always available.

Precise vertical alignment of the numbers in a table is facilitated by U+2007, the figure space, and U+2008, a punctuation space whose advance width matches both the comma and the full-stop.

U+0020, the ASCII space, and U+00a0, the non-breaking space, have the same widths as the en space. That is to say, the advance width within the font has the same value as the advance width of the en space – some software may see things differently.

U+1680 OGHAM SPACE MARK, U+180e MONGOLIAN VOWEL SEPARATOR and U+3000 IDEOGRAPHIC SPACE are not included.

1.e (ii)     other punctuation

The usual dots, commas, colons and semicolons are all there, plus sundry delimiters and fences.

Delimiters come in pairs: usually opening and closing glyphs, sometimes left and right, often mirror images of each other. Brackets, braces and parentheses are the archetypical delimiters, but the set includes quotation marks, question and exclamation marks, among others. Corners, of course, come in sets of four.

Delimiters may be nested, and they may enclose more than one line of text, so delimiters need to be able to vary in height. Brackets and parentheses can be built from top, centre and bottom pieces. These need to vertically aligned carefully, but it does give unlimited height. Braces are not quite so well served. Top, centre and bottom pieces can be used to build braces 3-lines high and, with the aid of extension pieces, 5‑ and 7-lines high. Braces 2-lines and 4-lines high require special treatment, and the necessary pieces are provided in the PUA. With the aid of extension pieces, these can be extended to 6- or more lines high.

Some other characters are defined singly, but commonly used in pairs — U+007c, U+2016 and U+2980.

These glyphs are centred on the same horizontal as brackets, braces and parentheses, and they have equal left and right bearings, so it’s up to the user to add any space felt necessary to distinguish left and right. There is no provision for glyphs more than 1 line high.

Yet other glyphs, such as summation (U+23b2 and U+23b3) and integral (U+2320, U+23ae and U+2321), may span more than one line, but lack a closing delimiter. In the event of ambiguity, when scope is not obvious from the layout, or cannot be derived by precedence rules, parentheses will be necessary.

Technology intended for text processing is not entirely satisfactory for box drawing. A subset of the characters in the block U+2500~U+257f is provided, but the problems of vertical alignment will be felt more acutely. In anticipation of possible applications, the box drawing characters have the same advance width as the digits, but for boxed text, the horizontals will inevitably be too long or too short. All in all, a low-tech solution, for simple cases only.

2.     the letter shapes

“Form follows function”, to quote Louis Sullivan. Granted, he was talking about rather more concrete objects, but the principle applies here.

Hitherto, programmers’ fonts have been bit-mapped 256 character fonts, usually monospaced, intended for 25‑by-80 displays of green-on-black. A modern programmer’s font needs to be able to display an extended character set as distinguishable scaled glyphs, at low resolutions.

Unlike fixed spacing horizontally, fixed vertical spacing is very helpful. The space within the bounding box has been allocated so that, without breaking the box, there is space for a single diacritical marking above all uppercase letters, and space for another below.  Likewise, there is space above lowercase letters with an ascender, and below lowercase letters without descenders. Lowercase letters without ascenders (which includes all vowels) can accommodate two diacritical markings above. When it is necessary to stack two marks above an uppercase letter (as in Vietnamese), a smallcap is used – not everybody’s taste in typography, but the first priority is to provide the programmer with an identifiable glyph.

In normal text, it is possible to identify unclear characters, using context. In a program, it is often necessary to manipulate individual letters, or small groups of letters, while (e.g) constructing a prompt where adjectives, nouns and verbs must agree in number, gender and case.

Letter shapes need to be unambiguously identifiable, even out of context. This means even the letter i needs a certain minimum width, while punctuation and diacritical markings are heavier than usual. (For further justification, see http://www.dclab.com/dclnews0802.asp#ASTORY1 .)

The letter shapes in SIXPACK are geometrical, showing no contrast. This is not a display font: quirky features which succeed well in display fonts can get very irritating when reading and rereading a chunk of code, trying to locate a bug. Not only can subtle curves not be seen at small sizes on low-res equipment, they may make matters worse, by moving part of a supposed vertical one pixel to the left or the right, to deleterious effect.

The result is legible when this font printed at 8pt on a 300dpi inkjet. It is still legible at 6pt, but some unevenness is apparent in the stroke widths.

If we make the simplifying assumption that the screen has 72dpi, then (unzoomed, in physical inches) the body height of a 12pt character is precisely 12 pixels, leaving approximately 5 pixels for the x-height. This is just about sufficient for the letters s and the 2-storey a. At 10pt, the lowercase counters disappear – continuous text is still intelligible, but not sufficiently legible for code listings.

3.     the font file

The font file contains glyph outlines, encoding details and not much else.

The outlines all have carefully positioned T1 hinting, which helps control horizontal and vertical stems. TT hinting would allow diagonal hinting, triple hinting and hinting of white space, but that would involve a lot more time and effort. If it does get done, that version of the font will not be available gratis.

There are no kerning pairs, no OT tables and no anchors.

(It wasn't supposed to be that way. A moment's inattention destroyed much of the necessary information, while a week's blissful ignorance of the damage meant any useful backups got overwritten. These tables will be restored in due course.)

4.     the future

i)         some stylistic unity between the scripts would be nice, but there are difficulties -- apart from the differences between the scripts themselves, mistakes are discovered, skills improve, tastes change, technology advances -- so this may be a forlorn hope;

ii)                 add in some TT hinting (but not delta-hinting and the like, which may no longer be appropriate);

iii)              define anchors, so that base characters and their diacritical markings may be assembled automatically;

iv)                provide smallcaps for all text characters (for caseless scripts, this will mean scaling ascenders down to x-height);

v)                 provide OT tables, so that (e.g) OSF, alternative fractions and smallcaps may be accessed by suitable software;

vi)               include pre-21st century usage: chronologically, 20th and 19th century usage ought to be implemented first, but extensions for mediæval Latin have already been added, out of sequence;

vii)             consider user requests.

Some of this may prove unnecessary as software and rendering machines improve.

5.     the competition

currently under review

6.     versions

2009 Aug 21          0.004   U+2200~22FF Mathematical Operators completed

2009 Jun 24           0.004   Armenian and Georgian Mkhedruli added

2009 May 20         0.002   early draft

2009 Jan 20           0.001   initial draft

 

 

----------------------------------