Uniscribe: The Missing Documentation & Examples

作者: admin 分类: C++ 发布时间: 2013-03-14 09:52 ė2,762 浏览数 6没有评论


Uniscribe: The Missing Documentation & Examples



Microsoft created an extremely powerful API called Uniscribe that allows applications to do typography of scripts that may have complex rules for transforming the input string (a list of Unicode code points) to the proper thing that should be rendered on the screen.

Unfortunately, Microsoft did not document this library very well, gave no examples, and blessed it with an extremely complex API. I have attempted to document and give examples for some aspects of the Uniscribe library that I am familiar with in the hopes that it will be useful to other developers.

This document comes from my contribution to getting Uniscribe to work in Google Chrome. You can see the production versions of the code in: UniscribeHelper.h and UniscribeHelper.cpp. This document was written from memory without referencing the Google Chrome code (to avoid leaks) before it was released, so there may be bugs or typos present here that are not in the production code. It’s the only browser other than IE to get Arabic justification using Kashidas correct, and it’s the only browser other than Safari to get extra character spacing in Hebrew correct. There are bugs, however. The main limitation is that it doesn’t handle font (face, color, style, etc.) changes in the middle of shaped words.

Note: In the examples, I use Unicode characters rather than images, and count on your browser to be able to display complex scripts properly. This may not be the case for all systems. It at least the case for the newest versions of Firefox and IE on Windows 2000 and above.

Why should you use Uniscribe?

  • You want to be able to render right-to-left and left-to-right languages in the same layout.
  • You want to be able to handle the complex ligaturization rules of languages like Arabic. The transformations can be quite dramatic, for example: Incorrect, left-to-right layout without ligatures: ة ي د و ع س ل ا Correct, right-to-left with ligatures: السعودية (assuming your browser can render Arabic properly).
  • You want to be able to handle combining characters like Hebrew vowel points or even combining accents that can be used in Latin-based languages. For example, é can be represented by the single code point U+00E9, or by the combination of U+0065 e and U+0301 ´ (combining acute accent).
  • Use nice ligatures or other decoration that your font has available. Many fonts have a ligature for “fi” (compare fi and , assuming your font has this glyph) and “fl”, and some high-end OpenType fonts have fancier ligatures for pairs like “Th” that can look very nice (for example, see the Glyph Complement Sheet for Adobe Minion Pro [pdf]).


Uniscribe works in two modes. In the more basic mode, the programmer calls ScriptStringAnalyze on their input, calls any number of other ScriptString… functions to get information about the text, ScriptStringOut to draw the string, and ScriptStringFree to free the internal data for the string.

This basic mode is not discussed here. It is slightly easier to use than the non-basic mode, and I also have no experience with it, so I have nothing to add other than the MSDN documentation for Uniscribe functions.

Instead, I document the parts of the more complex do-it-yourself API that I am familiar with. The important thing to know is that you are either in basic mode, in which case all your functions start with ScriptString…, or your are in do-it-yourself mode where you can not call these functions. The approximate outline for do-it-yourself mode is as follows:

  1. Call ScriptItemize on your input string. This will itentify the “runs” in the string that consist of a single direction of text. Most of the rest of the Uniscribe functions operate on these runs individually, so you’ll have to keep track of them yourself.
  2. Call ScriptLayout to convert your list of runs to the order that they should appear on the screen, from left to right. This allows you to have a sequence of runs that are right to left, embedded in runs that are left to right.
  3. Call ScriptShape with the text of each run to convert it to a series of glyph indices (these are internal references to the font you selected that identifies the glyph to use). One character in the input may be composed of 0, 1, or more than one glyphs.
  4. Call ScriptPlace with the glyph indices of each run to find out where they should be placed relative to each other. After this, you will also be able to measure the width of your run.
  5. Optional: call ScriptJustify to fill the text out to a given width.
  6. Optional: call ScriptCPtoX and ScriptXtoCP as needed to convert between character offsets and pixel positions in a run.
  7. Call ScriptTextOut to draw the placed glyphs on the screen.


I wrote this documentation and examples based on what I learned when using Uniscribe. It is likely I am incorrect about some aspects of the library, and there are surely errors in the examples, which have never been compiled. Use at your own risk! If something isn’t working the way you expect, don’t automatically assume my code is correct. If you do find errors, please email me and I’ll try to fix them.


MSDN Documentation for ScriptItemize


HRESULT ScriptItemize(
const WCHAR* pwcInChars,
int cInChars,
int cMaxItems,
const SCRIPT_CONTROL* psControl,
const SCRIPT_STATE* psState,
int* pcItems);
pwcInChars, cInChars
Easy: the input string and its length.
The number of spaces in the pItems array that Uniscribe can write into. If this number is insufficient, it will return E_OUTOFMEMORY.
It appears pre XP SP2 Uniscribe versions have a buffer overflow in this function (see Mozilla bug 366643). The workaround is to always give it a buffer one larger than you report. The documentation mysteriously says that you must give it one more byte than you would expect, which might be a workaround for this same bug. I am unsure if this one extra byte is sufficient to work around the problem so I would always just give it an additional full item.
psControl, psState
Pointer to a SCRIPT_CONTROL that sets application-level preferences for the type of formatting to do, and a SCRIPT_STATE that tells Uniscibe about the surrounding context of the input. Most times, the only important thing to do is set SCRIPT_STATE.uBidiLevel to the value indicating the direction of the surrounding text so that Uniscribe can handle your additional run correctly. The rest of the values will most often be 0 for normal use.
Important: psControl must not be NULL. The MSDN documents say it can be if you don’t want to set any options, but it seems some RTL code inside Uniscribe doesn’t run if you don’t provide a SCRIPT_CONTROL pointer, and you will have some rendering bugs (for example, some punctuation will be rendererd LTR even when it is inside an RTL run). Instead, provide a pointer to a structure initialized to all 0s when you don’t have any options.
pItems, pcItems
A pointer to an array of SCRIPT_ITEM structures that contain the information about the run. On success, the number of items written to the list (corresponding to each run of text) will be in *pcItems.
It seems that Uniscribe often requires many more items structures internally than it will actually return. For example, it may require an array of 10 items to not return E_OUTOFMEMORY, even though it actually returns 3 items in the end.
Uniscribe will write one more SCRIPT_ITEM structure to the array than it reports, meaning the output items will always be at least one less than the maximum number of input items you report.

Example input and output

Here is an example of an input array that produces pcItems = 3 as so:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
H e l l o ا ل س ع و د ي ة !
item[0].iCharPos = 0 item[0].a.fRTL = false item[1].iCharPos = 6 item[1].a.fRTL = true item[2].iCharPos = 14 item[2].a.fRTL = false

There will also be a magical item[3].iCharPos = 15 so you can tell that the last run has only one character in it. The first and the last run are left-to-right, but the middle run is Arabic so it will have item[1].a.fRTL = true.

Example code


MSDN Documentation for ScriptLayout

ScriptLayout tells you what order the runs returned by ScriptItemize should appear on the screen. If you are only dealing with left-to-right text, then the order that the runs appear on the screen is the same order as the input text, so there is nothing that needs to be done. However, if one or more runs is right-to-left, there may be some shuffling that needs to happen to get things in the correct order. This function tells you the mapping between logical (in your input text) to visual (on the screen).

ScriptLayout is fairly straightforward and is documented pretty well on MSDN, so I’ll skip the documentation and go to the example. I will use these mapping arrays in other examples below.

Example input and output

Logical order from ScriptItemize: Run one, LTR Run two, RTL Run three, RTL Run four, LTR
Desired screen order: Run one, LTR Run three, RTL Run two, RTL Run four, LTR

Example code

Assuming we have our “input” array in items from ScriptItemize, we can construct two lookup tables that allow us to convert between logical and visual run indices:


MSDN Documentation for ScriptShape

ScriptShape computes which glyphs to use for a given run that has already been identified with ScriptItemize. With the output of this function, you can call ScriptPlace to compute how the glyphs should be arranged. You can’t do much with just the glyphs, so I treat ScriptShape and ScriptPlace as a pair of functions that are always called together.


HRESULT ScriptShape(
HDC hdc,
const WCHAR* pwcChars,
int cChars,
int cMaxGlyphs,
WORD* pwOutGlyphs,
WORD* pwLogClust,
int* pcGlyphs);
hdc, psc
An optional HDC with the desired font selected into it, plus the SCRIPT_CACHE of that font from a previous call. See MSDN’s explanation of how caching works.
pwcChars, cChars
The characters and length of the run as identified by ScriptItemize.
The size of the per-glyph buffers that you are giving it, these are the pwOutGlyphs and psva arrays. If these buffers are not big enough, the function will return E_OUTOFMEMORY and you will need to re-call with bigger buffers.
A pointer to the SCRIPT_ANALYSIS structure that ScriptItemize computed for this run. This structure in in SCRIPT_ITEM.a for each run.
You can modify this SCRIPT_ANALYSIS structure after getting it from ScriptItemize but before giving it to ScriptShape to override run parameters. For example, Mozilla disables shaping by setting SCRIPT_ANALYSIS.eString = SCRIPT_UNDEFINED if ScriptShape fails. This might be a good approach if you find that ScriptShape fails with USP_E_SCRIPT_NOT_IN_FONT and you don’t have a good alternative font to use that might support the script. If you do this, note Mozilla bug 341500. Apparently, Uniscribe occasionally crashes if you disable shaping and there is a UTF-16 surrogate pair (representing a character above U+FFFF). Mozilla’s solution is to generate a new input string with UTF-16 surrogates (search for GenerateAlternativeString in gfxWindowsFonts.cpp) replaced by U+FFFD (the Unicode “replacement character”).
A pointer to an array that will receive the glyph indices. This should be cMaxGlyphs long.
A pointer to an array that will receive the log information, it should be the same size as the input. It will tell you, for each character in the input, the index of the first glyph in pwOutGlyphs that was generated from it.
Some input characters won’t generate a glyph or may share a glyph with another character. Other input characters will generate more than one glyph. Also be aware that for RTL runs, the indices in pwLogClust will count backwards, since the glyph indices will be put in screen order from left to right.
See the example input and output below for more information.
A pointer to an array that will receive SCRIPT_VISATTR structures for each character. These tell you information for each glyph, such as whether it is the first glyph in a cluster (see the example below). There are other flags that may be interesting to you (see the MSDN documentation).
On success, indicates how many glyphs were actually written to pwOutGlyphs and psva.

Example input and output

Let’s say you are processing a run consisting of the word “fiancé”, and that the font you are using maps “fi” to a single-glyph ligature, and “é” to two glyphs, one for the “e” and one for the accent.

Per-character information: The log tells you, for each input character, which glyph is the first glyph it generated.

0 1 2 3 4 5
Input pwcChars: f i a n c é
Output pwLogClust: 0 0 1 2 3 4

Per-glyph information: You can see that the fClusterStart flag is set whenever the glyph is the first glyph in a “cluster.” A cluster is something that the user would think of as one letter or logical unit. Here, each cluster corresponds to one input character, but that is not necessarily the case. If a combining accent was used for the “e” instead, the input would have been two case of combining accents, for example, the input could have been two code points, but the cluster would still have been the same as in this example.

0 1 2 3 4 5
Output pwOutGlyphs: a n c e ´
SCRIPT_VISATTR[x].fClusterStart: 1 1 1 1 1 0

Example code

How to loop over the runs identified by callScriptItemize above.

How to turn each of those runs into a list of glyphs with ScriptShape:


MSDN Documentation for ScriptPlace

ScriptPlace computes the actual glyphs and positions of a run. It is called with the output of ScriptShape. With the output of this function, you can compute the width of the run for layout purposes, and draw the run using ScriptTextOut.


HRESULT ScriptPlace(
HDC hdc,
const WORD* pwGlyphs,
int cGlyphs,
const SCRIPT_VISATTR* psva,
int* piAdvance,
GOFFSET* pGoffset,
hdc, psc
An optional HDC with the desired font selected into it, plus the SCRIPT_CACHE of that font from a previous call. MSDN actually has a pretty good explanation of how caching works.
pwGlyphs, cGlyphs, psva
These are the things computed by ScriptShape: the glyph indices, the number of glyphs, and the SCRIPT_VISATTR array corresponding to each glyph.
The SCRIPT_ANALYSIS object filled in by ScriptItemize for this run. MSDN says this structure will be modified, but it is not clear in what ways.
Contains an array of the advance widths, one per glyph. This is the amount to advance after drawing the corresponding glyph, to get to the next glyph. Some advance widths may be 0 to cause glyphs to overlap, for example, to combine a base glyph with a combining accent (see pGoffset below). The the “Example input and output” below.
This is an array of GOFFSET structures, one for each item. The GOFFSET indicates an offset amount (horizontally GOFFSET.du and vertially GOFFSET.dv) that the associated glyph should be shifted when it is drawn. Generally, this shifting amount will be 0, but will be nonzero to move combining accents, Hebrew vowel points, or other “decorations” to the correct position relative to the base glyph. The application generally doesn’t have to pay attention to these offsets at all. They are generated by ScriptPlace and used by ScriptTextOut, and all the application needs to do is keep track of the values in the meantime. For example:
ế   =   e   +   ˆ   +   ́
In this example, there are three glyphs, the first two with an advance of 0, and the third with an advance of the width of the combination. This causes the glyphs to be drawn over the top of each other. Depending on the font, however, the position of the accents if rendered over the top of the “e” may not be correct (in this example, the top (acute) accent may not be high enough to fit over the circumflex). In this case, the GOFFSET indicates how they should be moved to produce the proper combination.
This is a pointer to one ABC structure that contains the width information for the entire run. Normally, the ABC structure tells you how to place a glyph in relation to the surrounding characters.
Here, however, I’m not sure what the use is beyond giving you an easy way to compute the width of the run. It appears that the sum of the ABC widths abc.abcA + abc.abcB + abc.abcC of the run exactly equal the sum of the advance widths of each character in the run. It seems that one is not supposed to treat the A and C widths (normally extra space that one needs to apply to calculations) manually, as these are included in the advance width of the first and the last character, respectively.
Since the application does not account for the advance itself, and the ABC width is not passed to any other Uniscribe functions, I’m not sure how ScriptTextOut knows how to place the first character in relation to the start of the run. Perhaps this is one of the “reserved” fields in the SCRIPT_VISATTR structure, or perhaps this information is stored elsewhere.

Example input and output

This example slows how “écrit” might be represented. Notice that the “e” has no advance, causing the next glyph, an accent, to be drawn over the top. The accent also has a small offset to move it into the appropriate place over the “e”. The advance for the accent takes us over the “e” and to where the “c” should begin.

Input glyph Output advance Output offset
e 0 (0,0)
´ 16 (1,-2)
c 16 (0,0)
r 11 (0,0)
i 8 (0,0)
t 10 (0,0)

Example code


MSDN Documentation for ScriptJustify

ScriptJustify allows you to expand text to fit a column width. For most languages, justification is straightforward because one can just distribute the additional space between all the spaces in the line. Given input in English, for example, this is exactly what ScriptJustify will do. Arabic, however, is more complicated. Justification involves adding additional lines called kashidas between certain characters, and ScriptJustify will handle this properly. For example:


ScriptJustify is therefore very good for justification of Latin or Arabic scripts. It will also be good for longer runs of Arabic that have a few Latin-based words in them. ScriptJustify also does not require that its input is only one run, as long as you can collapse your runs to form a single input array, it will distribute spaces appropriately between all the runs on the line. You will have to then expand these back into arrays that correspond to your runs so you can use the rest of the Uniscribe functions.

However, because it will favor adding kashidas rather than spaces between words, if you have some text that is mostly in a Latin script but with one or two Arabic words in it, ScriptJustify probably doesn’t do what you want. It will assign all the extra space to the Arabic word in the form of kashidas, which may make them look overly extended. If you want to handle this case, you may want to do your own algorithm. For example, one approach would be count the number of space-separated words, and distribute the amount of space you are adding to each of the runs in individual calls to ScriptJustify in porportion to the number of words they have. This will distribute space evenly between Arabic kashidas and spaces between Latin-based words.


HRESULT ScriptJustify(
const SCRIPT_VISATTR* psva,
const int* piAdvance,
int cGlyphs,
int iDx,
int iMinKashida,
int* piJustify);
An array of SCRIPT_VISATTR structures, one for each glyph, that was computed by ScriptShape.
An array of advance widths computed, one for each glyph, by ScriptPlace.
The number of glyphs in the run.
The amount of space that should be added to the input. The MSDN documentation states that this is the length of the desired line, but it is incorrect. You need to measure the total advance of the widths of all the glyphs in all the runs (this is also the sum of the ABC widths of the runs), and subtract that from your desired line width to get iDx.
The minimum amount of space to assign to a kashida. MSDN offers little guidance on what this is for. In talking to somebody more informed than I am, I found that “it’s the width of the glyph used for the tatweel character U+640 in the current font family and font size.” The reason it needs this is presumably that you haven’t given it an HFONT or an HDC at this point in layout, so it can’t retrieve the value itself. Michael Kaplan also wrote a blog post discussing kashidas which touches on this topic. For what it’s worth, Chrome hard-codes this to 2 and it works OK, though possibly this will cause bugs with extreme sizes.
The output array, one for each input glyph, that contains the new widths that the characters take up. This is the old advance width plus any additional width added for justification purposes. You should use this in place of advance widths for many applications, such as measuring the width of the text or using ScriptXtoCP and ScriptCPtoX.
This value is passed into ScriptTextOut in addition to the regular advance widths. I’m guessing, but I assume that ScriptTextOut compares the justified advances with the regular advances. Any extra space is treated according to the SCRIPT_VISATTR.uJustification field associated with each glyph, be it a kashida, a space or a number of other possibilities.


MSDN Documentation for ScriptXtoCP

ScriptXtoCP converts a pixel offset to a character position given a whole lot of information computed by ScriptPlace (see above). Because ScriptXtoCP operates only on single runs, you will need to skip over whole runs yourself until you find the run with the given offset. Only then can you call this function. See the example on how to do this.

This function is not very difficult to call, you mostly just have to collect all the information you collected previously, so I’ll refer to the MSDN documentation for this. Note that if the X position occurs before any characters, the return value will be -1.

One tricky thing is that if you called ScriptJustify to expand the glyphs, the advances returned by ScriptPlace won’t represent where the glyphs actually are and you’ll get incorrect results. In this case, you should instead pass in the piJustify array computed by ScriptJustify for the piAdvance parameter.

Example code

This example assumes your input runs are in the order returned by ScriptItemize, which is the same order as the input. However, the presence of right-to-left text will mean that these items should actually be displayed in a different order. This example uses the visual_to_logical lookup table computed in the example for ScriptLayout above.


MSDN Documentation for ScriptCPtoX

ScriptCPtoX converts character positions to offsets. Like ScriptXtoCP, the parameters, though numerous, are not very difficult to figure out, so I’ll mostly refer you to the MSDN documentation.

The key to calling this function is that it only handles one run, so that you have to manually compute the advance for the runs preceding it to the left on the screen (this information is computed by ScriptLayout).

As with ScriptXtoCP, if you have justified the text, you should pass the justified advances returned by ScriptJustify for the advance parameter of ScriptCPtoX to get the correct results.

Example code

This example assumes you have the visual_to_logical lookup table computed in the example above for ScriptLayout.

Copyright © 2007 Brett Wilson.

只回答业务咨询点击这里给我发消息 点击这里给我发消息


本文出自 王牌软件,转载时请注明出处及相应链接。

本文永久链接: http://www.softwareace.cn/?p=250



电子邮件地址不会被公开。 必填项已用*标注

您可以使用这些HTML标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">