diff options
author | Corinna Vinschen <corinna@vinschen.de> | 2009-03-25 10:37:06 +0000 |
---|---|---|
committer | Corinna Vinschen <corinna@vinschen.de> | 2009-03-25 10:37:06 +0000 |
commit | f276aab75a74d66ab8388a8eab1cb902fea6dffc (patch) | |
tree | 3b74cde68c508eb72b6138a0fc4029576c9c98ed | |
parent | 4747078502ec9a9aaa7e867ef7b29acd194c2b8d (diff) | |
download | cygnal-f276aab75a74d66ab8388a8eab1cb902fea6dffc.tar.gz cygnal-f276aab75a74d66ab8388a8eab1cb902fea6dffc.tar.bz2 cygnal-f276aab75a74d66ab8388a8eab1cb902fea6dffc.zip |
* new-features.sgml: Add missing GB2312 and eucKR character sets.
* pathnames.sgml: Change "DOS devices" title to "Invalid filenames"
and rephrase that section.
Add section "Filenames with unusual (foreign) characters".
Fix an emphasis.
* setup-net.sgml: Integrate setup-locale section.
* setup2.sgml: Add locale variables to section "Environment Variables".
Add section "Internationalization".
-rw-r--r-- | winsup/doc/ChangeLog | 11 | ||||
-rw-r--r-- | winsup/doc/new-features.sgml | 5 | ||||
-rw-r--r-- | winsup/doc/pathnames.sgml | 62 | ||||
-rw-r--r-- | winsup/doc/setup-net.sgml | 1 | ||||
-rw-r--r-- | winsup/doc/setup2.sgml | 284 |
5 files changed, 352 insertions, 11 deletions
diff --git a/winsup/doc/ChangeLog b/winsup/doc/ChangeLog index a85e8168c..b49413d31 100644 --- a/winsup/doc/ChangeLog +++ b/winsup/doc/ChangeLog @@ -1,3 +1,14 @@ +2009-03-25 Corinna Vinschen <corinna@vinschen.de> + + * new-features.sgml: Add missing GB2312 and eucKR character sets. + * pathnames.sgml: Change "DOS devices" title to "Invalid filenames" + and rephrase that section. + Add section "Filenames with unusual (foreign) characters". + Fix an emphasis. + * setup-net.sgml: Integrate setup-locale section. + * setup2.sgml: Add locale variables to section "Environment Variables". + Add section "Internationalization". + 2009-03-24 Corinna Vinschen <corinna@vinschen.de> * new-features.sgml: Add section about chaged (no)winsymlink default. diff --git a/winsup/doc/new-features.sgml b/winsup/doc/new-features.sgml index cae0b492e..3889efa6d 100644 --- a/winsup/doc/new-features.sgml +++ b/winsup/doc/new-features.sgml @@ -195,8 +195,9 @@ in 1-16, except 12, "UTF-8", Windows codepages "CPxxx", with xxx in (437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258), "JIS", "SJIS", - "eucJP", "Big5". The leading language and territory part (en_US) is not - used by Cygwin yet, but is required for POSIX compatibility. + "GB2312", "eucJP", "eucKR", and "Big5". The leading language and territory + part (en_US, for instance) is not used by Cygwin yet, but is required + for POSIX compatibility. - Allow multiple concurrent read locks per thread for pthread_rwlock_t. diff --git a/winsup/doc/pathnames.sgml b/winsup/doc/pathnames.sgml index 97706e99a..722c98b80 100644 --- a/winsup/doc/pathnames.sgml +++ b/winsup/doc/pathnames.sgml @@ -311,21 +311,25 @@ to be readable by the $USER user account itself.</para> </sect2> -<sect2 id="pathnames-dosdevices"><title>DOS devices</title> +<sect2 id="pathnames-dosdevices"><title>Invalid filenames</title> <para>Filenames invalid under Win32 are not necessarily invalid -under Cygwin since release 1.7.0. There are a couple of rules which -apply to Windows filenames. First of all, DOS device names like +under Cygwin since release 1.7.0. There are a few rules which +apply to Windows filenames. Most notably, DOS device names like <filename>AUX</filename>, <filename>COM1</filename>, <filename>LPT1</filename> or <filename>PRN</filename> (to name a few) -cannot be used in a native Win32 application, even with an -extension (<filename>prn.txt</filename>). Cygwin can handle files with -these names just fine.</para> +cannot be used as filename or extension in a native Win32 application. +So filenames like <filename>prn.txt</filename> or <filename>foo.aux</filename> +are invalid filenames for native Win32 applications.</para> + +<para>This restriction doesn't apply to Cygwin applications. Cygwin +can create and access files with such names just fine. Just don't try +to use these files with native Win32 aqpplications...</para> </sect2> <sect2 id="pathnames-specialchars"> -<title>Special characters in filenames</title> +<title>Forbidden characters in filenames</title> <para>Win32 filenames can't contain trailing dots and spaces for backward compatibility. When trying to create files with trailing dots or spaces, @@ -346,6 +350,48 @@ are converted to special UNICODE characters in the range 0xf000 to 0xf0ff </sect2> +<sect2 id="pathnames-unusual"> +<title>Filenames with unusual (foreign) characters</title> + +<para> Windows filesystems use the Unicode character set in the UTF-16 +encoding to store filename information. If you don't use the UTF-8 +character set (see <xref linkend="setup-locale"></xref>) then there's a +chance that a filename is using one or more characters which have no +representation in the character set you're using.</para> + +<para>For instance, there are no chinese characters in the ISO-8859-1 +character set. So, converting a filename containing a chinese character +to ISO-8859-1 leaves you with a wrongly converted filename, for instance +containing a question mark '?' as replacement for the unconvertable +character. When trying to access the file, Cygwin has to convert the +filename back to UTF-16. However, this doesn't result in the original +filename because the question mark will not translate back to the original +chinese character, but to a simple question mark instead. This in turn +results in strange "File not found" messages.</para> + +<note><para>To avoid this scenario altogether, just use always UTF-8 as +character set.</para></note> + +<para>If you don't want or can't use UTF-8 as character set for whatever +reason, you will nevertheless be able to access the file. How does that +work? When Cygwin converts the filename from UTF-16 to your character +set, it recognizes characters which can't be converted. If that occurs, +Cygwin replaces the non-convertible character with a special character +sequence. The sequence starts with an ASCII SO character (hex code +0x0e, equivalent Control-N), followed by the UTF-8 representation of the +character. The result is a filename containing some ugly looking +characters. While it doesn't <emphasis>look</emphasis> nice, it +<emphasis>is</emphasis> nice, because Cygwin knows how to convert this +filename back to UTF-16. The filename will be converted using your +usual character set. However, when Cygwin recognizes an ASCII SO +character, it skips over the ASCII SO and handles the following bytes as +a UTF-8 character. Thus, the filename is symmetrically converted back to +UTF-16 and you can access the file.</para> + +<para>Again, by using UTF-8 you can avoid this problem entirely.</para> + +</sect2> + <sect2 id="pathnames-casesensitive"> <title>Case sensitive filenames</title> @@ -369,7 +415,7 @@ HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\obcaseinsensitive this registry value also on Windows NT4 and Windows 2000, which usually both don't know this registry key. If you want case-sensitivity on these systems, create that registry value and set it to 0. On these systems -(and *only* on these systems) you don't have to reboot to bring it +(and <emphasis role='bold'>only</emphasis> on these systems) you don't have to reboot to bring it into effect, rather stopping all Cygwin processes and then restarting them is sufficient.</para> diff --git a/winsup/doc/setup-net.sgml b/winsup/doc/setup-net.sgml index 165924d07..57c3fb185 100644 --- a/winsup/doc/setup-net.sgml +++ b/winsup/doc/setup-net.sgml @@ -254,6 +254,7 @@ Problems with Cygwin</ulink>. DOCTOOL-INSERT-setup-env DOCTOOL-INSERT-setup-maxmem +DOCTOOL-INSERT-setup-locale DOCTOOL-INSERT-ntsec DOCTOOL-INSERT-setup-files </chapter> diff --git a/winsup/doc/setup2.sgml b/winsup/doc/setup2.sgml index 4ae4d4fd3..20718b955 100644 --- a/winsup/doc/setup2.sgml +++ b/winsup/doc/setup2.sgml @@ -13,13 +13,22 @@ The <envar>CYGWIN</envar> variable is used to configure many global settings for the Cygwin runtime system. Initially you can leave <envar>CYGWIN</envar> unset or set it to <literal>tty</literal> (e.g. to support job control with ^Z etc...) using a syntax like this in the -DOS shell, before launching bash. </para> +DOS shell, before launching bash.</para> <screen> <prompt>C:\></prompt> <userinput>set CYGWIN=tty notitle glob</userinput> </screen> <para> +Locale support is controlled by the <envar>LANG</envar> and +<envar>LC_xxx</envar> environment variables. You can set all of them +but Cygwin itself only honors the variables <envar>LC_ALL</envar>, +<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according +to the POSIX standard. The first one found rules. For a more detailed +description see <xref linkend="setup-locale"></xref>. +</para> + +<para> The <envar>PATH</envar> environment variable is used by Cygwin applications as a list of directories to search for executable files to run. This environment variable is converted from Windows format @@ -124,6 +133,279 @@ Run the program and it will output the maximum amount of allocatable memory. </sect1> +<sect1 id="setup-locale"><title>Internationalization</title> + +<sect2 id="setup-locale-ov"><title>Overview</title> + +<para> +Internationalization support is controlled by the <envar>LANG</envar> and +<envar>LC_xxx</envar> environment variables. You can set all of them +but Cygwin itself only honors the variables <envar>LC_ALL</envar>, +<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according +to the POSIX standard. The content of these variables should follow the +POSIX standard for a locale specifier. The correct form of a locale +specifier is</para> + +<screen> + language[[_TERRITORY][.charset][@modifier]] +</screen> + +<para>"language" is a lowercase two character string per ISO 639-1, +"TERRITORY" is an uppercase two character string per ISO 3166, charset is +one of a list of supported character sets, and the modifier doesn't matter +here (though it might for some applications). If you're interested in the +exact description, you can find it in the online publication of the POSIX +manual pages on the homepage of the +<ulink url="http://www.opengroup.org/">Open Group</ulink>.</para> + +<para>Typical locale specifiers are</para> + +<screen> + "de_CH" language = German, territory = Switzerland, default charset + "fr_FR.UTF-8" language = french, territory = France, charset = UTF-8 + "ko_KR.eucKR" language = korean, territory = South Korea, charset = eucKR +</screen> + +<para> +And let's not forget the default locale called "C" or "POSIX" +which basically only supports plain ASCII code. If the aforementioned +environment variables are not set, or set to "C" or "POSIX", you get the +default ASCII-only behaviour. +</para> + +<para> +Right now the language and territory content is not evaluated by Cygwin any +further. The only important part so far is the character set. How does that +work? +</para> + +</sect2> + +<sect2 id="setup-locale-how"><title>How to set the locale</title> + +<itemizedlist mark="bullet"> + +<listitem><para> +The default locale is the "C" or "POSIX" locale. In this locale, basically +only ASCII characters are supported. Even if one of the aforementioned +environment variables are set to something else, it's the application's +responsibility to call the function <function>setlocale</function>, +typically like this</para> + +<screen> + setlocale (LC_ALL, ""); +</screen> + +<para>to switch to another locale according to the settings of the +internationalization environment variables. +</para></listitem> + +<listitem><para> +Assuming you set one of the aforementioned environment variables to some +valid POSIX locale value, other than "C" and "POSIX", and assuming you +call an application which calls <function>setlocale</function> as above.</para> + +<para>Assuming further you're living in Japan. So you might want to use +the language code "ja" and the territory "JP", thus setting, say, +<envar>LANG</envar> to "ja_JP". You didn't set a character set, so +what will Cygwin use now? Easy! It will use the default Windows ANSI +codepage of your system, if it's supported by Cygwin. Hopefully Cygwin +supports all relevant default ANSI codepages...</para> + +<note><para>For a list of supported character sets, see +<xref linkend="setup-locale-charsetlist"></xref> +</para></note> +</listitem> + +<listitem><para> +You don't want to use the default Windows codepage as character set? +In that case you have to specify the charset explicitely. For instance, +assume you're from Italy and don't want to use the default Windows codepage +1252, but the more portable ISO-8859-15 character set. What you can do is +to set the <envar>LANG</envar> variable in the +<filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file +to start a Cygwin session from the "Cygwin" desktop shortcut.</para> + +<screen> + @echo off + + C: + chdir C:\cygwin\bin + set LANG=it_IT.ISO-8859-15 + bash --login -i +</screen> +</listitem> + +<listitem><para> +Most singlebyte or doublebyte charsets have a disadvantage. Windows +filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters +from the Unicode character set are available in a singlebyte or doublebyte +charset. While Cygwin has a workaround to access files with unusual +characters (see <xref linkend="pathnames-unusual"></xref>), a better +workaround is to use always the UTF-8 character set. UTF-8 is the only +multibyte character set which can represent <emphasis>every</emphasis> +Unicode character.</para> + +<screen> + set LANG=es_MX.UTF-8 +</screen> + +<para>For a description of the Unicode standard, see the homepage of the +<ulink url="http://www.unicode.org/">Unicode Consortium</ulink>. +</para></listitem> + +</itemizedlist> + +</sect2> + +<sect2 id="setup-locale-problems"><title>Potential Problems</title> + +<para> +You can set the above internationalization variables not only in +<filename>Cygwin.bat</filename> or in the Windows environment, but also +in your Cygwin shell on the fly, even switch to yet another character +set, and yet another. In bash for instance:</para> + +<screen> + <prompt>bash$</prompt> export LC_CTYPE="nl_BE.UTF-8" +</screen> + +<para>However, here's a problem. At the start of the first Cygwin process +in a session, the Windows environment has to be converted from UTF-16 to +some singlebyte or multibyte charset. If the internationalization environment +variable hasn't been set <emphasis>before</emphasis> starting this process, +Cygwin has to make an educated guess which charset to use to convert +the environment itself. The only reproducible way to do that in the absence +of <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, or <envar>LANG</envar>, +is to use the current Windows ANSI codepage.</para> + +<para>As long as the environment only contains ASCII characters, this is +no problem. But if it does, and you're planning to use, say, UTF-8, +the environment will result in invalid characters in the UTF-8 charset. +This would be especially a problem in variables like <envar>PATH</envar>.</para> + +<note><para>Per POSIX, the name of an environment variable should only +consist of valid ASCII characters, and only of uppercase letters, digits, and +the underscore for maximum portablilty.</para></note> + +<para>And here's another problem when switching charsets on the fly. +Symbolic links. A symbolic link contains the filename of the target +file the symlink points to. When a symlink is created, the current +character set is used to store the target filename. If the target +filename contains non-ASCII characters and you switch to another +character set, the target filename of the symlink is now potentially +an invalid character sequence in the new character set. This behaviour +is not different from the behaviour in other Operating Systems. So, +if you suddenly can't access a symlink anymore, maybe it's because you +switched to another character set? +</para> + +</sect2> + +<sect2 id="setup-locale-missing"><title>What does not work?</title> + +<para> +Except for <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, +and <envar>LANG</envar>, all other LC_xxx environment variables, +<envar>LC_COLLATE</envar>, <envar>LC_MESSAGES</envar>, +<envar>LC_MONETARY</envar>, <envar>LC_NUMERIC</envar>, +and <envar>LC_TIME</envar>, are ignored right now. This means, while Cygwin +supports different character sets, it does <emphasis>not</emphasis> support +real localization so far. There's no support for locale-specific monetary +symbols, for a decimalpoint other than '.', no support for native time +formats, and no support for native language sorting orders. +</para> + +<para>However, internationalization is work in progress and we would be glad +for coding help in this area.</para> + +</sect2> + +<sect2 id="setup-locale-charsetlist"><title>List of supported character sets</title> + +<para>Last but not least, here's the list of currently supported character +sets. The left-hand expression is the name of the charset, as you would use +it in the internationalization environment variables as outlined above. +</para> + +<para>The right-hand side is the number of the equivalent Windows +codepage as well as the Windows name of the codepage. They are only +noted here for reference. Don't try to use the bare codepage number or +the Windows name of the codepage as charset in locale specifiers, unless +they happen to be identical with the left-hand side. Especially in case +oif the "CPxxx" style charsets, always use them with the trailing "CP".</para> + +<para>This works:</para> + +<screen> + set LC_ALL=en_US.CP437 +</screen> + +<para>This does <emphasis>not</emphasis> work:</para> + +<screen> + set LC_ALL=en_US.437 +</screen> + +<para>You can find a full list of Windows codepages on the Microsoft MSDN page +<ulink url="http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx">Code Page Identifiers</ulink>.</para> + +<screen> + Charset Codepage + + CP437 437 (OEM United States) + CP720 720 (DOS Arabic) + CP737 737 (OEM Greek) + CP775 775 (OEM Baltic) + CP850 850 (OEM Latin 1, Western European) + CP852 852 (OEM Latin 2, Central European) + CP855 855 (OEM Cyrillic) + CP857 857 (OEM Turkish) + CP858 858 (OEM Latin 1 + Euro Symbol) + CP862 862 (OEM Hebrew) + CP866 866 (OEM Russian) + CP874 874 (ANSI/OEM Thai) + CP1125 1125 (OEM Ukraine) + CP1250 1250 (ANSI Central European) + CP1251 1251 (ANSI Cyrillic) + CP1252 1252 (ANSI Latin 1, Western European) + CP1253 1253 (ANSI Greek) + CP1254 1254 (ANSI Turkish) + CP1255 1255 (ANSI Hebrew) + CP1256 1256 (ANSI Arabic) + CP1257 1257 (ANSI Baltic) + CP1258 1258 (ANSI/OEM Vietnamese) + + ISO-8859-1 28591 (ISO-8859-1) + ISO-8859-2 28592 (ISO-8859-2) + ISO-8859-3 28593 (ISO-8859-3) + ISO-8859-4 28594 (ISO-8859-4) + ISO-8859-5 28595 (ISO-8859-5) + ISO-8859-6 28596 (ISO-8859-6) + ISO-8859-7 28597 (ISO-8859-7) + ISO-8859-8 28598 (ISO-8859-8) + ISO-8859-9 28599 (ISO-8859-9) + ISO-8859-10 - (not available) + ISO-8859-11 - (not available) + ISO-8859-13 28563 (ISO-8859-13) + ISO-8859-14 - (not available) + ISO-8859-15 28565 (ISO-8859-15) + ISO-8859-16 - (not available) + + SJIS 932 (ANSI/OEM Japanese) + GB2312 936 (ANSI/OEM Simplified Chinese, GBK) + Big5 950 (ANSI/OEM Traditional Chinese) + JIS 50220 (ISO2022 Japanese w/o halfwidth Katakana) + eucJP 51932 (EUC Japanese) + eucKR 51949 (EUC Korean) + + UTF-8 65001 (UTF-8) +</screen> + +</sect2> + +</sect1> + <sect1 id="setup-files"><title>Customizing bash</title> <para> |