Welcome to Linux Knowledge Base and Tutorial
"The place where you learn linux"
Fatherhood.Org

 Create an AccountHome | Submit News | Your Account  

Tutorial Menu
Linux Tutorial Home
Table of Contents

· Introduction to Operating Systems
· Linux Basics
· Working with the System
· Shells and Utilities
· Editing Files
· Basic Administration
· The Operating System
· The X Windowing System
· The Computer Itself
· Networking
· System Monitoring
· Solving Problems
· Security
· Installing and Upgrading
· Linux and Windows

Glossary
MoreInfo
Man Pages
Linux Topics
Test Your Knowledge

Site Menu
Site Map
FAQ
Copyright Info
Terms of Use
Privacy Info
Disclaimer
WorkBoard
Thanks
Donations
Advertising
Masthead / Impressum
Your Account

Communication
Feedback
Forums
Private Messages
Surveys

Features
HOWTOs
News Archive
Submit News
Topics
User Articles
Web Links

Google
Google


The Web
linux-tutorial.info

Who's Online
There are currently, 278 guest(s) and 0 member(s) that are online.

You are an Anonymous user. You can register for free by clicking here

  

unicode



DESCRIPTION

       The international standard ISO 10646 defines the Universal
       Character Set (UCS).  UCS contains all characters  of  all
       other  character  set standards. It also guarantees round-
       trip compatibility, i.e., conversion tables can  be  built
       such  that  no  information  is lost when a string is con­
       verted from any other encoding to UCS and back.

       UCS contains the characters required to represent  practi­
       cally  all  known  languages.  This  includes not only the
       Latin, Greek,  Cyrillic,  Hebrew,  Arabic,  Armenian,  and
       Georgian  scripts,  but  also  also  Chinese, Japanese and
       Korean Han ideographs as well as scripts such as Hiragana,
       Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
       Oriya,  Tamil,  Telugu,  Kannada,  Malayalam,  Thai,  Lao,
       Khmer,  Bopomofo,  Tibetan, Runic, Ethiopic, Canadian Syl­
       labics,  Cherokee,  Mongolian,  Ogham,  Myanmar,  Sinhala,
       Thaana,  Yi,  and  others.  For  scripts  not yet covered,
       research on how to best encode them for computer usage  is
       still  going  on  and  they will be added eventually. This
       might eventually include not only Hieroglyphs and  various
       historic  Indo-European  languages, but even some selected
       artistic scripts such as Tengwar, Cirth, and Klingon.  UCS
       also  covers  a  large number of graphical, typographical,
       mathematical and scientific symbols, including those  pro­
       vided  by TeX, Postscript, APL, MS-DOS, MS-Windows, Macin­
       tosh, OCR fonts, as well as many word processing and  pub­
       lishing systems, and more are being added.

       The  UCS standard (ISO 10646) describes a 31-bit character
       set architecture consisting of  128  24-bit  groups,  each
       divided  into  256 16-bit planes made up of 256 8-bit rows
       with 256 column positions, one for each character. Part  1
       of the standard (ISO 10646-1) defines the first 65534 code
       positions (0x0000 to 0xfffd), which form the Basic  Multi­
       lingual Plane (BMP), that is plane 0 in group 0. Part 2 of
       the standard (ISO 10646-2) adds characters to group 0 out­
       side  the BMP in several supplementary planes in the range
       0x10000 to 0x10ffff. There are no plans to add  characters
       beyond  0x10ffff  to the standard, therefore of the entire
       code space, only a small fraction of group 0 will ever  be
       actually  used in the foreseeable future. The BMP contains
       all characters found in the commonly used other  character
       sets.  The  supplemental planes added by ISO 10646-2 cover
       only more exotic characters for special  scientific,  dic­
       tionary printing, publishing industry, higher-level proto­
       col and enthusiast needs.

       The representation of each UCS character as a 2-byte  word
       is  referred  to  as  the UCS-2 form (only for BMP charac­
       ters), whereas UCS-4 is the representation of each charac­
       ter by a 4-byte word.  In addition, there exist two encod­
       accent  to  the  previous  character.  The  most important
       accented characters have codes of their own in  UCS,  how­
       ever,  the  combining character mechanism allows us to add
       accents and other diacritical marks to any character.  The
       combining  characters  always  follow  the character which
       they modify. For example, the  German  character  Umlaut-A
       ("Latin  capital  letter  A with diaeresis") can either be
       represented by the precomposed UCS code 0x00c4, or  alter­
       natively  as  the  combination  of a normal "Latin capital
       letter A" followed  by  a  "combining  diaeresis":  0x0041
       0x0308.

       Combining characters are essential for instance for encod­
       ing the Thai script or for  mathematical  typesetting  and
       users of the International Phonetic Alphabet.


IMPLEMENTATION LEVELS

       As not all systems are expected to support advanced mecha­
       nisms like combining characters, ISO 10646-1 specifies the
       following three implementation levels of UCS:

       Level 1  Combining  characters  and Hangul Jamo (a variant
                encoding of the Korean  script,  where  a  Hangul
                syllable  glyph  is coded as a triplet or pair of
                vovel/consonant codes) are not supported.

       Level 2  In addition to level 1, combining characters  are
                now  allowed  for  some  languages where they are
                essential  (e.g.,  Thai,  Lao,  Hebrew,   Arabic,
                Devanagari, Malayalam, etc.).

       Level 3  All UCS characters are supported.

       The  Unicode 3.0 Standard published by the Unicode Consor­
       tium contains exactly the UCS Basic Multilingual Plane  at
       implementation  level 3, as described in ISO 10646-1:2000.
       Unicode 3.1 added the supplemental planes of ISO  10646-2.
       The  Unicode  standard  and technical reports published by
       the Unicode Consortium provide much additional information
       on the semantics and recommended usages of various charac­
       ters. They provide guidelines and algorithms for  editing,
       sorting, comparing, normalizing, converting and displaying
       Unicode strings.


UNICODE UNDER LINUX

       Under GNU/Linux, the C type wchar_t  is  a  signed  32-bit
       integer  type.  Its values are always interpreted by the C
       library as UCS code values (in all locales), a  convention
       that  is  signaled by the GNU C library to applications by
       defining the constant __STDC_ISO_10646__ as  specified  in
       the ISO C 99 standard.


       Under Linux, in general only  the  BMP  at  implementation
       level  1 should be used at the moment. Up to two combining
       characters per base character for certain scripts (in par­
       ticular  Thai)  are  also supported by some UTF-8 terminal
       emulators and ISO 10646 fonts (level 2),  but  in  general
       precomposed characters should be preferred where available
       (Unicode calls this Normalization Form C).


PRIVATE AREA

       In the BMP, the range  0xe000  to  0xf8ff  will  never  be
       assigned to any characters by the standard and is reserved
       for private usage. For the Linux community,  this  private
       area  has been subdivided further into the range 0xe000 to
       0xefff which can be used individually by any end-user  and
       the  Linux zone in the range 0xf000 to 0xf8ff where exten­
       sions are coordinated among all Linux users. The  registry
       of  the characters assigned to the Linux zone is currently
       maintained by H. Peter Anvin <Peter.Anvin@linux.org>.


LITERATURE

       * Information technology -- Universal Multiple-Octet Coded
         Character  Set  (UCS)  -- Part 1: Architecture and Basic
         Multilingual  Plane.   International  Standard   ISO/IEC
         10646-1, International Organization for Standardization,
         Geneva, 2000.

         This is the official specification of UCS.  Available as
         a PDF file on CD-ROM from http://www.iso.ch/.

       * The  Unicode Standard, Version 3.0.  The Unicode Consor­
         tium,   Addison-Wesley,   Reading,   MA,   2000,    ISBN
         0-201-61633-5.

       * S.  Harbison,  G.  Steele. C: A Reference Manual. Fourth
         edition, Prentice Hall,  Englewood  Cliffs,  1995,  ISBN
         0-13-326224-3.

         A  good reference book about the C programming language.
         The fourth edition covers the 1994 Amendment  1  to  the
         ISO  C  90  standard, which adds a large number of new C
         library functions for handling wide and multi-byte char­
         acter  encodings,  but  it  does not yet cover ISO C 99,
         which improved wide  and  multi-byte  character  support
         even further.

       * Unicode Technical Reports.
         http://www.unicode.org/unicode/reports/

       * Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux.
         http://www.cl.cam.ac.uk/~mgk25/unicode.html

       under  Linux usually provides for CJK double-width charac­
       ters and  sometimes  even  simple  overstriking  combining
       characters,  but  usually  does  not  include  support for
       scripts with right-to-left writing direction  or  ligature
       substitution  requirements  such as Hebrew, Arabic, or the
       Indic scripts. These scripts are currently only  supported
       in  certain  GUI  applications (HTML viewers, word proces­
       sors) with sophisticated text rendering engines.


AUTHOR

       Markus Kuhn <mgk25@cl.cam.ac.uk>


SEE ALSO

       utf-8(7), charsets(7), setlocale(3)

GNU                         2001-05-11                 UNICODE(7)
  




Login
Nickname

Password

Security Code
Security Code
Type Security Code


Don't have an account yet? You can create one. As a registered user you have some advantages like theme manager, comments configuration and post comments with your name.

Help if you can!


Amazon Wish List

Did You Know?
The Linux Tutorial welcomes your suggestions and ideas.


Friends



Tell a Friend About Us

Bookmark and Share



Web site powered by PHP-Nuke

Is this information useful? At the very least you can help by spreading the word to your favorite newsgroups, mailing lists and forums.
All logos and trademarks in this site are property of their respective owner. The comments are property of their posters. Articles are the property of their respective owners. Unless otherwise stated in the body of the article, article content (C) 1994-2013 by James Mohr. All rights reserved. The stylized page/paper, as well as the terms "The Linux Tutorial", "The Linux Server Tutorial", "The Linux Knowledge Base and Tutorial" and "The place where you learn Linux" are service marks of James Mohr. All rights reserved.
The Linux Knowledge Base and Tutorial may contain links to sites on the Internet, which are owned and operated by third parties. The Linux Tutorial is not responsible for the content of any such third-party site. By viewing/utilizing this web site, you have agreed to our disclaimer, terms of use and privacy policy. Use of automated download software ("harvesters") such as wget, httrack, etc. causes the site to quickly exceed its bandwidth limitation and are therefore expressly prohibited. For more details on this, take a look here

PHP-Nuke Copyright © 2004 by Francisco Burzi. This is free software, and you may redistribute it under the GPL. PHP-Nuke comes with absolutely no warranty, for details, see the license.
Page Generation: 0.05 Seconds