- ISO, an international organisation that defines the characters that Unicode encodes has recommended that the Bengali script be renamed as “Bengali-Assamese”.
Background: An understanding of Unicode
- Computers communicate in bits and bytes.
- Therefore any communication expressed as an idea in a language, say letter A, should be converted to raw data viz bits and bytes for the computer to communicate.
- An encoding is a method to transform an idea (like the letter “A”) into raw data (bits and bytes).
- Every character that we type on a keyboard has a unique code number.
- American Standard Code for Information Interchange (ASCII) system used in computers map the numeric values 0-127 to various Western characters (A,a,B,b,C,c etc)and control codes (newline, tab, etc.).
- ASCII system uses a 8-bit byte (3rd power of 2 since 3 digits) and written in decimal system.
- ASCII encoding worked great for English text.
- The world has so many languages and encoding is restricted in ASCII system because of the limited number of characters that can be used.
- To solve this problem Unicode was introduced.
- The Unicode Consortium, a non-Governmental body have standardised and maintains a Universal Character Set (UCS) commonly called Unicode in hexa-decimal system.
- ISO does the encoding on the basis of submissions from the concerned governments
- Unicode is a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers say on a webpage or even programming language.
- It aims to be a superset of all other character sets that have been encoded.
- These include characters for all the world’s main languages along with a selection of symbols for various purposes.
- Unicode characters are defined by the international standard ISO 10646;
- These numbers are encoded under a standard maintained by the Unicode Consortium.
- Unicode Consortium classifies characters according to the scripts they belong to.
- For example, the letter ‘A’ has the code number U+0041 and is listed as “Latin capital letter A” in the “Basic Latin” code chart.
Background of the Assamese-Bengali Issue
- While encoding for different Indian languages are also done in Unicode, the Assamese language was either misrepresented or not represented in the Unicode Standard.
- Also the code chart was named Bengali Code chart to serve the purpose of using the Assamese language in computers.
- This was because the two scripts share a large number of characters, some pronounced the same way, others denoting different sounds.
- In Unicode’s charts, these shared forms are defined as Bengali characters.
- Only a few characters exclusive to Assamese are listed as “additions for Assamese” — in the chart for Bengali.
- Therefore the script was named as Bengali and all character descriptors in the Unicode Code Chart named as per the Bengali nomenclature.
- Misrepresentation: Same sound, Different characters and No representation
- At times, the same sound is expressed by different characters in Assamese and Bengali.
- Example 1:
- The definitive example is the letter ‘ra’, which takes two different forms in the two languages, besides a letter ‘wa’ that is exclusive to Assamese.
- The Unicode chart lists the Assamese ‘ra’ and ‘wa’ among additions for Assamese, and defines them as being both Bengali and Assamese characters.
- At other times, the same letter denotes two different sounds.
- For example, three Assamese characters denote the sound ‘xa’, defined as “a soft ‘kh’ with the air released from the throat with the base of the tongue not touching the palate or the roof of the mouth”.
- In Bengali, the same three characters denote the sounds ‘sa’, ‘sha’ and ‘ssa’.
- Example 2:
- Assamese alphabet “ৰ” (Ro) is being described as Bengali letter “র”(Ro) with middle diagonal, in the Bengali chart of the Unicode Standard.
- Assamese alphabet “ৱ” (Wobo) described as Bengali letter “র”(Ro) with lower diagonal, in the Bengali chart of the Unicode Standard.
- Further thirteen other Assamese alphabets similarly misrepresented in the Bengali chart of the Unicode Standard.
- Not Represented
- Assamese alphabet “ক্ষ” (Khya) was not represented at all in the Bengali Code Chart of the Unicode.
- This result in collation error which occurs when sorting softwares are run in Assamese as because “ৰ” (Ro) and “ৱ” (Wobo) are not in proper place and “ক্ষ” (Khya) is not represented at all in the Bengali Code Chart of the Unicode Standard.
- Transliteration to Indic Scripts
What has been done?
- ISO, an international organisation that defines the characters that Unicode encodes has recommended renaming of the current Bengali script in the Unicode Standard as “Bengali-Assamese”.
- The major problem lies on the Assamese side.
- The question is will the renaming be limited to the renaming of the name of the Script and Code chart only or will it include the misrepresented character descriptors’ nomenclature also.
- If such an alteration is possible and every character is given both the Assamese and Bengali descriptors and the script renamed as per an acceptable name and the displaced and missing Assamese characters “ৰ” (Ro) and “ৱ” (Wobo) and “ক্ষ” (Khya) put in proper place in the chart, for proper collation the problem may be solved.
- But as per the basic principle of a Unique Code, one particular entity can have one identifier; in this case around fifteen characters will have one identifier for two entities.
Solution: Way Forward
The Assam government proposal lists 104 characters and symbols — 35 identical to Bengali ones in name and shape, 42 similarly shaped but with different sounds/uses, and 27 yet to be encoded.
SEPARATE SLOT/RANGE FOR THE ASSAMESE SCRIPT
- The allocation of a separate slot/range for the Assamese Script remains the only solution.
- This is perhaps easier for the Unicode Consortium to do.
- Government of Assam has also moved the Government of India seeking a separate slot/range for the Assamese script.
- Allocation of a separate slot/range for the Assamese Script will mean Unicode Consortium allowing and accepting duplication of characters.
- The Unicode Consortium has already allowed and accepted not only duplication but in case of some of the characters triplication of characters in the three major European writing systems viz. Cyrillic, Greek and Latin.
- Consequently in the Unicode Standard has more than the following number of duplicate characters:
a=2, A=3, B=3, c=2, C=2, e=2, E=3, H=3, i=2, I=3, j=2, J=2, K=2, M=3, N=2, o=2, O=3, p=2, P=3, s=2, S=2, T=2, x=2, X=3, y=2, Y=2 and Z=2
The solution therefore lies in duplicity since duplicity of characters is already there in the Unicode Standard.