Organization Of Characters

Characters may be organized at three different levels in terms of their relative complexity, namely: (1) at the level of "related characters", (2) at the level of "dictionary" (for indexing purposes), and (3) at the level of a "data base system" for word-processing and researches involving a very large character set.

While sets of "related characters" are useful for teaching beginners, a directory with a more effective indexing system is better for intermediate learners. A "data base system" is best for the more advanced learners. In the following, we shall discuss these in greater detail.

THE CONCEPT OF RELATED CHARACTERS

For beginning learners, characters should be organized to form a family of "related characters", for which an example is shown in Fig. 3.7a and b, headed by the derived significs, Two, and Rain .

These related sets resulted by applying the hierarchical methodology to the etymological definitions of the characters (Appendix i). In order to prove that the above two examples are not mere isolated instances, we need to investigate other such related groups, if any.

Fig. 6.1 shows just such a related group, headed by the primary signific, Ten. Fig. 6.1.a shows the modern script, while Fig. 6.1.b shows the corresponding group in Small Seal. These were arrived at by applying the same hierarchical methodology to the etymological definitions as before (Appendix iii).

In reference to Fig. 6.1, we again notice that there exists the interesting vertical series of "direct descendents": {Ten, Twenty, Thirty, A Generation}, and the horizontal series of siblings at level 2 of the hierarchical tree. Again, this phenomenon is also common.

Yet another example is the set of related characters for Sun , which is a primary signific. Our research revealed that there exists s large body of related characters: 37 derived significs and phonetics, and in terms of the tree hierarchy they extend down to the fourth level, not including the phonograms (see Fig. 6.2).

In Fig. 6.2.a again we see the presence of a "direct descendent" series, such as: {Dawn, Solar Rays, To Wound}, and a horizontal series as well. Fig. 6.2a', b', and c' show the corresponding Small Seal character.

B. INDEXING OF SIGNIFICS

We have organized the complete set of 214 significs into ordered hierarchical structure, headed by 117 primary significs. This is helpful, but still it is a bit too much to handle efficiently. According to Miller (1956), humans are good at handling only seven items +/- two. Therefore, some kind of system is needed to further structure and organize them into meaningful groups of manageable size for learners of the beginning and intermediate levels.

In Fig. 6.3 is introduced an indexing system for the set of primary significs. This system consists of only 8 major groups (7 + 1 items), which are: (1) Heaven, (2) Five States of Change, (3) Humanity, (4) Divination, (5) The Physical Body, (6) Four Necessities, (7) Environment, (8) Others.

These eight groups were so chosen because they have deep cultural significance for the Chinese People, namely: they are already familiar entities in the existing knowledge structures of the Chinese learner.

C. DATA BASE CONSIDERATIONS

Beyound the levels of the beginners and intermediate learners, there are needs at the advanced level. Linguists, systems engineers, and other researchers sometimes must work with tens of thousands of characters. How is one to approach such a task?

We feel that a problem of such magnitude cannot be efficiently handled by the human mind alone; a combination of human and computer seem to be the solution. That is, a dynamic system of man and machine together such that an inexpensive data base management system is included as part of the solution.

Although it is beyond the scope of this research to do a systems design, some general discussion is appropriate here.

Each Chinese character may be viewed as a "record" in the computer memory (say, a disk memory). Therefore, we may expect:

.the volume of data to be large, of the order of tens of thousands of characters.

.the nature of the data to be complex (in the sense that there exist numerous and varied relationships among the formal elements of character, as well as among characters)

.the coding of sets of characters to be difficult (i.e., quite a challange to have an efficient file system for information manipulation.)

.the lifetime of such a system, once it is set up, to be long (i.e., considerably longer than the life of the computer hardware or software systems that support it).

in view of the above considerations, it is important to develop efficient methods of (i) referring to the "Chinese character" files and their contents, (ii) expressing operations on the file data, (iii) designing the logical structure of the Chinese characters such that it is seperated from the storage structure of the physical system as much as possible, and (iv) interfacing with users.

Should such a system be developed, we would expect that most of the users will not be programmers, and that most users will probably not be knowledgeable about the operating system or the file organization of the system. Therefore, as a general rule, the more the logical concepts related to the application are seperated from the physical concepts related to the computer operation, the greater will be the possibilities for wide-scale use of such a Chinese-character data-base system. So, we should be aware of methodologies for managing such a data base for convenient accessibility to many types of users with a diversity of applications and needs.

Specifically, we are interested in how Chinese-character data can be described and structured, and how retrieval, searching, and updating procedures can be specified in a Chinese-character DBMS (Data Base Management System).

.REVIEW

In order to facilitate a better understanding of the "data-base management" concept, it is helpful to review some basic definitions:

.Data Base: A collection of stored data that models some aspects of the objects of the real world. A data base contains two kinds of information: (1) descriptions of entities, and (2) representation of relationships.

.Entity: An object that has independant existance in the context of the application for which the data base is intended. An entity is described by a set of characteristics or attributes.

.Relationship: A named association among sets of entities. Relationships can be of the following kinds: (1) 1 : 1, (2) 1 : n, and (3) m : n. When it is 1 : 1, it is called a "binary" relation. When it is 1 : n, it is said to be "hierarchical". When it is m : n, it is referred to as "network".

.DISCUSSION

A DBMS maintains a "data structure" representation of the entities and relationships. The logical structure (i.e., a data model), together with its "operation set", constitutes the interface through which the data base is accessed. It also employs specific techniques for mapping information into the storage, and for information retrieval.

Relationships are expressed through: (1) Proximity, (2) Position, and (3) Pointer mechanisms. Regarding "data modeling", there are currently three main approaches: (1) Hierarchical, (2) Network, and (3) Relational.

.THE RELATIONAL MODEL

Since Codd's relational model was built upon a mathematical theory, it is anticipated that as time goes on it would gain greater prominence and would exert greater impact on DBMS design and implementation. Therefore, we shall in the following place more emphasis on the relational model in our discussion and in our exploratory work of simulated Chinese-character data-base studies.

The relationel model is based on the mathematical notation of a "relation" (Codd, 1970). Codd adapted the concept of relation to data-base use and formally developed it for data representation and retrieval.

.PHILOSOPHY

The philosophy of the "Relational Data Model" may be expressed as follos:

(1) The entities and relationships making up a data base can be represented by n-ary relations that are: (a) time-varying, in the sense that a tuple may be changed, inserted, or deleted. A relation is simply a two-dimensional table, and each row of the relation is known as a tuple. (b) normalized or flat, in the sense that each component of a tuple is either of primitive type or a character string (but not a set, relation, or otherwise composite).

(2) By a process of successive decomposition (further normalization), a relation can be split into several relations, which behave in a consistant manner when modified, and from which the information in the original can be reconstructed without loss.

(3) It is possible to develop relational data languages that are highly data independant and which can, for practical purposes, express any query whose answer is contained in the data base.

.COMMENT

In the "relational model", there is really no logical distinction between "entities" and "relationships". The basic element is a "relation" in the mathematical sense--a set of n-tuples--and it is a matter of interpretation whether a relation represents a set of entities or a relationship. Frequently, a relation is pictured as a table in which rows correspond to "tuples", and columns to "domains". Since the relation is actually a "set", identical rows are not permitted, and the ordering of rows is considered immaterial. But, the ordering of columns is significant, reflecting the fact that, in general, the ith and jth components of tuples cannot be changed without affecting the meaning (Gottlieb & Gottlieb, 1978).

D. DBMS DESIGN AND IMPLEMENTATION

.INTRODUCTION

Considering current advances in micro-computer technology, we envision the possibility of a practical system that combines a powerful operating system with a DBMS for educational applications. We believe that it is realizable to have an electronic Chinese-character dictionary, a word-processor for Chinese characters, and an interactive computer-based system for learning Chinese characters.

In the following, we shall briefly sketch a minimum system required for input and retrieval of Chinese characters, which serves to demonstrate its feasibility. We shall describe a system that allows the use of the option of either inputting in English or in romanization for fetching the corresponding character desired from a disk memory via a DBMS.

.DBMS COMMANDS

The DBMS should have commands to do the following in order to function properly: (i) to create a file; (ii) to change records, such as to edit, insert, or delete a record from the file; sometimes, to restore a record, and to do "garbage collection" on a given file after an extended period of usage; (iii) to provide a full screen editing capability; (iv) to generate reports.

.DBMS ORGANIZATION

It should hace an indexing command that allows only the data within the "key" to be sorted. This makes possible a fast search operation. It also should have a special command for finding a piece of known data when an index file is in use. It should also have the capability of manipulating multiple index files.

.DATA STRUCTURE

1. FILES

The user must have two kinds of files: (i) The data file for storing the font images of the Chinese characters, which are stored as a record in the auxilary memory. Here we shall refer to this file as CHNEW to mean a data file for storing newly created Chinese characters. (ii) The various program files for the user to command the system on what to do at the appropriate moment. The five programming modules might be named: SEARCH, ENGWORD, ROMWORD, GETCHAR, and DRAWCHAR.

The user needs to construct the following files:

a. A data file for all the new characters created, called the CHNEW data file.

Five files for user's application programs:

b. A "SEARCH" file to search for the character desired.

c. An "ENGWORD" file to allow the user to input the English word meaning for the corresponding Chinese character desired.

d. A "ROMWORD" file to allow the user to input the romanization of the Chinese character desired.

e. A "GETCHAR" file to get the Chinese character from the data base in the auxilary memory (i.e., the disk memory)

f. A "DRAWCHAR" file to display the Chinese character on the CRT screen, or to be printed out on a printer.

2. RECORDS

Each font image of a Chinese character is defined as a record in the memory. Thus, there will be tens of thousands of records, and each record is a "m x n" matrix. In the feasibility study, we have used a 10x10 position for the font, and two rows of 1x10 for the two identifying attributes of the character. Later on, more attributes may be added as the data structure is modified.

3. FIELDS AND DATA TYPES

In the data structure for our initial feasibility study, we need to identify each of the fields, and to specify the data type, be it numeric, string or logical.