Design and Construction of a Virtual Environment
for Japanese Language Instruction

by Howard Rose

[Previous Chapter][Table of Contents][Next Chapter]


Chapter III
The Zengo Sayu Application

To support the TPR and NA methods effectively, a virtual environment must have certain characteristics. At the interface level it must be highly interactive, allow multi-sensory gestural and voice commands and provide natural speech feedback. The application itself must support language acquisition in incremental stages and facilitate the developing of concrete associations between language and meaning.

Technical Details of the Computer System

Zengo Sayu (which roughly translates to the English: up-down-all-around ) meets these requirements by coupling speech and gesture recognition with a virtual environment and digitized speech output. The student is represented as a virtual hand in the environment (Figure 1). The setting is a Japanese style tatami room with a view of Mt. Fuji (Figure 2). The room contains a low table, a number of colored boxes and colored orbs as shown in Figures 2 and 3. These contents change as the user adds and deletes objects during the various stages of the program.

The Zengo Sayu system hardware includes a head mounted display, microphone, hand-held wand with three push-buttons, Polhemus tracking system and three networked computers. It should be stressed that the high-cost hardware described here to run Zengo Sayu was used because these were the most appropriate machines available at the time. Recent advances in virtual reality computing have drastically reduced the cost of computer systems by one-half to two-thirds. The next generation of hardware and software due sometime in 1996 will put low-cost, immersive virtual reality systems within reach of average consumers. In light of the fast paced progress in computer technology, it is inevitable that immersive VR systems are soon going to proliferate out of the laboratory and into industry, homes and schools.

The student wears a fully immersive head mounted display (HMD) which can be fitted with a microphone for the voice recognition system Figures 4 and 5 show students using the HMD without the micorphone attached. The HMD has two NTSC video displays. A VR4 HMD was chosen for use in this system because it offers an adequate field of view of approximately 55 degrees. The VR4 is also a lightweight HMD, which is preferable to reduce physical stress on students using the system.

A magnetically tracked wand controls a graphical representation of a white hand. This virtual hand is the icon which represents the user in the world. The hand responds in real-time to movements of the wand. The wand is equipped with three buttons. The first enables the user to fly through the virtual world forward and backward. Normal flight is locked on a level plane. A second button unlocks the fly-level function and enable the user to ascend or descend according to the direction she is facing. The third button is a trigger mounted on the front of the wand, like a pistol trigger. Pressing the trigger enables the user to pick and move up objects in the virtual environment, and collide objects together to cause various events to occur.

Table 3 summarizes some common terms used to describe elements and actions in virtual environments. These terms are used below to describe the Zengo Sayu system and how students use it to learn Japanese.

(Figure 1&2&3)

Figure 1&2&3: Zengo Sayu Environment

(Figure 4&5)

Figure 4&5: Students Using Environment

Table 3: Common Terms used in Virtual Environments

Virtual hand The person immersed in the virtual environment is represented by a white hand, which moves in response to the motion of the user's hand.
Events Events are programmed actions which the computer causes to happen in response to various user actions. Events can cue the computer to perform a variety of functions such as: play a sound, change an object's position or make objects visible or invisible.
Touch (event) To contact the virtual hand with a virtual object. No buttons on the wand are pressed. The virtual hand touching an object can cue an event.
Grab/Pick (event) To take hold of a virtual object by placing the virtual hand inside the object and depressing the trigger-button. Grabbing an object makes it stick to the virtual hand so it can be moved. Grabbing an object can cue an event.
Drop (event) Releasing the trigger-button causes the virtual hand to release the object it is currently grabbing. Dropping an object can cue an event.
Collide (event) Collide is an event which the computer registers each time two virtual objects come in contact. Collision is one of the most often used events used in virtual environments.

The Zengo Sayu environment also allows the user to interact with the environment by pointing at virtual objects using the wand. For example, when the user points at a virtual object, the computer calculates a line from the user's eye to their hand and then decides at which object the user is pointing. The system holds the name of this object in the form of a variable to be used in programmed event strings. For example, when the user points at the red box and says: "Put it on the blue box.", the computer substitutes "red box" for the word "it" in the spoken sentence, and automatically moves the red box on top of the blue box.

Zengo Sayu was created using Division LTD's dVS virtual world authoring system, version 2.0. The dVS system both generates the graphical scene experienced by the user, and controls the flow of user interactions with the environment. Additional custom functions were written in C++ code to detect relative object positioning, and for speech and gesture recognition.

The voice recognition software is a research prototype which is speaker dependent, meaning it must be trained to understand the voice of each individual user (Savage, Holden, & Billinghurst, 1994). During voice training, the student recites a given phrase into a microphone connected directly to the computer. The voice is digitally sampled and stored in a voice library for the individual user. This system requires between five and ten individual digitized samples for each phrase, and calculates an average wave form for each set of samples. This average sample is used later to match incoming sound signals with the computer's library of sounds for a given student. As the student trains the system, he is presented with a visual display of the sound wave for each utterance. While this process is slightly time consuming, it does compel the student to practice speaking with consistency and vocal control. This system currently supports a one hundred phrase vocabulary with a recognition accuracy of over 90%.

The Japanese speech samples heard in Zengo Sayu are all recorded, natural speech which has been digitized on the computer. The audio quality varies between 11 and 22kHz, and is significantly better than telephone quality sound. Some of the longer sound samples have been parsed into phrases to reduce the number of digitized samples required. For example, the sentence: "The red box is on the table." is parsed into two samples: "The red box is" and "on the table." Thus the parsed phrases can be combined in multiple ways to create a large number of sentences, while still maintaining an overall quality very close to natural human speech. The choice was made to parse the sounds into phrases rather than at the level of individual words so as to avoid computer-like speech with unnatural pauses and awkward intonation.

The current form of Zengo Sayu runs on three computers networked together using UNIX sockets. The graphics for the virtual environment are rendered on a Silicon Graphics Onyx computer. The digitized sound samples are played on a Silicon Graphics Indy computer. The voice recognition software is run on a DEC Alpha workstation. In spite of the use of the network, the Zengo Sayu interface responds almost instantaneously to vocal and gestural commands.

(Figure 6)

Figure 6: Hardware system diagram for Zengo Sayu

One of the unique aspects of this interface is the use of combined voice and gesture recognition in an educational setting. This is a very powerful way of interacting with the virtual environment because the two modalities compliment each other. Hauptman and McAvinny (1993) have shown how natural language is ideally suited for descriptive tasks, while gestural interaction is ideal for direct manipulation of objects. Hauptman and McAvinny have also shown that users prefer using combined voice and gesture interaction with computer graphics over either modality alone (Cohen, 1992).

The Teaching Method

Zengo Sayu is a whole language approach for teaching Japanese prepositions to students with no prior exposure to the language. The target vocabulary includes five colors (red, blue, white, black, yellow), two nouns (box, table), five prepositions (on, under, next to, front, behind) and two verbs ('is' and 'put'). Because this is a whole language approach it does not teach discreet language elements such as grammar, pronunciation or syntax, but students are exposed to these elements within the context of their interactions and experiences in the environment. The environment is designed to be experienced totally in Japanese without the need for English translation, ensuring a total immersion language experience for the student.

This lesson consists of gradual steps, consistent with the Natural Approach principles of silent period, gradual knowledge acquisition and the development of concrete associations between language and meaning. Though there are distinct phases to complete as students progress through the program, each student is free to choose their own path throughout the process. In other words, there is no prescribed 'correct' path to take. Students can move through these steps at their own pace and each step relies on knowledge gained during previous stages. This iterative process builds knowledge like climbing a staircase.

Zengo Sayu is intended to give students great control over their own learning process. The teacher is no longer in sole possession of the knowledge, but acts as a guide to facilitate students' understanding. Knowledge is held within the virtual objects, which can speak and react according to the students' wishes. For example, if a student would like to hear a specific sentence repeated over and over, he is able to do so as many times as he wishes. This approach frees the teacher from performing the tedious role of a human tape player, and allows her to concentrate on higher level learning beyond the practical capabilities of the computer.

Zengo Sayu is designed to be flexible for the student. For example, all the functionality in the program is simultaneously available to allow progress or review at any point in the program. The student is free at any time to revisit or repeat portions of the lesson.

Using Zengo Sayu

The following scenario describes a typical lesson using Zengo Sayu in order to clarify how each of these steps work in practice. Due to its interactive nature there is a lot of flexibility in the sequencing, pace and methods which individual students may choose to cover the material. It may be completed in one long sitting or spread over weeks of instruction. Alternative presentations not described here might include using Zengo Sayu with more traditional language instruction interspersed along the way. Actual student interaction will vary from person to person, but by the time they have completed the lesson, all students will have covered essentially the same content.

The research study described in this thesis tests Zengo Sayu for use with small groups of 7 students. Using Zengo Sayu in a class situation opens up instructional options such as stimulating conversation and interaction between students both inside and outside of the virtual environment. Students outside can follow the student through the virtual environment by watching and listening to a video monitor. In this way, substantial learning can take place as the students get a secondary experience of the virtual world.

Preparation: Getting Acquainted with VR

A major concern in testing this system has to do with students' lack of experience with VR systems, and unfamiliarity with virtual icons and metaphors. Therefore, it is crucial that students become acquainted with moving and interacting with a virtual environment before they progress to the task of learning Japanese.

For this purpose, an English version of the environment was developed to introduce both the technology and the specific functionality of the Zengo Sayu system. Preliminary trials have shown that users who spend five to ten minutes in the English environment are more comfortable and adept with the technology when they move to the Japanese version of Zengo Sayu. The result is that students report being less distracted by the mechanics of the system, and better able to concentrate on the task of language learning.

Step I: Exploratory Learning

All initial learning occurs through an interactive process of exploration and discovery driven by the student. The students starts by touching translucent orbs colored red, blue, black, yellow and white. These orbs are floating in space, arranged in a line in front of the student when she enters the environment (Figure 7). Picking one of these orbs (using the trigger button on the wand) causes the orb to turn opaque and announce its color, for example, "Aka" (red) as shown in Figure 8. Directly below the floating orbs is a translucent box. This box is constrained in space and cannot be moved, but picking the box with the wand causes it to announce its name: "Hako".

Picking any of the orbs and colliding it with the translucent box spawns a new box the same color as the orb (Figures 9 and 10). For example, the student picks the red orb and hears "Aka." She places it into the box and releases the trigger to hear "Hako". The translucent box is then transformed with an opaque, red texture and the student hears "Akai hako. Akai hako." (red box). The red box then flies from the table and onto a stage area across the room, leaving the translucent box covered in the red texture. The student hears "Akai hako." once more for reinforcement before the texture disappears and the box returns to its original translucent state.

The translucent quality of virtual objects in Zengo Sayu is intended to convey meaning to the user. Translucent objects are objects which students can manipulate in some way. Translucent objects also are 'meta-objects', meaning that they represent a broad concept beyond the single object itself. For example, the orbs are translucent because they represent the general word for color, rather than the simple noun for 'orb.' The box is translucent because it has the capacity to spawn colored boxes, demonstrating the broad grammatical concept of combining adjectives with nouns. Translucence is one experiment at creating a set of metaphors for virtual environments.

Students are encouraged at this early stage to develop their understanding of Japanese and the VR system through exploration. The relationship between the virtual environment and the Japanese language content can be seen as symbiotic. Students are explicitly introduced to the fundamental functionality of the virtual environment, but they are not led by the hand through all the steps. The result is that some interactions will result in events which surprise the student. The intent is to add a dramatic element to the environment, and also cause the student to reflect on causes and relationships which lead to a surprising occurrence, such as a colored box being spawned. Such reflection and hypothesis development is a key element of inductive learning approaches such as the Silent Way (Gattegno, 1976, 1972) or Natural Approach.

(Figure 7&8&9&10)

Figure 7&8&9&10: Colored Orbs and Box

Step II: Speaking Practice Once the student feels familiar with the above content, the next step is to train the voice recognition system to enable voice interaction. Based on evidence which supports a silent period for beginning learners (Mangubhai, 1991; Atherton, 1993; Gary, 1975; Winitz & Reeds, 1973), students should be allowed ample time to develop their hearing capabilities before progressing to voice training. I suggest the best gauges for determining speaking readiness are 1) students' ability to effortlessly produce the target language, and 2) student attitude. In all cases, the teacher tests the students' pronunciation before allowing them to train the computer in order to avoid developing bad habits and practicing mistakes.

After training the voice recognition system, the students are able to interact with objects in the virtual environment through speech input. For example, when the student says "Kuroi hako", the black box grows and shrinks to indicate that the student has been understood. If the speech is not understood, the environment answers: "Wakarimasen." (I don't understand.)

(Figure 11&12&12&14)

Figure 11&12&13&14: The Preposition Table

Step III. Developing Complex Meaning: Prepositions and Full Sentences

Prepositions are introduced at the Preposition Table shown in Figures 11 to 14. The Preposition Table is a translucent table with five translucent boxes placed on, under, beside, behind, and in front of the table, respectively. This table has multiple layers of functionality. Touching the table turns it opaque, which indicates that picking the object will cause an event. Picking the table with the trigger button causes the table to announce its name: "Teburu." Touching the boxes placed around the table turns them opaque. Picking the boxes causes them to report their position relative to the table. For example, picking the box on the table the student will hear: "Shita (under). Shita. Teburu no shita ni arimasu (under the table). Shita."* (Figure 11). The word arimasu, is a form of the verb for 'to be' and connotes existence. Note that no explicit definition or conjugations are presented regarding this verb. It is expected that students will assimilate this and other verbs somewhat subconsciously, through repeated exposure and usage as in TPR and the NA.

Next students are introduced to the virtual Query Wand. The Query Wand is a virtual tool which enables the student to query the environment without the need to speak. The question mark shape of the Query Wand's head readily suggests its purpose to the user. Colliding the Query Wand with one of the boxes on the Preposition Table reveals a deeper level of language and functionality (Figure 12) . For example, when the Query Wand collides with the box under the table, the box turns white and the student hears a full sentence stating the relationship of the white box to the table: "Shita (under). Shita. Shiroi hako wa teburu no shita in arimasu. Shita." (The red box is on the table.) Each of the remaining four boxes behaves similarly, turning to one of the other four respective colors and stating their relationship to the table (Figures 13 and 14). In this way the student is exposed to complex elements of Japanese such as 'wa', which denotes the topic of a sentence and has no English equivalent, and particles of direction such as 'ni' and 'no'. All complex structures are directly linked and explained to concrete examples and experience.

Once the student is familiar with the language at the level of the Preposition Table, she advances to manipulating the colored boxes by hand. For example, placing a black box on the yellow box, the student will hear: "Kuroi hako wa kiiroi hako no ue ni arimasu" (The black box is on the yellow box.) (Figure 15). In this way, the student can create complex arrangements using all the boxes. Once an arrangement is set, the student can use the Query Wand to review or ask further questions. In this situation, the Query Wand detects sequential collisions with two boxes, and then reports the relationship of the first box touched with the second box touched (Figure 16).

The Query Wand enables extensive interaction without pressuring students to speak. These methods of exploration and explanation allow students to learn at their own pace in a stress free environment. In keeping with the documented correlation between motivation, anxiety and performance in language programs, this should result in lower levels of learner anxiety and higher levels of learning (Ganschow et al., 1994; Aida, 1994; Williams, 1991; Horwitz et al., 1986).

(Figure 15&16)

Figure 15&16: Manipulating Boxes

Step IV: Listen, Understand and Respond

Once the component pieces of language have been absorbed, it is time for the students to practice and put them to use. Step IV presents model arrangements of boxes to the student (Figures 17 to 19). Each time a model is shown, the student is also given five colored boxes placed in a row (Figure 17). The student's task is to use the boxes she is given to recreate the model. When the student touches a box, it gives a command how to place the box. For example, touching the yellow box would instruct the student: "Kiiroi hako o shiroi hako no ue ni oite kudasai." (Put the yellow box on the white box.) just as can be seen in the model (Figure 18).

In the initial stages, the model remains visible to the student throughout the exercise. This enables the student to compare the aural commands to the model to check her understanding. Later, the model is kept hidden and only five boxes are presented to the student in a line. This makes the task more difficult, forcing the student to listen more closely and rely only on the aural instructions.

Once the student has built her own configuration according to the directions she received, she can use the Query Wand to check her accuracy (Figure 19). Once all five boxes have been placed, the Query Wand can be used to go step by step through the boxes to check each relationship. When the student touches two adjoining boxes in sequence, the computer informs her if she is correct or incorrect. For incorrect answers, the system will say: "Chigaimasu" and repeat the instruction to place the given box or boxes correctly. In this way, the student can work through the problem until she reaches a correct solution.

After the student successfully completes a series of model building problems, she is allowed to train the voice recognition system to the full extent of her ability. Again, a teacher checks and monitors the students ability before giving final approval to train the computer.

(Figure 17&18&19)

Figure 17&18&19: Model Arrangements

Step V: Initiation

Last, students play a game in the virtual environment where they assemble stacks of boxes that match sample models. This game is intended to be played in a number of forms, according to the instructional setting.

The first form is for a single student using the world independently. The student's virtual hand is disabled, forcing her to use vocal and pointing commands to arrange the boxes. She can either give complete vocal commands such as "Aoi hako o akai hako no ue ni oite kudasai" (Put the blue box on the red box.), or combined voice and gesture commands like "Sore o shiroi hako no mae ni oite kudasai." (Put that in front of the white box) while pointing at an object. The objects are monitored by the software to check when the target configuration is reached.

The second form is to be used in a class situation. A student immersed in the virtual world works with a student or group of students outside the world. The students outside arrange a set of five boxes. Then the students take turns explaining their model to the student in the virtual environment, trying to help him recreate the model as accurately as possible. Students can take turns asking and answering questions until the model is correct.

A third form of this game would use a multi-participant virtual reality system where a number of people can interact within the same virtual space. The game in this case could be a combination of the two previous forms. A multi-participant virtual environment could potentially be shared between remote sites as part of distance learning programs. Developing virtual environments which promote collaboration and group interaction over a distance is one of the most promising educational applications of this technology.

Summary

The process of using Zengo Sayu for teaching Japanese language covers a series of gradual steps. This methodology is consistent with the Natural Approach principles of having a silent period, gradual knowledge acquisition and the development of concrete associations between language and meaning. Attempts have been made to make student interaction with the world as seamless and natural as possible. At the same time, specific attention has been given to developing a set of easily identified iconic tools for use in virtual environments. The underlying theory guiding Zengo Sayu's development is that giving students control over their own learning process will increase their learning and motivation, while reducing anxiety and stress commonly associated with foreign language learning.


[Previous Chapter][Table of Contents][Next Chapter]


Human Interface Technology Laboratory