Barra, Paola, Cantone, Andrea Antonio, Francese, Rita, Giammetti, Marco, Sais,
Raffaele, Santosuosso, Otino Pio, Sepe, Aurelio, Spera, Simone, Tortora, Genoveffa,
Vitiello, Giuliana (2024): A Task-oriented Multimodal Conversational Interface for a
CSCW Immersive Virtual Environment. In: Proceedings of the 22nd European Conference
on Computer-Supported Cooperative Work: The International Venue on Practice-centered
Computing on the Design of Cooperation Technologies - Exploratory papers, Reports of
the European Society for Socially Embedded Technologies (ISSN 2510-2591), DOI:
10.48340/ecscw2024_ep05

Copyright 2024 held by Authors, DOI 10.48340/ecscw2024_ep05
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the full citation on the
first page. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, contact the Authors.

A Task-oriented Multimodal
Conversational Interface for a CSCW
Immersive Virtual Environment

Paola Barra1, Andrea Antonio Cantone2, Rita Francese2, Marco
Giammetti2, Raffaele Sais2, Otino Pio Santosuosso2, Aurelio Sepe2,
Simone Spera2, Genoveffa Tortora2, and Giuliana Vitiello2

1Department of Science and Technology, University of Naples
Parthenope, Italy
2Department of Computer Science, University of Salerno, Italy

Abstract. In CSCW immersive Virtual Reality environments, users may be uncomfortable
when interacting with a two-dimensional menu. Multimodal conversational interfaces may
enhance the interaction enabling users to communicate with the system in different

1


modalities. In this paper, we investigate the use of an embodied multimodal chatbot for
improving interaction in a Virtual Reality (VR) environment simulating a working context. In
particular, we adopt a User-Centered Design approach to build a multimodal
conversational interface, named Muxi, in which a task-oriented voice avatar is enhanced
with an interactive board for supporting meeting organization in VR. Users were involved
in all the development phases, from task definition to iterative user testing. To assess the
usability of the proposed interface, we conducted a controlled experiment involving 32
participants to compare the use of Muxi with a traditional menu-based interface in a
CSCW environment. We performed quantitative analysis, concerning efficiency and
effectiveness assessment, and qualitative analysis, related to participant cognitive load
and perceived usability. Results revealed that our multimodal interface increases usability
by greatly alleviating cognitive load and improving user performance, representing a good
alternative to a menu-based interface.

1 Introduction

Since the advent of the Graphical User Interface (GUI), menus have been
recognized as essential tools for computer users. They help users navigate through
various items and select one of them. They are also adopted in 3D VR
environments, where they require the user to select the menu item by using
gestures or controllers to indicate the object and confirm the selection (Mundt and
Mathew, 2020; Wang et al., 2021).

In VR, however, the user is immersed in a three-dimensional, spatial
environment, and it may be uncomfortable to have to interact with a
two-dimensional menu. Multimodal conversational interfaces may enable the user
to communicate with the computer in different modalities, such as speech, text,
gesture, image, video, and sound. Introducing them in VR environments may
improve the system’s usability. VR voice assistants are generally implemented by
using an avatar (Zhao et al., 2022) to increase the presence perception and
engagement of the users by providing a more realistic interaction.

The design of a multimodal conversational interface is not an easy task
(Crovari et al., 2020; Francese et al., 2022). It requires the choice of the most
appropriate interaction modalities for the user, the task, and the context. In
addition, multiple modalities have to be integrated coherently and consistently,
providing clear and intuitive feedback to the user (Sebillo et al., 2009). In
multimodal conversational interfaces, interaction may also depend on the chatbot
type, ranging from service chatbot useful for customer support (Mohamad Suhaili
et al., 2021), to task-oriented chatbot helping users complete tasks in specific
domains, to Personal Assistants, serving the user continuously, to general purpose
chatbots (Følstad et al., 2019). Chatbots are also adopted to support collaborative
work and learning in VR environments (Trappey et al., 2022; David et al., 2019;
De Lucia et al., 2009).


In this paper, we equipped the multi-user VR CSCW Environment MetaCUX
(Barra et al., 2023a,b) with a multi-modal conversational interface, named Muxi,
for helping users in tasks related to the setting of a working environment, such as
creating a meeting room.

The main contributions of the paper are the following:
• We describe the User-Centered Design (UCD) approach we followed to

design a multi-modal task-oriented CSCW conversational interface.
• The proposed interface enhances the vocal interaction provided by an

embodied avatar with a board GUI.
• We conduct a user study involving 32 participants aiming at assessing the

impact of the use of a multimodal task-oriented chatbot versus a menu-based
interface on user performance and perception when interacting in an
immersive virtual environment.

The paper is structured as follows: Section 2 discusses related work. Section
3 describes the MetaCUX system and Section 4 describes the UCD methodology
used for the development of the multimodal conversational interface Muxi. Section
5 describes the experimental user study while in Section 6 results are reported and
discussed. Finally, Section 7 concludes the paper.

2 Related work

In this section, we discuss the research efforts that have been devoted to the use
and assessment of menu-based and Chatbot interfaces in VR environments, and the
support offered by task-oriented chatbots in VR CSCW environments.

2.1 Menu-based interface in VR environment

Das and Borst (2010) compared different types of design choices for Menu in VR:
layout (pie vs. linear list), placement (fixed vs. contextual), and pointing method
(ray vs. pointer-attached-to-menu) reporting the pros and cons of each of them.
They involved 34 participants and compared time and errors. Mundt and Mathew
also assessed the use of several types of pie-menu (Mundt and Mathew, 2020) with
24 participants, assessing usability, user experience, presence, error rate, and
selection time.

The authors in (Lipari and Borst, 2015) integrated touch menus into a cohesive
smartphone-based VR controller. Users transitioned between the menu interaction
area and the other for spatial interactions such as VR object navigation areas. The
study involved 20 participants and compared touch menu selection and ray-based
selection by measuring time, errors, and user satisfaction. Results showed that both
techniques have advantages and disadvantages.

Wang et al. (2021) assessed the use of handled menus in VR that follow the
users to move, without obstructing their vision. They compared two types of menu


interfaces (fixed menu and handheld menu) and three selection techniques. The
choice of the solution depends on the contexts of use and end-users.

2.2 Chatbot in VR environment

The amount of experiments on chatbot usability has increased in the literature.
Generally, it is assessed with experiments measuring usability based on
effectiveness, efficiency, and satisfaction (Ren et al., 2022). The study proposed in
(Nguyen et al., 2022) investigated disparities in user satisfaction between a chatbot
and a menu-based interface system related to a mobile app. The research findings
reveal that the use of the chatbot results in a decreased level of perceived autonomy
and increased cognitive load compared to menu-based interface systems,
ultimately leading to lower user satisfaction. This study suggests that advanced
technology may not always be the optimal solution to organizational problems,
which could lead to unintended negative consequences if user concerns are not
adequately addressed.

Concerning the usability assessment of chatbots in VR, few works performed
this kind of analysis. Indeed, Trappey et al. (2022) introduced a VR chatbot trained
to answer frequently asked questions (FAQs) from a power transformer
manufacturer. They assessed only the performance of the NLP models, which
achieved an accuracy rate exceeding 91%. No user study has been conducted. In
(Xie et al., 2023), chatbots are integrated into a university platform to assist both
students and teachers with various tasks. Also in this study, no user study has been
conducted.

Pick et al. (2017) compared speech-based and pie-menu-based interaction for
the control of complex VR applications. They conducted a user study with 20
participants and assessed their performance in terms of time and errors and
perceived usability. It resulted that on one side, speech is faster but on the other
side pie menus are less error-prone.

2.3 Supporting CSCW with task-oriented chatbots

Task-oriented chatbots in VR are designed with specific purposes. They focus on
assisting users in achieving well-defined tasks within the working VR environment.
Examples include managing virtual meetings, coordinating complex projects, or
providing real-time information.

Wang et al. (2023) provided guidance for online retailers to design chatbots
with appropriate communication styles for effective service recovery in electronic
commerce. Trappey et al. (2022) considered the context of industrial equipment
manufacturing, involving customized design, assembly, installation, and
maintenance services for electric power transformers. These services cater to the
specific needs of customers. They proposed a VR-Enabled Chatbot for intelligent
engineering consultation. The chatbot provides VR users with highly interactive
and realistic graphical views during engineering counseling sessions.


Figure 1: The adopted development process.

Unlike previous works that investigated the use of chatbot or menu-based
interfaces, we propose a multimodal conversational interface enhancing a vocal
chatbot represented by a virtual avatar with an interactive board. The novelty of
this paper lies also in adapting UCD principles to chatbot development in a CSCW
VR meeting environment, ensuring that these AI-driven interfaces truly enhance
user experiences and seamlessly integrate into their daily interactions. We also
assess the multimodal interface usability by comparing it with an already existing
menu-based interface.

3 The MetaCUX system

The MetaCUX system is a multi-user CSCW VR immersive environment (Barra
et al., 2023b,a). It allows users to choose a customized avatar, offered by Meta, for
navigating different virtual rooms.

A user can play two roles: the organizer, enabled to create a new public or
private room, select the scenario, and manage the creation and scheduling of various
activities, such as meetings, interviews, etc. or the participant, enabled to enter
rooms and perform activities organized by other users. Whenever the organizer
changes the room scenario, all users in the room can view the change in real-time.
The new scenario is automatically loaded for everyone.

4 Enhancing MetaCUX through a UCD approach

The goal of introducing a multimodal assistant into MetaCUX is to try to simplify
the interaction. To this aim, we adopted a UCD methodological approach consisting
of the steps summarized in Fig.1. We involved the users in the various development
phases.


4.1 Requirement definition

Muxi is expected to assist the user in performing the meeting management inside
the MetaCUX environment.

4.1.1 Chatbot definition

The chatbot is task-oriented (chatbot type definition) to be used in a CSCW VR
environment. For this type of chatbot, we select a user-driven dialogue (dialog type
definition): the chatbot has to identify the user intent, serve it, and provide
feedback on the result. The relation is short-term (relation-type definition): the
chatbot considers the user as a newcomer, it does not remember the past
interactions (Følstad et al., 2019).

4.1.2 Task definition

To identify the tasks that require greater support we performed a preliminary study
on the management of a meeting in MetaCUX, involving three HCI experts. We
first let them freely use the original system and then asked them to use the features
for creating rooms, changing the scenarios, scheduling and participating in a
meeting, writing on the whiteboard, and so on. Then, immediately after leaving the
experience, we conducted a focus group (Kontio et al., 2008) by following a
discussion template previously prepared by the authors of this paper (Cassell et al.,
2004). During the focus group meeting, which lasted 30 minutes, participants had
to reveal the positive and negative aspects of their experience. From the discussion
it emerged that the most critical interaction aspects were found for the following
tasks:

• Creating a new room;
• Scheduling a new meeting;
• Changing the room scenario.

4.2 NLP model development

In this study, targeted data collection was conducted to develop two Deep Learning
models: one for intent recognition and the other, Named Entity Recognition (NER),
capable of understanding and interpreting users’ intentions and recognizing named
entities within a voice request. We associated each interaction task with an intent
Muxi has to detect to accomplish the task.

4.2.1 Dataset creation

To create a dataset for training the NLP model implementing the chatbot Muxi,
we first studied possible human-based dialogues for performing those tasks. Thus,
we conducted a survey, which provides an example of the three intents the chatbot
has to execute and requires two sentences for each of them. This small number of


sentences has been chosen to avoid overloading the user. A group of 31 volunteer
users (Computer Science students) were involved. They were asked to simulate the
inquiry of a voice assistant for performing the considered tasks and fill out the form
with their requests. We collected 186 sentences.

4.2.2 Data Augmentation

We removed the duplicated sentences. Then, the original dataset, consisting of 157
sentences collected with their corresponding intent labels was input to ChatGPT
which was required to generate similar sentences. It was also tasked with altering
the structure of existing sentences. This approach resulted in the generation of a
new dataset consisting of 900 sentences, divided into 300 sentences for each of the
three intents.

To create a training dataset for intent detection the data collected were
pre-processed as follows.

1. Data Cleaning, consisting in removing duplicate and inconsistent sentences.

2. Tokenization, the method of splitting a large text into tokens, which are shorter
texts.

3. Stopwording, consisting in the removal of commonly used terms, such as "a",
"an", and "the".

4. Lemminization, consisting in reducing the words to their root, e.g., "running"
is reduced to "run".

5. Vectorization and Transformation, the text data were converted into a numeric
format so that it can be used as input for NLP tasks for BERT.

4.2.3 NER dataset creation

The tagging of the datasets of NER models was done manually. In particular, we
defined the tags for intent as follows:

• Create rooms:
– Scenario type: B_TYPE_SCEN;
– Number of participants: B_NUM_PART;
– Room name: B_NAME

• Create meeting:
– Name meeting: B_NAME-MEETING;
– Meeting description: B_DESCR;
– Day: B_DAY;
– Month: B_MONTH;
– Start time: B_HOUR-START;
– Finish time: B_HOUR-END;


• Change scenario:
– Scenario name: B_NAME_SCEN;

The dataset was divided into sentences and words, along with their respective named
entity labels. Additionally, missing labels were filled using the forward-fill method
to ensure dataset consistency. The labels were converted to uppercase for uniform
formatting.

4.3 Model development

For intent recognition, we adopted the pre-trained BERT model1. In particular, we
adapted the BERT model for the specific task of intent recognition by including a
dropout layer to prevent overfitting, and an output layer for the three intent
classifications. Also, the NER model is based on BERT, utilizing the
implementation provided by the "simpletransformers" library2.

Both models were validated with K-fold cross-validation, for K=5. Their
performance was assessed by using on the test set standard multiclass evaluation
metrics, such as Macro Average Precision (MAPrecision), Macro Average Recall
(MARecall), and Macro Average F1 (MAF1) (Berger and Guda, 2020), reported in
Table I for both the models and computed as follows, where pi and ri are precision
and recall computed on the multiclass Confusion Matrix on the i-th class, for
i = 1 . . . 3. These measures are computed by assessing the Task Completion
Success.

MAPrecision =

∑3
i=1 pi
3

,MARecall =

∑3
i=1 ri
3

)

MAF1 = 2 ∗
(
MAPrecision ∗MARecall

MAPrecision+MARecall

)

Table I: Model performance

Model MAPrecision MARecall MAF1

Intent rec. 92.72 91.52 92.12
NER 89.24 88.07 88.65

When the accuracy of all three intents is lower than 75% we consider that the
chatbot does not understand the question or it is inappropriate.

4.4 Multimodal interface design

The design phase is concerned with both the visual appearance of the interface and
the interaction modality the interface offers.

1 https://huggingface.co/docs/transformers/index
2 https://simpletransformers.ai/


Table II: The adopted usability guidelines (Crovari et al., 2020)

ID Guideline
P1 Show, don’t tell.
P2 Separate feedback from support
P3 Show information only when necessary
P4 Design a light interface — emphasize

content
P5 Show one modality at a time
P6 Do not overload multiple modalities

beyond user preferences and capabilities
P7 Use multi-modality to resolve

ambiguities

4.4.1 Visual Interface design

To make the user experience more engaging and realistic in a task-oriented
interaction, we decided to represent the chatbot with an avatar. In some cases,
visual interaction may be preferred to the vocal one, e.g., when a list of the
available actions is provided. Thus, we decided to offer a multimodal interface
consisting of the chatbot avatar equipped with an interactive board. The
appearance of the avatar and the board should be appropriate for the type of
environment in which they are introduced, a working setting in our case. The final
result is shown in Fig. 2, where the user is on the left (with the label of his name)
and the avatar is on the right, near the board. We animated the avatar with
movements that resemble a person gesturing while speaking.

4.4.2 Interaction design

There is a need to design how the different communication approaches have to
combine the two elements (chatbot and board) to avoid overloading or confusing
the user, also considering the wide space of the virtual environment. We followed
the design guidelines (Crovari et al., 2020) summarized in Table II while Table III
describes how the guidelines have been applied to the Muxi design.

As shown in Fig. 2, the user avatar starts the interaction with the chatbot by
pressing the "Ask me" button on the board. In particular, the P1 guideline is related
to providing the user feedback on what the chatbot has understood of the user
request. The user may pronounce a sentence, such as "Create a meeting room for
twenty people called job interviews." Visual feedback is provided on the higher
part of the board, where the text of the user command is displayed. This is useful
to permit the user to give again the command in case of misunderstanding. In the
case the conversation is out of the three individuated topics the chatbot vocally
specifies that it does not understand the question and shows on the board a
description of the task it may perform (P2).


Figure 2: The multimodal conversational interface.

Table III: Application of the usability guidelines to the Muxi interface design

ID Guideline application
P1 Use the visual interface to display the user

sentence after the pronunciation.
P2 Feedback on the performed operation is

vocally provided, while support (e.g.,
what the user can or should do in the next
interactions) is visually shown.

P3 The GUI changes according to the
conversation.

P4 The vocal interface provides only
essential information.

P5 One modality at a time is adopted to
provide information.

P6 Feedback is vocally provided, list of
actions is provided in the support visual
interface.

P7 Both the conversational and the visual
interfaces produce a message when a task
is successfully executed.


4.5 Deployment

The multimodal conversational interface is deployed on a client-server system and
communicates with the client via an API call. The interaction between the user and
the bot has been implemented as follows.

• Speech-To-Text: for Speech-To-Text (STT) we adopted the wit.ai3 NLP
platform, which provides various tools and services to build conversational
interfaces.

• Bot intent elaboration: the translated text is sent to the trained NLP models
that recognize the intent of the sentence and the related attributes.

• Text-To-Speech Result: the opensource TTS engine eSpeak4 has been
selected.

• Avatar implementation: The Muxi avatar performs the required action and
provides feedback to the user. It was created using the ready player sdk5

and has lip and body movements synchronized with the bot’s voice during
conversation. It is included in MetaCUX.

4.6 Pilot user testing

We performed two iterations. In the first, we involved three users (the same
participating in the task definition phase) to experiment with the first Wizard-of-Oz
prototype, in which the avatar speech was pronounced by one of the authors, and
another author managed the room changes and the display. One of the participants
suggested better highlighting the "Ask me" button. In the first iteration, the avatar
was a futuristic man with a head-mounted display. An avatar more appropriate to a
working setting was preferred, such as the man dressed formally shown in Fig. 2.

We performed a second iteration with the same users and a running prototype.
Participants suggested displaying the user sentence (see the top of the board in Fig.
2) and adding the Chatbot feedback when the intent is not understood. We enhanced
the final prototype with these last suggestions and then performed a user study with
real users, as described in the following.

5 Evaluation Planning and design

The experimental design and other measures were approved by Computer Science
Ethics Board of the University of Salerno. Participants joined the study voluntarily,
and they could leave at any time without having to justify their decision.

3 https://wit.ai/
4 https://espeak.sourceforge.net
5 https://readyplayer.me/


Figure 3: (a) Age and (b) experience with VR device of participants

5.1 Goal

The goal of the study is the following (Basili and Rombach, 1988):
Experiment with the interaction in a CSCW VR environment in order to

evaluate the impact of the use of a task-oriented multimodal conversational
interface when compared with a menu-based interface with respect to usability
from the point of view of end-users in the context of a meeting management
setting.

Starting from this goal we formulated the following Research Question (RQ):

RQ

When immersing in a virtual environment, is there a difference in usability
whether using a multimodal conversational interface or a menu-based
interface?

5.2 Participants

We involved 32 participants from the University of Salerno. There were 22 males
and 10 females. Their age was distributed according to Fig.3(a) while their
previous experience with the use of VR with head-mounted devices is summarized
in Fig.3(b).

5.3 Tasks

We identified the following two tasks:
• T1: create a new room and change the scenario room;
• T2: schedule a meeting.

In particular, for task T1, we asked participants to perform these activities: "Create
an interview room for 20 people called Job Interview" and "Change the environment


in a meeting room"; for task T2: "Schedule a meeting with the development team
on December 17th from seven to eight o’clock".

5.4 Study design

Participants performed two tasks, namely T1 and T2 described in the previous
section. They were randomly grouped into two groups named Group1 and Group2,
except for participants experts in VR use, who were equally distributed. All
performed two tasks T1 and T2, and were exposed to two treatments: Menu, when
the user interacts with a menu-based interface, and Chatbot, when the interaction
occurs with Muxi. To avoid bias due to task ordering we adopted a crossover
design (Vegas et al., 2016), where the Menu treatment is provided in T1 for
Group1 and in T2 for Group2. Vice versa for the Chatbot treatment. Figure 4
shows the study design.

5.5 Variables and Measurements

As independent variable, we considered the two treatments Menu and Chatbot.
To assess the two considered interaction modalities we measured the following

dependent variables representing usability, grouped in performance measures and
user perceptions.

Performance measures, measuring performance in terms of:
• Efficiency: Time. It measures the time to perform a task.
• Effectiveness: Errors. It measures the number of errors committed during

the execution, e.g., the number of times the chatbot failed in the Chatbot
treatment and the number of user errors in the Menu treatment.

Users’ perceptions, measuring user satisfaction in terms of:
• Cognitive load, measured through the NASA Task Load Index (NASATLX)

questionnaire (Hart, 2006). It consists of six subscales representing the
following factors: Mental, Physical, and Temporal Demands, Frustration,
Effort, and Performance. To make it simple, the NASA (raw) TLX version
was applied (Hart, 2006), where each sub-scale can get a score that goes
from 0 (low) to 100 (high), except Performance that goes from 0 (Good) to
100 (Poor). The final score is the mean of the individual scores; a smaller
score means lower cognitive load experienced by people performing a task.

• Perceived usability, measured through the System Usability Scale (SUS)
questionnaire (Bangor et al., 2008), a usability tool based on a ten-item
survey, a widely used method. The SUS was evaluated following the
standard approach: all items were rated on a 1–5 Likert scale, all items with
positive wording were transformed as (xpos-1) (adjusting them to 0–4) while
all items with negative wording were transformed as (5-xneg) (reversing the
scale and adjusting to 0–4). The SUS score was then computed by summing
all items and multiplying them by 2.5, resulting in a final score on a scale of
0–100.


Figure 4: Study Design

Figure 5: The MetaCUX interface for the Menu treatment

5.6 Experimental objects

The Chatbot and Menu treatments are conducted in a virtual environment ad-hoc
developed and hosted on the MetaCUX platform, exposing the two interfaces
depicted in Fig. 2 and 5, respectively. To interact with the former interface users
have to speak with the avatar while they have to adopt controllers for performing
the tasks when using the latter.

5.7 Procedure

During the experiment, we followed this procedure:
• Recruitment. Participants were recruited at the University of ***. After

collecting their consensus form, they filled in a pre-test questionnaire,
collecting their demographic information and their experience concerning of
use of VR technology.

• Assignment. Considering the results of the pre-test we randomly distributed
the participants with and without previous experience in VR use in the two
groups (Wohlin et al., 2012).

• Training. Participants received training on how to use the Meta Quest 2 device
and its controllers to engage with the virtual environment. The duration of this
training session was twenty minutes.


• Operation. The participants individually performed the two tasks according
to the study design in Fig.4. At the end of each task, they filled in the
questionnaires described in Section 5.5. A Post-task single open question is
also proposed: What are the positive and negative aspects of this mode of
interaction?

5.8 Analysis procedure

We aim to assess the effect of one factor - the interaction modality - on the
dependent variables Effectiveness and Efficiency. For quantitative variables, a
t-test for normally distributed data or, otherwise, a Wilcoxon Signed Rank Test
may be adopted as our factor has only two levels. We also measure the effect size
by using Cohen’s distance in the case of normally distributed data and Cliff’s
(Cliff, 2014) effect size, otherwise. We fixed the significance level (α) at 0.05.

Concerning the user perception analysis related to cognitive load and perceived
usability, since all questions are measured on a Likert ordinal scale we analyze the
questionnaire results by analyzing the median and adopting nonparametric tests.

6 Results

6.1 Performance analysis

Descriptive statistics of the dependent variables by Treatment are reported in Table
IV. It is possible to see that the Chatbot-based interaction modality reached better
time performance (Median=40.5 sec.) when compared to the menu-based
interaction (Median=61 sec.). This is confirmed by the statistical analysis: Time
was not normally distributed, thus, we applied the Wilcoxon Signed Rank Test and
it resulted in a statistically significant difference with a large negative effect size
(see Tab. V). Similarly, results are also reached by Errors. The boxplot in Fig.6
shows the boxplots of the time by analyzing the two tasks, which confirms this
trend for both tasks. A similar trend occurs also for Errors, see Fig.7.

Table IV: Some descriptive statistics for the dependant variables

Variable Treatment Mean SD Min Median Max

Time (sec.) Chatbot 43.63 16.25 27 40.5 99
Menu 59.69 15.52 31 61 98

Errors Chatbot 0.59 1.01 0 0 4
Menu 2 1.19 0 2 4


Figure 6: Boxplot of the Times to accomplish the two tasks

Figure 7: Boxplot of the Errors to performed during the two tasks

Table V: Results of statistical analysis of quantitative variables

Variable p-value Cliff’s delta
Time 8.303e-06 -0.6484375 (large)
Error 2.69e-06 -0.6484375 (large)


Table VI: Median for NASA-TLX questionnaire (Lower values have less load and
best performance)

Variable Chatbot Menu
Mental Demand 35 70

Physical Demand 20 65
Temporal Demand 25 50

Frustration 15 35
Performance 15 70

Effort 30 60
NASA TLX total score 33.33 58.33

Table VII: SUS Score

Variable Chatbot Menu
SUS score 82.5 48.75

6.2 User perception analysis

6.2.1 Cognitive load

As shown in Table VI, all the median of the NASA-TLX scales related to the
Chatbot treatment always outperforms the Menu treatment ones. This is confirmed
by the statistical analysis (Table VIII): users perceived less cognitive load in all the
scales when using chatbots with a large negative effect size.

6.2.2 Perceived usability

We assessed the SUS score for each participant and treatment. Fig. 8 shows the
results of the single questions. Globally, the Chatbot interface is always better
perceived (note that questions with pair numbers have been reversed). The SUS
score is 54.06 and 82.19 for the Menu and Chatbot treatments, respectively, as
shown in Table VII.

Table VIII: Results of statistical analysis of user perception variables (Chatbot vs
Menu)

Variable p-value Cliff’s delta
Mental Demand 3.168e-08 -0.79 (large)

Physical Demand 1.635e-07 -0.75 (large)
Temporal Demand 3.646e-11 -0.9492188 (large)

Frustration 1.49e-05 -0.6152344 (large)
Performance 6.007e-07 -0.71875 (large)

Effort 2.535e-07 -0.7363281 (large)
Total 6.695e-10 -0.8955078 (large)

SUS score 1.573e-07 0.7617188 (large)


Figure 8: SUS Likert scores summarized for both the treatments (negative answers
are reversed)


6.3 Discussion

Performance measures revealed that to support task-oriented activities in a CSCW
VR environment a multimodal conversational interface may represent a good
alternative to the menu-based interface based on controllers. Indeed, the new
interface performed better in both tasks, as shown in Figures 6 and 7.

According to Bangor et al. (2008), a SUS score higher than 77.8 is in the fourth
quartile. This indicates that Muxi, which scored 82.5, has no relevant usability
problem when compared with the menu approach of the previous version of the
system, which scored 48.75.

NASA TLX scores highlight that the cognitive load was significantly lower for
all the subscales. In particular, Physical and Mental demand, and Performance seem
more affected by the different interaction types.

The performance of NLP models has been further assessed in the experiment
obtaining good results: Muxi user performance and perceptions were far better than
the menu ones with a large effect size. This may suggest that the proposed UCD
development approach has successfully met the users’ needs.

Concerning the open questions related to the chatbot experience, an expert
participant wrote: "Using an avatar allowed me to do what was asked of me
quickly and in a short time. In this way, however, I had less interaction with the
virtual environment in general.". A non-expert user commented "It was easier to
use your voice rather than the headset controllers to experiment. I did the task
much faster."

Comment of an inexpert user related to the menu interface: "The negative aspect
is that this mode of interaction, for those who are less accustomed to the use of
viewers or technology in general, can cause frustration."

These comments may indicate that interfaces based on chatbots may be
particularly useful for non-expert users to start to familiarize themselves with the
environment. Only two experts participated in the experiment, which had the same
trend for all the factors except for Performance and Physical effort: they both
scored better on the menu Performance than Chatbot and signaled a reduced
Physical effort in the Menu case.

6.4 Threats to validity

To address the threats that may affect the validity of our findings we follow the
recommendations by Wohlin et al. (2012).

External validity. We conducted our experiment with a few participants
having different abilities in the use of the technology, which may pose a threat to
the interaction of selection and treatment (i.e., the findings may not apply to all the
people with the same skills). We tried to limit this threat by uniformly distributing
the most skilled participants between the two groups. Furthermore, the adopted
multimodal conversational interface was designed to be appealing and easy to use,
but we acknowledge that our findings may not apply to a different setting
interaction of setting and treatment. The selected tasks were also associated


specifically with the MetaCUX environment. We formulated the two tasks in such
a way as to have about the same duration, to avoid different cognitive loads in the
Menu treatment.

Internal validity. The voluntary participation may introduce a selection threat
because volunteers are usually more motivated than the whole population.

Construct validity. We mitigated the social threats. In particular, participants
have not evaluated (evaluation apprehension), and we did not communicate the
experiment’s aim to avoid influencing their opinion (Experimenter expectancy).

Conclusion validity. The threat of violated assumptions of statistical tests may
exist. To mitigate this threat we adopted non-parametric tests and distances for data
that was not normally distributed and qualitative data.

7 Conclusion

In this paper, we described the User Centered Design process we adopted to create
the task-oriented multimodal conversational interface in a CSCW VR environment
named MetaCUX. A vocal chatbot embodied by an avatar is enhanced by an
interactive board for supporting meeting management by easing interaction
concerning the original menu-based interface and showing additional content. The
empirical investigation involving 32 users aimed to compare the usability of a
menu-based interface with the proposed multimodal interface. Both the
performance and user perception analyses revealed that performance and user
perceptions of the multimodal modalities obtained better results in all the
considered aspects. Thus, the proposed multimodal interface may constitute a
valid solution for designing task-oriented chatbots in CSCW VR environments.

References
Bangor, A., P. T. Kortum, and J. T. Miller (2008): ‘An empirical evaluation of the system usability

scale’. Intl. Journal of Human–Computer Interaction, vol. 24, no. 6, pp. 574–594.

Barra, P., A. A. Cantone, R. Francese, M. Giammetti, R. Sais, O. P. Santosuosso, A. Sepe, S. Spera,
G. Tortora, and G. Vitiello (2023a): ‘MetaCUX-a multi-user, multi-scenario environment for a
cooperative workspace’. In: Proceedings of the 15th Biannual Conference of the Italian SIGCHI
Chapter. pp. 1–3.

Barra, P., A. A. Cantone, R. Francese, M. Giammetti, R. Sais, O. P. Santosuosso, A. Sepe, S. Spera,
G. Tortora, and G. Vitiello (2023b): ‘MetaCUX: Social Interaction and Collaboration in the
Metaverse’. In: IFIP Conference on Human-Computer Interaction. pp. 528–532.

Basili, V. R. and H. D. Rombach (1988): ‘The TAME Project: Towards Improvement-Oriented
Software Environments’. IEEE Transactions on Software Engineering, vol. 14, no. 6, pp.
758–773.

Berger, A. and S. Guda (2020): ‘Threshold optimization for F measure of macro-averaged precision
and recall’. Pattern Recognition, vol. 102, pp. 107250.


Cassell, C., G. Symon, and N. King (2004): Using Templates in the Thematic Analysis of Text, pp.
257 – 270. SAGE Publications London.

Cliff, N. (2014): Ordinal methods for behavioral data analysis. Psychology Press.

Crovari, P., S. Pidó, F. Garzotto, and S. Ceri (2020): ‘Show, don’t tell. reflections on the design of
multi-modal conversational interfaces’. In: International Workshop on Chatbot Research and
Design. pp. 64–77.

Das, K. and C. W. Borst (2010): ‘An evaluation of menu properties and pointing techniques in a
projection-based VR environment’. In: 2010 IEEE Symposium on 3D User Interfaces (3DUI).
pp. 47–50.

David, B., R. Chalon, B. Zhang, and C. Yin (2019): ‘Design of a collaborative learning environment
integrating emotions and virtual assistants (chatbots)’. In: 2019 IEEE 23Rd international
conference on computer supported cooperative work in design (CSCWD). pp. 51–56.

De Lucia, A., R. Francese, I. Passero, and G. Tortora (2009): ‘Development and evaluation of a
system enhancing Second Life to support synchronous role-based collaborative learning’. Softw.
Pract. Exp., vol. 39, no. 12, pp. 1025–1054.

Følstad, A., M. Skjuve, and P. B. Brandtzaeg (2019): ‘Different chatbots for different purposes:
towards a typology of chatbots to understand interaction design’. In: Internet Science: INSCI
2018 International Workshops, St. Petersburg, Russia, October 24–26, 2018, Revised Selected
Papers 5. pp. 145–156.

Francese, R., A. Guercio, V. Rossano, and D. Bhati (2022): ‘A Multimodal Conversational Interface
to Support the creation of customized Social Stories for People with ASD’. In: P. Bottoni and
E. Panizzi (eds.): AVI 2022: International Conference on Advanced Visual Interfaces, Frascati,
Rome, Italy, June 6 - 10, 2022. pp. 19:1–19:5, ACM.

Hart, S. G. (2006): ‘NASA-task load index (NASA-TLX); 20 years later’. In: Proceedings of the
human factors and ergonomics society annual meeting, Vol. 50. pp. 904–908.

Kontio, J., J. Bragge, and L. Lehtola (2008): Guide to Advanced Empirical Software Engineering,
Chapt. The Focus Group Method as an Empirical Tool in Software Engineering, pp. 93–116.
Springer.

Lipari, N. G. and C. W. Borst (2015): ‘Handymenu: Integrating menu selection into a multifunction
smartphone-based VR controller’. In: 2015 IEEE Symposium on 3D User Interfaces (3DUI). pp.
129–132.

Mohamad Suhaili, S., N. Salim, and M. N. Jambli (2021): ‘Service chatbots: A systematic review’.
Expert Systems with Applications, vol. 184, pp. 115461.

Mundt, M. and T. Mathew (2020): ‘An evaluation of pie menus for system control in virtual reality’.
In: Proceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping
Experiences, Shaping Society. pp. 1–8.

Nguyen, Q. N., A. Sidorova, and R. Torres (2022): ‘User interactions with chatbot interfaces vs.
Menu-based interfaces: An empirical study’. Computers in Human Behavior, vol. 128, pp.
107093.

Pick, S., A. S. Puika, and T. W. Kuhlen (2017): ‘Comparison of a speech-based and a pie-menu-
based interaction metaphor for application control’. In: 2017 IEEE Virtual Reality (VR). pp.
381–382.


Ren, R., M. Zapata, J. W. Castro, O. Dieste, and S. T. Acuña (2022): ‘Experimentation for Chatbot
Usability Evaluation: A Secondary Study’. IEEE Access, vol. 10, pp. 12430–12464.

Sebillo, M., G. Vitiello, and M. De Marsico (2009): Multimodal Interfaces, pp. 1838–1843. Boston,
MA: Springer US.

Trappey, A. J., C. V. Trappey, M.-H. Chao, and C.-T. Wu (2022): ‘VR-enabled engineering
consultation chatbot for integrated and intelligent manufacturing services’. Journal of Industrial
Information Integration, vol. 26, pp. 100331.

Vegas, S., C. Apa, and N. Juristo (2016): ‘Crossover Designs in Software Engineering Experiments:
Benefits and Perils’. IEEE Transactions on Software Engineering, vol. 42, no. 2, pp. 120–135.

Wang, S., Q. Yan, and L. Wang (2023): Task-oriented vs. social-oriented: Chatbot communication
styles in electronic commerce service recovery, pp. 1–33. Springer.

Wang, Y., Y. Hu, and Y. Chen (2021): ‘An experimental investigation of menu selection for
immersive virtual environments: fixed versus handheld menus’. Virtual Reality, vol. 25, pp.
409–419.

Wohlin, C., P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén (2012):
Experimentation in software engineering. Springer Science & Business Media.

Xie, Q., W. Lu, Q. Zhang, L. Zhang, T. Zhu, and J. Wang (2023): ‘Chatbot Integration for Metaverse
- A University Platform Prototype’. In: 2023 IEEE International Conference on Omni-layer
Intelligent Systems (COINS). pp. 1–6.

Zhao, Y., J. Jiang, Y. Chen, R. Liu, Y. Yang, X. Xue, and S. Chen (2022): ‘Metaverse: Perspectives
from graphics, interactions and visualization’. Visual Informatics, vol. 6, no. 1, pp. 56–67.