Shall I describe it or shall I move closer? Verbal references and locomotion in VR collaborative search tasks. Riccardo Bovo, Daniele Giunchi, Enrico Costanza, Anthony Steed, Thomas Heinis Imperial College London, University College London rb1619@ic.ac.uk, d.giunchi@ucl.ac.uk Riccardo Bovo, Daniele Giunchi, Enrico Costanza, Anthony Steed, Thomas Heinis (2022): Shall I describe it or shall I move closer? Verbal references and locomotion in VR collaborative search tasks.. In: Proceedings of the 20th European Conference on Computer-Supported Cooperative Work: The International Venue on Practice-centred Computing on the Design of Cooperation Technologies - Exploratory Papers, Reports of the European Society for Socially Embedded Technologies (ISSN 2510-2591), DOI: 10.48340/ecscw2022_ep02 Copyright 2022 held by Authors, DOI: 10.18420/ecscw2022_ep02 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, contact the Authors. Abstract. Research in pointing-based communication within immersive collaborative virtual environments (ICVE) remains a compelling area of study. Previous studies explored techniques to improve accuracy and reduce errors when hand-pointing from a distance. In this study, we explore how users adapt their behaviour to cope with the lack of accuracy during pointing. In an ICVE where users can move (i.e., locomotion) when faced with a lack of laser pointers, pointing inaccuracy can be avoided by getting closer to the object of interest. Alternatively, collaborators can enrich the utterances with details to compensate for the lack of pointing precision. Inspired by previous CSCW remote desktop collaboration, we measure visual coordination, the implicitness of deixis’ utterances and the amount of locomotion. We design an experiment that compares the effects of the presence/absence of laser pointers across hard/easy-to-describe referents. Results show that when users face pointing inaccuracy, they prefer to move closer to the referent rather than enrich the verbal reference. 1 mailto:d.giunchi@ucl.ac.uk mailto:rb1619@ic.ac.uk Figure 1: On the left an example of implicit verbal reference aided by a hand pointing action at a close distance from the referent. On the right, the equivalent reference is aided by a more detailed verbal description of the referent but lacks the hand pointing action from a close distance. 1 Introduction Immersive collaborative virtual environments (ICVE) with user embodiment (i.e., avatars) support collaboration by providing a shared setting where collaborators have a sense of each other’s presence (Benford et al., 1995). In ICVEs, the user’s embodied hands behaviour is a non-verbal cue that integrates verbal communication during collaboration (Hindmarsh et al., 1998). For example, users can point to a referent during an utterance to trigger mutual orientation and visual coordination Moore et al. (2007). Hand pointing in conjunction with verbal, spatial references is called deictic pointing. Previous studies explore deictic pointing with distant targets (from fxed distances), measuring the accuracy of different hand pointing supports (Mayer et al., 2018, 2020; Wong and Gutwin, 2014, 2010). Outcomes from previous studies highlight how the degree of precision needed for the pointing gestures depends on how complex it is to describe the referent using utterances (Wong and Gutwin, 2014, 2010). However, in modern ICVE, users might get as close as needed to the referent and adapt to the accuracy required to perform the pointing gesture. Therefore, when faced with a lack of accuracy, will users spend time adjusting their distance from the target or overcome the diffculties of describing the target? While in a physical environment, it is not always possible to move closer to an object of interest, in an immersive ICVE, this is not a problem as there are no physical barriers. In such scenarios, users can avoid inaccurate distance pointing by moving closer to the referent. However, the movement has a temporal cost: the time required to move closer to the point of interest. As Wong and Gutwin (2014) highlight, another approach consists of users enriching their verbal references with enough details to compensate for the pointing gesture’s lack of precision. On the other hand, such a verbal supplement comes with a higher temporal and cognitive cost for both the performer and the reference recipient Wong and Gutwin (2014); D’Angelo and Begel (2017). Previous studies defne pointing accuracy as a function of distance from the referent (Mayer et al., 2020; Wong and Gutwin, 2010). Accurate pointing can be performed from a far distance with a laser pointer for support or performed without a close distance to the referent. Inaccurate pointing consists of users who do not use/have laser pointers from a distance and choose to compensate with explicit verbal references. The research community established the importance of laser pointers to achieve accurate pointing, but like any other tool or metaphor of interaction, laser pointers can be included or not in an ICVE. Some reasons for not including pointers can be the following: data visualisation issues such as a hidden or occluding cursor (especially if the informative area is dense), many users with cursors, and noise- induced by body jittering in high-density information areas Batmaz and Stuerzlinger (2019). This study explores the trade-off between using locomotion to approach the referent or the alternative use of explicit verbal references to deal with the lack of pointing accuracy. We look at how this trade-off varies across conditions of lack/availability of laser pointers and conditions related to how complex/easy it is to describe the various referent in the scene (Figure 1). We explore such trade-offs in the context of visual search tasks, which are recognised as a proxy for many other tasks performed synchronously by pairs of participants in ICVE’s Prilla (2019). We run an experiment with 20 participants quantifying implicit/explicit references, locomotion and, in addition, visual coordination, which is highly correlated to the quality of pointing-based communication (Schneider and Pea, 2013). We use two datasets of different complexity representing two levels of diffculty in describing the referent: a simple puzzle and a very detailed 3D satellite map. In the simple 3D puzzle, pieces can be described by colours or labels, while on the map, places need to be referenced via 2D coordinates, which requires a greater cognitive effort. Inspired by previous CSCW work D’Angelo and Begel (2017) we measure the number of implicit/explicit deixis and the number of successful/unsuccessful deixis. Moreover, we measure users’ movement in the space and task performance (task score and completion time). The data collected shows statistically signifcant differences in locomotion performed when distance-pointing support is unavailable. Both data and observations confrm that when users lack support for distance pointing, they prefer to move closer to the referent to perform accurate pointing gestures rather than formulate a more complex verbal reference. We can see this change no matter the complexity of the task. The data collected also shows a statistically signifcant increase in visual coordination when laser pointers are available, which confrms previous work Moore et al. (2007). Our results enable designers to understand how different elements (embodiment, locomotion, laser pointers) available in immersive ICVE impact pointing-based communication during a generic collaborative visual search task. Thus, our work can contribute to a more profcient interaction by outlining design implications. Presence of locomotion and the freedom for the user to move throughout the whole environment remove the need for distance pointing support. In this way, an effcient locomotion system increases the rates of proximal pointing instead of promoting a cursor for distal pointing. However, laser pointers support may need to be considered if the collaborative task requires high visual coordination. Our study thus helps to make informed choices when designing an ICVE. 2 Related work Pointing-based communication is ubiquitous in collaborative work. Within physically co-located scenarios, a pair of collaborators may use their hands and voice to engage in pointing-based communication. For example, indicating an object of interest by pointing hands towards it during an utterance is a common interaction called deictic pointing or deixis. During deixis, the interlocutor (i.e., recipient of the deixis) has to mentally project the collaborator’s hand directly onto the observed scene to understand the referent of the deixis (i.e., understand the target object) (Higuch et al., 2016; Pfeiffer et al., 2008; Wong and Gutwin, 2014). Pointing-based communication, however, can also be supported by laser pointers. A pointer’s spotlight projected onto the observed scene allows identifying the referent unambiguously (Hindmarsh et al., 1998). Additionally, it facilitates the interpretation of the pointing gesture by removing the cognitive effort of projecting the hand/head directly onto the observed scene. Using a laser pointer might avoid any incorrect mental projection or ambiguous unclear projection results. Essentially pointers increase the awareness, during deixis, of a collaborator’s visual focus (Piumsomboon et al., 2017). Pointing-based communication is possible in co-located scenarios and remote scenarios thanks to either embodiment (i.e., avatars) and enhanced behaviour (i.e., pointers). There are several examples of remote collaboration scenarios in which pointing-based communication is possible, to mention a few: remote pair programming (D’Angelo and Begel, 2017), support of local workers by remote experts (Bai et al., 2020), remote collaboration in immersive VR environments (Moore et al., 2007). 2.1 Pointing-based communication in remote desktop collaborations Pointing based communication can occur as long as collaborators have the means to point towards an object of interest while also communicating verbally. For example, several studies investigate pointing-based communication using gaze pointers (i.e., enhanced behaviour of eyes) in the context of 2D desktop remote collaboration (Villamor and Rodrigo, 2018; Jermann et al., 2011; Nüssli, 2011; Pietinen et al., 2008). These studies show how visual aids based on the eye-tracked behaviour of collaborators (i.e., gaze-pointers) increase mutual awareness of visual focus, higher visual coordination and better collaboration quality. Schneider and Pea (2013) explore how depicting gaze in a remote desktop collaboration of two users performing a visual task increases visual coordination and enhances visual collaboration quality. When visual aids, such as pointers, are used, collaborators look at the same objects at the same time more often than without visual aids. Additionally, such increased visual coordination seems to aid communication about the visual context. For example D’Angelo and Begel (2017) explore visual aids (based on real-time eye-tracked behaviour) and prove that such visual aids improve communication by reducing the number of explicit utterances during deixis. However, fndings from the 2D desktop environment remote programming and visual analysis do not generalize to the immersive VR environments because the reviewed scenarios lack embodiment and locomotion (both elements present in state-of-the-art immersive VR collaboration environments). Embodiment, especially hand representation and hand real-time tracking behaviour, is the natural behaviour used in deictic pointing. However, in 2D desktop environments, the gaze is used as an input for pointing. While gaze can be thought of as coinciding with visual attention, it is a behaviour that is less deliberate and thus less controllable than the behaviour of hands. A second signifcant difference is related to fragmentation (Wong and Gutwin, 2014; Hindmarsh et al., 1998), or in other words, the fact that large parts of the environment in VR are not visible to the users, unlike the 2D desktop screen is. Fragmentation impacts pointing-based communication because the pair of collaborators may not be seeing the same subset of the 3D environment during deixis. They may thus not be able to see the collaborators’ embodiment or the pointing visual aid. Moore et al. (2007) highlights how the observability of embodied activity and the projectability of gestures are essential aspects of pointing-based communication. While 2D desktop remote programming work may inspire metrics such as visual coordination and implicitness/explicitness of deixis utterances, their results are not necessarily generalizable to immersive VR collaboration. 2.2 Pointing-based communication in ICVE Finally, ICVE offers the same degree of embodiment of mixed reality scenarios. Real-time tracked behaviour of hands/head allows natural pointing behaviour and natural exploration of the scene via head movements and locomotion. Several immersive VR studies explore the accuracy of hand pointing gestures. Mayer et al. (2018) propose adaptations to hand pointing in immersive VR that enhance the precision and accuracy of the pointers representations through spatial distortion. Mayer et al. (2020), in a similar way to Sousa et al. (2019) explores the approaches to improve precision by warping gestures to adjust pointing to the target. However, while these recent studies aim to improve hand pointing accuracy, they do not evaluate the effect that pointers have on collaboration focusing only on the quantifcation of the pointing accuracy. All these works measure the accuracy of pointing from fxed distances, avoiding any form of locomotion within the scene. Our work aims to fll this gap, introducing specifc tasks where we require the participants to move freely in the scene. An additional study from Bai et al. (2020) proposes a remote collaboration system that introduces an asymmetric interaction between a VR user and an AR user sharing a live 3D panorama of their surroundings. Differently from this study, our VR system provides both symmetric interaction and interface, and we focus on measuring the impact of locomotion on pointing-based communication. 2.3 How users compensate for inaccuracies during distance pointing Previous studies explore techniques to improve accuracy and reduce errors when hand-pointing during pointing-based communication in immersive collaborative virtual environments (CVE). However, in a CVE in which users can move (i.e., locomotion), distance pointing (and its negative consequences) can be easily avoided by users’ choice of increasing proximity to the referent. Additionally, a user could choose to compensate for imprecise distance pointing by enriching (adding details) to a verbal reference during a pointing gesture. In an immersive CVE with embodiment and locomotion, we compare the presence and absence of pointers to understand if and how users compensate to avoid pointing errors and lack of precision. We also use several quantitative measures to understand how behaviour changes impact the quality of pointing-based communication. Inspired by previous CSCW remote desktop collaboration, we identify three easily quantifable metrics: visual coordination, the implicitness of deixis’ utterances, and references’ success. Such metrics represent the quality of pointing-based communication during a collaborative task. Previous literature allows us to defne accurate pointing (both from the points of view of the producer and observer) and inaccurate pointing. Pointing gestures can be either proximal or distal Schmidt (1999). When indicating proximal referents, the producer of a pointing gesture can touch the target, and observers can identify targets with confdence Bangerter and Oppenheimer (2006). Therefore, consider proximal pointing is considered accurate as there is no room for misinterpretation. With distal pointing, the observer needs instead to extrapolate the vector direction defned by the pointer’s posture Bangerter and Oppenheimer (2006); Batmaz and Stuerzlinger (2019). However, previous studies have found that using a cursor improves mid-air pointing precision thanks to visual feedback and removes the need to extrapolate the direction of the pointing gesture again thanks to the visual depiction of the cursor Mayer et al. (2018). Therefore, we consider distal pointing with the cursor accurate as there is no room for misinterpretation, while we defne distal pointing without the cursor as inaccurate. While previous works offer several methods to improve the accuracy of pointing via machine learning models in our study, we explore how users deal with the lack of accuracy in an ecological context, in particular, related to visual analysis tasks. 3 Study Design In the following subsections, we detail different aspects of the experiment. This study has been approved by the UCL Interaction Centre (UCLIC) Research Department’s Ethics Chair. 3.1 Participants Twenty-four participants (twelve pairs) volunteered to take part in the remote study. The data of two pairs of participants was used to pilot the study and test the application, while the remaining ten pairs were used for the data analysis. One condition during recruitment was for participants to own or have access to the specifc VR HMD: Oculus Quest. This condition was because the experimental session was conducted remotely via teleconference software and then via the VR application. Participants were recruited online via forums and social networks groups dedicated to the Oculus Quest headset and Slack channels dedicated to HCI VR research participant pooling. Participants were recruited individually and then matched up in pairs based on their time availability to conduct the experiment. All participants provided informed consent and received £15 compensation for participating. For the study, pairs of participants were asked to work together in a remote collaborative visual analysis task. Participants were familiar with VR devices as they owned or had access to a HMD’s headset. All participants had at least a university grade (6 PhD Candidates, 8 PhDs, 6 MScs, 4 BAs ). The mean age was 33 years old with a standard deviation of 8.3. The 88% of the participants was male, and the 12% female. 3.2 Setup To keep the application development simple and to avoid noise due to differences across VR HMDs, we decided to target a single device for the experiment. The selected headset (Oculus Quest) is 6 degrees of freedom (DoF) untethered VR HMD, with a 60 Hz refresh rate. We chose this headset because of its popularity and low retail price. We developed an application for collaborative visual analysis of 3D data using Unity (version 2018.4.14f1) and the Oculus unity SDK. The application enables the visualisation of different types of 3D data sets (i.e., terrains, 3D networks, CAD fles). The application enables each participant to join a real-time session in which other participants’ presence is represented by avatars (i.e., Oculus Avatar SDK) as shown in Figure 2. Each participant in the VR space is free to move in any direction using a thumbstick controller or physically move using the 6 DoF of the VR HMD. Avatar movements are streamed via the network, so their behaviour (head and hand movements) and position in the virtual space is reproduced with low latency. The application also enables participants to talk to each other using the embedded microphone and speakers of the VR HMD. Additionally, the setup supports an observer/moderator to be present in the VR session and environment. (a) Puzzle Task (b) Terrain Task Figure 2: A pair of participants collaborate on the visual analysis tasks in the 3D environment. a) Participants are using a hand pointer while performing a four- part 3D puzzle. b) Participants identify the four largest settlements in a terrain dataset using a hand pointer. The hand direction visualised as a series of dotted lines is displayed in the image only to illustrate the difference between head and hand pointers. Both task environments have a size of 3x3 meters. 3.3 Pointers The pointer consists of a small (1cm) sphere depicted at the intersection between the direction of the hand and the visualised data. Hands are tracked via controllers, and the hand pointer is associated with the dominant hand via the Oculus Unity SDK. The VR HMD tracks the head direction and position. The hand direction, or in other words, the ray departing from the hand, is not visualised; instead, the little sphere is visualised, depicting a small spotlight and therefore displaying the same effect of a laser pointer. The pointers can be seen in the "Detail pointers" window in Figure 2. When the pointer is not present, participants can still point using the hand embodiment as if they were in a physical co-located collaborative scene. The controller triggers approximate the posture of hands, so if a trigger is pressed/released, the correspondent fnger is depicted fully contracted or in a straight position. Users can, therefore, intuitively use the index fnger to point to referents (Figure 2). 3.4 Experiment Design We design a 2(pointer)x2(reference diffculty) factors (Table 3a), within-subjects experiment. Participants collaborate on two visual search tasks consisting of identifying visual features in two data sets. The reference diffculty factor consists of two levels: a 3D terrain with hard-to-describe features and a 3D puzzle with an easy-to-describe feature. On the hard level, verbal references can be done using map coordinates or describing features in detail. On the easy level, verbal references can refer to the colour of puzzle blocks or a unique label number. We argue that the complexity of the features in the satellite map is higher than the simple puzzle geometric shapes to describe and disambiguate. Moreover, map coordinates are more complex to reference than a single puzzle label, as they require users to compose the coordinate by reading both longitudinal and latitudinal labels. Therefore, we argue that the cognitive effort required to describe the map’s referent is higher than the puzzle. We validated such a hypothesis by pilots of the experiment. Moreover, experiment results of the number of implicit references further validate this level classifcation. The pointer factor consisted of two conditions: a condition without any pointer and a hand laser pointer, as previous work validates pointers as successfully supporting pointing based communication Moore et al. (2007). 3.5 Task The two tasks are collaborative visual search tasks. Visual search task is considered a proxy for many other tasks to be done together in VR synchronously, which include fnding virtual objects or information together, jointly referencing the same referent Schmalstieg and Höllerer (2016); Prilla (2019). For the hard task, we used a scenario common in HCI studies that consist in identifying features on 3D terrain maps. We took inspiration from previous works Šašinka et al. (2019); Liu et al. (2017). 3D terrain data is rich in details. Therefore it is complicated to describe it verbally. In the 3D terrain visual analysis task (i.e. hard verbal reference task), participants must identify the four largest settlements (i.e. cities) and the four largest lakes. The terrain consists of satellite images and elevation extracted from Mapbox, and the coordinates corners in the frst dataset are for the top-left latitude 46.56, longitude 11.53 and bottom-right latitude 46.17, longitude 11.92; in the second dataset, the coordinates are top-left latitude 46.62, longitude 10.53 and bottom-right latitude 46.23, longitude 11.92. For the easy task, we selected a scenario that is very common in collaborative VR tasks: puzzle. For example, many studies can be found in the literature using puzzle quiz Slater et al. (2000); Steptoe et al. (2009); Schroeder et al. (2001); Widestrom et al. (2000); Kim et al. (2014). Such tasks contain a visual analysis component which requires participants to identify compatible blocks by comparing them. In our specifc case, we avoided any manipulation to focus on visual analysis and related pointing-base communication. In the 3D puzzle task, users must identify the four puzzle blocks that ft together (2 puzzles were present for each experiment condition). Each block measures 50x50x25 cm, and each of the two sides of the block contains 3x3 puzzle joints. Both puzzle conditions are available to be downloaded from ANON-REPOSITORY. At the beginning of each trial, participants were asked to collaboratively identify and report the four correct features to the experiment moderator. If there was a leading effect (i.e., one participant being the only one active), the experiment moderator would remind the pairs to discuss and agree upon features before reporting them. Both task search spaces are equal in size and correspond to 3x3 m. The time given to participants is displayed as a countdown on the VR scene and consists of 5 min max for each scene. 3.6 Procedure At the beginning of each experimental block, participants are given a chance to practise the task and familiarise themselves with sample datasets. The practice time consists of a maximum of 5 min, but participants can interrupt it earlier if needed. The sample dataset used in practice was not used for the task. Users were allowed to train on both the easy (blocks) and hard (map) tasks. During familiarisation, participants can ask questions; this phase ends once both participants confrm understanding the task. Following the familiarisation, participants are asked to perform the task across the two conditions: hand pointer and no pointer. For each of the two conditions, an equivalent variation of each data set is used (two terrains and two puzzles) for four data sets (Table 3b). Trial order and experimental block order were randomised to counterbalance learning effects. Once participants agree on a feature, they are asked to communicate it to the observer verbally. The observer only acknowledges the communicated data features as recorded if both participants explicitly agree on it; otherwise, the observer prompts a reminder that both participants have to agree. Such constraint forces pairs to work collaboratively. To incentivise engagement with the task, participants are told that if they score above a specifc threshold value, they will receive a £15 voucher instead of a £10 voucher (in the end, every participant receives £15 regardless of their score). We recorded audio and video in VR and log position for all the experiment sessions. Factor1: Pointer Level 1 Level 2 No Pointer Hand Pointer Factor2: Level1 Terrain No Pointer Terrain Hand Pointer Terrain Diffculty Level2 Puzzle No Pointer Puzzle Hand Pointer Puzzle Experimental Session Participants Diad Experimental Block1 Factor 2 Level 2: Terrain Experimental Block2 Factor 2 Level 2: Puzzle eg Trial1 F1 L1 No Pointer Trial2 F1 L2 Hand Pointer eg Trial1 F1 L1 No Pointer Trial2 F1 L2 Hand Pointer (a) Experiment Design (b) Experiment procedure Figure 3: (a) Experiment Design: the experiment has two factors: dataset and pointer. The dataset factor has two levels: 3D surface (terrain), 3D volumes (puzzle). The pointer factor has two levels: No Pointer, Hand Pointer. (b) The experimental procedure is divided in experimental blocks one for each level of the independent variable diffculty, and experimental trials one for each level of the independent variable pointer, plus one trial for task familiarization at the start of each experimental block. Trial order and experimental blocks order were randomised to counterbalance learning effects. 4 Measures This section gives an overview of the measures collected during the experiment and how we post-process them. We record the head behaviour of both participants. Head gaze is the intersection between the ray starting from the Head position with the direction of Head rotation and the visualised data, which is used to calculate head concurrent pointing behaviour (i.e., visual coordination, section 4.1). Additionally, we record a video/audio stream of the virtual environment for every experimental session of the participants’ avatars, containing verbal communication between participants. We use this data to perform implicit/explicit reference analysis (Section 4.3). To understand if the experimental conditions impact temporal and accuracy performances, we also record the task time and task score. Task time is capped to 300 seconds, 5 min to keep the duration of the whole experiment to 20 min max. The maximum number of correct answers for each task is four. 4.1 Visual Coordination Visual coordination consists of participants’ visual focus coupling, or in other words, how well synchronised their visual attention is. As previous work suggests, when users point to a referent during an utterance, this triggers mutual orientation, an essential part of visual coordination. Pointing-based communication is, in this sense, an effort aimed at negotiating shared visual attention during collaborative work Moore et al. (2007). Previous work also explores how visual coordination is highly correlated to the quality of collaboration Schneider and Pea (2013). Therefore visual coordination is a crucial dimension of collaboration in visual search tasks. The ideal measure of visual coordination would require to use eye-gaze behaviour. However, our study did not use eye-trackers as most low-cost VR HMDs do not have them and running a remote user study during pandemics requires us to target popular low-cost headsets such as oculus quest. Instead, we use head-gaze behaviour, which several studies have reported as a good proxy of eye movements (Biguer et al., 1982; Pelz et al., 2001; Wang et al., 2019). Concurrent head pointing measures the time two participants concurrently point their heads towards the same target simultaneously. For example, when collaborators discuss a visual feature, they are likely to point their heads towards such a feature concurrently. This effect is also described as mutual orientation, identifed by Moore et al. (2007) as the frst stage of deixis in pointing-based communication. We post-process head gaze recorded data to measure the time head-gaze overlap during one experimental trial. We defne a distance of 20 cm as the threshold for the euclidean distance calculation. Below such threshold, the two head-gaze are considered to point at the same location and above. They are considered to be pointing at different data features. The distance between the two head gaze points Figure 4: A view of the collected measure of head position, head direction and head signal intersection at a specifc moment in time. We post-process head gaze recorded data (intersection) to measure the time head-gaze overlap during one experimental trial by computing the euclidean distance for each time frame. We post-process head position data to measure the time (seconds) participants stay still/move and compare it throughout the different experimental conditions. is calculated for every sample at time t; then, we multiply the number of samples by the sampling frequency to obtain the cumulative time of concurrent head pointing (Figure 4). 4.2 Locomotion When performing an implicit spatial reference (i.e. pointing/utterance) during pointing-base communication, the referent can be misunderstood by the collaborator (i.e., recipient). Such misunderstanding happens because the gesture performer might point imprecisely. Alternatively, the recipient may fail to correctly project the direction of the hand/arm onto the observed scene. A way to improve the accuracy of a pointing action during pointing-based communication consists in moving closer to the referent, so to make sure that the observer/listener won’t miss-interpret the direction of the pointing action Wong and Gutwin (2010). Laser Pointers instead allow participants to perform precise pointing. Using a laser pointer, the performer of the pointing action can adjust the cursor position until the cursor lays on the referent, removing ambiguities. Pointers, therefore, allow to perform accurate pointing gestures from a distance (i.e., without having to travel towards the referent) Wong and Gutwin (2014). However, during collaborative visual tasks, participants might be interested in reducing the distance to a referent for other reasons, such as observing it in greater detail or simply increasing its presence by joining a collaborator’s working area. To investigate the impact of locomotion on pointing based communication, we measure how much time each participant spends moving in ICVE during each trial. As part of the experiment guideline, we expressly asked participants to explore the space only via a thumb-stick controller rather than moving physically for safety reasons. Therefore we used the locomotion speed set in the unity environment of 1.6 m/s to determine the ideal threshold to classify intended movement and noise. We post-process head position data to measure the cumulative time of locomotion and compare it throughout the different experimental conditions. To calculate the locomotion time, we considered only the samples where the velocity is above the threshold of 0.8 m/s, calculating the distance using sampling frequency and velocity and removing small movements and noise. 4.3 Implicit references Deixis consists of verbal references supported by a pointing gesture. Within a visual search task, deixes are common occurrences as they allow negotiating the collaborative shared visual context. Deixes can be implicit or explicit: the frst requires less information uttered and are also cognitively less demanding D’Angelo and Begel (2017); Wong and Gutwin (2014). Implicit deixis tends to rely more on the accuracy of the pointing action as the utterance does not carry suffcient information to disambiguate the referent. We consider an implicit spatial reference occurring whenever a participant refereed to a data feature without explicitly naming any unique property of the object (i.e., name, location, colour). Instead, explicit deixis contains information to disambiguate the referent from the rest of the data set. Such explicit information can consist of: position relative to the user (e.g., on my left/right etc.), object characteristics (e.g., the red block etc.), labels (i.e., a unique textual description) or its absolute position expressed in coordinates (i.e., the data feature in B5). Understanding how pointing based communication changes when hard-to-describe referents are present, or a lack of distance pointing support means classifying each deixis as implicit/explicit. Such a classifcation gives us an understanding of how smooth/fast verbal communication is. Additionally allows us to understand the balance with behavioural alternatives, such as getting close to the referent to pinpoint it more accurately. Inspired by previous CSCW work proposed by D’Angelo and Begel (2017), we transcribed audio of the collected videos and carried out a double-blinded video/text classifcation of the spatial, verbal references. Two analysts performed the analysis to countereffect the subjectivity of the classifcation process. If the two interpreters were unclear if an instance was implicit or explicit, they conducted a collaborative post-analysis to reach convergence. We also classify each reference as successful/unsuccessful. Such classifcation allows us to understand if and how locomotion impacts the effectiveness of point-based communication when there is a lack of support for distant pointing. A reference is considered unsuccessful when the recipient misinterprets the correct referent or if the recipient ignores the deixis. 5 Statistical Analysis We performed a repeated measure ANOVA test (using JASP) on the data we collected and post-processed. For the measures of temporal and accuracy performances and the number of unsuccessful deixes, the analysis did not return any signifcant difference across conditions. For these measures, we won’t report the analysis result for conciseness. Our results of visual coordination are achieved by a set of 10 samples (10 pairs of participants). While for locomotion and implicit references, all 20 participants are measured individually, thus equivalent to 20 samples. 5.1 Visual Coordination The 2 way ANOVA analysis results show one main effect related to the factor: Pointer p-value <.001 (Table I and Figure 5a). When participants have a laser pointer, they spend approximately 8 seconds more pointing their head towards the same data subset. To contextualize this measure, the average duration of a task is 230 sec, representing approximately 3.4% of the time. However, from observations, we can see that the task time is split between independent work (scanning data visualization independently) and collaborative work (discussing the interpretation of data features). Considering that visual coordination only relates to collaborative work, we argue that the 3.4% of time represents a much higher value within the collaborative stages. (a) Visual Coordination (b) Locomotion (c) Implicit references Figure 5: Descriptive plots: on the horizontal axes the pointer conditions, on the separate lines the diffculty of explicit references (i.e., hard task and easy task), error bars display the confdence interval of 95%. Table I: ANNOVA: Within Subjects Effects Cases Sum of Squares df Mean Square F p (a) Visual Coordination pointer diffculty pointer * diffculty 1001.618 11.694 29.941 1 1 1 1001.618 11.694 29.941 22.919 0.114 0.695 < .001* 0.743 0.426 (b) Locomotion pointer diffculty pointer * diffculty 6906.361 32328.758 469.447 1 1 1 6906.361 32328.758 469.447 15.816 19.590 0.887 < .001* < .001* 0.358 (c) Implicit references pointer diffculty pointer * diffculty 0.140 3.560 2.984e-4 1 1 1 0.140 3.560 2.984e-4 7.031 80.807 0.010 0.026 < .001* 0.924 * p < .005 5.2 Locomotion We statistically compare the measures of locomotion (i.e. time spent moving) by performing a two way repeated measure ANOVA (Table I and Figure 5a). The ANOVA analysis results show two main effect related to the factors Pointer (p-value <.001) and Diffculty (p-value <.001). While we see an effect of locomotion related to the differences in the task, the important result is the effect on the pointer level and the lack of interaction between the two levels. When participants do not have a laser pointer, they spend approximately 18s more moving. To give a contextual understanding of this measure, the average duration of a task is 230 seconds, therefore representing approximately 7% of the time. If we consider that the average locomotion speed for this experiment is set to 1.6 m/s. This means that participants without support for distance pointing travelled approximately 28 meters more (in a 3m x 3m visualization space). 5.3 Implicit References We statistically compare the repeated measures of the dependent variable: number of implicit Deixes by performing a two way repeated measure ANOVA (Table I and Figure 5a). The ANOVA analysis results show two main effects related to the factor: diffculty (p-value <.001). This result validates the design level of diffculty: if the referent is simple to identify by an explicit reference, the user tends to verbally describe it. On the other hand, when the referent is diffcult to identify by verbal description, the user will adopt the strategy of pointing it and adding implicit references. 6 Discussion Previous studies based on distance pointing in ICVE and real-world scenarios show that collaborators pointing accuracy from a distance often depends on either having access to a laser pointer or on how hard to describe it the referent (Wong and Gutwin, 2010, 2014). However, ICVE allows participants to move in the environment and, therefore, get as close as they need to the referent to perform an accurate pointing gesture. Therefore, what would users do when faced with the option of moving closer to the referent or describing it in better detail? Such a question is worth answering to understand better the dynamics of pointing-based communication in ICVEs. A better understanding of such collaborative dynamics is fundamental to developing solutions that can better support collaboration in ICVEs. Therefore within this study, we introduce the ability for users to move in the ICVE to investigate the trade-off between moving close to a referent and the effort of composing a verbal reference when the referent is diffcult to describe. We do so within the context of a collaborative visual search task which is recognised to be a proxy of many other collaborative tasks in VR Prilla (2019). Figure 6: Heat-map of physical movement for the 4 experimental conditions. 6.1 Impact of locomotion on pointing-based communication Our results extends the work of Wong and Gutwin (2010, 2014) by exploring a different dynamic of pointing based communication in the collaborative search task. While Wong measured accuracy in the context of fxed user distances from the referent, we explore a more ecologically valid scenario. Users are free to move in the ICVE and are instructed to perform a generalise search task. We extend his work by showing how users choose to locomote no matter how hard-to-describe is the referent in front of the choice of verbally describing a referent or moving closer to it. Such a statement is supported by the statistical analysis of locomotion, which shows a signifcant movement increment in hard and easy tasks when the pointer is absent. Furthermore, we integrate the analysis of locomotion by generating cumulative head position heat maps for each experimental condition Figure 6. It is evident that the different datasets led to different exploration patterns and that the support for distance pointing did not impact how users explored the environment. If we cross the data from Figure 6 and Figure 5b and we notice that the locomotion last 20 seconds more in the absence of the pointer condition, we infer that such difference is not due to the exploration but to compensate lack of a laser pointer. 6.2 Impact of pointers on verbal communication Previous CSCW studies in 2D desktop collaboration in remote programming show how pointers can increase the number of implicit references during deixis, making verbal communication faster and smoother (D’Angelo and Begel, 2017). Inspired by such a study, we counted and analysed the number of implicit references. In our ICVE experiment, results and observations suggest that when a pointer is not available, the number of implicit references (Fig 5c) during deixis stay the same. Our results differ from D’Angelo and Begel (2017) suggesting that when the embodiment is available, and users are free to move throughout the data pointers, visualisations do not infuence verbal communication. 6.3 Impact of pointers on visual coordination Previous research explored visual attention cues from head behaviour or eye gaze behaviour in ICVE during visual search tasks Piumsomboon et al. (2017) measuring how visual attention cues increase visual coordination. In general, hand pointing is recognized to trigger mutual orientation and visual coordination Wong and Gutwin (2010); Moore et al. (2007), however to the best of our knowledge, no study measure visual coordination with and without laser pointers in ICVEs. Our study flls this gap by showing that hand pointers availability increases the amount of time that collaborators spend concurrently pointing their heads towards the same subset of the data (section 5a). 7 Future work and Design Implications In this study, we answered the following question: what will users do when faced with a lack of pointing accuracy: moving closer to the referent or describing it in better detail? While pointers in VR are proved extremely useful from previous studiesHindmarsh et al. (1998); Hoppe et al. (2018); Bai et al. (2020), we observe that visual pointers inclusion might depend on several factors: the complexity of the user interface, how crowded the ICVE is, and the confusion that multiple pointers may cause. Such considerations impact the design of ICVE, which needs to balance the advantages and disadvantages of pointers, compensating with alternative approaches that help to point accuracy. In addition, since there are benefts in moving closer to a referent, such as observing it in more detail or improving engagement with collaborators, we aim to identify methods that allow participants to semi-automatically move closer to an intended referent with or without pointing at it. A further approach can be identifying the intended referent by leveraging shared focus or adding semantic augmentation. Our study does not consider distance perception as a crucial factor. This assumption is inherited from different works Mayer et al. (2020, 2018, 2015); Schweigert et al. (2019); Sousa et al. (2019); Wong and Gutwin (2014) that conversely consider distance with an active role in pointing accuracy. However, this possible implication of distance perception in deictic pointing could be a good topic for future studies, as the research community is not yet detailed; studies that explore the perception of distance in VR are Finnegan et al. (2016); Maruhn et al. (2019). Another interesting aspect is the implication of different locomotion strategies in ICVEs. For example, teleportation is a locomotion method which requires pointing to translate a user’s location in the ICVE. Such a technique depends on the individual and the environment. However, our study, which explores the relations between pointing and locomotion, could inspire the community to investigate a collaborative version of locomotion. For example, when someone is making a pointing reference, the system can offer a "privileged" position and orientation for the observer that can be instantly applied. In addition, such a mechanism can be used for different collaboration tasks. Moreover, we hope that the research community could use our results to explore novel ways of referencing targets based on a different paradigm or input channels such as speech. Previous studies demonstrate that a natural language processing pipeline could be used to describe and possibly display visual cues on some specifc object parts Giunchi et al. (2021). Our study entails that when the referent is easy- to-describe, such a speech-based system could be used to highlight referents, such as collaborators are doing this naturally during a collaboration task. On the other hand, if the referent is hard-to-describe, that system may not be effectively used. 8 Conclusions This paper designed and carried out an experiment to test the participants’ attitude in a pointing-based task in ICVE. We conclude that deictic referencing in ICVEs with embodiment and locomotion does not require pointers to be accurate and implicit, as long as the users are free to move as close as they need to the data they are observing. One main reason is that when users are facing the problem of inaccuracy during pointing, they instinctively move closer to the referent rather than using verbal references to improve the precision of their pointing. Moreover, this effect is independent of how hard-to-describe the referent is. Locomotion allows users to move closer to the referent while performing deixis, improving pointing accuracy. We outline some design implications by highlighting how designers and engineers should consider two essential elements in support of distance-pointing: frst, if users are able to move within the environment, and second if the collaborative task requires high visual coordination. References Bai, H., P. Sasikumar, J. Yang, and M. Billinghurst (2020): ‘A User Study on Mixed Reality Remote Collaboration with Eye Gaze and Hand Gesture Sharing’. In: Conference on Human Factors in Computing Systems - Proceedings. Bangerter, A. and D. M. Oppenheimer (2006): ‘Accuracy in detecting referents of pointing gestures unaccompanied by language’. Gesture, vol. 6, no. 1, pp. 85–102. Batmaz, A. U. and W. Stuerzlinger (2019): ‘Effects of 3d rotational jitter and selection methods on 3d pointing tasks’. In: 26th IEEE Conference on Virtual Reality and 3D User Interfaces, VR 2019 - Proceedings. pp. 1687–1692. Benford, S., J. Bowers, L. E. Fahlen, C. Greenhalgh, and D. Snowdon (1995): ‘User embodiment in collaborative virtual environments’. Conference on Human Factors in Computing Systems - Proceedings, vol. 1, pp. 242–249. Biguer, B., M. Jeannerod, and C. Prablanc (1982): ‘The coordination of eye, head, and arm movements during reaching at a single visual target’. Experimental Brain Research, vol. 46, no. 2, pp. 301–304. D’Angelo, S. and A. Begel (2017): ‘Improving communication between pair programmers using shared gaze awareness’. Conference on Human Factors in Computing Systems - Proceedings, vol. 2017-Janua, pp. 6245–6255. Finnegan, D. J., E. O’Neill, and M. J. Proulx (2016): ‘Compensating for distance compression in audiovisual virtual environments using incongruence’. In: Conference on Human Factors in Computing Systems - Proceedings. pp. 200–212. Giunchi, D., A. Sztrajman, S. James, and A. Steed (2021): ‘Mixing modalities of 3D sketching and speech for interactive model retrieval in virtual reality’. In: IMX 2021 - Proceedings of the 2021 ACM International Conference on Interactive Media Experiences. pp. 144–155, Association for Computing Machinery. Higuch, K., R. Yonetani, and Y. Sato (2016): ‘Can eye help you?: Effects of visualizing eye fxations on remote collaboration scenarios for physical tasks’. In: Conference on Human Factors in Computing Systems - Proceedings. New York, NY, USA, pp. 5180–5190, Association for Computing Machinery. Hindmarsh, J., M. Fraser, C. Heath, S. Benford, and C. Greenhalgh (1998): ‘Fragmented interaction: Establishing mutual orientation in virtual environments’. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work. New York, New York, USA, pp. 217–226, ACM Press. Hoppe, A. H., K. Westerkamp, S. Maier, F. van de Camp, and R. Stiefelhagen (2018): ‘Multi-user collaboration on complex data in virtual and augmented reality’. Communications in Computer and Information Science, vol. 851, pp. 258–265. Jermann, P. ., D. . Mullins, and M.-A. . Nüssli (2011): ‘Collaborative Gaze Footprints Correlates of Interaction Quality’. Kim, S., G. Lee, N. Sakata, and M. Billinghurst (2014): ‘Improving co-presence with augmented visual communication cues for sharing experience through video conference’. In: ISMAR 2014 - IEEE International Symposium on Mixed and Augmented Reality - Science and Technology 2014, Proceedings. pp. 83–92, IEEE. Liu, C., O. Chapuis, M. Beaudouin-Lafon, and E. Lecolinet (2017): ‘CoReach: Cooperative gestures for data manipulation on wall-sized displays’. Conference on Human Factors in Computing Systems - Proceedings, vol. 2017-May, pp. 6730–6741. Maruhn, P., S. Schneider, and K. Bengler (2019): ‘Measuring egocentric distance perception in virtual reality: Infuence of methodologies, locomotion and translation gains’. PLoS ONE, vol. 14, no. 10. Mayer, S., J. Reinhardt, R. Schweigert, B. Jelke, V. Schwind, K. Wolf, and N. Henze (2020): ‘Improving Humans’ Ability to Interpret Deictic Gestures in Virtual Reality’. In: Conference on Human Factors in Computing Systems - Proceedings. pp. 1–14. Mayer, S., V. Schwind, R. Schweigert, and N. Henze (2018): ‘The effect of offset correction and cursor on mid-air Pointing in real and virtual environments’. Conference on Human Factors in Computing Systems - Proceedings, vol. 2018-April, pp. 1–13. Mayer, S., K. Wolf, S. Schneegass, and N. Henze (2015): ‘Modeling distant pointing for compensating systematic displacements’. In: Conference on Human Factors in Computing Systems - Proceedings, Vol. 2015-April. pp. 4165–4168. Moore, R. J., N. Ducheneaut, and E. Nickell (2007): ‘Doing virtually nothing: Awareness and accountability in massively multiplayer online worlds’. Computer Supported Cooperative Work, vol. 16, no. 3, pp. 265–305. Nüssli, M.-A. (2011): ‘Dual Eye-Tracking Methods for the Study of Remote Collaborative Problem Solving’. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE, vol. 5232. Pelz, J., M. Hayhoe, and R. Loeber (2001): ‘The coordination of eye, head, and hand movements in a natural task’. Experimental Brain Research, vol. 139, no. 3, pp. 266–277. Pfeiffer, T., M. E. Latoschik, and I. Wachsmuth (2008): ‘Conversational pointing gestures for virtual reality interaction: Implications from an empirical study’. In: Proceedings - IEEE Virtual Reality. pp. 281–282. Pietinen, S., R. Bednarik, T. Glotova, V. Tenhunen, and M. Tukiainen (2008): A Method to Study Visual Attention Aspects of Collaboration: Eye-Tracking Pair Programmers Simultaneously. Piumsomboon, T., A. Dey, B. Ens, G. Lee, and M. Billinghurst (2017): ‘CoVAR: Mixed-Platform Remote Collaborative Augmented and Virtual Realities System with Shared Collaboration Cues’. In: Adjunct Proceedings of the 2017 IEEE International Symposium on Mixed and Augmented Reality, ISMAR-Adjunct 2017. pp. 218–219, Institute of Electrical and Electronics Engineers Inc. Prilla, M. (2019): ‘"I simply watched where she was looking at": Coordination in short- term synchronous cooperative mixed reality’. Proceedings of the ACM on Human-Computer Interaction, vol. 3, no. GROUP. Šašinka, C., Z. Stachon,ˇ M. Sedlák, J. Chmelík, L. Herman, P. Kubíček, A. Šašinková, M. Doležal, H. Tejkl, T. Urbánek, H. Svatoňová, P. Ugwitz, and V. Juřík (2019): ‘Collaborative immersive virtual environments for education in geography’. ISPRS International Journal of Geo-Information, vol. 8, no. 1. Schmalstieg, D. and T. Höllerer (2016): AR Textbook Tobias. Schmidt, C. L. (1999): ‘Adult Understanding of Spontaneous Attention-Directing Events: What Does Gesture Contribute?’. In: Ecological Psychology, Vol. 11. pp. 139–174. Schneider, B. and R. Pea (2013): ‘Real-time mutual gaze perception enhances collaborative learning and collaboration quality’. International Journal of Computer-Supported Collaborative Learning, vol. 8, no. 4, pp. 375–397. Schroeder, R., A. Steed, A. S. Axelsson, I. Heldal, A. Abelin, J. Widestrom, A. Nilsson, and M. Slater (2001): ‘Collaborating in networked immersive spaces: as good as being there together?’. Computers & Graphics, vol. 25, no. 5, pp. 781–788. Schweigert, R., V. Schwind, and S. Mayer (2019): ‘EyePointing: A Gaze-Based Selection Technique’. vol. 19. Slater, M., A. Sadagic, M. Usoh, and R. Schroeder (2000): ‘Small-group behavior in a virtual and real environment: A comparative study’. Presence: Teleoperators and Virtual Environments, vol. 9, no. 1, pp. 37–51. Sousa, M., R. K. Dos Anjos, D. Mendes, M. Billinghurst, and J. Jorge (2019): ‘Warping deixis: Distorting gestures to enhance collaboration’. In: Conference on Human Factors in Computing Systems - Proceedings, Vol. 12. New York, NY, USA, pp. 1–12, Association for Computing Machinery. Steptoe, W., O. Oyekoya, A. Murgia, R. Wolff, J. Rae, E. Guimaraes, D. Roberts, and A. Steed (2009): ‘Eye tracking for avatar eye gaze control during Object-Focused multiparty interaction in immersive collaborative virtual environments’. In: Proceedings - IEEE Virtual Reality. pp. 83–90, IEEE. Villamor, M. and M. M. Rodrigo (2018): ‘Predicting successful collaboration in a pair programming eye tracking experiment’. In: UMAP 2018 - Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization. pp. 263–268. Wang, P., S. Zhang, X. Bai, M. Billinghurst, W. He, S. Wang, X. Zhang, J. Du, and Y. Chen (2019): ‘Head pointer or eye gaze: Which helps more in MR remote collaboration’. In: 26th IEEE Conference on Virtual Reality and 3D User Interfaces, VR 2019 - Proceedings. pp. 1219–1220, Institute of Electrical and Electronics Engineers Inc. Widestrom, J., A. S. Axelsson, R. Schroeder, A. Nilsson, I. Heldal, and A. Abelin (2000): ‘The Collaborative Cube Puzzle: A Comparison of Virtual and Real Environments’. Wong, N. and C. Gutwin (2010): ‘Where are you pointing? The accuracy of deictic pointing in CVEs’. In: Conference on Human Factors in Computing Systems - Proceedings, Vol. 2. New York, New York, USA, pp. 1029–1038, ACM Press. Wong, N. and C. Gutwin (2014): ‘Support for deictic pointing in CVEs’. pp. 1377–1387.