Co-Creating a Research Data Infrastructure with Social Policy Researchers Gabriela Molina León, Gabriella Skitalinskaya, Nils Düpont, Jonas Klaff, Anton Schlegel, Hendrik Heuer, Andreas Breiter University of Bremen, Germany Contact Author: molina@uni-bremen.de Gabriela Molina León, Gabriella Skitalinskaya, Nils Düpont, Jonas Klaff, Anton Schlegel, Hendrik Heuer, Andreas Breiter (2022): Co-Creating a Research Data Infrastructure with Social Policy Researchers. In: Proceedings of the 20th European Conference on Computer-Supported Cooperative Work: The International Venue on Practice-centred Computing on the Design of Cooperation Technologies - Posters and Demos, Reports of the European Society for Socially Embedded Technologies (ISSN 2510-2591), DOI: 10.48340/ecscw2022_p03 Abstract. We present a case study on co-creating a research data infrastructure together with social policy researchers. Over three years, we investigated how the social scientists worked with data, and designed a collaborative system to support them in the harmonization, validation, exploration, and sharing of research data. We conducted several co-creation workshops, interviews, surveys, and user studies not only to co-design the system but also to assess the benefts and limitations of our participatory approach for this interdisciplinary collaboration. The evaluation uncovered that the researchers were satisfed with the processes and tools that we developed, and that the system was successfully adopted. We found that when working in a large interdisciplinary project, especially in the context of social policy research, it is critical to assess the status of the data early on, and to discuss how the group and individual goals connect with each other, to ensure long-term engagement and commitment. 1 Copyright 2022 held by Authors, DOI: 10.18420/ecscw2022_p03 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, contact the Authors. mailto:molina@uni-bremen.de Introduction In the last decade, we have witnessed a rapid increase in the quantity of data available in science. Accordingly, CSCW researchers have been studying how experts work with data in diverse domains to fnd out how technology can support cooperative scientifc work (Velden et al., 2014). Vertesi and Dourish (2011) studied how the way planetary scientists produce data is a key factor in how they share data. Neang et al. (2021) investigated the social and organizational concerns surrounding data integration in oceanography. Overall, the scientifc culture and practices of the disciplines play a critical role in how computer-support systems can facilitate scientifc work (Jirotka et al., 2013). This is what Lee et al. (2006) call the human infrastructure of cyberinfrastructure. Tenopir et al. (2015) found that the norms of data sharing vary highly between disciplines. While astronomy and biodiversity researchers have a culture of data sharing, medicine and social sciences researchers are less likely to share. According to Savage and Vickers (2009), researchers rarely create appropriate metadata early enough, which later leads to not releasing the data because of the associated workload. Given the need for more efforts to support sharing in the social sciences, we sought to co-design a research data infrastructure together with social science researchers. Over three years, we collaborated with social policy experts in a multidisciplinary project aimed at analyzing and explaining social policy dynamics worldwide. We supported them on the harmonization, validation, exploration, and sharing of their datasets. Accordingly, we present a case study tackling the following research question: RQ What to consider when applying co-creation as a design methodology to create a data infrastructure system for social policy researchers? We present our insights on how social policy researchers organize their data work, and how we co-designed a data infrastructure to support them. According to the evaluation, the system was successfully adopted. We share our recommendations for data infrastructure projects based on our co-creation study. Motivation and methods Our case study is based on a multidisciplinary research project on global social policy involving 29 researchers from political science, sociology, geography, and computer science (CRC 1342: Global Dynamics of Social Policy, 2022). We report our insights from the frst three years of our on-going collaboration. The main goal of the project is to collect data on social policies worldwide. The data involves not only social policy indicators (i.e. variables) created by the researchers, but also indicators collected by institutions such as the World Bank. We designed an information system to harmonize, share, and explore said data. http:2022).We We applied co-creation as a design methodology (co-design). Co-creation is based on conducting regular workshops with the stakeholders to not only design a solution for them, but also with them (Sanders, 2008). In the workshops, we used well-known methods for creative work such as wishful thinking (Kerzner et al., 2019), paper prototyping (Snyder, 2003), and refective discussions (Molina León and Breiter, 2020). To learn more about their work, we conducted contextualized interviews with researchers of different project roles, and collected artifacts such as data fles, papers, and data analysis scripts. All the interviews and discussions were recorded and analyzed through open coding according to grounded theory. To evaluate the collaboration and the system, we conducted a survey and two user studies whose results we present in the Evaluation section. The Information System Through the workshops and interviews, we elicited and iteratively refned the following design requirements for the system: R1 Support data harmonization. The researchers collected time series data from various sources in different formats (e.g. books, CSV fles). They required support on combining the datasets together and preparing them for analysis. R2 Support data validation. The data standards agreed on needed to be validated systematically. The researchers wished for support on checking the data, e.g. verifying country names. R3 Enable interactive data exploration. Once the data was in the system, the social scientists wished for tools to search and flter the indicators according to their research interests. R4 Allow fexible sharing of data and resources. Sharing was a priority to collaborate with other researchers. Sharing tools would help ensure transparency, reproducibility, and reuse of their research and data. To support data harmonization (R1), we established Data and metadata standards as guidelines for the data collection and merging processes. We created a dedicated wiki to document the standards and the data itself, ensuring a high level of documentation quality and transparency. Furthermore, we co-developed a universal dataset template. The template covered all necessary attributes for each data point and metadata. We also harmonized existing practices in data coding and established coding rules. These rules described the requirements for each template item, such as country codes, naming guidelines, etc. For the data validation (R2), we implemented a validation pipeline, which thoroughly checked if the uploaded data fulflled the standards and gave detailed feedback otherwise. For data exploration (R3), we designed three interfaces that present the data in different ways: http:metadata.We Figure 1. The Electoral democracy index indicator page (upper part of the interface). 1. Indicator page. This page presents all the information about a particular indicator, covering its coding rules, sources, and more. Since the researchers wished to discover and analyze spatio-temporal data patterns, the page supports exploration through a coverage visualization, interactive search and fltering options, and a wide range of visualizations tailored for each data type (see Figure 1). 2. Country profle. Many theories and explanations in social policy research focus on countries as the focal unit of analysis. Thus, we co-designed profles that zoom in on a specifc country and shift the focus to the development within it. As such, the profle is a valuable tool to inform area studies, providing easy access to a set of key indicators. 3. Data Explorer. Here, we focus on supporting the analysis of multiple indicators simultaneously by providing basic correlation insights and visualizations tailored to different combinations of indicator types. While correlation is not causation, it helps uncovering possible relationships that can be further inspected and may inform inductive reasoning. To support data sharing (R4), all pages provide various exporting options with version control and all visualizations are downloadable. While the system is still being prepared for general public access, registered users can compile indicators into so-called “datasets” and share them with non-registered users via token-based urls. For script sharing, we created the Community Notebooks page, where researchers can upload computational notebooks to reproduce and replicate results. Evaluation After the frst fve workshops, we conducted a survey to investigate how the researchers perceived the collaboration so far. Eight researchers participated. Despite the small sample, the results provided relevant insights. Paper prototyping and group discussions were the most preferred activities as they allowed the experts to concretize their ideas and refne them by discussing them with their peers. While researchers with high attendance were more positive about how their participation infuenced the outcome, half of the participants did not fnd such regular meetings helpful for their work but noted that the workshops were the place where they learned most about the research of their colleagues. A few months later, the frst version of the system was almost ready to be released within the project. Before doing so, we conducted a small user study to evaluate the interface design and to further assess the benefts and limitations of our participatory approach. The researchers performed three navigation tasks focused on the data visualizations, and participated in an interview. We had six participants. That was the frst time they could interact with the system, and four participants reported to be impressed because it offered more options than other systems they knew. This led to more positive answers about our collaboration being helpful for their work. In the interviews, the most mentioned issue was that not everyone was attending the workshops. Initially, we invited all researchers to encourage openness and diversity, but only a few attended regularly. Shortly after releasing the system, we conducted a second study with 12 researchers to evaluate the validation and exploration features. The study consisted of fve tasks. The frst and second tasks required uploading a dataset, with and without errors. The other tasks involved searching and exploring a given indicator, interacting with a Country profle, and exploring indicator relationships in the Data Explorer. After each task, participants rated its diffculty, and shared any problems they had. Figure 2 presents the diffculty answer rates. All but one participant completed the validation tasks successfully and everyone fnished the exploration tasks successfully. Overall, the outcome was positive because most participants found all tasks easy to perform. The researchers found the validation tests especially helpful for verifying the data. However, this required additional work to adjust the data according to the established standards — in contrast to their previous manual approach. They especially appreciated the option to combine indicators in the Data Explorer, missing in other systems. Regarding the co-creation process, the evaluation showed that the system fulflled the requirements and that the participants felt that their ideas were included. However, the diversity of goals among the researchers, combined with Task difficulty T5: Using the Data Explorer 33.3% 33.3% 25.0% 8.3% T4: Accessing a country profile 41.7% 41.7% 8.3% 8.3% T3: Browsing by indicator 41.7% 41.7% 8.3% 8.3% T2: Uploading a file with errors 33.3% 41.7% 8.3% 8.3% 8.3% T1: Uploading a valid file 50.0% Very easy 33.3% Easy 8.3% Neutral 8.3% Difficult Very difficult Figure 2. Diffculty rank per task in the second user study, evaluating the data validation and exploration features. the inconsistent attendance, made it challenging to design custom features. Moreover, the researchers saw the beneft of a systematized workfow for future colleagues but considered that co-creating increased their workload. Recommendations for data infrastructure projects Based on our case study, we propose the following recommendations for researchers and practitioners who plan to co-create a data infrastructure: 1. Ensure a limited yet representative group of participants actively involved in the process. Initially, we invited all researchers. We noticed that too many people were involved, some attended rarely, and power structures infuenced who voiced their opinion (e.g. doctoral students hesitated before disagreeing with their supervisors). Overtime, we decided to invite only two persons per research group and to organize teams mixing different groups and roles. 2. Assess the status and amount of data available early on. We planned to use example datasets for designing the system early on, yet such datasets were not ready. Thus, the design and development had to happen in parallel to the data collection, which is not rare for research data management systems. 3. Connect individual and group goals, working in short iterations. Long-term projects struggle with keeping participants engaged. Discuss the individual goals of every participant and how they connect to the project goal, prioritizing a balance between both. Short work iterations lead to less repetition and facilitate including the input of the participants in every step. 4. Defne the roles and tasks of the participants early on. The expectations of the social scientists about the computer scientists, and viceversa, were different because each group overestimated the work speed of the other. This illustrates how misconceptions can easily occur in multidisciplinary projects. Although participatory methods are favored to get everyone’s voice heard, it is also important to clearly defne the tasks and commitment needed for the collaboration to succeed. Discussion and conclusions Tenopir et al. (2015) suggest that creating a sound data infrastructure is a solution to impulse data sharing among researchers. However, designing for reproducibility has multiple constraints and challenges (Feger et al., 2020). Our study shows that designing such a system is a long-term process that requires a close and exhaustive collaboration. In the workshops, we found that some researchers did not identify themselves as users because it would take a long time for the system to reach a state where it could provide immediate benefts. This refects one of the challenges of developing groupware applications reported by Grudin (1994): the disparity between work and (immediate) beneft. Promoting collaboration among the researchers was another positive outcome beyond the system adoption. Participants developed a shared understanding of their collaborative research in the workshops. This confrms the fndings of Neang et al. (2021) with oceanographers. Overall, our case study presents insights on how to co- create a data infrastructure for social policy research. Accordingly, we provide our recommendations for similar endeavors. Our work contributes to the open science efforts within the scientifc community. Acknowledgments This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)– Projektnummer 374666841 – SFB 1342. References CRC 1342: Global Dynamics of Social Policy (2022): ‘About the CRC 1342’. Feger, S. S., P. W. Wozniak, L. Lischke, and A. Schmidt (2020): ‘’Yes, I Comply!’: Motivations and Practices around Research Data Management and Reuse across Scientifc Fields’. Proc. ACM Hum.-Comput. Interact., vol. 4, no. CSCW2. Grudin, J. (1994): ‘Groupware and Social Dynamics: Eight Challenges for Developers’. Commun. ACM, vol. 37, no. 1, pp. 92–105. Jirotka, M., C. P. Lee, and G. M. Olson (2013): ‘Supporting scientifc collaboration: Methods, tools and concepts’. Computer Supported Cooperative Work (CSCW), vol. 22, no. 4, pp. 667–715. Kerzner, E., S. Goodwin, J. Dykes, S. Jones, and M. Meyer (2019): ‘A Framework for Creative Visualization-Opportunities Workshops’. IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 1, pp. 748–758. Lee, C. P., P. Dourish, and G. Mark (2006): ‘The Human Infrastructure of Cyberinfrastructure’. In: Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work. New York, NY, USA, p. 483–492, Association for Computing Machinery. Molina León, G. and A. Breiter (2020): ‘Co-creating Visualizations: A First Evaluation with Social Science Researchers’. Computer Graphics Forum, vol. 39, no. 3, pp. 291–302. Neang, A. B., W. Sutherland, M. W. Beach, and C. P. Lee (2021): ‘Data Integration as Coordination: The Articulation of Data Work in an Ocean Science Collaboration’. Proc. ACM Hum.-Comput. Interact., vol. 4, no. CSCW3. Sanders, E. (2008): ‘An evolving map of design practice and design research’. interactions, vol. 15, no. 6, pp. 13–17. Savage, C. J. and A. J. Vickers (2009): ‘Empirical Study of Data Sharing by Authors Publishing in PLoS Journals’. PLOS ONE, vol. 4, no. 9, pp. 1–3. Snyder, C. (2003): Paper Prototyping: The Fast and Easy Way to Design and Refne User Interfaces. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Tenopir, C., E. D. Dalton, S. Allard, M. Frame, I. Pjesivac, B. Birch, D. Pollock, and K. Dorsett (2015): ‘Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide’. PLOS ONE, vol. 10, no. 8, pp. 1–24. Velden, T., M. J. Bietz, E. I. Diamant, J. D. Herbsleb, J. Howison, D. Ribes, and S. B. Steinhardt (2014): ‘Sharing, Re-Use and Circulation of Resources in Cooperative Scientifc Work’. In: Proceedings of the Companion Publication of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing. New York, NY, USA, p. 347–350, Association for Computing Machinery. Vertesi, J. and P. Dourish (2011): ‘The Value of Data: Considering the Context of Production in Data Economies’. In: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work. New York, NY, USA, p. 533–542, Association for Computing Machinery.