Interdisciplinary Approaches to Data Collection, Annotation and Computational Processing of Code-Switched Languages around the World

A special issue of Languages (ISSN 2226-471X).

Deadline for manuscript submissions: closed (1 May 2022) | Viewed by 16075

Special Issue Editors


E-Mail Website
Guest Editor
Department of Translation, Interpreting and Communication, Faculty of Arts and Philosophy, Ghent University, 9000 Ghent, Belgium
Interests: computational sociolinguistics; multilingualism; social media; spoken data; computational social sciences; language contact, language variation and change; low-resource languages, linguistic data curation and analyses; code-switching; usage-based methods to language; multi-word expressions; natural language processing; constructions

E-Mail Website
Guest Editor
Microsoft Research India, Bengaluru 560001, India
Interests: code-switching; speech recognition; speech synthesis; low-resource languages; spoken dialogue systems

Special Issue Information

Dear Colleagues,

Code-switching (C-S) in multilingual settings has been extensively studied from a linguistic and a computational linguistic point of view across language pairs/tuples. Although there are valuable theoretical and data-based studies on C-S in linguistics (e.g., Bullock & Toribio, 2009; Fernandez et al., 2019), they usually focus on collecting and analyzing (relatively) small-scale data sets and the results of these studies are presented/published in academic venues targeting fellow linguists (e.g., publications in a journal/book, workshops & conferences in linguistics, bilingualism, multilingualism). Therefore, research on C-S from a linguistic point of view is often less visible to computational linguists who also conduct research on C-S.

Recent developments in computational areas of research make it possible to analyze large scale and multilingual data through automatized methods of analyses. Some of these methods can also be applied to the analyses of C-S as well. (e.g., Rijhwani et al., 2017; Vilares et al., 2016). So far, computational research has mostly focused on developing algorithms for processing code-switched languages. However, there are rarely any systematic analyses about the types and frequencies of errors that occur as a result of computational processing of code-switched languages. Similarly, standards for the annotation and evaluation of C-S for computational processing are lacking. Although there is a growing interest in analyzing code-switched languages in computational areas of research, there is also an imbalance between the well-studied and low-resource language pairs (e.g., C-S across African, Southeast Asian, Indigenous Languages) around the world. Finally, most computational research is published in venues attended by fellow computational linguists and/or speech technologists (e.g., Bali et al., 2020; Solorio, 2021; Sitaram, 2019). Therefore, there is less visibility for the linguistic audience.

As described above, there is a gap and lack of collaboration between linguistics and computational areas of research about C-S. In addition, the research output is not always visible across these domains. The first goal of this Special Issue is to bridge the gap and increase collaboration across disciplines. More specifically, we aim to familiarize the linguistic audience with the available large-scale data sets, methods, and techniques about C-S in computational areas of research (e.g., Natural Language Processing (NLP), Automatic Speech Processing). Secondly, we aim to make computational researchers aware of the rich linguistic research on

C-S and multilingualism in general. We also hope to increase awareness among researchers about the linguistic and social factors that lead to different types of C-S across diverse language pairs and multilingual contexts (e.g., Doğruöz et al., 2021).

We invite submissions authored by computational linguists, speech technologists, and linguists describing novel or existing research on computational processing of C-S and/or proposing interdisciplinary solutions for the existing challenges (e.g., how linguistic research in C-S could be useful for the computational processing of C-S). We also welcome papers describing, analyzing, or providing alternatives for the creation, curation, and annotation of C-S datasets across languages around the world. Submissions may include, but need not be limited to:

  • Computational Techniques for processing code-switched languages (including less documented and/or low-resource languages) around the world
  • Collection and annotation of code-switched spoken data for Automatic Speech Processing
  • Collection and annotation of code-switched textual data for NLP
  • Development and utilization of multilingual computational models for processing spoken and textual C-S data sets.
  • Self-supervised models for processing spoken and textual C-S data sets.
  • Evaluation benchmarks for code-switched NLP methods and speech systems

The submission guidelines to the journal are included here.

Tentative completion schedule:

  • Abstract Submission Deadline: 1 Feb 2022
  • Notification of Acceptance: 1 March 2022
  • Full Manuscript Deadline: 1 May 2022

References:

  1. Bali, K., et al. (2020). “Proceedings of the First Workshop on Speech Technologies for Code-Switching in Multilingual Communities”, WSTCSMC 2020.
  2. Bullock, Barbara E. & Toribio, Almeida Jacqueline (eds.). 2009. The Cambridge Handbook of Linguistic Code-switching. Cambridge University Press.
  3. Doğruöz, A.S., Sitaram, S., Bullock, B.E., Toribio, A.J. (2021). "A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies", Proceedings of The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021). Association for Computational Linguistics (ACL).
  4. Fernández Fuertes, Raquel & Gómez Carrero, Tamara & Martinez, Alejandro. (2019). Where the Eye Takes You: The Processing of Gender in Codeswitching.
  5. Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K., & Maddila, C. S. (2017, July). “Estimating code-switching on twitter with a novel generalized word-level language detection technique”, Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 1971-1982).
  6. Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2016, May). “En-es-cs: An English-Spanish code-switching twitter corpus for multilingual sentiment analysis”, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 4149-4153).
  7. Solorio, T., Chen, S., Black, A.W., Diab, M., Sitaram, S., Soto, V., Yilmaz, E. (2021), “Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching”, CALCS 2021.
  8. Sitaram, S., Chandu, K.R., Rallabandi, S.K., Black, A.W., (2019). "A survey of code-switched speech and language processing", arXiv preprint arXiv:1904.00784 (2019).

Dr. A.Seza Doğruöz
Dr. Sunayana Sitaram
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a double-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Languages is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • code-switching
  • multilingualism
  • data annotation
  • computational approaches
  • Natural Language Processing
  • Automatic Speech Processing
  • linguistic approaches
  • low resource languages

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

28 pages, 500 KiB  
Article
Multilingual Children’s Motivations to Code-Switch: A Qualitative Analysis of Code-Switching in Dutch-English Bilingual Daycares
by Nina-Sophie Sczepurek, Suzanne P. Aalberse and Josje Verhagen
Languages 2022, 7(4), 274; https://doi.org/10.3390/languages7040274 - 26 Oct 2022
Cited by 5 | Viewed by 3366
Abstract
This paper investigates code-switching in young multilingual children through a qualitative analysis. Our aim was to examine which types of code-switches occur and to categorize these in terms of children’s motivations for code-switching. Data were collected from 70 children aged two to three [...] Read more.
This paper investigates code-switching in young multilingual children through a qualitative analysis. Our aim was to examine which types of code-switches occur and to categorize these in terms of children’s motivations for code-switching. Data were collected from 70 children aged two to three years who attended Dutch-English daycare in the Netherlands where teachers adopted a one-teacher-one-language approach. We observed seven types of code-switches. Motivations for code-switching related to social, metalinguistic, lexical, or conversational factors. These data indicate that young children can tailor their language choices towards the addressee, suggesting a certain level of meta-linguistic awareness and perspective taking. Implications for computational approaches are discussed. Full article
19 pages, 1951 KiB  
Article
Traceback and Chunk-Based Learning: Comparing Usage-Based Computational Approaches to Child Code-Mixing
by Nikolas Koch, Stefan Hartmann and Antje Endesfelder Quick
Languages 2022, 7(4), 271; https://doi.org/10.3390/languages7040271 - 25 Oct 2022
Cited by 1 | Viewed by 1944
Abstract
Recent years have seen increased interest in code-mixing from a usage-based perspective. In usage-based approaches to monolingual language acquisition, a number of methods have been developed that allow for detecting patterns from usage data. In this paper, we evaluate two of those methods [...] Read more.
Recent years have seen increased interest in code-mixing from a usage-based perspective. In usage-based approaches to monolingual language acquisition, a number of methods have been developed that allow for detecting patterns from usage data. In this paper, we evaluate two of those methods with regard to their performance when applied to code-mixing data: the traceback method, as well as the chunk-based learner model. Both methods make it possible to automatically detect patterns in speech data. In doing so, however, they place different theoretical emphases: while traceback focuses on frame-and-slot patterns, chunk-based learner focuses on chunking processes. Both methods are applied to the code-mixing of a German–English bilingual child between the ages of 2;3 and 3;11. Advantages and disadvantages of both methods will be discussed, and the results will be interpreted against the background of usage-based approaches. Full article
Show Figures

Figure 1

18 pages, 534 KiB  
Article
Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
by Joshua Jansen van Vüren and Thomas Niesler
Languages 2022, 7(3), 236; https://doi.org/10.3390/languages7030236 - 13 Sep 2022
Cited by 1 | Viewed by 1780
Abstract
In this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the impact of the [...] Read more.
In this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the impact of the way in which multiple monolingual datasets are interleaved prior to being presented as input to a language model. In addition, we consider the application of large pretrained transformer-based architectures, and present the first investigation employing these models in English-Bantu code-switched speech recognition. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains. Full article
Show Figures

Figure 1

18 pages, 481 KiB  
Article
Building Educational Technologies for Code-Switching: Current Practices, Difficulties and Future Directions
by Li Nguyen, Zheng Yuan and Graham Seed
Languages 2022, 7(3), 220; https://doi.org/10.3390/languages7030220 - 18 Aug 2022
Cited by 7 | Viewed by 5434
Abstract
Code-switching (CSW) is the phenomenon where speakers use two or more languages in a single discourse or utterance—an increasingly recognised natural product of multilingualism in many settings. In language teaching and learning in particular, code-switching has been shown to bring in many pedagogical [...] Read more.
Code-switching (CSW) is the phenomenon where speakers use two or more languages in a single discourse or utterance—an increasingly recognised natural product of multilingualism in many settings. In language teaching and learning in particular, code-switching has been shown to bring in many pedagogical benefits, including accelerating students’ confidence, increasing their access to content, as well as improving their participation and engagement. Unfortunately, however, current educational technologies are not yet able to keep up with this ‘multilingual turn’ in education, and are partly responsible for the constraint of this practice to only classroom contexts. In an effort to make progress in this area, we offer a data-driven position paper discussing the current state of affairs, difficulties of the existing educational natural language processing (NLP) tools for CSW and possible directions for future work. We specifically focus on two cases of feedback and assessment technologies, demonstrating how the current state-of-the-art in these domains fails with code-switching data due to a lack of appropriate training data, lack of robust evaluation benchmarks and lack of end-to-end user-facing educational applications. We present some empirical user cases of how CSW manifests and suggest possible technological solutions for each of these scenarios. Full article
18 pages, 1480 KiB  
Article
And She Be like ‘Tenemos Frijoles en la Casa’: Code-Switching and Identity Construction on YouTube
by Michael Wentker and Carolin Schneider
Languages 2022, 7(3), 219; https://doi.org/10.3390/languages7030219 - 18 Aug 2022
Cited by 1 | Viewed by 2029
Abstract
This empirical case study explores the (co-)construction and negotiation of identities through code-switching (CS) as found on the video-sharing platform YouTube, disentangling the complexities of social practice anchored in a discursive online environment. Drawing on a YouTube comment corpus and paying special attention [...] Read more.
This empirical case study explores the (co-)construction and negotiation of identities through code-switching (CS) as found on the video-sharing platform YouTube, disentangling the complexities of social practice anchored in a discursive online environment. Drawing on a YouTube comment corpus and paying special attention to the socio-technical affordances of the platform, the study examines users’ positioning practices and metapragmatic replies in response to a culturally themed video priming discussion about LatinX family stereotypes. More specifically, it analyses how users discursively position themselves vis-à-vis the video and which linguistic strategies they exploit to (co-)construct and negotiate their cultural identity. Focusing on interrelated positioning devices such as code-choice, identity labels and quoting, this contribution proposes a multi-level model of analysis to account for the dynamic interplay between CS practices and identity construction in a heterogeneous online space. Following a social-constructivist approach to identity, CS is shown to reinforce in-group solidarity rooted in the shared experience and discussion of LatinX culture and provides evidence of a sense of togetherness in an emerging community of practice. Full article
Show Figures

Figure 1

Back to TopTop