BiRCh corpus is a project in progress. The ultimate product will be twofold.
First, we are collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia and Ukraine, Germany, and the U.S. and Canada.
We are aiming for data collection over a decade, and hoping that at least a few families from each geographical area will participate for 5-10 years.
Transcipts of this data, amounting to several million words, time-aligned with the audio speech signal, and fully text searchable will constitute the "Audio-aligned Longitudinal Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)" (Dubinina, Malamud & Denisova-Schmidt).
Second, we are building a 1-million word corpus based on a subset of this data, the "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)" (Malamud, Dubinina, Lưu & Xue) with two basic components:
NEWS! Our project "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)" has been awarded an NSF grant. The funds will be used to create the 1-million word corpus, and to conduct research on passives, impersonals, and politeness markers in monolingual and bilingual Russian.
The purpose of this project is to create an online, freely available audio-aligned parsed corpus of language produced by children acquiring Russian in monolingual and bilingual contexts. Though the language of immigrant communities is often stigmatized and deprecated (even by its speakers), it is of central importance to the cultural identity and practices of these communities, and its study is crucial to an understanding the fundamental properties of linguistic knowledge, language acquisition and maintenance.
The first and major step in creating a corpus is data collection. Therefore, the main aim of this project is to collect and aggregate longitudinal data on language development of monolingual children in Russia, Russian-English bilingual children in the US, and Russian-German bilingual children in Germany. The recordings will present a wealth of data on speech phenomena, such as speech rate and intonation, and time-aligning the digital recordings with the transcripts will allow researchers to rapidly find desired parts of the speech signal by searching the transcribed text.
The second aim of the project is the creation of a grammatically annotated corpus. Such corpus will serve as a powerful research tool for investigating the grammar of Russian as spoken in Russia and by immigrants in Germany and the U.S., and the different factors influencing language acquisition in monolingual and immigrant bilingual contexts. This resource ultimately will help advance knowledge in the field of linguistics, language acquisition and bilingualism.
Research of language grammar, meaning and use must be based on data that allows researchers to see linguistic structure, meaning, and context. As research in other subfields of linguistics has shown, large collections of language data annotated with information about linguistic structure can bring about major advances. For instance, parsed corpora of historical English (Kroch & Taylor 1999, Taylor et al. 2003, Kroch et al. 2004) led to groundbreaking discoveries about the processes that defined the shape of English today and allowed linguists to gain a greater understanding of the very nature of language change.
An annotated corpus of monolingual and bilingual child speech would provide crucial data for researchers investigating the culture and speech of immigrant and monolingual Russian communities, the development of heritage languages, and language acquisition more generally. It would also supply the necessary information for practitioners developing language materials for heritage learners, for parents raising bilingual children, and for policy makers drafting appropriate rules and procedures.
Dubinina, Irina, Sophia A. Malamud, and Elena Denisova-Schmidt. "Audio-aligned Longitudinal Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)". 2013-present.
Malamud, Sophia A., Irina Dubinina, Alex Lưu, and Nianwen Xue. "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)" 2017-present.
• Irina Dubinina, Assistant Professor of Russian, Director of the Russian Language Program, Brandeis University
• Sophia Malamud, Associate Professor of Language & Linguistics, Department of Computer Science, Brandeis University
• Elena Denisova-Schmidt, Lecturer, School of Humanities and Social Sciences, University of St. Gallen
• Nianwen Xue, Associate Professor of Linguistics and Computer Science, Brandeis University
Research Assistants, Brandeis:
• Alex Lưu, computational tools; corpus linguistics, pragmatics, and computational linguistics research, including converting SynTagRus into a Penn Treebank-style syntactically parsed corpus of Russian
• Benjamin Rozonoyer, segmentation/sentence tokenization, transcription, adjudication
• Masha Shaposhnikova, transcription, adjudication, research of disfluencies
• Yan Shneyderman, adjudication, annotation, research of disflluencies
Other Research Assistants:
Dina Akselrod, Robin Goodfellow Malamud, Olena Prusikin, Ilya Rozonoyer
This project is supported by
A Leonardo da Vinci, EU grant to Elena Denisova-Schmidt [project BILIUM], 09/2012 - 07/2014
Theodore and Jane Norman Award, Brandeis University to Irina Dubinina, summer 2014
Provost Research Grant, Brandeis University to Sophia Malamud and Irina Dubinina, 07/2015 - 07/2016
The Faculty Grant from the Mandel Foundation for Humanities to Sophia Malamud and Irina Dubinina, 01/2016 - 12/2017
Provost Research Grant, Brandeis University to Irina Dubinina, 07/2016 - 07/2017
Brandeis Dean of Arts and Sciences Collaborative Faculty-Student research award to Sophia Malamud, Irina Dubinina, Masha Shaposhnikova, Yan Shneyderman, spring 2017
National Science Foundation Award BCS-1651083 to Sophia Malamud (PI), Irina Dubinina (co-PI), Nianwen Xue (co-PI), 08/2017-1/2021
Welcome letter and instructions for parents
Guidelines for transcription
& disfluency annotation
Segmentation for aligning transcripts with audio follows the principle that each segment = sentence token.
Disfluency annotation and tokenization follows the Penn Syntactic annotation manual for audio-aligned parsed corpora.
[full citation on MILa: publications page]