Skip to main content

Mandarin-English Code-Switching in South-East Asia

Resource
URL
https://dss2.princeton.edu/data/247/
Blurb

Comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities.

Data

The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian.

The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length.

Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances. The transcription file for each audio file is stored in UTF-8 tab-separated text file format.

Development and Training Divisions are available as a seperate download (SEAME_train_dev_division.zip) and on the provider's Github page.

Link time
2023-02-23 23:26:00 UTC
Sample
Principal investigator
Producer
Distributor
Version
More detail URL
Resource type
Single study
Subjects
  • Art & Culture
Regions
  • Asia
Countries
  • China
  • Malaysia
  • Singapore