張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset

開源粵語語音數據集,適合語音識別、語音合成、大語言模型、語言學文學研究等應用 Open-sourced Cantonese voice dataset for ASR, TTS, LLM, linguistics research and more

授權許可
License

CC0 公共領域
Public Domain

語言
Language

粵語
Cantonese
ISO 639-3: yue

總時長
Total Duration

66.01 個鐘 hours
(3960.73 分鐘 minutes)

總字數(含標點)
Total Characters # (including punctuation)

946176

發音人
Voice Actor

張悦楷

介紹 Introduction

本數據集由廣州最出名嘅話劇演員、説書藝人(講古佬)張悦楷喺 1980 年代電台播講《三國演義》嘅錄音製成。數據集所有文本均由人工轉寫,並根據《三國演義》原文校對嚟確保準確性。

This dataset was made from recordings of Zoeng Jyut Gaai, the most famous drama actor and storyteller in Canton, storytelling Romance of the Three Kingdoms during the 1980s. All texts in the dataset were transcribed manually and proofread according to the original text of Romance of the Three Kingdoms to ensure accuracy.

本數據集可用於各種用途,例如語音合成(TTS)、語音識別(ASR)、語言模型(LLM)、語言學分析等等。 張悦楷語音合成 就係一個用本數據集訓練出嚟嘅 TTS 系統。

This dataset is multi-purposed. It can be used for Text-To-Speech (TTS), Automatic Speech Recognition (ASR), Language Modeling, linguistics analysis, etc. As an example, 張悦楷語音合成 is a TTS system trained on this dataset.

數據樣例 Data samples

當今天下嘅英雄,就係使君你,同我喇。

唉!既生瑜,何生亮!既生瑜,何生亮!既生瑜,何生亮啊!

王朗講完,孔明喺架車上哈哈大笑佢話:哈哈哈哈哈哈哈哈,我仲以為堂堂漢朝嘅大老元臣,所講嘅道理必定十分高明嘅,點估到竟然如此卑鄙啊!

下載 Download

如果你想單純克隆所有 wav 文件,可以用下面嘅命令嚟凈係克隆個 wav/ 路徑,避免 clone 晒成個 repo:

If you want to clone only the wav files without cloning the entire repo, use the following commands to clone the wav/ directory only:

mkdir zoengjyutgaai_saamgwokjinji
cd zoengjyutgaai_saamgwokjinji
git init

git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai_saamgwokjinji
git sparse-checkout init --cone

# 指定凈係下載個別路徑 Tell git which directory you want
git sparse-checkout set wav

# 開始下載 Pull the content
git pull origin main

所有文字轉寫都喺 wav/metadata.csv入面。

All text transcriptions are in wav/metadata.csv.

説明 Info

所有源字幕 SRT 文件都存放喺 Hugging Face 倉庫嘅srt/路經下。所有源音頻都以 .webm 格式放喺 .webm/ 路經下。

All source subtitle SRT files are stored in the srt/ directory of the Hugging Face repository. All source audio are stored in .webm format in the .webm/ directory.

  • All text are standardized with the orthography in jyutping.org/blog/typo and jyutping.org/blog/particles/
  • All text use full-width punctuations and has no half-width punctuations.
  • All text is in Chinese characters, no Latin letters or Arabic numbers.
  • All source audios are stored in /webm.

數據統計 Statistics

總時長 Total Duration 66.01 個鐘 hours(3960.73 分鐘 minutes)
平均音頻時長 Average Clip Duration 6.065 秒 seconds
中位音頻時長 Median Clip Duration 5.606 秒 seconds
最短音頻時長 Min Clip Duration 0.339 秒 seconds
最長音頻時長 Max Clip Duration 31.822 秒 seconds
平均每句字數(含標點) Average Characters Per Clip (including punctuation) 24.00
中位每句字數(含標點) Median Characters Per Clip (including punctuation) 23
文本總字數(含標點) Total Characters # (including punctuation) 946176
覆蓋漢字數 Unique Chinese Characters Coverage 3988
平均語速(含標點) Average Speaking Rate (including punctuation) 3.98 字/秒 characters per second
採樣率 Sampling Rate 44100 Hz
音頻文件格式 Audio file format .wav

引用 Citation

本數據集屬公共領域,遵循 CC0 許可聲明。即係話你可以無需授權免費任用本數據集,亦都唔需要註明出處。不過如果你用咗本數據集,我哋都希望你可以引用本頁面,作為對楷叔嘅懷念同致敬:

This dataset is in the public domain and follows the CC0 license agreement. This means you can use this dataset for free without attribution. However, if you use this dataset, we hope you can cite this page as a tribute to Gaai Suk:

@misc{zoengjyutgaai2025,
    title={張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset},
    author={粵語計算語言學基礎建設組 Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)},
    howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}},
    year={2025}
}

意見反饋 Feedback

數據集建設難免有疏漏,如果你發現有任何錯誤、問題,或者有任何意見,歡迎喺 Hugging Face 討論區 提出。

Dataset construction is inevitably flawed. If you find any errors, problems, or have any suggestions, feel free to raise them in the Hugging Face discussion forum.