授權許可
License
CC0 公共領域
Public Domain
語言
Language
粵語
Cantonese
ISO 639-3: yue
總時長
Total Duration
66.01 個鐘 hours
(3960.73 分鐘 minutes)
總字數(含標點)
Total Characters # (including punctuation)
946176
發音人
Voice Actor
張悦楷
介紹 Introduction
本數據集由廣州最出名嘅話劇演員、説書藝人(講古佬)張悦楷喺 1980 年代電台播講《三國演義》嘅錄音製成。數據集所有文本均由人工轉寫,並根據《三國演義》原文校對嚟確保準確性。
This dataset was made from recordings of Zoeng Jyut Gaai, the most famous drama actor and storyteller in Canton, storytelling Romance of the Three Kingdoms during the 1980s. All texts in the dataset were transcribed manually and proofread according to the original text of Romance of the Three Kingdoms to ensure accuracy.
本數據集可用於各種用途,例如語音合成(TTS)、語音識別(ASR)、語言模型(LLM)、語言學分析等等。 張悦楷語音合成 就係一個用本數據集訓練出嚟嘅 TTS 系統。
This dataset is multi-purposed. It can be used for Text-To-Speech (TTS), Automatic Speech Recognition (ASR), Language Modeling, linguistics analysis, etc. As an example, 張悦楷語音合成 is a TTS system trained on this dataset.
數據樣例 Data samples
當今天下嘅英雄,就係使君你,同我喇。
唉!既生瑜,何生亮!既生瑜,何生亮!既生瑜,何生亮啊!
王朗講完,孔明喺架車上哈哈大笑佢話:哈哈哈哈哈哈哈哈,我仲以為堂堂漢朝嘅大老元臣,所講嘅道理必定十分高明嘅,點估到竟然如此卑鄙啊!
下載 Download
如果你想單純克隆所有 wav 文件,可以用下面嘅命令嚟凈係克隆個
wav/
路徑,避免 clone 晒成個 repo:
If you want to clone only the wav files without cloning the entire
repo, use the following commands to clone the
wav/
directory only:
mkdir zoengjyutgaai_saamgwokjinji
cd zoengjyutgaai_saamgwokjinji
git init
git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai_saamgwokjinji
git sparse-checkout init --cone
# 指定凈係下載個別路徑 Tell git which directory you want
git sparse-checkout set wav
# 開始下載 Pull the content
git pull origin main
所有文字轉寫都喺 wav/metadata.csv
入面。
All text transcriptions are in
wav/metadata.csv
.
説明 Info
所有源字幕 SRT 文件都存放喺 Hugging Face
倉庫嘅srt/
路經下。所有源音頻都以 .webm 格式放喺
.webm/
路經下。
All source subtitle SRT files are stored in the
srt/
directory of the Hugging Face repository. All
source audio are stored in .webm format in the
.webm/
directory.
- 所有文本都根據 jyutping.org/blog/typo 同 jyutping.org/blog/particles/ 規範用字
- 所有文本都使用全角標點,冇半角標點
- 所有文本都用漢字轉寫,無阿拉伯數字無英文字母
- 所有音頻源都存放喺
/webm
下面
- All text are standardized with the orthography in jyutping.org/blog/typo and jyutping.org/blog/particles/
- All text use full-width punctuations and has no half-width punctuations.
- All text is in Chinese characters, no Latin letters or Arabic numbers.
-
All source audios are stored in
/webm
.
數據統計 Statistics
總時長 Total Duration | 66.01 個鐘 hours(3960.73 分鐘 minutes) |
平均音頻時長 Average Clip Duration | 6.065 秒 seconds |
中位音頻時長 Median Clip Duration | 5.606 秒 seconds |
最短音頻時長 Min Clip Duration | 0.339 秒 seconds |
最長音頻時長 Max Clip Duration | 31.822 秒 seconds |
平均每句字數(含標點) Average Characters Per Clip (including punctuation) | 24.00 |
中位每句字數(含標點) Median Characters Per Clip (including punctuation) | 23 |
文本總字數(含標點) Total Characters # (including punctuation) | 946176 |
覆蓋漢字數 Unique Chinese Characters Coverage | 3988 |
平均語速(含標點) Average Speaking Rate (including punctuation) | 3.98 字/秒 characters per second |
採樣率 Sampling Rate | 44100 Hz |
音頻文件格式 Audio file format | .wav |
引用 Citation
本數據集屬公共領域,遵循 CC0 許可聲明。即係話你可以無需授權免費任用本數據集,亦都唔需要註明出處。不過如果你用咗本數據集,我哋都希望你可以引用本頁面,作為對楷叔嘅懷念同致敬:
This dataset is in the public domain and follows the CC0 license agreement. This means you can use this dataset for free without attribution. However, if you use this dataset, we hope you can cite this page as a tribute to Gaai Suk:
@misc{zoengjyutgaai2025, title={張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset}, author={粵語計算語言學基礎建設組 Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)}, howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}}, year={2025} }
意見反饋 Feedback
數據集建設難免有疏漏,如果你發現有任何錯誤、問題,或者有任何意見,歡迎喺 Hugging Face 討論區 提出。
Dataset construction is inevitably flawed. If you find any errors, problems, or have any suggestions, feel free to raise them in the Hugging Face discussion forum.