授權許可
License
CC0 公共領域
Public Domain
語言
Language
粵語
Cantonese
ISO 639-3: yue
總時長
Total Duration
112.54 個鐘 hours
(6752.70 分鐘 minutes)
總字數(含標點)
Total # Characters (including punctuation)
1,679,097
發音人
Voice Actor
張悦楷
介紹 Introduction
本數據集由廣州最出名嘅話劇演員、説書藝人(講古佬)張悦楷喺 1980 年代電台播講《三國演義》《水滸傳》《走進毛澤東的最後歲月》嘅錄音製成。數據集所有文本均由人工轉寫,並根據原文校對嚟確保準確性。
This dataset was made from recordings of Zoeng Jyut Gaai, the most famous drama actor and storyteller in Canton, storytelling Romance of the Three Kingdoms, Water Margin and The Final Days of Mao Zedong during the 1980s. All texts in the dataset were transcribed manually and proofread according to the original text of Romance of the Three Kingdoms to ensure accuracy.
本數據集可用於各種用途,例如語音合成(TTS)、語音識別(ASR)、語言模型(LLM)、語言學分析等等。 張悦楷語音合成 就係一個用本數據集訓練出嚟嘅 TTS 系統。
This dataset is multi-purposed. It can be used for Text-To-Speech (TTS), Automatic Speech Recognition (ASR), Language Modeling, linguistics analysis, etc. As an example, 張悦楷語音合成 is a TTS system trained on this dataset.
數據樣例 Data samples
當今天下嘅英雄,就係使君你,同我喇。
武松見到隻老虎返轉頭唄,就雙手捹起條棍出盡平生之力,將條棍喺半空中劈落去嘞喎。
佢連珠炮噉話:孟錦雲,你反動透頂,你反對毛主席,你罪該萬死!
下載 Download
如果想單純將 opus/
入面所有嘢下載落嚟,可以跑下面嘅
Python 代碼,注意要安裝
pip install --upgrade huggingface_hub
先:
If you want to download the
opus/
directory only, run the following Python code.
Make sure you have installed
pip install --upgrade huggingface_hub
first:
from huggingface_hub import snapshot_download
# 如果淨係想下載啲字幕或者源音頻,就將 `opus/*` 改成 `srt/*` 或者 `source/*`
# If you only want to download subtitles or source audio, change `opus/*` to `srt/*` or `source/*`
snapshot_download(repo_id="CanCLID/zoengjyutgaai",allow_patterns="opus/*",local_dir="./",repo_type="dataset")
如果唔想用 Python ,可以用下面嘅命令嚟凈係克隆個
opus/
路徑,避免克隆晒成個倉庫:
If you don't want to use Python, use the following commands to clone
the
opus/
directory only:
mkdir zoengjyutgaai
cd zoengjyutgaai
git init
git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai
git sparse-checkout init --cone
# 指定凈係下載個別路徑 Tell git which directory you want
git sparse-checkout set opus
# 開始下載 Pull the content
git pull origin main
所有文字轉寫都喺 opus/saamgwokjinji/metadata.csv
同
opus/seoiwuzyun/metadata.csv
入面。
All text transcriptions are in
wav/metadata.csv
.
説明 Info
所有源字幕 SRT 文件都存放喺 Hugging Face
倉庫嘅srt/
路經下。為咗節省儲存空間,所有源音頻都已經轉成
opus 格式放喺
source/
路經下,切分好嘅句子音頻都放喺
opus/
路經下。
All source subtitle SRT files are stored in the
srt/
directory of the Hugging Face repository. In order
to save space, all source audio are stored in opus format in the
source/
directory, cut sentence audios are stored in
the opus/
directory.
- 所有文本都根據 jyutping.org/blog/typo 同 jyutping.org/blog/particles/ 規範用字
- 所有文本都使用全角標點,冇半角標點
- 所有文本都用漢字轉寫,無阿拉伯數字無英文字母
- 所有音頻都轉為 opus 格式,48000 採樣率
- All texts are standardized with the orthography in jyutping.org/blog/typo and jyutping.org/blog/particles/
- All text use full-width punctuations and has no half-width punctuations.
- All text is in Chinese characters, no Latin letters or Arabic numbers.
- All audios are in opus format, 48000 sampling rate.
數據統計 Statistics
全集 Total | 三國演義 saamgwokjinji | 水滸傳 seoiwuzyun | 走進毛澤東的最後歲月 mouzaakdung | |
---|---|---|---|---|
總時長 Total Duration (個鐘 hours | 分鐘 minutes) | 112.10 | 6752.70 | 66.01 | 3960.73 | 38.62 | 2317.43 | 7.91 | 474.49 |
平均音頻時長 Average Clip Duration (秒 seconds) | 5.901 | 6.067 | 5.619 | 6.004 |
中位音頻時長 Median Clip Duration (秒 seconds) | 5.443 | 5.607 | 5.198 | 5.553 |
最短音頻時長 Min Clip Duration (秒 seconds) | 0.322 | 0.339 | 0.322 | 0.735 |
最長音頻時長 Max Clip Duration (秒 seconds) | 33.144 | 31.822 | 33.144 | 21.252 |
平均每句字數,含標點 Average Characters Per Clip, including punctuation | 24.36 | 24.00 | 24.86 | 24.77 |
中位每句字數,含標點 Median Characters Per Clip, including punctuation | 23 | 23 | 23 | 23 |
文本總字數,含標點 Total Characters, including punctuation | 1679097 | 952427 | 621682 | 117308 |
覆蓋漢字數 Unique Chinese Characters Coverage | 4597 | 3993 | 3520 | 2709 |
平均語速,含標點 Average Speaking Rate, including punctuation (字/秒 characters per second) | 4.14 | 3.96 | 4.47 | 4.12 |
引用 Citation
本數據集屬公共領域,遵循 CC0 許可聲明。即係話你可以無需授權免費任用本數據集,亦都唔需要註明出處。不過如果你用咗本數據集,我哋都希望你可以引用本頁面,作為對楷叔嘅懷念同致敬:
This dataset is in the public domain and follows the CC0 license agreement. This means you can use this dataset for free without attribution. However, if you use this dataset, we hope you can cite this page as a tribute to Gaai Suk:
@misc{zoengjyutgaai2025, title={The Zoeng Jyut Gaai Story-telling Speech Dataset}, author={Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)}, howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}}, year={2025} }
意見反饋 Feedback
數據集建設難免有疏漏,如果你發現有任何錯誤、問題,或者有任何意見,歡迎喺 Hugging Face 討論區 提出。
Dataset construction is inevitably flawed. If you find any errors, problems, or have any suggestions, feel free to raise them in the Hugging Face discussion forum.