張悦楷講古語音數據集

授權許可
License

CC0 公共領域
Public Domain

語言
Language

粵語
Cantonese
ISO 639-3: yue

總時長
Total Duration

188.25 個鐘 hours
（11295.07 分鐘 minutes）

總字數（含標點）
Total # Characters (including punctuation)

2,903,094

發音人
Voice Actor

張悦楷

介紹 Introduction

本數據集由廣州最出名嘅話劇演員、説書藝人（講古佬）張悦楷喺 1980 年代電台播講《三國演義》《水滸傳》《走進毛澤東的最後歲月》《鹿鼎記》嘅錄音製成。數據集所有文本均由人工轉寫，並根據原文校對嚟確保準確性。

This dataset is made from the recordings of Zoeng Jyut Gaai, the most famous drama actor and storyteller in Canton, storytelling Romance of the Three Kingdoms, Water Margin, The Final Days of Mao Zedong and The Deer and the Cauldron during the 1980s. All text in the dataset is transcribed by human and proofread according to the original text of the books to ensure accuracy.

本數據集可用於各種用途，例如語音合成（TTS）、語音識別（ASR）、語言模型（LLM）、語言學研究等等。張悦楷語音合成就係一個用本數據集訓練出嚟嘅 TTS 系統。

This dataset is multi-purposed. It can be used for Text-To-Speech (TTS), Automatic Speech Recognition (ASR), Language Modeling, linguistics research, etc. As an example, 張悦楷語音合成 is a TTS system trained on this dataset.

數據樣例 Data samples

當今天下嘅英雄，就係使君你，同我喇。

武松見到隻老虎返轉頭唄，就雙手捹起條棍出盡平生之力，將條棍喺半空中劈落去嘞喎。

佢連珠炮噉話：孟錦雲，你反動透頂，你反對毛主席，你罪該萬死！

下載 Download

前往 🤗 Hugging Face 下載

如果想單純將 opus/ 入面所有嘢下載落嚟，可以跑下面嘅 Python 代碼，注意要安裝 pip install --upgrade huggingface_hub 先：

If you want to download the opus/ directory only, run the following Python code. Make sure you have installed pip install --upgrade huggingface_hub first:

from huggingface_hub import snapshot_download

# 如果淨係想下載啲字幕或者源音頻，就將 `opus/*` 改成 `srt/*` 或者 `source/*`
# If you only want to download subtitles or source audio, change `opus/*` to `srt/*` or `source/*`
snapshot_download(repo_id="CanCLID/zoengjyutgaai",allow_patterns="opus/*",local_dir="./",repo_type="dataset")

如果唔想用 Python ，可以用下面嘅命令嚟凈係克隆個 opus/ 路徑，避免克隆晒成個倉庫：

If you don't want to use Python, use the following commands to clone the opus/ directory only:

mkdir zoengjyutgaai
cd zoengjyutgaai
git init

git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai
git sparse-checkout init --cone

# 指定凈係下載個別路徑 Tell git which directory you want
git sparse-checkout set opus

# 開始下載 Pull the content
git pull origin main

所有文字轉寫都喺 opus/saamgwokjinji/metadata.csv 同 opus/seoiwuzyun/metadata.csv入面。

All text transcriptions are in wav/metadata.csv.

説明 Info

所有源字幕 SRT 文件都存放喺 Hugging Face 倉庫嘅srt/路經下。為咗節省儲存空間，所有源音頻都已經轉成 opus 格式放喺 source/ 路經下，切分好嘅句子音頻都放喺 opus/ 路經下。

All source subtitle SRT files are stored in the srt/ directory of the Hugging Face repository. In order to save space, all source audio are stored in opus format in the source/ directory, cut sentence audios are stored in the opus/ directory.

所有文本都根據 jyutping.org/blog/typo 同 jyutping.org/blog/particles/ 規範用字
所有文本都使用全角標點，冇半角標點
所有文本都用漢字轉寫，無阿拉伯數字無英文字母
所有音頻都轉為 opus 格式，48000 採樣率

All texts are standardized with the orthography in jyutping.org/blog/typo and jyutping.org/blog/particles/
All text use full-width punctuations and has no half-width punctuations.
All text is in Chinese characters, no Latin letters or Arabic numbers.
All audios are in opus format, 48000 sampling rate.

數據統計 Statistics

	全集 Total	三國演義 saamgwokjinji	水滸傳 seoiwuzyun	走進毛澤東的最後歲月 mouzaakdung	鹿鼎記 lukdinggei
總時長 Total Duration (個鐘 hours \| 分鐘 minutes)	188.25 \| 11295.07	66.01 \| 3960.73	38.62 \| 2317.43	7.91 \| 474.49	75.71 \| 4542.37
平均音頻時長 Average Clip Duration (秒 seconds)	5.826	6.067	5.619	6.004	5.718
中位音頻時長 Median Clip Duration (秒 seconds)	5.385	5.607	5.198	5.553	5.302
最短音頻時長 Min Clip Duration (秒 seconds)	0.322	0.339	0.322	0.735	0.329
最長音頻時長 Max Clip Duration (秒 seconds)	33.144	31.822	33.144	21.252	31.528
平均每句字數，含標點 Average Characters Per Clip, including punctuation	24.90	24.00	24.86	24.77	25.68
中位每句字數，含標點 Median Characters Per Clip, including punctuation	23	23	23	23	24
文本總字數，含標點 Total Characters, including punctuation	2,903,094	952,427	621,682	117,308	1,223,997
覆蓋漢字數 Unique Chinese Characters Coverage	5017	3993	3520	2709	4130
平均語速，含標點 Average Speaking Rate, including punctuation (字/秒 characters per second)	4.28	3.96	4.47	4.12	4.49

引用 Citation

本數據集屬公共領域，遵循 CC0 許可聲明。即係話你可以無需授權免費任用本數據集，亦都唔需要註明出處。不過如果你用咗本數據集，我哋都希望你可以引用本頁面，作為對楷叔嘅懷念同致敬：

This dataset is in the public domain and follows the CC0 license agreement. This means you can use this dataset for free without attribution. However, if you use this dataset, we hope you can cite this page as a tribute to Gaai Suk:

@misc{zoengjyutgaai2025,
    title={The Zoeng Jyut Gaai Story-telling Speech Dataset},
    author={Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)},
    howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}},
    year={2025}
}

意見反饋 Feedback

數據集建設難免有疏漏，如果你發現有任何錯誤、問題，或者有任何意見，歡迎喺 Hugging Face 討論區提出。

Dataset construction is inevitably flawed. If you find any errors, problems, or have any suggestions, feel free to raise them in the Hugging Face discussion forum.

授權許可 License

語言 Language

總時長 Total Duration

總字數（含標點） Total # Characters (including punctuation)

發音人 Voice Actor