Pythonと大規模言語モデルで作るリアルタイムマルチモーダル対話システム

著：	東中竜一郎氏（名古屋大学）光田航氏（rinna株式会社）千葉祐弥氏（NTTコミュニケーション科学基礎研究所）李晃伸氏（名古屋工業大学）
定価：	3,960円（本体3,600円＋税）
判型：	A5
ページ数：	241 ページ
ISBN：	978-4-910558-30-1
発売日：	2024/6/20
管理No：	128

目次
参考文献
口コミ

【目次】

第１章　対話システム

１－１　対話システムとは
１－２　リアルタイムマルチモーダル対話システムとは
１－３　一般的な対話システムのアーキテクチャ
１－４　リアルタイムマルチモーダル対話システムのアーキテクチャ
１－５　リアルタイムマルチモーダル対話システムツールキットRemdis
１－６　ツール・ソフトウェアのインストール
1. １－６－１　APIキーの取得
2. １－６－２　Windowsのインストール手順
3. １－６－３　Macのインストール手順
4. １－６－４　Windows/Mac共通のインストール手順
１－７　プログラムの起動

第２章　大規模言語モデルに基づくテキスト対話システム

２－１　大規模言語モデル
1. ２－１－１　大規模言語モデルの理論
2. ２－１－２　大規模言語モデルを用いた応答生成の実装
3. ２－１－３　大規模言語モデル差し替えの実装
２－２　リアルタイムテキスト対話システム
1. ２－２－１　リアルタイムテキスト対話システムの理論
2. ２－２－２　リアルタイムテキスト対話システムの実装
２－３　リアルタイムテキスト対話システムの改善
1. ２－３－１　応答生成（高速版）の実装
2. ２－３－２　自発的な発話生成の実装
２－４　本章のまとめ

第３章　音声対話システム

３－１　音声認識
1. ３－１－１　音声認識の理論
2. ３－１－２　ストリーミング音声認識システムの実装
３－２　音声合成
1. ３－２－１　音声合成の理論
2. ３－２－２　音声対話システムの実装
３－３　ターンテイキング
1. ３－３－１　ターンテイキングの理論
2. ３－３－２　Voice Activity Projection(VAP)
3. ３－３－３　リアルタイム音声対話システムの実装
３－４　本章のまとめ

第４章　マルチモーダル対話システム

４－１　マルチモーダル対話システム
４－２　マルチモーダル対話システムの理論
1. ４－２－１　入出力
2. ４－２－２　表出の方法
3. ４－２－３　エージェントの見た目のデザイン
４－３　MMDAgent-EX
1. ４－３－１　入手・準備
2. ４－３－２　基本的な操作
3. ４－３－３　コンテンツの構成
4. ４－３－４　メッセージによる制御
5. ４－３－５　ログの表示と保存
6. ４－３－６　動作スクリプト
7. ４－３－７　CGエージェントの表示
8. ４－３－８　モーションの再生
9. ４－３－９　オーディオの再生
10. ４－３－10　リップシンク付き音声再生
４－４　リアルタイムマルチモーダル対話システムの実装
1. ４－４－１　起動
2. ４－４－２　ファイルの構成
3. ４－４－３　同梱の3Dモデルについて
4. ４－４－４　main.mdf
5. ４－４－５　動作スクリプトの解説
6. ４－４－６　RabbitMQプラグイン
４－５　カスタマイズ方法
1. ４－５－１　リップシンクを調整する
2. ４－５－２　テキストや画像を提示する
3. ４－５－３　Remdisとの連携を拡張する
4. ４－５－４　開発情報

第５章　今後の展望

５－１　より知的な応答
５－２　実世界との紐づけ
５－３　多人数対話
５－４　リアルタイムに変化するシステム
５－５　共通理解
５－６　意図や欲求

【参考文献】

Joseph Weizenbaum. Eliza―a computer program for the study of natural language communication between man and machine. Communications of the ACM, Vol. 9, No. 1, pp. 36–45, 1966.
Terry Winograd. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. Technical report, MIT, 1971. (AITR-235).
松尾豊. 人工知能は人間を超えるかディープラーニングの先にあるもの. 角川EPUB選書, 2015.
松尾豊ほか. 人工知能とは. 近代科学社, 2016.
https://say-can.github.io/
小磯花絵，居關友里子，臼田泰如，柏野和佳子，川端良子，田中弥生，伝康晴，西川賢哉．『日本語日常会話コーパス』の構築．言語処理学会第23 回年次大会発表論文集，pp. 775–778，2017.
Michael Johnston, et al. MATCH: An Architecture for Multimodal Dialogue Systems. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 376–383, 2002.
Timothy Bickmore and Justine Cassell. Relational agents: a model and implementation of building user trust. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, p. 396–403, 2001.
Dan Bohus and Eric Horvitz. Models for Multiparty Engagement in Open-World Dialog. In Proceedings of the SIGDIAL 2009 Conference, pp. 225–234, 2009.
David DeVault, et al. Simsensei kiosk: a virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, p. 1061–1068, 2014.
Dian Yu, et al. Gunrock: A Social Bot for Complex and Engaging Long Conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pp. 79–84, 2019.
Tom Brown, et al. Language Models are Few-Shot Learners. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 1877–1901, 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL, pp. 4171–4186, 2019.
Jared Kaplan, et al. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361, 2020.
Ashish Vaswani, et al. Attention is All You Need. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 5998–6008, 2017.
Laria Reynolds and Kyle McDonell. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Proceedings of the Conference on Human Factors in Computing Systems, pp. 1–7, 2021.
Long Ouyang, et al. Training language models to follow instructions with human feedback. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 27730–27744, 2022.
河原達也. IT Text 音声認識システム. オーム社, 2016.
山本龍一, 高道慎之介. Python で学ぶ音声合成. インプレス, 2021.
高梨克也. 基礎から分かる会話コミュニケーションの分析法. ナカニシヤ出版, 2016.
Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, Vol. 67, pp. 1–26, 2021.
Erik Ekstedt and Gabriel Skantze. Voice activity projection: Self-supervised learning of turn-taking events. In Proceedings of Interspeech, pp. 5190–5194, 2022.
Stephen Levinson and Francisco Torreira, Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, Vol. 6, pp. 10–26, 2015.
Anne Anderson et al., The HCRC map task corpus. Language and Speech, Vol. 34, No. 4, pp. 351–366, 1991.
Koji Inoue, Bing’er Jiang, Erik Ekstedt, Tatsuya Kawahara, and Gabriel Skantze, Real-time and continuous turn-taking prediction using voice activity projection. arXiv preprint arXiv:2401.04868, pp. 1–10, 2024.
Aaron Powers, Sara Kiesler, Susan Fussell, and Cristen Torrey. Comparing a computer agent with a humanoid robot. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, pp. 145–152, 2007.
小松孝徳, 山田誠二. 適応ギャップがユーザのエージェントに対する印象変化に与える影響. 人工知能学会論文誌, Vol. 24, No. 2, pp. 232–240, 2009.・Joon Sung Park, et al. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023.

【口コミ】

※口コミはありません。

Pythonと大規模言語モデルで作るリアルタイムマルチモーダル対話システム

【目次】

第１章 対話システム

第２章 大規模言語モデルに基づくテキスト対話システム

第３章 音声対話システム

第４章 マルチモーダル対話システム

第５章 今後の展望

【参考文献】

【口コミ】

第１章　対話システム

第２章　大規模言語モデルに基づくテキスト対話システム

第３章　音声対話システム

第４章　マルチモーダル対話システム

第５章　今後の展望