# A Reinforcement Learning Approach to Speech Coding

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Current Standardized Speech Codecs

## 3. Reinforcement Learning and Stochastic Control: Terminology and Methods

- A model of the environment that we call the System Model;
- A control policy that describes the behavior of the learning agent;
- A cost function instead of a reward function;
- A cost-to-go that tries to emulate a Value Function.

## 4. Speech Coding as Reinforcement Learning

- A System Model;
- A Cost Function;
- A Cost-to-Go function;
- A Control Policy.

#### 4.1. The System Model

## 5. Learning (Parameter Adaptation)

#### 5.1. AR and MA Parameter Adaptation

#### 5.2. Quasi-Periodic Excitation Adaptation

## 6. Error Shaping

## 7. The Value Function

## 8. Control Policy

#### 8.1. Control Tree Sequences

#### 8.2. Control Tree Gain Adaptation

## 9. L Step Lookahead Cost Function

## 10. Exploitation and Exploration

## 11. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AMR-WB | Adaptive Multirate-Wideband |

AR | Autoregressive |

MA | Moving Average |

PESQ-MOS | Perceptual Evaluation of Speech Quality-Mean Opinion Score |

RLS | Recursive Least Squares |

VAD/CNG | Voice Activity Detection/Comfort Noise Generation |

## Appendix A. DPCM and Tree Coding

#### Appendix A.1. DPCM

#### Appendix A.2. Tree Coding

## Appendix B. Pitch Lag Adaptation

## Appendix C. Pitch Stability Test

#### Appendix C.1. Stability Test

- If $\left|a\right|\ge \left|b\right|$, the following is sufficient for stability:
- (a)
- $|{\beta}_{-1}|+|{\beta}_{0}|+|{\beta}_{+1}|<1$

- If $\left|a\right|<\left|b\right|$, the satisfaction of the two following conditions is sufficient for stability:
- (a)
- $|{\beta}_{0}|+|a|<1$
- (b)
- (i) ${b}^{2}\le \left|a\right|$ or(ii) ${b}^{2}{\beta}_{0}^{2}-(1-{b}^{2})({b}^{2}-{a}^{2})<0$

#### Appendix C.2. Stabilization Procedure

- If $\left|a\right|\ge \left|b\right|$,$$c=\frac{1}{|{\beta}_{-1}|+|{\beta}_{0}|+|{\beta}_{+1}|}.$$
- If ${b}^{2}\le \left|a\right|$,$$c=\frac{1}{\left|a\right|+|{\beta}_{0}|}.$$
- If ${b}^{2}>\left|a\right|$,$$c=\sqrt{\frac{{b}^{2}-{a}^{2}}{{b}^{4}+{b}^{2}{\beta}_{2}^{2}-{b}^{2}{a}^{2}}}.$$

## Appendix D. Voice Activity Detection/Comfort Noise Generation (VAD/CNG)

## References

- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
- Berger, T. Rate Distortion Theory; Prentice-Hall: Upper Saddle River, NJ, USA, 1971. [Google Scholar]
- Berger, T.; Gibson, J.D. Lossy Source Coding. IEEE Trans. Inf. Theory
**1998**, 44, 2693–2723. [Google Scholar] [CrossRef] [Green Version] - Woo, H.C.; Gibson, J.D. Low delay tree coding of speech at 8 kbit/s. IEEE Trans. Speech Audio Process.
**1994**, 2, 361–370. [Google Scholar] [CrossRef] - Oh, H.; Gibson, J.D. Output Recursively Adaptive (ORA) Tree Coding of Speech with VAD/CNG. In Proceedings of the 54th Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–4 November 2020. [Google Scholar]
- Li, Y.Y.; Ramadas, P.; Gibson, J. Multimode tree coding of speech with pre-/post-weighting. Appl. Sci.
**2022**, 12, 2026. [Google Scholar] [CrossRef] - Wittenmark, B. Adaptive Dual Control Methods: An Overview. In IFAC Adaptive Systems in Control and Signal Processing; Elsevier: Amsterdam, The Netherlands, 1995. [Google Scholar]
- Bertsekas, D.P. Reinforcement Learning and Optimal Control; Athena Scientific: Nashua, NH, USA, 2019. [Google Scholar]
- Bertsekas, D.P. Dynamic Programming and Stochastic Control; Academic Press: Cambridge, MA, USA, 1976. [Google Scholar]
- Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
- Feldbaum, A.A. Dual Control Theory I–II. Autom. Remote Control
**1960**, 21, 874–880, 1033–1039. [Google Scholar] - Feldbaum, A.A. Dual Control Theory III–IV. Autom. Remote Control
**1961**, 22, 1–12, 109–121. [Google Scholar] - Feldbaum, A.A. Dual Control Theory Problems. IFAC Proc.
**1963**, 1, 541–550. [Google Scholar] [CrossRef] - Bessette, B.; Salami, R.; Lefebvre, R.; Jelinek, M.; Rotola-Pukkila, J.; Vainio, J.; Mikkola, H.; Jarvinen, K. The adaptive multirate wideband speech codec (AMR-WB). IEEE Trans. Speech Audio Process.
**2002**, 10, 620–636. [Google Scholar] [CrossRef] - Dietz, M.; Multrus, M.; Eksler, V.; Malenovsky, V.; Norvell, E.; Pobloth, H.; Miao, L.; Wang, Z.; Laaksonen, L.; Vasilache, A.; et al. Overview of the EVS codec architecture. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5698–5702. [Google Scholar] [CrossRef]
- Gibson, J.D. Speech Coding Methods, Standards, and Applications. IEEE Circuits Syst. Mag.
**2005**, 5, 30–49. [Google Scholar] [CrossRef] - Gibson, J.D. Speech Coding for Wireless Communications. In Mobile Communications Handbook; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- Gibson, J.D. Speech compression. Information
**2016**, 7, 800–808. [Google Scholar] [CrossRef] [Green Version] - Ljung, L.; Soderstrom, T. Theory and Practice of Recursive Identification; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
- Pagano, M. Estimation of autoregressive signal plus noise. Ann. Stat.
**1974**, 2, 99–108. [Google Scholar] [CrossRef] - Gibson, J.D. Backward adaptive prediction as spectral analysis in a closed loop. IEEE Trans. Acoust. Speech Signal Process.
**1985**, 33, 166–1174. [Google Scholar] [CrossRef] - Honig, M.L.; Messerschmitt, D.G. Adaptive Filters: Structures, Algorithms, and Applications; Kluwer Academic Publishers: Hingham, MA, USA, 1984. [Google Scholar]
- Haykin, S. Adaptive Filter Theory; Prentice-Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
- ITU-T Recommendation G.726. 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM). 1990. Available online: https://www.itu.int/rec/T-REC-G.726/e (accessed on 15 June 2022).
- ITU-T Recommendation G.727. 5-, 4-, 3- and 2-bit/sample Embedded Adaptive Differential Pulse Code Modulation (ADPCM). 1990. Available online: https://www.itu.int/rec/T-REC-G.727/en (accessed on 15 June 2022).
- Cuperman, V.; Pettigrew, R. Robust low-complexity backward adaptive pitch predictor for low-delay speech coding. IEE Proc.-I
**1991**, 138, 338–344. [Google Scholar] [CrossRef] - Reininger, R.; Gibson, J. Backward Adaptive Lattice and Transversal Predictors in ADPCM. IEEE Trans. Commun.
**1985**, 33, 74–82. [Google Scholar] [CrossRef] - Ramachandran, R.; Kabal, P. Stability and performance analysis of pitch filters in speech coders. IEEE Trans. Acoust. Speech Signal Process.
**1987**, 35, 937–946. [Google Scholar] [CrossRef] - Pettigrew, R.; Cuperman, V. Backward pitch prediction for low-delay speech coding. In Proceedings of the 1989 IEEE Global Telecommunications Conference and Exhibition ‘Communications Technology for the 1990s and Beyond’, Dallas, TX, USA, 27–30 November 1989; pp. 1247–1252. [Google Scholar] [CrossRef]
- ITU-T Recommendation P.862. Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. 2001. Available online: https://www.itu.int/rec/T-REC-P.862 (accessed on 15 June 2022).
- ITU-T Recommendation P.501. Available online: https://www.itu.int/rec/T-REC-P.501 (accessed on 15 June 2022).
- Jayant, N.S.; Noll, P. Digital Coding of Waveforms: Principles and Applications to Speech and Video; Prentice Hall: Upper Saddle River, NJ, USA, 1984. [Google Scholar]
- Anderson, J.B.; Mohan, S. Source and Channel Coding: An Algorithmic Approach; Kluwer: Dordrecht, The Netherlands, 1991. [Google Scholar]
- McCree, A.V. Low-Bit-Rate Speech Coding, Chapter 16. In Springer Handbook of Speech Processing; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Chen, J.-H.; Thyssen, J. Analysis-by-Synthesis Speech Coding, Chapter 17. In Springer Handbook of Speech Processing; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Jassim, W.A.; Skoglund, J.; Chinen, M.; Hines, A. Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders. In Proceedings of the Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland, 26–28 May 2020; pp. 1–6. [Google Scholar]
- Gibson, J.D.; Berger, T.; Lookabaugh, T.; Lindbergh, D.; Baker, R.L. Digital Compression for Multimedia: Principles and Standards; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998. [Google Scholar]
- Gibson, J.D. Adaptive prediction in speech differential encoding systems. Proc. IEEE
**1980**, 68, 488–525. [Google Scholar] [CrossRef] - Anderson, J.; Bodie, J. Tree encoding of speech. IEEE Trans. Inf. Theory
**1975**, 21, 379–387. [Google Scholar] [CrossRef]

Control | Value Function | F1 | F2 | M1 | M2 | Avg | Std Dev |
---|---|---|---|---|---|---|---|

Random 4-2 | PESQ-MOS | 3.377 | 3.376 | 3.443 | 3.556 | 3.438 | 0.073 |

Rate (kbits/s) | 5.92 | 5.57 | 4.09 | 5.41 | 5.25 | ||

Random 4-2 plus 5-level | PESQ-MOS | 3.425 | 3.515 | 3.544 | 3.569 | 3.513 | 0.054 |

Rate (kbits/s) | 6.03 | 5.66 | 4.15 | 5.49 | 5.33 | ||

Random 4-2, 5-level, 5pol | PESQ-MOS | 3.471 | 3.562 | 3.584 | 3.599 | 3.554 | 0.050 |

Rate (kbits/s) | 6.03 | 5.66 | 4.15 | 5.49 | 5.33 |

Control | Value Function | F1 | F2 | M1 | M2 | Avg | Std Dev |
---|---|---|---|---|---|---|---|

Random 4-2 | PESQ-MOS | 3.646 | 3.581 | 3.784 | 3.699 | 3.678 | 0.074 |

Rate (kbits/s) | 12 | 12 | 12 | 12 | 12 | ||

Random 4-2 plus 5-level | PESQ-MOS | 3.747 | 3.798 | 4.022 | 3.793 | 3.840 | 0.107 |

Rate (kbits/s) | 12.24 | 12.24 | 12.24 | 12.24 | 12.24 | ||

Random 4-2, 5-level, 5pol | PESQ-MOS | 3.766 | 3.827 | 4.039 | 3.819 | 3.863 | 0.104 |

Rate (kbits/s) | 12.24 | 12.24 | 12.24 | 12.24 | 12.24 | ||

AMR, Narrowband | PESQ-MOS | 4.04 | 4.001 | 4.089 | 4.063 | 4.048 | 0.032 |

Rate (kbits/s) | 12.2 | 12.2 | 12.2 | 12.2 | 12.2 |

**Table 3.**PESQ-MOS value function results for three control policies and AMR/voiced only–additional female sentences.

Control | F4 | F5 | F6 | F7 | F8 | F9 | F10 | F11 | Avg | StD |
---|---|---|---|---|---|---|---|---|---|---|

Random 4-2 | 3.826 | 3.543 | 3.597 | 3.577 | 3.602 | 3.54 | 3.634 | 3.669 | 3.624 | 0.087 |

Random 4-2 plus 5-level | 3.878 | 3.769 | 3.632 | 3.512 | 3.721 | 3.677 | 3.624 | 3.787 | 3.700 | 0.106 |

Random 4-2, 5-level, 5pol | 3.861 | 3.772 | 3.639 | 3.514 | 3.733 | 3.68 | 3.638 | 3.79 | 3.703 | 0.102 |

AMR, Narrowband | 3.978 | 3.96 | 3.721 | 3.818 | 3.923 | 3.634 | 3.954 | 3.697 | 3.836 | 0.128 |

**Table 4.**PESQ-MOS value function results for three control policies and AMR/voiced only–additional male sentences.

Control | M4 | M5 | M6 | M7 | M8 | M9 | M10 | Avg | StD |
---|---|---|---|---|---|---|---|---|---|

Random 4-2 | 3.645 | 3.684 | 3.856 | 3.662 | 3.796 | 3.848 | 3.828 | 3.760 | 0.086 |

Random 4-2 plus 5-level | 3.806 | 3.792 | 3.877 | 3.723 | 3.801 | 3.878 | 3.793 | 3.810 | 0.050 |

Random 4-2, 5-level, 5pol | 3.804 | 3.797 | 3.88 | 3.736 | 3.806 | 3.875 | 3.808 | 3.815 | 0.046 |

AMR, Narrowband | 3.984 | 4.091 | 3.973 | 3.902 | 3.824 | 3.745 | 4.052 | 3.939 | 0.114 |

**Table 5.**Average PESQ-MOS value function for three control policies and AMR/voiced only—male and female sentences outside of the optimization set.

Control | Average | Standard Deviation |
---|---|---|

Random 4-2 | 3.687 | 0.110 |

Random 4-2 plus 5-level | 3.751 | 0.101 |

Random 4-2, 5-level, 5pol | 3.756 | 0.098 |

AMR, Narrowband | 3.884 | 0.132 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gibson, J.; Oh, H.
A Reinforcement Learning Approach to Speech Coding. *Information* **2022**, *13*, 331.
https://doi.org/10.3390/info13070331

**AMA Style**

Gibson J, Oh H.
A Reinforcement Learning Approach to Speech Coding. *Information*. 2022; 13(7):331.
https://doi.org/10.3390/info13070331

**Chicago/Turabian Style**

Gibson, Jerry, and Hoontaek Oh.
2022. "A Reinforcement Learning Approach to Speech Coding" *Information* 13, no. 7: 331.
https://doi.org/10.3390/info13070331