TY - JOUR
T1 - Estimating the user's state before exchanging utterances using intermediate acoustic features for spoken dialog systems
AU - Chiba, Yuya
AU - Nose, Takashi
AU - Ito, Masashi
AU - Ito, Akinori
PY - 2016/2/1
Y1 - 2016/2/1
N2 - The spoken dialog system (SDS) is an example of a speech interface and has been included in several devices to help users operate the system. The SDS is beneficial for the user because it does not restrict the style of the user's input utterances, but sometimes makes it difficult to speak to the system. Conventional systems cannot give appropriate help to a user who does not make explicit input utterances since these systems have to recognize and parse a user's utterance in order to decide the next prompt. Therefore, the system should estimate the state of the user upon encountering a problem in order to start the dialog and provide appropriate help before the user abandons the dialog. Based on this assumption, we aim to construct a system which responds to a user who does not speak to the system. In this research, we defined two basic states of the user when the user does not speak to the system: the user is embarrassed by the prompt, or is thinking about how to answer the prompt. We discriminated these user states by using intermediate acoustic features and the facial orientation of the user. Our previous approach used several intermediate acoustic features determined manually, and it was not possible to discriminate the user's state automatically. Therefore, the present paper examines a method to extract intermediate acoustic features from low-level features, such as MFCC, log F0, and zero cross counting (ZCC). We introduce a new annotation rule, and compare the discrimination performance with the previous feature set. Finally, the user's state was discriminated by using the combination of intermediate acoustic features and facial orientation.
AB - The spoken dialog system (SDS) is an example of a speech interface and has been included in several devices to help users operate the system. The SDS is beneficial for the user because it does not restrict the style of the user's input utterances, but sometimes makes it difficult to speak to the system. Conventional systems cannot give appropriate help to a user who does not make explicit input utterances since these systems have to recognize and parse a user's utterance in order to decide the next prompt. Therefore, the system should estimate the state of the user upon encountering a problem in order to start the dialog and provide appropriate help before the user abandons the dialog. Based on this assumption, we aim to construct a system which responds to a user who does not speak to the system. In this research, we defined two basic states of the user when the user does not speak to the system: the user is embarrassed by the prompt, or is thinking about how to answer the prompt. We discriminated these user states by using intermediate acoustic features and the facial orientation of the user. Our previous approach used several intermediate acoustic features determined manually, and it was not possible to discriminate the user's state automatically. Therefore, the present paper examines a method to extract intermediate acoustic features from low-level features, such as MFCC, log F0, and zero cross counting (ZCC). We introduce a new annotation rule, and compare the discrimination performance with the previous feature set. Finally, the user's state was discriminated by using the combination of intermediate acoustic features and facial orientation.
KW - Multi-modal information
KW - Spoken dialog system
KW - User's state
UR - http://www.scopus.com/inward/record.url?scp=84962782995&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84962782995&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:84962782995
VL - 43
SP - 1
EP - 9
JO - IAENG International Journal of Computer Science
JF - IAENG International Journal of Computer Science
SN - 1819-656X
IS - 1
ER -