Spoken dialog system (SDS) is a typical speech application and sometimes regarded as one of ideal interfaces. However, most of conventional SDSs cannot help their user while waiting for input utterance since they treat a user's utterance as a trigger of processing. This architecture is largely different from the manner of human-human interaction and factor that makes the user feel inconvenience when they cannot respond to the system's prompt appropriately. To solve this problem, the system should be able to estimate the internal state of the user before observing the user's input utterance. In present paper, we proposed twostep discrimination method using multi-modal information to estimate the user's state frame by frame.