Camera-based character recognition applications equipped with voice synthesizer are useful for the blind to read text messages in the environments. Such applications in the current market and/or similar prototypes under research require users’ active reading actions, which hamper other activities. We presented a different approach at ICCHP2014; the user can be passive, while the device actively finds useful text in the scene. Text tracking feature was introduced to avoid duplicate reading of the same text. This report presents an improved system with two key components, scene text detection and tracking, that can handle text in various languages including Japanese/Chinese and resolve some scene analysis problems such as merging of text lines. We have employed the MSER (Maximally Stable Extremal Regions) algorithm to obtain better text images, and developed a new text validation filter. Some technical challenges for future device design are presented as well.