Visually impaired people suffer daily from their disability to read textual information. One of the most anticipated blind-assistive devices is a system equipped with a wearable camera capable of finding the textual information in natural scenes and translating it into sound through a speech synthesizer. To avoid duplicate readings, the device should be able to recognize text areas with the same content, and group them to obtain a single result. Scene text detection and tracking methods attract a lot of interest for these purposes. However, this field is still challenging and methods of scene text detection and tracking are yet to be perfected. This paper proposes a scene text tracking system capable of finding text regions and tracking them in video frames captured by a wearable camera. By combining a text detection method with a feature point tracker, we obtain a robust text tracker which produces much less false positive text images at 2.9 times faster speed compared with the conventional method.