When an autonomous robot tries to find text in the surrounding scene using an onboard video camera, some duplicate text images appear in the video frames. To avoid recognizing the same text many times, it is necessary to decrease the number of text candidate regions for recognition. This paper presents a text capturing robot that can look around the environment using an active camera. The text candidate regions are extracted from the images using an improved DCT feature. The text regions are tracked in the video sequence so that the number of text images to be recognized is reduced. In the experiment, we tested 460 images of a corridor with fifteen signboards including text. The number of text candidate regions is reduced by 90.1% using our text tracking method.