We present a learning-based method for extracting distinctive features on video objects. From the extracted features, we are able to derive dense correspondences between the objects in the current video frame and in the reference template. We train a deep-learning model with non-local blocks to predict dense feature maps for long-range dependencies. A new video object correspondence dataset is introduced for training and for evaluation. Further, we propose a new feature-aggregation technique that is based on the optical flow of consecutive frames and we apply it to the integration of multiple feature maps for alleviating uncertainties. We also use the local information provided by optical flow to evaluate the reliability of feature matching. The experimental results show that our local and nonlocal fusion approach can reduce unreliable correspondences and thus improve the matching accuracy.