In this study, we propose a voice conversion technique with two-stage conversion, which is realized by using two models consisting of U-Net and pix2pix. Using U-Net, we tried to reproduce intonation of a target speaker by performing low-dimensional feature conversion considering the time direction. We introduced pix2pix for the task of spectrogram enhancement. The pix2pix is trained to map from low definition spectrogram to high definition spectrogram (low-to-high spectrogram mapping). Low definition spectrogram is reconstructed from low dimensional mel-cepstrum converted by U-Net and high definition spectrogram is extracted from natural speech. In objective evaluations, we showed that the proposed method was effective in improvement of mel-cepstral distance (MCD) and Log F0 RMSE. Subjective evaluations revealed that the use of the proposed method had a certain effect in improving speech individuality while maintaining the same level of naturalness as the conventional method.