Using an imaging system in which the image plane can be tilted with respect to the optical axis of the lens, the image of a large-scale scene that appears to be a miniature to human eyes can be captured. This phenomenon suggests that the image contains information regarding the scale of the scene and that human vision can extract this information and recognize the scene scale from a single image. In this study, we consider how human vision can perform this single-view scale estimation. Although it is obvious that the existence of defocus blur in the image that simulates a shallow DOF plays an essential role in the scale estimation, we propose that this alone is not sufficient to explain the estimtation mechanism. By incorporating a few assumptions, we theoretically show that scale estimation is made possible when (1) the 3D structure of the scene can be recovered from the image and furthermore, (2) the structure is combined with the defocus blur. Further, we present a simple algorithm for scale recognition and demonstrate its working using a real image.