This paper proposes a highly efficient AES hardware architecture that supports both encryption and decryption for the CBC mode. Some conventional AES architectures employ pipelining techniques to enhance the throughput and efficiency. However, such pipelined architectures are frequently unfit because many practical cryptographic applications work in the CBC mode, where block-wise parallelism is not available for encryption. In this paper, we present an efficient AES encryption/ decryption hardware design suitable for such block-chaining modes. In particular, new operation-reordering and register-retiming techniques allow us to unify the inversion circuits for encryption and decryption (i.e., SubBytes and InvSubBytes) without any delay overhead. A new unification technique for linear mappings further reduces both the area and critical delay in total. Our design employs a common loop architecture and can therefore efficiently perform even in the CBC mode. We also present a shared key scheduling datapath that can work on-the-fly in the proposed architecture. To the best of our knowledge, the proposed architecture has the shortest critical path delay and is the most efficient in terms of throughput per area among conventional AES encryption/ decryption architectures with tower-field S-boxes. We evaluate the performance of the proposed and some conventional datapaths by logic synthesis results with the TSMC 65-nm standard-cell library and Nan-Gate 45-and 15-nm open-cell libraries. As a result, we confirm that our proposed architecture achieves approximately 53–72% higher efficiency (i.e., a higher bps/GE) than any other conventional counterpart.