In this paper, we present an end-to-end learning approach for human motion inference from 3D point cloud data. Examples of human motion to be learned are collected as point cloud data through a 3D sensor, mapped into 3D occupancy grids and then used as supervised learning samples for a 3D Convolutional Neural Network (3D CNN). The 3D CNN model is able to learn spatiotemporal features from time steps of occupancy grids and predict human motion intentions with an accuracy of 83% within 60% of the motion performed. We demonstrate the performance of this model in real time by predicting the intention of a human arm motion for some predetermined targets, and furthermore generalise the model to new users whose data were not used in the training of the model. This approach is useful for human-robot interaction and human-computer interaction applications that need human motion learning without explicitly modelling the dynamics of the human motion.