In this paper, we demonstrate an end-to-end spatiotemporal gesture learning approach for 3D point cloud data using a new gestures dataset of point clouds acquired from a 3D sensor. Nine classes of gestures were learned from gestures sample data. We mapped point cloud data into dense occupancy grids, then time steps of the occupancy grids are used as inputs into a 3D convolutional neural network which learns the spatiotemporal features in the data without explicit modeling of gesture dynamics. We also introduced a 3D region of interest jittering approach for point cloud data augmentation. This resulted in an increased classification accuracy of up to 10% when the augmented data is added to the original training data. The developed model is able to classify gestures from the dataset with 84.44% accuracy. We propose that point cloud data will be a more viable data type for scene understanding and motion recognition, as 3D sensors become ubiquitous in years to come.