Object Shape Feature Extraction from Motion Parallax using Convolutional Neural Network [Not invited]
ChengJun Shao; Makoto Murakami
13th World Congress on Computational Mechanics 2018/07 Oral presentation
We propose a neural network which can recognize objects from a sequence of RGB images captured with a single camera through two different convolutional neural networks. The learning process is divided into two steps: learning of CNN for spatial feature extraction and learning of CNN for spatiotemporal feature extraction. The spatial feature extraction CNN extracts spatial feature vectors with position invariance. And they are input to the following spatiotemporal feature extraction CNN, which convolutes them temporally to achieve depth information based on motion parallax. In the spatial feature extraction CNN, each frame of image sequence is convoluted with some spatial filters, the convoluted values are passed through an activation function, and some spatial features are extracted in the convolutional layer. The features are input to the local contrast normalization layer, and the following pooling layer for downsampling. With these three layers as a set, three sets of layers are concatenated to extract low, medium, and high level spatial features. Then, the high level features are converted to a one-dimensional vector, and weighted sums of elements of it are passed through an activation function in the fully connected layer. We may use dropout to reduce the degree of freedom of the network, and to prevent overfitting. In the spatiotemporal feature extraction CNN, a sequence of the low and medium spatial features extracted in the spatial feature extraction CNN with a frame length T is input to the convolutional layer. The sequence of the same spatial features is convoluted with some temporal filters, the convoluted values are passed through an activation function, and some temporal features including depth information from motion parallax can be extracted. The features are input to the local contrast normalization layer, the pooling layer, and the fully connected layer. And the high level spatial features extracted in the spatial feature extraction CNN are also input to the fully connected layer. And these different kinds of features are integrated in the output layer. To evaluate our proposed method we conducted an experiment using some objects with simple shapes, and extracted the shape information from motion parallax.