4.1 Introduction This section describes the procedure we used to classify remote sensing image scenes. We applied different procedures to correctly classify remote sensing scene images. We applied different hand-engineered feature extraction methods, proposed CNN model, applied fine-tuning on VGG16 and ResNet50. We have also used VGG16 and ResNet50 as fixed feature extractor and trained the classifier. The major contributions of this paper are as follows: 1. We gave a brief review on hand engineered feature extraction methods and deep learning procedures. 2. Discussed about transfer learning and its application in different situations in deep learning. 3. We fine-tuned two popular deep learning models named VGGNet16 and RESNet50 on …show more content…
The input image is then zero padded so that the size of feature map remains same after convolution. The padded image then convolved with 64 kernels and produces 64 feature map of 224x224 size. The feature maps then max-pooled from 2x2 grid with 2x2 stride and produces 64 feature maps of 112x112 in size. The feature maps found from the previous max-pooling layer is zero padded and convolved with 128 kernels of 64x3x3 size which produces 128 feature maps of 112x112 size. The input is then max-pooled and produces output of 56x56 sized 128 feature maps. 56x56 sized 128 feature maps then zero padded and convolved with 128x3x3 sized 256 kernels and produce output of 56x56 sized 256 feature maps. 256 feature maps are then max-pooled (2x2 grid with 2x2 stride) and produces 256 feature maps of size 28x28 as output. ; Figure 4.1: Proposed CNN model 256 feature maps then zero padded and convolved with 512 kernels of 256x28x28 sized. This fourth convolution produces 512 feature maps of 28x28 as output. The output then max-pooled (2x2 grid with 4x4 stride) by producing 512 feature maps size as 7x7. A flatten layer is then applied to produce 1 dimensional vector from 512x7x7 input. A fully-connected layer with 256 then added which is connected with every neurons of previous layer. A dropout layer of probability 0.5 is added just after first fully connected layer to reduce overfitting. Another fully connected layer with 256 neurons follows first dropout
Shi Et al [13] used a local projection profile at each pixel of the image, and transform the original image into an adaptive local connectivity map (ALCM). For this process, first the gray scale image is reversed so the foreground pixels (text) have idensity values of up to 255. Then the image is downsample to ¼ of each size, ½ in each direction. Next a sliding window of size 2c scans the image from left to right, and right to left in order to compute the cumulative idensity of every neighborhood. This technique is identical to computing the projection profile of every sliding window, (i.e. counting the foreground pixels), but instead of outpiting a projection profile histogram, the entire sliding window sum is saved on the ALCM image. Finaly
For activity 2, the frame is run through a thresholding function that creates a binary image based on the histogram levels that are similar to that of a stop sign. Any pixel that falls within the levels are then converted to a value of one and the rest are set to zero. The binary image is then put through two morphological operators, open and close. The open morphological operator gets rid of smaller clutters of pixels as defined by the parameters that are passed in to the function. The close morphological operator fills in clusters of pixels that are the shape of the parameter passed to the function. In this case, the shape is an octagon just like a stop sign. The last step is to run the binary image through blob analysis to get the ROIs that are of a specified minimum size. Once the ROIs have been found, they are passed into the cascade object detector for verification of
These 16 features included 12 features calculated based on the 6 multispectral bands, which is mean value and standard deviation of these bands. In addition, we chose intensity, texture-variance, texture-mean, and NDVI (Normalized Difference Vegetation Index) for classifications. Finally, training samples were selected for each classification category based on the previously segmented and merged objects
x 8 blocks and 1-valued 8 x 8 blocks. Trimming images help to remove the background frames that are redundant.
The goal of the feature extraction and selection is to reduce the dimension of the data. In this experiment the dimension of the AVIRIS and HYDICE images reduced to 20 from 220 and 191 respectively using PCA. From the PCA analysis we can see that image of principal component 1 is brightest and sharpest than other PCA image which is illustrated in figure-2.
Both methods work on a 64x64 input window to extract feature and start by first determining edges oriented at 0, 90, 45, and -45 degrees. In a method known as cell edge distribution (CED), each 64x64 edge map is divided into 16 16x16 cells, and the number of edge pixels in each of these cells is counted. In another method known as principal projected edge distribution (PPED), edge pixels are counted along the orientation of the edge map. For example, edge pixels in a horizontal edge map are counted by counting the number of edge pixels in every 4 rows of the edge map. The result of either feature extraction method is a 64 element feature vector. A third suggested feature vector is a simple concatenation of CED and PPED feature vectors into a 128 element feature vector simply referred to as
The Haar transform is based on the calculations of the averages (approximation co-efficient) and differences (detail co-efficient). Given two adjacent pixels a and b, the principle is to calculate the average and the difference . If a and b are similar, s will be similar to both and d will be small, i.e., require few bits to represent. This transform is reversible, since and and it can be written using matrix notation as
We investigate architectures of discriminatively trained deep Convolutional Net- works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion be- tween frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework.
maps, each with a rectifier activation function and the size of 5x5. It expects images with
The first study discusses the basic operations of erosion and dilation and present a system to recognize six handwritten digits. For this purpose, a novel method to recognize cursive and degraded text has been found by Badr & Haralick (1994 ) through using technology of OCR. Parts of symbols (primitives) are detected to interpret the symbols with this method (KUMAR et al. 2010). The study involves mathematical morphology operations to perform the recognition process. Cun et al. (1998) challenge this problem by designing a neural based classifier to discriminate handwritten numeral. This study achieved a reliable system with very high accuracy (over 99%) on the MINIST database. Moreover, the gradient and curvature of the grey character image have been taken into consideration by Shi et al. (2002 ) to enhance the accuracy of handwriting digits recognition. Uniquely, Teow & Loe (2002 ) identify new idea to solve this problem based on biological vision model with excellent results and very low error rate (0.59%). The discoveries and development have been continuing through innovating new algorithms and learning rules. Besides efficiency of using these rules individually in the machine learning, some researchers have going further in developing the accuracy and performance of the learning by mixing several rules to support one
Classification is a principle technique in hyperspectral images (HSI) analysis, where a label is assigned to each pixel based on its characteristics. Applying machine learning techniques to these datasets need special consideration, since the hyperspectral images are typically represented by features vectors of extremely high dimensions. A robust HSI classification requires a prudent combination of deep feature extractor and powerful classifier. In the last one decade, extensive classification methods are designed in hyperspectral objects, but these approaches prone to learn the correlated features, and usually fail to consider the structural information in the feature space. Most of CNN methods extract image features at the last layer
The linearity in the network of neural network is the most attention seeking cause of using it. A single layer adaptive network was firstly used for the face recognition which is also known as WISARD. For each provided individual, it has a single network. For an effective recognition, the construction of the neural network structure is very important which are based on the applications we had studied before. Different types of neural network is used for different techniques, for example, multilayer perceptron shown in fig.2.1 is used for the detection of the face and multi-resolution pyramid structure shown in fig.2.2 is used for the verification of the face.
Recently, George Azzopardi and Nicolai Petkov [36] propose a trainable filter which we call Combination Of Shifted FIlter REsponses (COSFIRE) and use for keypoint detection and pattern recognition.
256 gray levels. This leads to 25684 x 84 x 256 = approx. 1067970 combinations or possible game
The network settings are described in Table I, and represented as 64(9)-32(1)-32(5) - 32(3)-64(1)-1[9]-s. We used a strided convolution and deconvolution technique with stride s > 1 for network speed up. In our experiment we consider Y luminance color components only and the compressed luminance frames become input to MDCNN network. There is no max pooling or full-connected (FC) layers in our network. Rectified Linear Unit (ReLu) max (0, X) [19] is applied on the filter responses. Our CNN network is designed to learn the residue between output and input frame, as there is lots of similarity between them. So the final output frame F(Y) is just a combination of input frame and learnt residual frame.