Essay On Sound Event Classification

704 Words3 Pages
Several CNN approaches have been proposed for sound event classification (SEC), [20, 21] to cite a few. However, most of these works are formulated around strongly labeled data. When done on weakly labeled data they are almost always limited in terms of their scale [22, 23, 4]; offering little insight into how well they might generalize in large scale scenarios and be useful for transfer learning. The DCASE 2017 [7] weakly labeled challenge and works based on it also considers only 17 events from Audioset. [17] analyzes popular CNN architectures such as VGG, Resnet for large scale sound event classification on web videos. However, the training procedure in [17] makes a simplistic strong label assumption for weakly labeled audios. The sound…show more content…
Max pooling are done over a 2  2 window, with a stride of 2 by 2. F1 is also a convolutional layer with ReLU activation. 1024 filters of size 2  2 are used with a stride of 1. No padding is used in F1. F2 is the secondary output layer, a covolutional layer of CS filters of size 11 and sigmoid output. This layer produces segment level output (Cs  K  1), where Cs is the number of classes (in source task) and K is the number of segments. The segment level outputs are aggregated using a global pooling layer to produce Cs1 dimensional output for whole recording. The network scans through the whole input (Logmels) and produces outputs corresponding to segments of 128 frames moving by 64 frames. For example, an input logmel spectrogram of 896 frames, X 2 R896128, will produce K = 13 segments at F1 and F2. Since in weakly labeled audio we have labels for the full recording, the outputs at segment level are pooled to obtain the full recording level output. The loss is then computed with respect to this recording level output. Hence, this network (NS) treats weak labels as weak. Overall, it is a VGG style [25] CNN for weak label learning of sounds. We will refer to this network as NS. The segment size and hop size can be controlled by the network design. For NS these are 128 and 64 respectively. Note that, if needed one can get temporal localization of events during inference though segment level outputs. Unlike SLAT where fully connected dense layers are used in
Get Access