Code and Dataset
(1) Code: https://github.com/LAION-AI/CLAP
(2) LAION-Audio-630K: https://github.com/LAION-AI/audio-dataset
Details of LAION-Audio-630K
(1) We list the specifications of website/sources from which we collect the audio samples and text captions for LAION-Audio-630K in Table 1.
(2) We list the detail of three datasets in Table 2. We use the combination of them to train the model in the section 4 of the submission.
Attentional Feature Fusion
The fusion architecture acceptes two inputs: X is the global information, and Y is the merged local information. Two inputs are sent to two CNN networks to generate the coefficient, then X and Y are added by this coefficient.
Examples of Keyword-to-Caption Augmentation
Additionally, when applying keyword to caption, we excluded samples shorter than 2 seconds, as we found in such case the audio is merely a single event, thus matching poorly with the caption generated. When using keyword to caption in training dataset including audioset, we use only the captions generated by keyword to caption and exclude the captions generated by template.
Loss Trend of Different Audio/Text Encoder Combinations.
The final decision is to use the RoBERTa, as we see the result on CLIP Transformer is not as good as the other two text encoders. We further visualize the loss trends of these three text encoders (+ PANN audio encoder) on the AudioCaps evaluation set below.
We demonstrate that the worse result from CLIP Transformer is because of the overfitting issue, as the CLIP Transformer is trained by OPEN-AI's large-scale image-text dataset with the size of about 4 billion. This makes the audio encoder hard to learn a cross-modal representation because the text encoder is already very powerful.
Compared to CLIP Transformer, RoBERTa and BERT seem to be a better choice as it has more generalization ability brought by the large-scale text pretraining instead of contrastive learning between text and image data.
Attentional Experiment of feature fusion on Freesound Dataset
The result is shown in the below table, the notation is the same as the Table 3 in our submission paper.
From this table, we can further prove that the feature fusion can improve the retrieval performance on the Freesound dataset. The performance on Freesound dataset shares a similar trend with that on Clotho dataset:
(1) the performance trained on "AudioCaps + Clotho + LA." is better than that trained on "AudioCaps + Clotho + LA. + AudioSet". As demonstrate in the section 4.2, similar to Clotho, the Freesound dataset contains audio samples that are different from AudioSet, adding the AudioSet into the training will move the model's distribution out of general audio data to AudioSet-like audio data, such decreasing the perfomance.
(2) the performance with feature fusion is better than that without feature fusion, as the Freesound dataset contains the samples larger than 10-secs, which is the same to Clotho dataset. Their performance trend are similar.
Experiment Settings on Data Exclusion
Acknowledgement
Our codebase is build on following open-source projects: (1) PANN (2) HTSAT (3) open-clip (4) PyTorch We would like to thank the support of computation infrastructure from LAION, Stability.ai and Summit cluster from Oak Ridge National Laboratory. We would like to thank Christoph Schuhmann, Richard Vencu, Irina Rish, Romain Beaumon, as this project would not be possible without them. We would like to thank all the community contributors for contributing the collection of LAION-630k dataset. Those community contributors include but not limited to: @marianna13#7139, @Chr0my#0173, @PiEquals4#1909, @Yuchen Hui#8574, @Antoniooooo#4758, @IYWO#9072, krishna#1648, @dicknascarsixtynine#3885, and @turian#1607. We would like to thank Xinhao Mei for explaining and helping on retrieval metrics.