Evaluation
We evaluated Perch 2.0 using a few-shot linear probe on marine tasks, such as distinguishing different baleen whale species or different killer whale subpopulations. Its performance was compared against pre-trained models that are supported in our Perch Hoplite repository for agile modeling and transfer learning. They include Perch 2.0, Perch 1.0, SurfPerch, and the multispecies whale model.
For underwater data evaluation, we used three datasets: NOAA PIPAN, ReefSet, and DCLDE.
- NOAA PIPAN: An annotated subset of the NOAA NCEI Passive Acoustic Data Archive from the NOAA Pacific Islands Fisheries Science Center recordings. It includes labels used in our prior whale models as well as new annotations for baleen species such as common minke whale, humpback whale, sei whale, blue whale, fin whale, and Bryde’s whale.
- ReefSet: Developed for SurfPerch model training, this dataset leverages data annotations from the Google Arts and Culture project: Calling in Our Corals. It includes a mix of biological reef noises (croaks, crackles, growls), specific species/genera classes (e.g., damselfish, dolphins, and groupers), and anthropomorphic noise and wave classes.
- DCLDE: This dataset is evaluated using three different label sets:
- Species: For distinguishing between killer whales, humpbacks, abiotic sounds, and unknown underwater sounds (with some uncertainty in killer whale and humpbacks labels).
- Species Known Bio: For certain labels of killer whales and humpbacks.
- Ecotype: For distinguishing between killer whale subpopulations (ecotypes), including Transient/Biggs, Northern Residents, Southern Residents, Southeastern Alaska killer whales, and offshore killer whales.
In this protocol, for a given target dataset with labeled data, we compute embeddings from each of the candidate models. We then select a fixed number of examples per class (4, 8, 16, or 32), and train a simple multi-class logistic regression model on top of the embeddings. We use the resulting classifier to compute the area under the receiver-operating characteristic curve (AUC_ROC), where values closer to 1 indicate a stronger ability to distinguish between classes. This process simulates using a given pre-trained embedding model to create a custom classifier from a small number of labelled examples.
Our results show that more examples per class improve performance across all the models, except on ReefSet data, where performance is high even with only four examples per class for all models, except the multispecies whale model. Notably, Perch 2.0 is consistently either the top or second-best performing model for each dataset and sample size.
💸 Earn Instantly With This Task
No fees, no waiting — your earnings could be 1 click away.
Start Earning