The deep learning revolution partly embodied in transformers architectures and pre-trained self-supervised models opens many perspectives in the study of linguistics and animal communication. By exploring transfer learning approaches for computational bioacoustics applied to primate vocalizations, we delve into the explainability of pre-trained speech models to understand what they can teach us about the origins of language. To examine divergences and similarities between speech and primate vocalizations from a deep learning perspective, our method consists in probing and fine-tuning experiments based on self-supervised acoustic models. By analyzing their ability to process primate vocalizations, we test the effect of models' architectures, pre-training datasets, and task specificities on their transfer learning performance. In doing so, we want to evaluate the validity of deep transfer learning as a scientific tool in the study of the origins of language from a comparative standpoint.