Soham Ghosh

Research Scientist · Mistral AI

I am a Research Scientist at Mistral AI working on open-source multimodal large language models such as Pixtral and Voxtral. Previously, I worked at Google Deepmind on research into improving audio-visual representation learning with language supervision. During my Masters in Carnegie Mellon University I was fortunate to work with Alexander Hauptmann on video-language grounding and Ruslan Salakhutdinov on meta reinforcement learning. I completed my undergraduate studies at Nanyang Technological University, where I was a recipient of Nanyang Scholarship and the President's Scholarship for Research and am thankful for being advised by Justin Dauwels and Adams Kong.

The best way to reach me is through ghosh.soham@gmail.com

Vision-Language ModelsAudio LLMsRepresentation LearningReinforcement Learning

Google Scholar LinkedIn

Mission: Building artificial general intelligence that is able to perceive, understand and interact with us in natural ways and augment our own abilities.

Education

Carnegie Mellon University (School of Computer Science)
2017 – 2018
M.S. · Computational Data Science (Analytics)
CGPA: 4.03/4.33. Courses: Probabilistic Graphical Models, Deep RL, Large-Scale ML, Advanced Multimodal ML. Teaching Assistant for Introduction to Machine Learning and Introduction to Deep Learning.
Nanyang Technological University
2012 – 2016
B.Eng. & M.Sc. · Engineering Science (Computer Science), Technology Management
First Class Honours, Nanyang Scholarship, President’s Research Scholar. CGPA: 4.75/5.0.

Publications

Voxtral

Mistral AI team

Voxtral Mini and Small are audio-chat multimodal models that understand speech and text, offering state-of-the-art performance on audio tasks with up to 40-minute audio context.

arXiv 2025 PDF

Pixtral 12B

Mistral AI team

Pixtral-12B is a 12B-parameter multimodal model combining vision-language understanding with text generation, achieving leading performance on multimodal benchmarks.

arXiv 2024 PDF

TIPS: Text-Image Pretraining with Spatial Awareness

K.K. Maninis, K. Chen, S. Ghosh*, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, …

TIPS improves spatial reasoning in image-text models by blending contrastive learning with masked image modeling, yielding strong performance on dense and global vision tasks.

arXiv 2024

ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

X. Zhang, Z. Wang, H. Zhou, S. Ghosh*, D. Gnanapragasam, V. Jampani, H. Su, …

ConDense jointly learns 2D and 3D representations using a NeRF-like pipeline, ensuring consistent dense and sparse features across views.

ECCV 2024 PDF

VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners

S. Yan, T. Zhu, Z. Wang, Y. Cao, M. Zhang, S. Ghosh*, Y. Wu, J. Yu

VideoCoCa adapts CoCa to video-text tasks, enabling zero-shot video classification, retrieval, and QA with minimal additional training.

arXiv 2022

Concurrent Meta Reinforcement Learning

E. Parisotto, S. Ghosh*, S.B. Yalamanchi, V. Chinnaobireddy, Y. Wu, …

This work proposes a concurrent meta-RL framework that learns multiple tasks simultaneously, improving sample efficiency and generalization.

arXiv 2019 PDF

Extractive Clip Localization using Natural Language Descriptions

S. Ghosh*, A. Agarwal, Z. Parekh, A. Hauptmann

Presents a method for localizing video segments described by natural language, using alignment between text and temporal visual features.

ACL 2019 PDF

Denoising Autoencoders for Fast Real-time Traffic Estimation on Urban Road Networks

S. Ghosh*, M.T. Asif, L. Wynter

Uses denoising autoencoders to rapidly estimate real-time traffic states on large urban road networks from incomplete sensor data.

IEEE CDC 2017 PDF

Tattoo Detection based on CNN and Remarks on the NIST Database

Q. Xu, S. Ghosh*, X. Xu, Y. Huang, A.W.K. Kong

Develops a CNN-based approach for tattoo detection in images, with analysis of the NIST tattoo image database.

ICB 2016 PDF

Evaluation of Smart-phone Performance for Real-time Traffic Prediction

R. Ansar, P. Sarampakhul, S. Ghosh*, N. Mitrovic, M.T. Asif, J. Dauwels, …

Evaluates smartphones’ capability to perform real-time traffic prediction, highlighting feasibility and constraints for mobile deployment.

IEEE ITSC 2014 PDF