Soham Ghosh

Soham Ghosh

Research Scientist · Mistral AI

I am a Research Scientist at Mistral AI working on open-source multimodal large language models such as Pixtral and Voxtral. Previously, I worked at Google Deepmind on research into improving audio-visual representation learning with language supervision. During my Masters in Carnegie Mellon University I was fortunate to work with Alexander Hauptmann on video-language grounding and Ruslan Salakhutdinov on meta reinforcement learning. I completed my undergraduate studies at Nanyang Technological University, where I was a recipient of Nanyang Scholarship and the President's Scholarship for Research and am thankful for being advised by Justin Dauwels and Adams Kong.

The best way to reach me is through ghosh.soham@gmail.com

Vision-Language ModelsAudio LLMsRepresentation LearningReinforcement Learning

Mission: Building artificial general intelligence that is able to perceive, understand and interact with us in natural ways and augment our own abilities.

Education

  • Carnegie Mellon University (School of Computer Science)
    Carnegie Mellon University (School of Computer Science)
    2017 – 2018
    M.S. · Computational Data Science (Analytics)
    CGPA: 4.03/4.33. Courses: Probabilistic Graphical Models, Deep RL, Large-Scale ML, Advanced Multimodal ML. Teaching Assistant for Introduction to Machine Learning and Introduction to Deep Learning.
  • Nanyang Technological University
    Nanyang Technological University
    2012 – 2016
    B.Eng. & M.Sc. · Engineering Science (Computer Science), Technology Management
    First Class Honours, Nanyang Scholarship, President’s Research Scholar. CGPA: 4.75/5.0.

Publications

Voxtral
Voxtral

Mistral AI team

Voxtral Mini and Small are audio-chat multimodal models that understand speech and text, offering state-of-the-art performance on audio tasks with up to 40-minute audio context.

arXiv 2025 PDF
Pixtral 12B
Pixtral 12B

Mistral AI team

Pixtral-12B is a 12B-parameter multimodal model combining vision-language understanding with text generation, achieving leading performance on multimodal benchmarks.

arXiv 2024 PDF
TIPS: Text-Image Pretraining with Spatial Awareness
TIPS: Text-Image Pretraining with Spatial Awareness

K.K. Maninis, K. Chen, S. Ghosh*, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, …

TIPS improves spatial reasoning in image-text models by blending contrastive learning with masked image modeling, yielding strong performance on dense and global vision tasks.

arXiv 2024
ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images
ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

X. Zhang, Z. Wang, H. Zhou, S. Ghosh*, D. Gnanapragasam, V. Jampani, H. Su, …

ConDense jointly learns 2D and 3D representations using a NeRF-like pipeline, ensuring consistent dense and sparse features across views.

ECCV 2024 PDF
VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners
VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners

S. Yan, T. Zhu, Z. Wang, Y. Cao, M. Zhang, S. Ghosh*, Y. Wu, J. Yu

VideoCoCa adapts CoCa to video-text tasks, enabling zero-shot video classification, retrieval, and QA with minimal additional training.

arXiv 2022
Concurrent Meta Reinforcement Learning
Concurrent Meta Reinforcement Learning

E. Parisotto, S. Ghosh*, S.B. Yalamanchi, V. Chinnaobireddy, Y. Wu, …

This work proposes a concurrent meta-RL framework that learns multiple tasks simultaneously, improving sample efficiency and generalization.

arXiv 2019 PDF
Extractive Clip Localization using Natural Language Descriptions
Extractive Clip Localization using Natural Language Descriptions

S. Ghosh*, A. Agarwal, Z. Parekh, A. Hauptmann

Presents a method for localizing video segments described by natural language, using alignment between text and temporal visual features.

ACL 2019 PDF
Denoising Autoencoders for Fast Real-time Traffic Estimation on Urban Road Networks
Denoising Autoencoders for Fast Real-time Traffic Estimation on Urban Road Networks

S. Ghosh*, M.T. Asif, L. Wynter

Uses denoising autoencoders to rapidly estimate real-time traffic states on large urban road networks from incomplete sensor data.

IEEE CDC 2017 PDF
Tattoo Detection based on CNN and Remarks on the NIST Database
Tattoo Detection based on CNN and Remarks on the NIST Database

Q. Xu, S. Ghosh*, X. Xu, Y. Huang, A.W.K. Kong

Develops a CNN-based approach for tattoo detection in images, with analysis of the NIST tattoo image database.

ICB 2016 PDF
Evaluation of Smart-phone Performance for Real-time Traffic Prediction
Evaluation of Smart-phone Performance for Real-time Traffic Prediction

R. Ansar, P. Sarampakhul, S. Ghosh*, N. Mitrovic, M.T. Asif, J. Dauwels, …

Evaluates smartphones’ capability to perform real-time traffic prediction, highlighting feasibility and constraints for mobile deployment.

IEEE ITSC 2014 PDF