Seminars and talks

Slides for "regular" presentations are in the publications section associated to their papers.

2025

Spatial AI and emerging reasoning in end-to-end trained robotic navigation June 19th, 2025. Invited talk at ISIR workshop on ML for robotics, Sorbonne University, Paris.
- Abstract
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, awareness of the dynamics of the environment, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning of reasoning capabilities through large-scale training of deep neural networks from data, and we target different tasks involving fast, precise and smooth navigation of terrestrial robots. We will present solutions and describe key features: reinforcement learning, identifying accurate dynamical models for usage in simulation, and the inclusion of geometric foundation models. We also present an in-depth analysis of the type of reasoning emerging in end-to-end trained agents. In particular, we study the presence of realistic dynamics which the agents learned for open-loop forecasting, and their interplay with sensing; the way the agents use latent memory to hold elements of the scene structure; and finally, their planning capabilities. Put together, we present experiments which paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.
Spatial AI and emerging reasoning in end-to-end trained robotic navigation June 4th, 2025. Invited seminar Inria Willow, Paris.
- Abstract
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, awareness of the dynamics of the environment, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning of reasoning capabilities through large-scale training of deep neural networks from data, and we target different tasks involving fast, precise and smooth navigation of terrestrial robots. We will present solutions and describe key features: reinforcement learning, identifying accurate dynamical models for usage in simulation, and the inclusion of geometric foundation models. We also present an in-depth analysis of the type of reasoning emerging in end-to-end trained agents. In particular, we study the presence of realistic dynamics which the agents learned for open-loop forecasting, and their interplay with sensing; the way the agents use latent memory to hold elements of the scene structure; and finally, their planning capabilities. Put together, we present experiments which paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.

2023

Spatial AI: AI for robotics June 7th, 2023. Invited seminar at offsite meeting of Parisian ENPC/Imagine Group in Marseille, France.
- Abstract
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, awareness of the dynamics of the environment, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations in robotics, embodied computer vision and (intuitive) physics. We will present solutions for learning robot navigation in complex 3D environments, where we combine training in simulation with classical mapping and planning. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.
Spatial AI: AI for robotics and physics. May 4th, 2023. Invited seminar at University of Bourgogne, Dijon, France.
- Abstract
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, awareness of the dynamics of the environment, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations in robotics, embodied computer vision and (intuitive) physics. We will present solutions for learning robot navigation in complex 3D environments, where we combine training in simulation with classical mapping and planning and transfer of the acquired knowledge to real robots operating in physical environments. We cover learning of physical phenomena like fluid dynamics with applications to UAV control. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.
Spatial-AI and Robotics at Naver Labs Europe. March 17th, 2023. Keynote at AI Workshop of RTR DIAMS, Tours, France.
- Abstract
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations in robotics and embodied computer vision. In particular we will present solutions for learning robot navigation in complex 3D environments, where we combine training in simulation with classical mapping and planning and transfer of the acquired knowledge to real robots operating in physical environments. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.

2022

Robotics and AI at Naver Labs. November 24th, 2022. Keynote at the French WorkshopLa Journée de l'IA pour l'industrie, Clermont-Ferrand, France.
Learning representations for visual navigation in 3D environments. September 23rd, 2022. Invited talk at Czech-French-AI Workshop on Artificial Intelligence.
- Abstract
In this talk we address perception and navigation problems in robotics settings, in particular mobile terrestrial robots and intelligent vehicles acting in 3D environments from visual input. We focus on learning representations, which are structured and allow to reason on a high level on the presents of objects and actors in a scene and to take planification and control decisions. In particular, we compare different ways to design inductive biases for deep reinforcement learning: neural metric maps, neural topological maps, and neural implicit representations.
Machine Learning for Robotics: learning perception and decision taking. August 30rd, 2022. Invited talk at MLDM master of Machine Learning, Saint Etienne.
- Abstract
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.

In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision. In particular we will present solutions for learning robot navigation in complex 3D environments, where we combine large-scale training in simulation and virtual environments with the transfer of the acquired knowledge to real robots operating in physical environments.

2021

Deep Learning of high-level reasoning, December 2nd, 2021. Invited Seminar at VIBOT, University of Bourgogne, Dijon.
Reasoning vs. bias exploitation: X-raying high-capacity deep networks, October 11th, 2021. Invited talk at workshop on AI and Explainability organized by GDR ISIS and GDR IGRV, Paris.
- Abstract
- PDF
High-capacity deep networks trained on large-scale data are increasingly used to learn agents capable of automatically taking complex decisions on high-dimensional data like text, images, videos. Certain applications require robustness - the capacity of robustly taking the right decisions at the right moments, with high-risks associated with wrong decisions. We require these agents to acquire the right kind of reasoning capabilities, i.e. that they take decisions for reasons the designers had in mind. This is made diffcult by the diminishing role experts have in the design and engineering process, as the agents' decisions are in large part dominated by the impact of training data.

In this talk we will address the problem of learning explainable and interpretable models, in particular deep networks. We start by exploring the question of explainability in a broader sense in terms of feasability and trade-offs.

We then focus on visual reasoning tasks and we target different situations involving the need of the agents to acquire a certain approximation of common knowledge, including robotics and vision-and-language reasoning. We explore this problem in a holistic way and study it from various angles: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning?
Visualizing, evaluating and transferring reasoning patterns in VQA, September 29th, 2021. Seminar at Inria Thoth, Grenoble.
- PDF
Deep Learning of high-level reasoning, September 15th, 2021. Keynote at ORASIS 2021 conference, Lac de Saint-Férréol.
- Abstract
- PDF
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.

In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning? Can we structure neural networks with inductive bias to improve the emergence of reasoning?
Deep Learning: models and algorithms, autograd and pytorch. July 8th, 2021. Tutorial at EUR SLEIGHT Graduate school, Saint Etienne..
In this lecture we will discuss models and algorithms for deep learning, a variant of machine learning which puts the emphasis on the learning of high-capacity neural networks from large amounts of data, which are most often embedded in high-dimensional spaces (images, signals, audio, language, etc.). We will go over the different model variants used for different tasks, and link their structure and inductive biases to symmetries and invariances we want to enforce in the data, and to high-level goals: multi-layer perceptrons, convolutional neural networks, recurrent neural networks, graph networks and transformers (attention mechanisms).

We will link models and algorithms to an important sub goal of AI, namely the creation of intelligent agents, which require high-level reasoning capabilities, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.

On the application side, we will cover the automatic learning reasoning capabilities in different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack:
- Can we structure neural networks with particular inductive biases to improve the emergence of reasoning?
- What are the tasks which lead to emergence of reasoning?
- How can we evaluate agents and measure reasoning vs. bias exploitation?
- How can we x-ray neural models and visualize their internal behavior?
- What are the bottlenecks in learning reasoning?
The second part of this lecture deals with more practical and hands-down aspects of deep learning and covers the differentiable programming, the basis for the implementation of deep neural networks which can be trained by gradient descent. We will learn how to represent tensors in PyTorch, one of the most widely used deep learning frameworks. A particular emphasis will be put on Auto-grad, automatic differentiation of models based on the backpropagation algorithm applied to dynamically created computation graphs. The interested participant with access to a laptop can apply these techniques to a small toy problem during the lecture.
Learning high-level reasoning in vision, language and robotics, "Deeptails" seminar at Inria Grenoble, May 28th, 2021 (Virtual), twin-talk given together with my student Corent Kervadec.
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.

In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning? Can we structure neural networks with inductive bias to improve the emergence of reasoning?
Participation in a round table at organized by Goethe Institut Lyon and INSA-Lyon, GRETSI, May 27h, 2021.

2020

Integrating Learning and Geometry for Robotics, French Workshop on Robotics and AI organized by AFIA, December 15th, 2020 (Virtual).
Apprentissage Profond: cas d'études de collaborations académiques-industrielles, Formation Intelligence Artificielle, un outil de compétitivité pour les entreprises, December 2nd, 2020. Lyon.
- Abstract
Plusieurs cas d’études seront présentés illustrant des coopérations académiques / industriels sur l’usage de l’apprentissage profond : traitement automatique de documents numérisées (détection de bloc de textes, reconnaissance de caractères); interface homme-machine (navigation dans des environnements 3D sur tables tactiles); systèmes de questions-réponses visuels.
Learning robot navigation with differentiable projective and topological memory, October 12th, 2020. ONERA, Palaisau.
- Abstract
In this talk we address perception and navigation problems in robotics, in particular mobile terrestrial robots and intelligent vehicles. We focus on learning structured representations, which allow complex reasoning, planning and control. Our control policies are automatically learned from interactions with photo-realistic 3D environments using Deep Reinforcement Learning. While classical methods learn a representation of the history of observations in the form of a flat vectorial hidden state, we propose two different methods, which structure memory by imbuing neural networks with inductive biases of different kinds.

The first method structures its hidden state as a metric map in a bird’s eye view, updated through affine transforms given ego-motion. The semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias for deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself.

The second method introduces a differentiable topological representation, i.e. memory in graph form. Here, our main contribution is a data driven approach for planning under uncertainty requiring an estimate of shortest paths in valued graphs with a probabilistic structure. Whereas classical symbolic algorithms achieve optimal results on noise-less topologies, or optimal results in a probabilistic sense on graphs with probabilistic structure, we aim to show that machine learning can overcome missing information in the graph by taking into account rich high-dimensional node features, for instance visual information available at each location of the map. Compared to purely learned neural white box algorithms, we structure our neural model with an inductive bias for dynamic programming based shortest path algorithms, and we show that a particular parameterization of our neural model corresponds to the Bellman-Ford algorithm. By performing an empirical analysis of our method in simulated photo-realistic 3D environments, we demonstrate that the inclusion of visual features in the learned neural planner outperforms classical symbolic solutions for graph based planning.

2019

A very short developer's tutorial on Deep Learning and Neural Networks, December 12th, 2019. Café développeurs, LIRIS, Lyon.
Integrating Learning and Projective Geometry for Robotics, November 29th, 2019. Invited talk at AI for Robotics Workshop, Naver Labs, Grenoble.
- Abstract
- Video
In this talk we address perception and navigation problems in robotics settings, in particular mobile terrestrial robots and intelligent vehicles. We focus on learning representations, which are structured and allow to reason on a high level on the presents of objects and actors in a scene and to take planification and control decisions. Two different methods will be compared, which both structure their state as metric maps in a bird’s eye view, updated through affine transforms given ego-motion.

The first method combines Bayesian filtering and Deep Learning to fuse LIDAR input and monocular RGB input, resulting in a semantic occupancy grid centered on a vehicle. A deep network is trained to segment RGB input and to fuse it with Bayesian occupancy grids. The second method automatically learning robot navigation in 3D environments from interactions and reward using Deep Reinforcement Learning. Similar to the first mode, it keeps a metric map of the environment in a bird’s eye view, which is dynamically updated. Different from the first method, the semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias in deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself.

We also present a new benchmark and a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Solving our tasks requires substantially more complex reasoning capabilities than standard benchmarks available for this kind of environments.
CoPhy: counterfactual learning of physical dynamics. November 18th, 2019. Invited seminar at Orange Labs, Rennes.
Introduction into deep learning for robotics. November 5th+6th, 2019. 2 day (16h) lecture+exercise at ANF DeepRobot, Lille.
- Program
A short introduction into deep learning … and when to learn (and when not), October 28th, 2019. Talk at IA2 automn school on AI, Lyon.
- Abstract
In this talk we will give a (necessarily very short) introduction into deep learning, i.e. learning hierarchical high-capacity models from large amounts of data. After a short explanation of deep networks, gradient backpropagation and its implementation through auto-grad in a standard deep learning framework, the talk will focus on two points: (i) the visualization and transfert from learned knowledge from source data to a target application, including efforts to model the shift in distribution, and (ii) the combination of deep learning with more traditional models in the context of robot and vehicle navigation.
Learning and Robotics, October 17th, 2019. Tutorial at Journées Nationales de la Recherche en Robotique, Vittel.
Spatially structured Reinforcement Learning for 3D Control, October 11th, 2019. Invited seminar at SFR Math-STIC, l'Université d'Angers, Angers.
- Abstract
In this talk we address the problem of automatically learning the behavoir of intelligent agents navigating in 3D environments from interactions with Deep Reinforcement Learning. We discuss the reasoning capabilities required for these problems on the presence of objects and actors in a scene and to take planification and control decisions. We present a new benchmark and a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments.

We propose a method, which structures its state as a metric map in a bird’s eye view, dynamically updated through affine transforms given ego-motion. The semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias in deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself. We show, that this kind of geometric structure significantly improves the agent’s capability of storing objects and their locations and we visualize this reasoning in concrete scenarios.
Learning high-level reasoning in and from images, September 5th, 2019. Keynote at European Machine Vision Forum, Lyon.
- Abstract
- Video
Humans are able to infer what happened in a video given only a few sample frames. This faculty is called reasoning and is a key component of human intelligence. A detailed understanding requires reasoning over semantic structures, determining which objects were involved in interactions, of what nature, and what were the results of these. To compound problems, the semantic structure of a scene may change and evolve. In this talk we present research in high-level reasoning from images and videos, with the goals of understanding visual content (scene comprehension) or to make predictions of probable future outcomes, or to act in simulated environments based on visual observations. We present neural models addressing these goals through structured deep-learning, i.e. inductive biases in deep neural networks which explicitly model object relationships. We learn this models from data or from interactions between an agent and an environment.
Traitement du signal et Intelligence Artificielle : frères ennemis ou siamois ? Participation in a round table at GRETSI, August 28th, 2019.
- Abstract
- Video
De nombreuses applications de l'IA connues du grand public reposent sur l'apprentissage automatique ("machine learning" en anglais) : interprétation automatique des radiographies médicales par un « robot-radiologue », vision par ordinateur pour les véhicules autonomes, chatbots intelligents pour la gestion de services en ligne...
Dans beaucoup de ces applications, les données sont des signaux (images, sons, vidéos) issus d’un capteur physique. Ces applications recoupent donc naturellement celles chères à notre communauté, comme la reconnaissance d'images ou de sons (segmentation, reconnaissance faciale, indexation musicale), l'amélioration de ces contenus (débruitage, super-résolution), ou encore plus généralement l'exploitation de ces données (détection d’objets, d’actions dans des vidéos). Les méthodes d'apprentissage automatique utilisent également des concepts et des méthodes communs aux traiteurs de signaux (filtrage, optimisation, décompositions multi-échelles, théorie de l'information, etc.).
Le traitement du signal se situe à la fois en amont (l’acquisition des données), en aval (l’exploitation de l’information utile) ou encore en imbrication avec l’apprentissage automatique. Aujourd’hui se pose par conséquent la question de l'identité de la communauté du traitement du signal à l'heure de l'apprentissage artificiel.
L'objectif de cette table ronde sera d'explorer les connections entre l’apprentissage automatique et le traitement du signal, les spécificités individuelles de ces deux domaines et leurs limites actuelles. Pour aborder les nombreuses questions autour de ce couple « je t’aime - moi non plus », nous avons invité un panel d’experts :
Caroline Chaux, Chargée de Recherche CNRS et I2M, Aix-Marseille Université (optimisation)
Rémi Gribonval, Directeur de Recherche Inria Rennes (représentation des signaux et apprentissage)
Olivier Pietquin, Directeur de Recherche Google Brain Paris (langage et apprentissage) Jean Ponce, Directeur de Recherche Inria et Département d'informatique de l'ENS Paris (vision par ordinateur et apprentissage)
Nicolas Thome, Professeur au Conservatoire National des Arts et Métiers (CNAM) Paris, Laboratoire Cédric (vision par ordinateur et apprentissage)
Christian Wolf, Maître de Conférences Insa Lyon, LIRIS et Inria Grenoble (vision par ordinateur et apprentissage)
Learning and Robotics, June 25th, 2019. Invited seminar at Inria Thoth, Grenoble.
- Abstract
In this talk we address perception and navigation problems in robotics settings, in particular mobile terrestrial robots and intelligent vehicles. We focus on learning representations, which are structured and allow to reason on a high level on the presents of objects and actors in a scene and to take planification and control decisions. Two different methods will be compared, which both structure their state as metric maps in a bird’s eye view, updated through affine transforms given ego-motion.

The first method combines Bayesian filtering and Deep Learning to fuse LIDAR input and monocular RGB input, resulting in a semantic occupancy grid centered on a vehicle. A deep network is trained to segment RGB input and to fuse it with Bayesian occupancy grids. The second method automatically learning robot navigation in 3D environments from interactions and reward using Deep Reinforcement Learning. Similar to the first mode, it keeps a metric map of the environment in a bird’s eye view, which is dynamically updated. Different from the first method, the semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias in deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself.

We also present a new benchmark and a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Solving our tasks requires substantially more complex reasoning capabilities than standard benchmarks available for this kind of environments.
Attention mechanisms and spatial representations in artificial neural networks, June 18th, 2019. Invited seminar at Institut des Sciences Cognitives, Lyon.
- Abstract
Humans are able to infer what happens in a scene from a view sample glimpses and they are able to take decisions on actions taking into account observations, context and memory. This faculty is called reasoning and is a key component of human intelligence. A detailed understanding requires reasoning over semantic structures, determining which objects were involved in interactions, of what nature, and what were the results of these. To compound problems, the semantic structure of a scene may change and evolve. In this talk we present research in artificial intelligence, in particular in high-level reasoning from images and videos, with the goals of understanding visual content (scene comprehension) or to make predictions of probable future outcomes, or to act in real or simulated environments based on visual observations. We present (artificial) neural models addressing these goals through structured deep-learning, i.e. inductive biases in deep neural networks which explicitly model ego-centric or allo-centric spatial representations, attention mechanisms, and object relationships. We learn this models from data or from interactions between an agent and simulated or real environments, and we show visualizations of these mechanisms indicating the reasoning capabilities the agents learned from data.
Machine and Vision, March 13th, 2019. Invited talk at EM Lyon.
The field of computer vision adresses the high-level understanding of visual content, such as images and video sequences from various media: cameras embedded in telephones, robots or smart cars, multimedia content (television films and newscasts), digital documents , medical imaging etc. The problems are diverse, ranging from simple image classification and recognition of objects, gestures and human activities, to structured prediction and detailed understanding of a scene: identification of all the actors in a scene, an estimate of their dense posture, relationships between the actors and possibly the objects of the stage, dense labeling of all the elements of the scene; for some applications, the reconstruction of a 3D model of the scene is necessary.
The main scientific challenge is the semantic gap between low-level input signals and the semantic predictions, for example the class of objects to be recognized. Machine Learning from large amounts of data has been a major driving force of the evolution of the field in recent years, with a significant impact both on the academic world and on the industrial world. Deep learning has established itself as a reference method for a large number of problems by winning important scientific competitions.
This presentation will review the history of the field, the main actors and the major scientific challenges. We will first present a brief introduction into the general challenges in machine learning of high dimensional input, like images, signals video sequences. We will present common deep models like convolutional neural networks and recurrent networks and various widely used standard tools and problems, like attention mechanisms and transfer learning. Several applications will be presented: gesture recognition, human pose estimation, mobile robotics, automatic identification of smartphone users. We will finish with links to neuro sciences and parallels between human or biological learning and machine learning.
Vision, Image et Intelligence Artificielle, March 12th, 2019. Introductory talk at Université ouverte de Lyon.
Deep Learning and Artificial Intelligence, February 15th, 2019. Invited introductory talk at Université populaire de Montelimar.
Deep Learning, January 29th, 2019. Invited Tutorial on Deep Learning at Institut Henri Poincaré, Paris, at the workshop AI and images (the mathematics of imaging).

2018

Deep Learning for Robotics, November 22nd, 2018. Presentation jointly given with David Filiat at Journées Nationales de la Robotique, CNRS, Paris, France.
Tutorial: Deep Learning and Neural Networks for in computer vision and signal processing November 12th, 2018. Invited tutorial talk at CORESA, Poitiers, France.
- Abstract
Representation Learning or Deep Learning consists in automatically learning layered and hierarchical representation with various layers abstraction from large amounts of data. This presentation will review the history of the field, the main actors and the major scientific challenges. We will first present a brief introduction into the general challenges in machine learning of high dimensional input, like images, signals and text. We will present common deep models like convolutional neural networks and recurrent networks and various widely used standard tools and problems, like attention mechanisms, transfer learning and learning structured output. Implementing these models in deep learning frameworks (Tensorflow, PyTorch) will be briefly touched. Finally, we will go into more into detail of some selected applications in computer vision and signal processing.
How we went from Graphical Models to Deep Learning and what changed, October 18th, 2018. Invited tutorial talk at Journées GDR-ISIS on ML for remote sensoing ("Extraction d'attributs et apprentissage pour l'analyse des images de télédétection"), CNRS, Paris, France.
October 4th, 2018, Paris. Introductory talk to the workshop Machine Learning and Reasoning for Signal and Image Processing, co-organized by GDR ISIS and GDR IA. The talk was jointly given with Sébastien Destercke.
Structured Deep Learning and Visual Reasoning, Sept. 21st, 2018. Invited seminar at Strasbourg doctorale school on mathematics, computer science and engineering, Strasbourg, France.
Structured Deep Learning for Human Motion, July 10th, 2018. Talk at France-Canada-Iceland workshop (ANR/NSERC Deepvision), Reykjavic, Iceland.
- Abstract
Visual data consists of massive amounts of variables, and making sense of their content requires modeling their complex dependencies and relationships. This talk presents an overview of our past activities, which aim in enforcing coherence in this large ensemble of observed and latent variables, and to infer estimates from it. In particular, the presentation deals with work on attention mechanisms for video analysis, where structure in the data is not imposed but predicted from input through a fully trained model.

Application wise, we address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We explore, how articulated pose can be complemented, and in some cases replaced, by mechanisms, which draw attention to local positions in space and time. This allows to model interactions between humans and relevant objects in the scene, as well as regularities between objects themselves.
Pose or attention for human activity recognition?, June 18th, 2018. Invited conference at HUMAN 3D: CVPR 2018 workshop on HUman pose, Motion, Activities aNd Shape in 3D, Salt Lake City, USA.
- Abstract
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.

The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.

A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
Models of Visual Attention for Understanding Humans June 5th, 2018. Invited Seminar at Heudiasyc Laboratory, Compiegne, France.
- Abstract
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.

The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.

A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
Structured Deep Learning for Human Activity Recognition. May 30th, 2018. Invited Talk at INRIA Stars, Nice, France.
- Abstract
Visual data consists of massive amounts of variables, and making sense of their content requires modeling their complex dependencies and relationships. This talk presents an overview of our past activities, which aim in enforcing coherence in this large ensemble of observed and latent variables, and to infer estimates from it. In particular, the presentation deals with work on attention mechanisms for video analysis, where structure in the data is not imposed but predicted from input through a fully trained model.

Application wise, we address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. Our method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
Scruter pour mieux comprendre : Deep Learning et mécanismes d'attention. May 24th, 2018. Talk at Journées Aramis 2018, l'informatique du futur.
- Abstract
- Video
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les enjeux majeurs et quelques techniques clé.

Ensuite, elle présentera un concept récent, les mécanismes d'attention. Comme un humain scrutant une scène par des mouvements oculaires, ces méthodes permettent à un réseau de neurones de se focaliser sur une partie pertinente des données d'entrée : une partie d'un visage pour la reconnaissance faciale ou une partie d'une phrase pour la traduction.
Scruter pour mieux comprendre : Deep Learning et mécanismes d'attention. April 19th or 20th, 2018. Talk at #Mixit 2018, la conférence avec des crepes et du coeur, Lyon, France.
- Abstract
- Video
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les enjeux majeurs et quelques techniques clé.

Ensuite, elle présentera un concept récent, les mécanismes d'attention. Comme un humain scrutant une scène par des mouvements oculaires, ces méthodes permettent à un réseau de neurones de se focaliser sur une partie pertinente des données d'entrée : une partie d'un visage pour la reconnaissance faciale ou une partie d'une phrase pour la traduction.
Deep learning and Structured Models. March 9th, 2018. Invited Seminar at LIP laboratory, ENS Lyon, France.
- Abstract
Visual data consists of massive amounts of variables, and making sense of their content requires modeling their complex dependencies and relationships. This talk presents an overview of our past activities, which aim in enforcing coherence in this large ensemble of observed and latent variables, and to infer estimates from it. After a very brief overview of earlier work in structured models (graphs and graphical models), I will present my contributions in Representation Learning (also known by its more popular title « Deep Learning »), which consists in automatically learning layered and hierarchical representations with various layers of abstraction directly from large amounts of data.

A unifying thread of my research consists in integrating structure into deep neural networks in various forms and with different objectives: structure in the output space, typically in the form of spatial and temporal relationships in the measurements, or structure in the label space itself, often geometrical or topological. The presentation will conclude with an overview of our work on attention mechanisms for video analysis, where structure in the data is not imposed but predicted from input through a fully trained model.
The interwoven roles of articulated pose and visual attenion for human activity recognition March 5th, 2018. Invited Seminar at LABRI Laboratory, Bordeaux, France.
- Abstract
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.

The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.

A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
The interwoven roles of articulated pose and visual attenion for human activity recognition February 23rd, 2018. Invited Seminar at INRIA THOTH, Grenoble, France.
- Abstract
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.

The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.

A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
Deep Learning: history, models & challenges, with an application in signal processing and mobile authentification, Feb 22nd, 2018. Invited seminar at CITI Laboratory, Lyon, France.
- Abstract
Representation Learning (also known with its more popular title « Deep Learning ») consists in automatically learning layered and hierarchical representation with various layers abstraction from large amounts of data. This presentation will review the history of the field, the main actors and the major scientific challenges. We will first present a brief introduction into common deep models like convolutional neural networks and recurrent networks, before going more into detail of some selected applications in signal processing.

In particular, we present a large-scale study, exploring the capability of temporal deep neural networks in interpreting natural human kinematics and introduce the first method for active biometric authentication with mobile inertial sensors. This work has been done in collaboration with Google, where the first-of-its-kind dataset of human movements has been passively collected by 1500 volunteers using their smartphones daily over several months. We propose an optimized shift-invariant dense convolutional mechanism (DCWRNN) and incorporate the discriminatively-trained dynamic features in a probabilistic generative framework taking into account temporal characteristics. Our results demonstrate, that human kinematics convey important information about user identity and can serve as a valuable component of multi-modal authentication systems.

2017

Apprentissage profond (Deep Learning) et séries temporelles : Concepts et mise en oeuvre. November 20th, 2017. Invited Seminair at DL2T : Deep Learning – Télédétection - Temps, Paris (Issy-Les-Moulineaux), France.
- Conference site
Learning human centered computing for vision and robotics. November 8th-10th, 2017. Invited conference at Journées National de la Recherche en Robotique, Biaritz, France.
- Conference site
Learning human motion: gestures, activities, pose, identity. October 17th, 2017. Keynote at European and Nordic Symposium on Multimodal Communication.
- Abstract
- Conference site
This talk is devoted to (deep) learning methods advancing automatic analysis and interpreting of human motion from different perspectives and based on various sources of information, such as images, video, depth, mocap data, audio and inertial sensors. We propose several models and associated training algorithms for supervised classification and semi-supervised and weakly-supervised feature learning, as well as modelling of temporal dependencies, and show their efficiency on a set of fundamental tasks, including detection, classification, parameter estimation and user verification.

Advances in several applications will be shown, including (i) gesture spotting and recognition based on multi-scale and multi-modal deep learning from visual signals; (ii) human activity recognition using models of visual attention; (iii) hand pose estimation through deep regression from depth images, based on semi-supervised and weakly-supervised learning; (iv) mobile biometrics, in particular the automatic authentification of smartphone users through learning from data acquired from inertiel sensors.
Deep Learning May 16th, 2017. Lyon meetup sur l'intelligence artificielle.
- Site du meetup
- Video/youtube
Recurrent Neural networks for object detection and motion recognition February 6th, 2017. Invited seminar at INRIA THOTH work group.
- Abstract
In this talk we present recurrent neural networks and variants (LSTM, 2D-LSTM, CWRNN, DCWRNN) and show how these networks can model spatial and temporal information for two different applications: (i) object detection and (ii) human identification from motion.

For application (i), we propose a new neural model which directly predicts bounding box coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data is not as abun- dant as in the classical configuration of natural images and Imagenet/Pascal VOC tasks.

As for application (ii), we present a large-scale study in interpreting natural human kinematics and active biometric authentication with mobile inertial sensors. We propose an optimized shift-invariant dense convolutional mechanism (DCWRNN) and incorporate the discriminatively-trained dynamic features in a probabilistic generative framework taking into account temporal characteristics. Our results demonstrate, that human kinematics convey important information about user identity and can serve as a valuable component of multi-modal authentication systems.
Graphical Models and Deep Networks January 4th, 2017. Saint-Etienne & Lyon Deep Learning Workshop (organized by LHC & LIRIS: full program).
- Abstract
- Workshop page
We first present a very brief introduction into graphical models and their principal inference algorithms. We then present the differences with deep networks and give concrete examples for each family (deformable parts models, kinematic trees, attention mechanisms etc.). We will finish with some parallels to human psychology and human thinking.

2016

Deep Learning, le futur de l'intelligence artificielle. December 15th, 2016. Participation in the round table discussion at musée des arts et métiers.
- Conference page
Deep learning: introduction, trends and tendencies. December 13th, 2016. Journée ARC6: optimisation et apprentissage automatique (Program), Lyon, France.
- Conference page
Deep learning and human motion. November 24th, 2016. Invited Seminar at LITIS laboratory, Rouen, France.
- Abstract
We will first present a brief introduction into common deep models for computer vision like convolutional neural networks and recurrent networks, and the main challenges of the field.

The second part is devoted to develop learning methods advancing automatic analysis and interpreting of human motion from different perspectives and based on various sources of information, such as images, video, depth, mocap data, audio and inertial sensors. We propose several models and associated training algorithms for supervised classification and semi-supervised and weakly-supervised feature learning, as well as modelling of temporal dependencies, and show their efficiency on a set of fundamental tasks, including detection, classification, parameter estimation and user verification.

Advances in several applications will be shown, including (i) gesture spotting and recognition based on multi-scale and multi-modal deep learning from visual signals (such as video, depth and mocap data), where we will present a training strategy for learning cross-modality correlations while preserving uniqueness of each modality-specific representation; (ii) hand pose estimation through deep regression from depth images, based on semi-supervised and weakly-supervised learning; (iii) mobile biometrics, in particular the automatic authentification of smartphone users through deep learning from data acquired from inertiel sensors.
Deep Learning et Intelligence Artificielle : mythes et réalités November 24th, 2016. Invited Seminar at Codeurs en Seine, Rouen.
- Abstract
- Video
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les principaux acteurs et les enjeux majeurs. Les techniques clé sont brièvement esquissées, suivi par quelques résultats sur des applications diverses telles que la reconnaissance d’objets, les interfaces homme-machine et les applications mobiles.
Deep learning and human motion. April 28th, 2016. Invited Seminar at LISTIC laboratory, Annecy, France.
- Abstract
- Video/youtube
We will first present a brief introduction into common deep models for computer vision like convolutional neural networks and recurrent networks, and the main challenges of the field.

The second part is devoted to develop learning methods advancing automatic analysis and interpreting of human motion from different perspectives and based on various sources of information, such as images, video, depth, mocap data, audio and inertial sensors. We propose several models and associated training algorithms for supervised classification and semi-supervised and weakly-supervised feature learning, as well as modelling of temporal dependencies, and show their efficiency on a set of fundamental tasks, including detection, classification, parameter estimation and user verification.

Advances in several applications will be shown, including (i) gesture spotting and recognition based on multi-scale and multi-modal deep learning from visual signals (such as video, depth and mocap data), where we will present a training strategy for learning cross-modality correlations while preserving uniqueness of each modality-specific representation; (ii) hand pose estimation through deep regression from depth images, based on semi-supervised and weakly-supervised learning; (iii) mobile biometrics, in particular the automatic authentification of smartphone users through deep learning from data acquired from inertiel sensors.
Deep learning, nouveaux apports. March 10, 2016. Talk at Conference, "Intelligence artificielle aujourd’hui et demain", Lyon.
- Conference page
- Abstract
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les principaux acteurs et les enjeux majeurs. Les techniques clé sont brièvement esquissées, suivi par quelques résultats sur des applications diverses telles que la reconnaissance d’objets, les interfaces homme-machine et les applications mobiles.

2015

Learning human identity from motion patterns. December 2nd, 2015. Talk at GDR ISIS, journée IHM, Paris. This is joint work with Natalia Neverova, Griffin Lacey, Lex Fridman, Deepak Chandra, Brandon Barbello and Graham W. Taylor.
Random Forests vs. Deep Networks November 26th, 2015. Saint-Etienne & Lyon Deep Learning Workshop (organized by LHC & LIRIS: full program).
- Abstract
In this talk I present a brief overview of the differences between random forests and deep networks and I then present some recent work by Microsoft Research (with which I am not affilated) on combining random forests with Deep Learning.
Structured Deep learning: gesture recognition and pose estimation April 30th, 2015. Invited IPAC Seminar at LORIA laboratory / INRIA Lorraine, Nancy.
- Videos
Transductive deep hand segmentation. March 20th, 2015. Talk at GDR ISIS, journée deep learning, Paris.
Projet IMU RIVIERE - Renaturer la ville – facteur de risque ou de bien-être social March 13th, 2015. Talk at conseil scientifique LabEx IMU, Lyon.
Segmentation d'Images March 10th, 2015. Invited seminar at A2IA, Paris.

2014

Structured deep learning: gesture recognition December 11th, 2014. Talk at GDR ISIS, journée activités et gestes, Paris.
Structured Deep learning: gesture recognition and pose estimation October 17th, 2014. Invited Seminar at IDIAP laboratory, Martigny, Switzerland.
Segmentation and structured Deep learning. July 4th, 2014. Invited seminar at LabEx PRIMES, Lyon.
Apprentissage automatique de modèles de comportements interactifs pour des robots sociaux. March 19th, 2014, Talk at Innorobo exposition, Lyon.
Pose and gestures: spatial deep learning February 6th, 2014, Talk at GDR ISIS, journée RGB-D Images, Paris.

2013 and before

Rencontre du troisième type. Décembre 16th, 2013, Invited talk at Bibliothèque de VAISE (évènement culturel grand public autour de la robotique), Lyon.
Modélisation globalement cohérente d’interactions complexes avec prise en compte de critères géométriques July 20th, 2013. Invited seminar at the LHC laboratory, Saint-Etienne.
Maintien des personnes âgées à domicile - enjeux scientifiques et technologiques liés à la vision par ordinateur, July 08th, 2011, Invited seminar at summer school "Intelligence ambiante", Lille; Session Enjeux sociétaux, scientifiques et technologiques du maintien des personnes âgées à domicile.
Activity recognition in videos, April 14th, 2011. Invited seminar at the LSIIT Laborary (now i-Cube), Strasbourg, team MIV.
Activity recognition in videos, March 29th, 2011. Invited seminar at Telecom SudParis, Paris, team Biosecure.
Activity recognition in videos, June 14th, 2010, Talk at réunion du GdR Robotique sur le thème "Interaction homme/robot"
Better restore the recto side of a document with an estimation of the verso side: Markov model and inference with graph cuts. June 24th, 2008, Talk at journée thématique du GRCE et du GDR I3 - Théme 6: 'Approches Statistiques, Reconnaissance des Formes et Analyse d'Images'
Quality, quantity and generality in the evaluation of object detection algorithms. December 15th, 2005, GDR ISIS: Journée évaluation des traitements dans un système de vision
Text detection and extraction from audio-visual documents for semantic indexing. June 3rd, 2004. Invited seminar at laboratory GREYC. Emphasis on modeling and evaluation
Text detection and extraction from audio-visual documents for semantic indexing. July 1st, 2004. Invited seminar at FoxStream Emphasis on modeling and evaluation
Text detection and extraction from audio-visual documents for semantic indexing. July 17th, 2003 Talk at GdR-PRC I3 / GT 3.5: Indexation et Recherche d'Informations
Indexing and Retrieval of Video for TREC 2002 September 27th, 2002 Seminar at the Language and Media Processing Laboratory at University of Maryland
Detection and Extraction of Artificial Text for Semantic Indexing September 27th, 2002 Seminar on Content-Based Image and Video Retrieval at Schloß Dagstuhl, Germany
Detection and Extraction of Artificial Text from Videos September 27th, 2002 Seminar at the Language and Media Processing Laboratory at University of Maryland
Auf der Suche nach der Semantik - Inhaltsbasierte Indizierung von Bildern und Video November 17th, 2000 Presentation for the PRIP Preis 2000: Price of the Pattern Recognition and Image Processing group (PRIP) at Vienna University of Technology, organised in collaboration with industries, awarded for the best research activities