Slides for "regular" presentations are in the publications section associated to their papers.
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, awareness of the dynamics of the environment, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations in robotics, embodied computer vision and (intuitive) physics. We will present solutions for learning robot navigation in complex 3D environments, where we combine training in simulation with classical mapping and planning. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, awareness of the dynamics of the environment, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations in robotics, embodied computer vision and (intuitive) physics. We will present solutions for learning robot navigation in complex 3D environments, where we combine training in simulation with classical mapping and planning and transfer of the acquired knowledge to real robots operating in physical environments. We cover learning of physical phenomena like fluid dynamics with applications to UAV control. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations in robotics and embodied computer vision. In particular we will present solutions for learning robot navigation in complex 3D environments, where we combine training in simulation with classical mapping and planning and transfer of the acquired knowledge to real robots operating in physical environments. We will also showcase the fleet of autonomous robots operated by Naver Labs Korea in Seoul in the world's first robot friendly building.
In this talk we address perception and navigation problems in robotics settings, in particular mobile terrestrial robots and intelligent vehicles acting in 3D environments from visual input. We focus on learning representations, which are structured and allow to reason on a high level on the presents of objects and actors in a scene and to take planification and control decisions. In particular, we compare different ways to design inductive biases for deep reinforcement learning: neural metric maps, neural topological maps, and neural implicit representations.
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.
In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision. In particular we will present solutions for learning robot navigation in complex 3D environments, where we combine large-scale training in simulation and virtual environments with the transfer of the acquired knowledge to real robots operating in physical environments.
High-capacity deep networks trained on large-scale data are increasingly used to learn agents capable of automatically taking complex decisions on high-dimensional data like text, images, videos. Certain applications require robustness - the capacity of robustly taking the right decisions at the right moments, with high-risks associated with wrong decisions. We require these agents to acquire the right kind of reasoning capabilities, i.e. that they take decisions for reasons the designers had in mind. This is made diffcult by the diminishing role experts have in the design and engineering process, as the agents' decisions are in large part dominated by the impact of training data.
In this talk we will address the problem of learning explainable and interpretable models, in particular deep networks. We start by exploring the question of explainability in a broader sense in terms of feasability and trade-offs.
We then focus on visual reasoning tasks and we target different situations involving the need of the agents to acquire a certain approximation of common knowledge, including robotics and vision-and-language reasoning. We explore this problem in a holistic way and study it from various angles: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning?
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.
In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning? Can we structure neural networks with inductive bias to improve the emergence of reasoning?
In this lecture we will discuss models and algorithms for deep learning, a variant of machine learning which puts the emphasis on the learning of high-capacity neural networks from large amounts of data, which are most often embedded in high-dimensional spaces (images, signals, audio, language, etc.). We will go over the different model variants used for different tasks, and link their structure and inductive biases to symmetries and invariances we want to enforce in the data, and to high-level goals: multi-layer perceptrons, convolutional neural networks, recurrent neural networks, graph networks and transformers (attention mechanisms).
We will link models and algorithms to an important sub goal of AI, namely the creation of intelligent agents, which require high-level reasoning capabilities, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.
On the application side, we will cover the automatic learning reasoning capabilities in different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack:
The second part of this lecture deals with more practical and hands-down aspects of deep learning and covers the differentiable programming, the basis for the implementation of deep neural networks which can be trained by gradient descent. We will learn how to represent tensors in PyTorch, one of the most widely used deep learning frameworks. A particular emphasis will be put on Auto-grad, automatic differentiation of models based on the backpropagation algorithm applied to dynamically created computation graphs. The interested participant with access to a laptop can apply these techniques to a small toy problem during the lecture.
An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data.
In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning? Can we structure neural networks with inductive bias to improve the emergence of reasoning?
Plusieurs cas d’études seront présentés illustrant des coopérations académiques / industriels sur l’usage de l’apprentissage profond : traitement automatique de documents numérisées (détection de bloc de textes, reconnaissance de caractères); interface homme-machine (navigation dans des environnements 3D sur tables tactiles); systèmes de questions-réponses visuels.
In this talk we address perception and navigation problems in robotics, in particular mobile terrestrial robots and intelligent vehicles. We focus on learning structured representations, which allow complex reasoning, planning and control. Our control policies are automatically learned from interactions with photo-realistic 3D environments using Deep Reinforcement Learning. While classical methods learn a representation of the history of observations in the form of a flat vectorial hidden state, we propose two different methods, which structure memory by imbuing neural networks with inductive biases of different kinds.
The first method structures its hidden state as a metric map in a bird’s eye view, updated through affine transforms given ego-motion. The semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias for deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself.
The second method introduces a differentiable topological representation, i.e. memory in graph form. Here, our main contribution is a data driven approach for planning under uncertainty requiring an estimate of shortest paths in valued graphs with a probabilistic structure. Whereas classical symbolic algorithms achieve optimal results on noise-less topologies, or optimal results in a probabilistic sense on graphs with probabilistic structure, we aim to show that machine learning can overcome missing information in the graph by taking into account rich high-dimensional node features, for instance visual information available at each location of the map. Compared to purely learned neural white box algorithms, we structure our neural model with an inductive bias for dynamic programming based shortest path algorithms, and we show that a particular parameterization of our neural model corresponds to the Bellman-Ford algorithm. By performing an empirical analysis of our method in simulated photo-realistic 3D environments, we demonstrate that the inclusion of visual features in the learned neural planner outperforms classical symbolic solutions for graph based planning.
In this talk we will give a (necessarily very short) introduction into deep learning, i.e. learning hierarchical high-capacity models from large amounts of data. After a short explanation of deep networks, gradient backpropagation and its implementation through auto-grad in a standard deep learning framework, the talk will focus on two points: (i) the visualization and transfert from learned knowledge from source data to a target application, including efforts to model the shift in distribution, and (ii) the combination of deep learning with more traditional models in the context of robot and vehicle navigation.
In this talk we address the problem of automatically learning the behavoir of intelligent agents navigating in 3D environments from interactions with Deep Reinforcement Learning. We discuss the reasoning capabilities required for these problems on the presence of objects and actors in a scene and to take planification and control decisions. We present a new benchmark and a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments.
We propose a method, which structures its state as a metric map in a bird’s eye view, dynamically updated through affine transforms given ego-motion. The semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias in deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself. We show, that this kind of geometric structure significantly improves the agent’s capability of storing objects and their locations and we visualize this reasoning in concrete scenarios.
Humans are able to infer what happened in a video given only a few sample frames. This faculty is called reasoning and is a key component of human intelligence. A detailed understanding requires reasoning over semantic structures, determining which objects were involved in interactions, of what nature, and what were the results of these. To compound problems, the semantic structure of a scene may change and evolve. In this talk we present research in high-level reasoning from images and videos, with the goals of understanding visual content (scene comprehension) or to make predictions of probable future outcomes, or to act in simulated environments based on visual observations. We present neural models addressing these goals through structured deep-learning, i.e. inductive biases in deep neural networks which explicitly model object relationships. We learn this models from data or from interactions between an agent and an environment.
De nombreuses applications de l'IA connues du grand public reposent sur l'apprentissage automatique ("machine learning" en anglais) : interprétation automatique des radiographies médicales par un « robot-radiologue », vision par ordinateur pour les véhicules autonomes, chatbots intelligents pour la gestion de services en ligne...
Dans beaucoup de ces applications, les données sont des signaux (images, sons, vidéos) issus d’un capteur physique. Ces applications recoupent donc naturellement celles chères à notre communauté, comme la reconnaissance d'images ou de sons (segmentation, reconnaissance faciale, indexation musicale), l'amélioration de ces contenus (débruitage, super-résolution), ou encore plus généralement l'exploitation de ces données (détection d’objets, d’actions dans des vidéos). Les méthodes d'apprentissage automatique utilisent également des concepts et des méthodes communs aux traiteurs de signaux (filtrage, optimisation, décompositions multi-échelles, théorie de l'information, etc.).
Le traitement du signal se situe à la fois en amont (l’acquisition des données), en aval (l’exploitation de l’information utile) ou encore en imbrication avec l’apprentissage automatique. Aujourd’hui se pose par conséquent la question de l'identité de la communauté du traitement du signal à l'heure de l'apprentissage artificiel.
L'objectif de cette table ronde sera d'explorer les connections entre l’apprentissage automatique et le traitement du signal, les spécificités individuelles de ces deux domaines et leurs limites actuelles. Pour aborder les nombreuses questions autour de ce couple « je t’aime - moi non plus », nous avons invité un panel d’experts :
Caroline Chaux, Chargée de Recherche CNRS et I2M, Aix-Marseille Université (optimisation)
Rémi Gribonval, Directeur de Recherche Inria Rennes (représentation des signaux et apprentissage)
Olivier Pietquin, Directeur de Recherche Google Brain Paris (langage et apprentissage)
Jean Ponce, Directeur de Recherche Inria et Département d'informatique de l'ENS Paris (vision par ordinateur et apprentissage)
Nicolas Thome, Professeur au Conservatoire National des Arts et Métiers (CNAM) Paris, Laboratoire Cédric (vision par ordinateur et apprentissage)
Christian Wolf, Maître de Conférences Insa Lyon, LIRIS et Inria Grenoble (vision par ordinateur et apprentissage)
In this talk we address perception and navigation problems in robotics settings, in particular mobile terrestrial robots and intelligent vehicles. We focus on learning representations, which are structured and allow to reason on a high level on the presents of objects and actors in a scene and to take planification and control decisions. Two different methods will be compared, which both structure their state as metric maps in a bird’s eye view, updated through affine transforms given ego-motion.
The first method combines Bayesian filtering and Deep Learning to fuse LIDAR input and monocular RGB input, resulting in a semantic occupancy grid centered on a vehicle. A deep network is trained to segment RGB input and to fuse it with Bayesian occupancy grids. The second method automatically learning robot navigation in 3D environments from interactions and reward using Deep Reinforcement Learning. Similar to the first mode, it keeps a metric map of the environment in a bird’s eye view, which is dynamically updated. Different from the first method, the semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias in deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself.
We also present a new benchmark and a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Solving our tasks requires substantially more complex reasoning capabilities than standard benchmarks available for this kind of environments.
Humans are able to infer what happens in a scene from a view sample glimpses and they are able to take decisions on actions taking into account observations, context and memory. This faculty is called reasoning and is a key component of human intelligence. A detailed understanding requires reasoning over semantic structures, determining which objects were involved in interactions, of what nature, and what were the results of these. To compound problems, the semantic structure of a scene may change and evolve. In this talk we present research in artificial intelligence, in particular in high-level reasoning from images and videos, with the goals of understanding visual content (scene comprehension) or to make predictions of probable future outcomes, or to act in real or simulated environments based on visual observations. We present (artificial) neural models addressing these goals through structured deep-learning, i.e. inductive biases in deep neural networks which explicitly model ego-centric or allo-centric spatial representations, attention mechanisms, and object relationships. We learn this models from data or from interactions between an agent and simulated or real environments, and we show visualizations of these mechanisms indicating the reasoning capabilities the agents learned from data.
The field of computer vision adresses the high-level understanding of visual content, such as images and video sequences from various media: cameras embedded in telephones, robots or smart cars, multimedia content (television films and newscasts), digital documents , medical imaging etc. The problems are diverse, ranging from simple image classification and recognition of objects, gestures and human activities, to structured prediction and detailed understanding of a scene: identification of all the actors in a scene, an estimate of their dense posture, relationships between the actors and possibly the objects of the stage, dense labeling of all the elements of the scene; for some applications, the reconstruction of a 3D model of the scene is necessary.
The main scientific challenge is the semantic gap between low-level input signals and the semantic predictions, for example the class of objects to be recognized. Machine Learning from large amounts of data has been a major driving force of the evolution of the field in recent years, with a significant impact both on the academic world and on the industrial world. Deep learning has established itself as a reference method for a large number of problems by winning important scientific competitions.
This presentation will review the history of the field, the main actors and the major scientific challenges. We will first present a brief introduction into the general challenges in machine learning of high dimensional input, like images, signals video sequences. We will present common deep models like convolutional neural networks and recurrent networks and various widely used standard tools and problems, like attention mechanisms and transfer learning. Several applications will be presented: gesture recognition, human pose estimation, mobile robotics, automatic identification of smartphone users. We will finish with links to neuro sciences and parallels between human or biological learning and machine learning.
Representation Learning or Deep Learning consists in automatically learning layered and hierarchical representation with various layers abstraction from large amounts of data. This presentation will review the history of the field, the main actors and the major scientific challenges. We will first present a brief introduction into the general challenges in machine learning of high dimensional input, like images, signals and text. We will present common deep models like convolutional neural networks and recurrent networks and various widely used standard tools and problems, like attention mechanisms, transfer learning and learning structured output. Implementing these models in deep learning frameworks (Tensorflow, PyTorch) will be briefly touched. Finally, we will go into more into detail of some selected applications in computer vision and signal processing.
Visual data consists of massive amounts of variables, and making sense of their content requires modeling their complex dependencies and relationships. This talk presents an overview of our past activities, which aim in enforcing coherence in this large ensemble of observed and latent variables, and to infer estimates from it. In particular, the presentation deals with work on attention mechanisms for video analysis, where structure in the data is not imposed but predicted from input through a fully trained model.
Application wise, we address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We explore, how articulated pose can be complemented, and in some cases replaced, by mechanisms, which draw attention to local positions in space and time. This allows to model interactions between humans and relevant objects in the scene, as well as regularities between objects themselves.
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.
The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.
A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.
The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.
A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
Application wise, we address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. Our method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les enjeux majeurs et quelques techniques clé.
Ensuite, elle présentera un concept récent, les mécanismes d'attention. Comme un humain scrutant une scène par des mouvements oculaires, ces méthodes permettent à un réseau de neurones de se focaliser sur une partie pertinente des données d'entrée : une partie d'un visage pour la reconnaissance faciale ou une partie d'une phrase pour la traduction.
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les enjeux majeurs et quelques techniques clé.
Ensuite, elle présentera un concept récent, les mécanismes d'attention. Comme un humain scrutant une scène par des mouvements oculaires, ces méthodes permettent à un réseau de neurones de se focaliser sur une partie pertinente des données d'entrée : une partie d'un visage pour la reconnaissance faciale ou une partie d'une phrase pour la traduction.
Visual data consists of massive amounts of variables, and making sense of their content requires modeling their complex dependencies and relationships. This talk presents an overview of our past activities, which aim in enforcing coherence in this large ensemble of observed and latent variables, and to infer estimates from it. After a very brief overview of earlier work in structured models (graphs and graphical models), I will present my contributions in Representation Learning (also known by its more popular title « Deep Learning »), which consists in automatically learning layered and hierarchical representations with various layers of abstraction directly from large amounts of data.
A unifying thread of my research consists in integrating structure into deep neural networks in various forms and with different objectives: structure in the output space, typically in the form of spatial and temporal relationships in the measurements, or structure in the label space itself, often geometrical or topological. The presentation will conclude with an overview of our work on attention mechanisms for video analysis, where structure in the data is not imposed but predicted from input through a fully trained model.
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.
The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.
A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
We address human action recognition from RGB data and study the role of articulated pose and of visual attention mechanisms for this application. In particular, articulated pose is well established as an intermediate representation and capable of providing precise cues relevant to human motion and behavior. We describe two different methods which use pose in different ways, either during training and testing, or during training only.
The first method uses a trainable glimpse sensor to extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. The model not only learns to find choices relevant to the task, but also to draw away attention from joints which have been incorrectly located by the pose middleware.
A second method has been designed to explictely remove the dependency on pose during training, making the method more broadly applicable in situations where pose is not available. Instead, a sparse represention of focus points is calculated by a dynamic visual attention model and passed to a set of distributed recurrent neural workers. State-of-the-art results are achieved on several datasets, among which is the largest dataset for human activity recognition, namely NTU-RGB+D.
Representation Learning (also known with its more popular title « Deep Learning ») consists in automatically learning layered and hierarchical representation with various layers abstraction from large amounts of data. This presentation will review the history of the field, the main actors and the major scientific challenges. We will first present a brief introduction into common deep models like convolutional neural networks and recurrent networks, before going more into detail of some selected applications in signal processing.
In particular, we present a large-scale study, exploring the capability of temporal deep neural networks in interpreting natural human kinematics and introduce the first method for active biometric authentication with mobile inertial sensors. This work has been done in collaboration with Google, where the first-of-its-kind dataset of human movements has been passively collected by 1500 volunteers using their smartphones daily over several months. We propose an optimized shift-invariant dense convolutional mechanism (DCWRNN) and incorporate the discriminatively-trained dynamic features in a probabilistic generative framework taking into account temporal characteristics. Our results demonstrate, that human kinematics convey important information about user identity and can serve as a valuable component of multi-modal authentication systems.
This talk is devoted to (deep) learning methods advancing automatic analysis and interpreting of human motion from different perspectives and based on various sources of information, such as images, video, depth, mocap data, audio and inertial sensors. We propose several models and associated training algorithms for supervised classification and semi-supervised and weakly-supervised feature learning, as well as modelling of temporal dependencies, and show their efficiency on a set of fundamental tasks, including detection, classification, parameter estimation and user verification.
Advances in several applications will be shown, including (i) gesture spotting and recognition based on multi-scale and multi-modal deep learning from visual signals; (ii) human activity recognition using models of visual attention; (iii) hand pose estimation through deep regression from depth images, based on semi-supervised and weakly-supervised learning; (iv) mobile biometrics, in particular the automatic authentification of smartphone users through learning from data acquired from inertiel sensors.
In this talk we present recurrent neural networks and variants (LSTM, 2D-LSTM, CWRNN, DCWRNN) and show how these networks can model spatial and temporal information for two different applications: (i) object detection and (ii) human identification from motion.
For application (i), we propose a new neural model which directly predicts bounding box coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data is not as abun- dant as in the classical configuration of natural images and Imagenet/Pascal VOC tasks.
As for application (ii), we present a large-scale study in interpreting natural human kinematics and active biometric authentication with mobile inertial sensors. We propose an optimized shift-invariant dense convolutional mechanism (DCWRNN) and incorporate the discriminatively-trained dynamic features in a probabilistic generative framework taking into account temporal characteristics. Our results demonstrate, that human kinematics convey important information about user identity and can serve as a valuable component of multi-modal authentication systems.
We first present a very brief introduction into graphical models and their principal inference algorithms. We then present the differences with deep networks and give concrete examples for each family (deformable parts models, kinematic trees, attention mechanisms etc.). We will finish with some parallels to human psychology and human thinking.
We will first present a brief introduction into common deep models for computer vision like convolutional neural networks and recurrent networks, and the main challenges of the field.
The second part is devoted to develop learning methods advancing automatic analysis and interpreting of human motion from different perspectives and based on various sources of information, such as images, video, depth, mocap data, audio and inertial sensors. We propose several models and associated training algorithms for supervised classification and semi-supervised and weakly-supervised feature learning, as well as modelling of temporal dependencies, and show their efficiency on a set of fundamental tasks, including detection, classification, parameter estimation and user verification.
Advances in several applications will be shown, including (i) gesture spotting and recognition based on multi-scale and multi-modal deep learning from visual signals (such as video, depth and mocap data), where we will present a training strategy for learning cross-modality correlations while preserving uniqueness of each modality-specific representation; (ii) hand pose estimation through deep regression from depth images, based on semi-supervised and weakly-supervised learning; (iii) mobile biometrics, in particular the automatic authentification of smartphone users through deep learning from data acquired from inertiel sensors.
L’apprentissage profond de représentations (le Deep Learning) est une famille de méthodes du domaine « intelligence artificielle » permettant d’apprendre de connaissances à partir de masses de données (textes, images, vidéos etc.). Plus précisément, ces modèles permettent de faire des prédictions sur des nouvelles données. Cette intervention passera en revue l’historique de cette thématique, les principaux acteurs et les enjeux majeurs. Les techniques clé sont brièvement esquissées, suivi par quelques résultats sur des applications diverses telles que la reconnaissance d’objets, les interfaces homme-machine et les applications mobiles.
We will first present a brief introduction into common deep models for computer vision like convolutional neural networks and recurrent networks, and the main challenges of the field.
The second part is devoted to develop learning methods advancing automatic analysis and interpreting of human motion from different perspectives and based on various sources of information, such as images, video, depth, mocap data, audio and inertial sensors. We propose several models and associated training algorithms for supervised classification and semi-supervised and weakly-supervised feature learning, as well as modelling of temporal dependencies, and show their efficiency on a set of fundamental tasks, including detection, classification, parameter estimation and user verification.
Advances in several applications will be shown, including (i) gesture spotting and recognition based on multi-scale and multi-modal deep learning from visual signals (such as video, depth and mocap data), where we will present a training strategy for learning cross-modality correlations while preserving uniqueness of each modality-specific representation; (ii) hand pose estimation through deep regression from depth images, based on semi-supervised and weakly-supervised learning; (iii) mobile biometrics, in particular the automatic authentification of smartphone users through deep learning from data acquired from inertiel sensors.
In this talk I present a brief overview of the differences between random forests and deep networks and I then present some recent work by Microsoft Research (with which I am not affilated) on combining random forests with Deep Learning.