J-X ... journal paper c-X ... conference/workshop paper arxiv pre-print.


Guillaume Bono, Hervé Poirier, Leonid Antsfeld, Gianluca Monaci, Boris Chidlovskii, and Christian Wolf. Learning to navigate efficiently and precisely in real environments. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

In the context of autonomous navigation of terrestrial robots, the creation of realistic models for agent dynamics and sensing is a widespread habit in the robotics literature and in commercial applications, where they are used for model based control and/or for localization and mapping. The more recent Embodied AI literature, on the other hand, focuses on modular or end-to-end agents trained in simulators like Habitat or AI-Thor, where the emphasis is put on photo-realistic rendering and scene diversity, but high-fidelity robot motion is assigned a less privileged role. The resulting sim2real gap significantly impacts transfer of the trained models to real robotic platforms. In this work we explore end-to-end training of agents in simulation in settings which minimize the sim2real gap both, in sensing and in actuation. Our agent directly predicts (discretized) velocity commands, which are maintained through closed-loop control in the real robot. The behavior of the real robot (including the underlying low-level controller) is identified and simulated in a modified Habitat simulator. Noise models for odometry and localization further contribute in lowering the sim2real gap. We evaluate on real navigation scenarios, explore different localization and point goal calculation methods and report significant gains in performance and robustness compared to prior work.

Pierre Marza, Laetitia Matignon, Olivier Simonin and Christian Wolf. Task-conditioned adaptation of visual features in multi-task policy learning. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Successfully addressing a wide variety of tasks is a core ability of autonomous agents, which requires flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the underlying perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, in this work, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the policy and visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks of the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given visual demonstrations.

Steeven Janny, Madiha Nadri, Julie Digne and Christian Wolf. Space and time continuous physics simulation from partial observations, In International Conference on Learning Representations (ICLR), 2024 (spotlight presentation).

Modern techniques for physical simulations rely on numerical schemes and mesh-refinement methods to address trade-offs between precision and complexity, but these handcrafted solutions are tedious and require high computational power. Data-driven methods based on large-scale machine learning promise high adaptivity by integrating long-range dependencies more directly and efficiently. In this work, we focus on computational fluid dynamics and address the shortcomings of a large part of the literature, which are based on fixed support for computations and predictions in the form of regular or irregular grids. We propose a novel setup to perform predictions in a continuous spatial and temporal domain while being trained on sparse observations. We formulate the task as a double observation problem and propose a solution with two interlinked dynamical systems defined on, respectively, the sparse positions and the continuous domain, which allows to forecast and interpolate a solution from the initial condition. Our practical implementation involves recurrent GNNs and a spatio-temporal attention observer capable of interpolating the solution at arbitrary locations. Our model not only generalizes to new initial conditions (as standard auto-regressive models do) but also performs evaluation at arbitrary space and time locations. We evaluate on three standard datasets in fluid dynamics and compare to strong baselines, which are outperformed in classical settings and the extended new task requiring continuous predictions.

Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe Weinzaepfel and Christian Wolf. End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon, In International Conference on Learning Representations (ICLR), 2024.

Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.

Guillaume Bono, Leonid Antsfeld, Assem Sadek, Gianluca Monaci and Christian Wolf. Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction In International Conference on Learning Representations (ICLR), 2024.

Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.

Gianluca Monaci, Leonid Antsfeld, Boris Chidlovskii and Christian Wolf. Zero-BEV: Zero-shot Projection of any First-Person Modality to BEV Maps. In International Conference on 3D Vision (3DV), 2024 (spotlight).

Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.


Pierre Marza, Laetitia Matignon, Olivier Simonin, Dhruv Batra, Christian Wolf and Devendra Singh Chaplot. AutoNeRF: Training Implicit Scene Representations with Autonomous Agents. pre-print arxiv:2304.11241, 2023.

Implicit representations such as Neural Radiance Fields (NeRF) have been shown to be very effective at novel view synthesis. However, these models typically require manual and careful human data collection for training. In this paper, we present AutoNeRF, a method to collect data required to train NeRFs using autonomous embodied agents. Our method allows an agent to explore an unseen environment efficiently and use the experience to build an implicit map representation autonomously. We compare the impact of different exploration strategies including handcrafted frontier-based exploration and modular approaches composed of trained high-level planners and classical low-level path followers. We train these models with different reward functions tailored to this problem and evaluate the quality of the learned representations on four different downstream tasks: classical viewpoint rendering, map reconstruction, planning, and pose refinement. Empirical results show that NeRFs can be trained on actively collected data using just a single episode of experience in an unseen environment, and can be used for several downstream robotic tasks, and that modular trained exploration models significantly outperform the classical baselines.

Pierre Marza, Laetitia Matignon, Olivier Simonin and Christian Wolf. Multi-Object Navigation with dynamically learned neural implicit representations. In International Conference on Computer Vision (ICCV), 2023.

Understanding and mapping a new environment are core abilities of any autonomously navigating agent. While classical robotics usually estimates maps in a stand-alone manner with SLAM variants, which maintain a topological or metric representation, end-to-end learning of navigation keeps some form of memory in a neural network. Networks are typically imbued with inductive biases, which can range from vectorial representations to birds-eye metric tensors or topological structures. In this work, we propose to structure neural networks with two neural implicit representations, which are learned dynamically during each episode and map the content of the scene: (i) the Semantic Finder predicts the position of a previously seen queried object; (ii) the Occupancy and Exploration Implicit Representation encapsulates information about explored area and obstacles, and is queried with a novel global read mechanism which directly maps from function space to a usable embedding space. Both representations are leveraged by an agent trained with Reinforcement Learning (RL) and learned online during each episode. We evaluate the agent on Multi-Object Navigation and show the high impact of using neural implicit representations as a memory source.

Sombit Dey, Assem Sadek, Gianluca Monaci, Boris Chidlovskii and Christian Wolf. Learning whom to trust in navigation: dynamically switching between classical and neural planning. In International Conference on Intelligent Robots and Systems (IROS), 2023.

Navigation of terrestrial robots is typically ad- dressed either with localization and mapping (SLAM) followed by classical planning on the dynamically created maps, or by machine learning (ML), often through end-to-end training with reinforcement learning (RL) or imitation learning (IL). Recently, modular designs have achieved promising results, and hybrid algorithms that combine ML with classical planning have been proposed. Existing methods implement these combi- nations with hand-crafted functions, which cannot fully exploit the complementary nature of the policies and the complex regularities between scene structure and planning performance. Our work builds on the hypothesis that the strengths and weaknesses of neural planners and classical planners follow some regularities, which can be learned from training data, in particular from interactions. This is grounded on the assumption that, both, trained planners and the mapping algorithms underlying classical planning are subject to failure cases depending on the semantics of the scene and that this dependence is learnable: for instance, certain areas, objects or scene structures can be reconstructed easier than others. We propose a hierarchical method composed of a high-level planner dynamically switching between a classical and a neural planner. We fully train all neural policies in simulation and evaluate the method in both simulation and real experiments with a LoCoBot robot, showing significant gains in performance, in particular in the real environment. We also qualitatively conjecture on the nature of data regularities exploited by the high-level planner.

Steeven Janny, Aurélien Bénéteau, Madiha Nadri, Julie Digne, Nicolas Thome, Christian Wolf. EAGLE: Large-scale Learning of Turbulent Fluid Dynamics with Mesh Transformers. In International Conference on Learning Representations (ICLR), 2023.

Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated on datasets of static objects in static scenes with fixed geometry. We attempt to go beyond existing work in complexity and introduce a new model, method and benchmark. We propose EAGLE: a large-scale dataset of ∼1.1 million 2D meshes resulting from simulations of unsteady fluid dynamics caused by a moving flow source interacting with nonlinear scene structure of varying geometries, with 600 different scenes of three different types in total. To perform future forecasting of pressure and velocity on the challenging EAGLE dataset, we introduce a new mesh transformer. It leverages node clustering, graph pooling and global attention to learn long-range dependencies between spatially distant data points without needing a large number of iterations, as existing GNN methods do. We show that our transformer outperforms state-of-the-art performance on, both, existing synthetic and real datasets and on EAGLE. Finally, we highlight that our approach learns to attend to airflow, integrating complex information in a single iteration.

Assem Sadek, Guillaume Bono, Boris Chidlovskii, Atilla Baskurt, and Christian Wolf. Multi-Object Navigation in real environments using hybrid policies. In International Conference on Robotics and Automation (ICRA), 2023.

Navigation has been classically solved in robotics through the combination of SLAM and planning. More recently, beyond waypoint planning, problems involving significant components of (visual) high-level reasoning have been explored in simulated environments, mostly addressed with large-scale machine learning, in particular RL, offline-RL or imitation learning. These methods require the agent to learn various skills like local planning, mapping objects and querying the learned spatial representations. In contrast to simpler tasks like waypoint planning (PointGoal), for these more complex tasks the current state-of-the-art models have been thoroughly evaluated in simulation but, to our best knowledge, not yet in real environments. In this work we focus on sim2real transfer. We target the challenging Multi-Object Navigation (Multi-ON) task [41] and port it to a physical environment containing real replicas of the originally virtual Multi-ON objects. We introduce a hybrid navigation method, which decomposes the problem into two different skills: (1) waypoint navigation is addressed with classical SLAM combined with a symbolic planner, whereas (2) exploration, semantic mapping and goal retrieval are dealt with deep neural networks trained with a combination of supervised learning and RL. We show the advantages of this approach compared to end-to-end methods both in simulation and a real environment and outperform the SOTA for this task [28].

Nicolas Thome, Christian Wolf. Histoire des réseaux de neurones et du deep learning en traitement des signaux et des images. Oral presentation at GRETSI, 2023.

Cette soumission trace un historique des réseaux de neurones informatiques pour le traitement du signal et des images, depuis leurs fondements jusqu'aux succès du deep learning moderne. Nous présentons leurs éléments clés des modèles en termes d'architecture et d'entraînement, et soulignons l'importance du Big Data et de l'accélération matérielle des GPU. Enfin, nous mentionnons les tendances fortes actuelles à travers l'apprentissage auto-supervisé, les mécanismes attentionnels et les transformers.


Quentin Possamaï, Steeven Janny, Madiha Nadri, Laurent Bako, and Christian Wolf. Learning to estimate UAV created turbulence from scene structure observed by onboard cameras. pre-print arxiv:2203.14726, 2022.

Controlling UAV flights precisely requires a realistic dynamic model and accurate state estimates from onboard sensors like UAV, GPS and visual observations. Obtaining a precise dynamic model is extremely difficult, as important aerodynamic effects are hard to model, in particular ground effect and other turbulences. While machine learning has been used in the past to estimate UAV created turbulence, this was restricted to flat grounds or diffuse in-flight air turbulences, both without taking into account obstacles. In this work we address the complex problem of estimating in-flight turbulences caused by obstacles, in particular the complex structures in cluttered environments. We learn a mapping from control input and images captured by onboard cameras to turbulence. In a large-scale setting, we train a model over a large number of different simulated photo-realistic environments loaded into the this simulator augmented with a dynamic UAV model and an analytic ground effect model. We transfer the model from simulation to a real environment and evaluate on real UAV flights from the EuRoC-MAV dataset, showing that the model is capable of good sim2real generalization performance. The dataset will be made publicly available upon acceptance.

Pierre Marza, Corentin Kervadec, Grigory Antipov, Moez Baccouche and Christian Wolf. An experimental study of the vision-bottleneck in VQA. pre-print arxiv:2202.06858, 2022.

As in many tasks combining vision and language, both modalities play a crucial role in Visual Question Answering (VQA). To properly solve the task, a given model should both understand the content of the proposed image and the nature of the question. While the fusion between modalities, which is another obviously important part of the problem, has been highly studied, the vision part has received less attention in recent work. Current state-of-the-art methods for VQA mainly rely on off-the-shelf object detectors delivering a set of object bounding boxes and embeddings, which are then combined with question word embeddings through a reasoning module. In this paper, we propose an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images. We also study the impact of two methods to incorporate the information about objects necessary for answering a question, in the reasoning module directly, and earlier in the object selection stage. This work highlights the importance of vision in the context of VQA, and the interest of tailoring vision methods used in VQA to the task at hand.

Quentin Possamaï, Steeven Janny, Guillaume Bono, Madiha Nadri, Laurent Bako, and Christian Wolf. MoCap-less Quantitative Evaluation of Ego-Pose Estimation Without Ground Truth Measurements. pre-print arxiv:2202.00403, 2022.

The emergence of data-driven approaches for control and planning in robotics have highlighted the need for developing experimental robotic platforms for data collection. However, their implementation is often complex and expensive, in particular for flying and terrestrial robots where the precise estimation of the position requires motion capture devices (MoCap) or Lidar. In order to simplify the use of a robotic platform dedicated to research on a wide range of indoor and outdoor environments, we present a data validation tool for ego-pose estimation that does not require any equipment other than the on-board camera. The method and tool allow a rapid, visual and quantitative evaluation of the quality of ego-pose sensors and are sensitive to different sources of flaws in the acquisition chain, ranging from desynchronization of the sensor flows to misevaluation of the geometric parameters of the robotic platform. Using computer vision, the information from the sensors is used to calculate the motion of a semantic scene point through its projection to the 2D image space of the on-board camera. The deviations of these keypoints from references created with a semi-automatic tool allow rapid and simple quality assessment of the data collected on the platform. To demonstrate the performance of our method, we evaluate it on two challenging standard UAV datasets as well as one dataset taken from a terrestrial robot.

Steeven Janny, Quentin Possamaï, Laurent Bako, Madiha Nadri and Christian Wolf. Learning Reduced Nonlinear State-Space Models: an Output-Error Based Canonical Approach In Control and Decision Conference (CDC), 2022.

The identification of a nonlinear dynamic model is an open topic in control theory, especially from sparse inputoutput measurements. A fundamental challenge of this problem is that very few to zero prior knowledge is available on both the state and the nonlinear system model. To cope with this challenge, we investigate the effectiveness of deep learning in the modeling of dynamic systems with nonlinear behavior by advocating an approach which relies on three main ingredients: (i) we show that under some structural conditions on the tobe-identified model, the state can be expressed in function of a sequence of the past inputs and outputs; (ii) this relation which we call the state map can be modelled by resorting to the welldocumented approximation power of deep neural networks; (iii) taking then advantage of existing learning schemes, a statespace model can be finally identified. After the formulation and analysis of the approach, we show its ability to identify three different nonlinear systems. The performances are evaluated in terms of open-loop prediction on test data generated in simulation as well as a real world data-set of unmanned aerial vehicle flight measurements.

Pierre Marza, Laetitia Matignon, Olivier Simonin and Christian Wolf. Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation. In International Conference on Intelligent Robots and Systems (IROS), 2022

In the context of visual navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals. This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities, and discover object affordances. In classical Reinforcement Learning (RL) setups, this capacity is learned from reward alone. We introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objective. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input.

Steeven Janny, Fabien Baradel, Natalia Neverova, Madiha Nadri, Greg Mori, Christian Wolf. Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space. In International Conference on Learning Representations (ICLR), 2022 (oral presentation, 1.6% acceptance rate).

Learning causal relationships in high-dimensional data (images, videos) is a hard task, as they are often defined on low dimensional manifolds and must be extracted from complex signals dominated by appearance, lighting, textures and also spurious correlations in the data. We present a method for learning counterfactual reasoning of physical processes in pixel space, which requires the prediction of the impact of interventions on initial conditions. Going beyond the identification of structural relationships, we deal with the challenging problem of forecasting raw video over long horizons. Our method does not require the knowledge or supervision of any ground truth positions or other object or scene properties. Our model learns and acts on a suitable hybrid latent representation based on a combination of dense features, sets of 2D keypoints and an additional latent vector per keypoint. We show that this better captures the dynamics of physical processes than purely dense or sparse representations. We introduce a new challenging and carefully designed counterfactual benchmark for predictions in pixel space and outperform strong baselines in physics-inspired ML and video prediction.

Assem Sadek, Guillaume Bono, Boris Chidlovskii and Christian Wolf. An in-depth experimental study of sensor usage and visual reasoning of robots navigating in real environments. In International Conference on Robotics and Automation (ICRA), 2022.

Visual navigation by mobile robots is classically tackled through SLAM plus optimal planning, and more recently through end-to-end training of policies implemented as deep networks. While the former are often limited to waypoint planning, but have proven their efficiency even on real physical environments, the latter solutions are most frequently employed in simulation, but have been shown to be able learn more complex visual reasoning, involving complex semantical regularities. Navigation by real robots in physical environments is still an open problem. End-to-end training approaches have been thoroughly tested in simulation only, with experiments involving real robots being restricted to rare performance evaluations in simplified laboratory conditions. In this work we present an in-depth study of the performance and reasoning capacities of real physical agents, trained in simulation and deployed to two different physical environments. Beyond benchmarking, we provide insights into the generalization capabilities of different agents training in different conditions. We visualize sensor usage and the importance of the different types of signals. We show, that for the PointGoal task, an agent pre-trained on wide variety of tasks and fine-tuned on a simulated version of the target environment can reach competitive performance without modelling any sim2real transfer, i.e. by deploying the trained agent directly from simulation to a real physical robot.

Edward Beeching, Maxim Peter, Philippe Marcotte, Jilles Dibangoye, Olivier Simonin, Joshua Romoff and Christian Wolf. Graph augmented Deep Reinforcement Learning in the GameRLand3D environment. In AAAI Workshop on Reinforcement Learning in Games, 2022.

We address planning and navigation in challenging 3D video games featuring maps with disconnected regions reachable by agents using special actions. In this setting, classical symbolic planners are not applicable or diffi- cult to adapt. We introduce a hybrid technique combin- ing a low level policy trained with reinforcement learn- ing and a graph based high level classical planner. In addition to providing human-interpretable paths, the ap- proach improves the generalization performance of an end-to-end approach in unseen maps, where it achieves a 20% absolute increase in success rate over a recurrent end-to-end agent on a point to point navigation task in yet unseen large-scale maps of size 1km×1km. In an in- depth experimental study, we quantify the limitations of end-to-end Deep RL approaches in vast environments and we also introduce “GameRLand3D”, a new bench- mark and soon to be released environment built with the Unity engine able to generate complex procedural 3D maps for navigation tasks. An overview video is available here.

Edward Beeching, Jilles Dibangoye, Olivier Simonin and Christian Wolf. Godot Reinforcement Learning Agents. In AAAI Workshop on Reinforcement Learning in Games, 2022.

We present Godot Reinforcement Learning (RL) Agents, an open-source interface for developing environments and agents in the Godot Game Engine. The Godot RL Agents interface allows the design, creation and learning of agent behaviors in challenging 2D and 3D environments with various on-policy and off-policy Deep RL algorithms. We provide a standard Gym interface, with wrappers for learning in the Ray RLlib and Stable Baselines RL frameworks. This allows users access to over 20 state of the art on-policy, off-policy and multi-agent RL algorithms. The framework is a versatile tool that allows researchers and game designers the ability to create environments with discrete, continuous and mixed action spaces. The interface is relatively performant, with 12k interactions per second on a high end laptop computer, when parallized on 4 CPU cores. An overview video is available here.


Boris Chidlovskii, Assem Sadek and Christian Wolf. Universal Domain Adaptation in Ordinal Regression. pre-print arXiv:2106.11576, 2021.

We address the problem of universal domain adaptation (UDA) in ordinal regression (OR), which attempts to solve classification problems in which labels are not independent, but follow a natural order. We show that the UDA techniques developed for classification and based on the clustering assumption, under-perform in OR settings. We propose a method that complements the OR classifier with an auxiliary task of order learning, which plays the double role of discriminating between common and private instances, and expanding class labels to the private target images via ranking. Combined with adversarial domain discrimination, our model is able to address the closed set, partial and open set configurations. We evaluate our method on three face age estimation datasets, and show that it outperforms the baseline methods.

Théo Jaunet, Corentin Kervadec, Grigory Antipov, Moez Baccouche, Romain Vuillemot and Christian Wolf. VisQA: X-raying Vision and Language Reasoning in Transformers. In IEEE Transactions on Visualization and Computer Graphics (Proceedings of VIS 2021).

Visual Question Answering systems target answering open-ended textual questions given input images. They are a testbed for learning high-level reasoning with a primary use in HCI, for instance assistance for the visually impaired. Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data, and sometimes do not even look at the input image, instead of performing the required reasoning steps. We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation. It exposes the key element of state-of-the-art neural models -- attention maps in transformers. Our working hypothesis is that reasoning steps leading to model predictions are observable from attention distributions, which are particularly useful for visualization. The design process of VisQA was motivated by well-known bias examples from the fields of deep learning and vision-language reasoning and evaluated in two ways. First, as a result of a collaboration of three fields, machine learning, vision and language reasoning, and data analytics, the work lead to a direct impact on the design and training of a neural model for VQA, improving model performance as a consequence. Second, we also report on the design of VisQA, and a goal-oriented evaluation of VisQA targeting the analysis of a model decision process from multiple experts, providing evidence that it makes the inner workings of models accessible to users.

Corentin Kervadec*, Christian Wolf*. Grigory Antipov, Moez Baccouche and Madiha Nadri. Supervising the Transfer of Reasoning Patterns in VQA. In Neural Information Processing Systems (NeurIPS), 2021 (*=equal contribution).

Methods for Visual Question Anwering (VQA) are notorious for leveraging dataset biases rather than performing reasoning, hindering generalization. It has been recently shown that better reasoning patterns emerge in attention layers of a state-of-the-art VQA model when they are trained on perfect (oracle) visual inputs. This provides evidence that deep neural networks can learn to reason when training conditions are favorable enough. However, transferring this learned knowledge to deployable models is a challenge, as much of it is lost during the transfer. We propose a method for knowledge transfer based on a regularization term in our loss function, supervising the sequence of required reasoning operations. We provide a theoretical analysis based on PAC-learning, showing that such program prediction can lead to decreased sample complexity under mild hypotheses. We also demonstrate the effectiveness of this approach experimentally on the GQA dataset and show its complementarity to BERT-like self-supervised pre-training.

Théo Jaunet, Guillaume Bono, Romain Vuillemot and Christian Wolf. Sim2RealViz: Visualizing the Sim2Real Gap in Robot Ego-Pose Estimation. NeurIPS XAI Workshop on eXplainable AI approaches for debugging and diagnosis, 2021 (oral).

The Robotics community has started to heavily rely on increasingly realistic 3D simulators for large-scale training of robots on massive amounts of data. But once robots are deployed in the real world, the simulation gap, as well as changes in the real world (e.g. lights, objects displacements) lead to errors. In this paper, we introduce Sim2RealViz, a visual analytics tool to assist experts in understanding and reducing this gap for robot ego-pose estimation tasks, i.e. the estimation of a robot's position using trained models. Sim2RealViz displays details of a given model and the performance of its instances in both simulation and real-world. Experts can identify environment differences that impact model predictions at a given location and explore through direct interactions with the model hypothesis to fix it. We detail the design of the tool, and case studies related to the exploit of the regression to the mean bias and how it can be addressed, and how models are perturbed by the vanish of landmarks such as bikes.

Steeven Janny, Vincent Andrieu, Madiha Nadri, Christian Wolf. Deep KKL: Data-driven Output Prediction for Non-Linear Systems. In Control and Decision Conference (CDC), 2021.

We address the problem of output prediction, ie. designing a model for autonomous nonlinear systems capable of forecasting their future observations. We first define a general framework bringing together the necessary properties for the development of such an output predictor. In particular, we look at this problem from two different viewpoints, control theory and data-driven techniques (machine learning), and try to formulate it in a consistent way, reducing the gap between the two fields. Building on this formulation and problem definition, we propose a predictor structure based on the Kazantzis-Kravaris/Luenberger (KKL) observer and we show that KKL fits well into our general framework. Finally, we propose a constructive solution for this predictor that solely relies on a small set of trajectories measured from the system. Our experiments show that our solution allows to obtain an efficient predictor over a subset of the observation space.

Corentin Kervadec*, Théo Jaunet*, Grigory Antipov, Moez Baccouche, Romain Vuillemot and Christian Wolf. How Transferrable are Reasoning Patterns in VQA? In International Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (*=equal contribution).

Since its inception, Visual Question Answering (VQA) is notoriously known as a task, where models are prone to exploit biases in datasets to find shortcuts instead of performing high-level reasoning, required for generalization. Classical methods address these issues with different techniques including removing biases from training data, or adding branches to models to detect and remove biases. In this paper, we argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems. We train a visual oracle with perfect sight, and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases compared to standard models. In particular, we propose to study the attention mechanisms at work in the visual oracle and compare them with a SOTA Transformer-based model. We provide an in-depth analysis and visualizations of reasoning patterns obtained with an online visualization tool which we make publicly available ( We exploit these insights by transferring reasoning patterns from the oracle model to a SOTA Transformer-based VQA model taking as input standard noisy inputs. Experiments show successful transfer as evidenced by higher overall accuracy, as well as accuracy on infrequent answers for each type of question, which provides evidence for improved generalization and a decrease of the dependency on dataset biases.

Corentin Kervadec, Grigory Antipov, Moez Baccouche and Christian Wolf. Roses Are Red, Violets Are Blue... but Should VQA Expect Them To? In International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

To be reliable on rare events is an important requirement for systems based on machine learning. In this work we focus on Visual Question Answering (VQA), where, in spite of recent efforts, datasets remain imbalanced, causing shortcomings of current models: tendencies to overly exploit dataset biases and struggles to generalise to unseen associations of concepts. We focus on a systemic evaluation of model error distributions and address fundamental questions: How is the prediction error distributed? What is the prediction accuracy on infrequent vs. frequent concepts? In this work, we design a new benchmark based on a fine-grained reorganization of the GQA dataset [1], which allows to precisely answer these questions. It introduces distributions shifts in both validation and test splits, which are defined on question groups and are thus tailored to each question. We performed a large-scale study and we experimentally demonstrate that several state-of-the-art VQA models, even those specifically designed for bias reduction, fail to address questions involving infrequent concepts. Furthermore, we show that the high accuracy obtained on the frequent concepts alone is mechanically increasing overall accuracy, covering up the true behavior of current VQA models.

Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi and Graham W. Taylor. SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation In International Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (oral presentation).

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art.


Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf. Estimating semantic structure for the VQA answer space. pre-print arXiv:2006.05726, 2020.

Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image), has always been treated as a classification problem over a set of predefined answers. Despite its convenience, this classification approach poorly reflects the semantics of the problem limiting the answering to a choice between independent proposals, without taking into account the similarity between them (e.g. equally penalizing for answering \say{cat} or \say{German shepherd} instead of \say{dog}). We address this issue by proposing (1) two measures of proximity between VQA classes, and (2) a corresponding loss which takes into account the estimated proximity. This significantly improves the generalization of VQA models by reducing their language bias. In particular, we show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models. Finally, by combining our method with a language bias reduction approach, we report SOTA-level performance on the challenging VQAv2-CP dataset.

Théo Jaunet, Romain Vuillemot and Christian Wolf. DRLViz: Understanding Decisions and Memory in Deep Reinforcement Learning. In Computer Graphics Forum (Proceedings of Eurovis), 2020.

We present DRLViz, a visual analytics interface to interpret the internal memory of an agent (e.g. a robot) trained using deep reinforcement learning. This memory is composed of large temporal vectors updated when the agent moves in an environment and is not trivial to understand. It is often referred to as a black box as only inputs (images) and outputs (actions) are intelligible for humans. Using DRLViz, experts are assisted to interpret using memory reduction interactions, to investigate parts of the memory role when errors have been made, and ultimately to improve the agent training process. We report on several examples of use of DRLViz, in the context of video games simulators (ViZDoom) for a navigation scenario with item gathering tasks. We also report on experts evaluation using DRLViz, and applicability of DRLViz to other scenarios and navigation problems beyond simulation games, as well as its contribution to black box models interpret-ability and explain-ability in the field of visual analytics.

Théo Jaunet, Romain Vuillemot and Christian Wolf. Théo Guesser: Could you beat an AI guessing where you are in Theo’s apartment? In IEEE VIS Workshop on AI Explainability, 2020.
Edward Beeching, Jilles Dibangoye, Olivier Simonin and Christian Wolf. Learning to plan with uncertain topological maps. In European Conference on Computer Vision (ECCV), 2020 (spotlight, 5% acceptance rate).

We train an agent to navigate in 3D environments using a hierarchical strategy including a high-level graph based planner and a local policy. Our main contribution is a data driven learning based approach for planning under uncertainty in topological maps, requiring an estimate of shortest paths in valued graphs with a probabilistic structure. Whereas classical symbolic algorithms achieve optimal results on noise-less topologies, or optimal results in a probabilistic sense on graphs with probabilistic structure, we aim to show that machine learning can overcome missing information in the graph by taking into account rich high-dimensional node features, for instance visual information available at each location of the map. Compared to purely learned neural white box algorithms, we structure our neural model with an inductive bias for dynamic programming based shortest path algorithms, and we show that a particular parameterization of our neural model corresponds to the Bellman-Ford algorithm. By performing an empirical analysis of our method in simulated photo-realistic 3D environments, we demonstrate that the inclusion of visual features in the learned neural planner outperforms classical symbolic solutions for graph based planning.

Edward Beeching, Jilles Dibangoye, Olivier Simonin and Christian Wolf. EgoMap: Projective mapping and structured egocentric memory for Deep RL. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2020.

Tasks involving localization, memorization and planning in partially observable 3D environments are an ongoing challenge in Deep Reinforcement Learning. We present EgoMap, a spatially structured neural memory architecture. EgoMap augments a deep reinforcement learning agent's performance in 3D environments on challenging tasks with multi-step objectives. The EgoMap architecture incorporates several inductive biases including a differentiable inverse projection of CNN feature vectors onto a top-down spatially structured map. The map is updated with ego-motion measurements through a differentiable affine transform. We show this architecture outperforms both standard recurrent agents and state of the art agents with structured memory. We demonstrate that incorporating these inductive biases into an agent's architecture allows for stable training with reward alone, circumventing the expense of acquiring and labelling expert trajectories. A detailed ablation study demonstrates the impact of key aspects of the architecture and through extensive qualitative analysis, we show how the agent exploits its structured internal memory to achieve higher performance.

Johan Peralez, Francesco Galuppo, Pascal Dufour, Christian Wolf and Madiha Nadri. Data-driven multi-model control for a waste heat recovery system. In Control and Decision Conference (CDC), 2020.

We consider the problem of supervised learning of a multi-model based controller for non-linear systems. Selected multiple linear controllers are used for different operating points and combined with a local weighting scheme, whose weights are predicted by a deep neural network trained online. The network uses process and model outputs to drive the controller towards a suitable mixture of operating points. The proposed approach, which combines machine learning and classical control of linear processes, allows efficient imple- mentation on complex industrial processes. In this work, the control problem consists in the design of a controller for a waste heat recovery system (WHRS) mounted on a heavy duty (HD) truck engine to decrease fuel consumption and meet the future pollutant emissions standard. Note that the contribution of this work is not specific to HD truck processes since it can be applied to any nonlinear system with an existing linear controller bank. The proposed control scheme is successfully evaluated on an Organic Rankine Cycle (ORC) process simulator and compared to a standard linear controller and to several strong multi-model baselines without learning.

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, Christian Wolf. COPHY: Counterfactual Learning of Physical Dynamics. In International Conference on Learning Representations (ICLR), 2020 (spotlight, 6% acceptance rate).

Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the COPHY benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.

Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf. Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks. In European Conference on Artificial Intelligence (ECAI), 2020 (oral presentation).

The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e. modality alignment does not necessarily emerge automatically, and (2) that adding weak supervision for alignment between visual objects and words improves the quality of the learned models on tasks requiring reasoning. In particular , we integrate an object-word alignment loss into SOTA vision-language reasoning models and evaluate it on two tasks VQA and Language-driven Comparison of Images. We show that the proposed fine-grained inter-modality supervision significantly improves performance on both tasks. In particular, this new learning signal allows obtaining SOTA-level performances on GQA dataset (VQA task) with pre-trained models without finetuning on the task, and a new SOTA on NLVR2 dataset (Language-driven Comparison of Images). Finally, we also illustrate the impact of the contribution on the models reasoning by visualizing attention distributions.

Edward Beeching, Christian Wolf, Jilles Dibangoye and Olivier Simonin. Deep Reinforcement Learning on a Budget: 3D Control and Reasoning Without a Supercomputer. In International Conference on Pattern Recognition (ICPR), 2020.

An important goal of research in Deep Reinforcement Learning in mobile robotics is to train agents capable of solving complex tasks, which require a high level of scene understanding and reasoning from an egocentric perspective. When trained from simulations, optimal environments should satisfy a currently unobtainable combination of high-fidelity photographic observations, massive amounts of different environment configurations and fast simulation speeds. In this paper we argue that research on training agents capable of complex reasoning can be simplified by decoupling from the requirement of high fidelity photographic observations. We present a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Our scenarios combine two key advantages: (i) they are based on a simple but highly efficient 3D environment (ViZDoom) which allows high speed simulation (12000fps); (ii) the scenarios provide the user with a range of difficulty settings, in order to identify the limitations of current state of the art algorithms and network architectures. We aim to increase accessibility to the field of Deep-RL by providing baselines for challenging scenarios where new ideas can be iterated on quickly. We argue that the community should be able to address challenging problems in reasoning of mobile agents without the need for a large compute infrastructure.


Tom Gillooly, Nicolas Coltice and Christian Wolf. An anticipation experiment for plate tectonics. In "Tectonics", 38(11):3916-3938, 2019.

Although plate tectonics has pushed the frontiers of geosciences in the past 50 years, it has legitimate limitations and among them we focus on both the absence of dynamics in the theory, and the difficulty of reconstructing tectonics when data is sparse. In this manuscript, we propose an anticipation experiment, proposing a singular outlook on plate tectonics in the digital era. We hypothesize that mantle convection models producing self-consistently plate-like behavior will capture the essence of the self-organisation of plate boundaries. Such models exist today in a preliminary fashion and we use them here to build a database of mid-ocean ridge and trench configurations. To extract knowledge from it we develop a machine learning framework based on Generative Adversarial Networks (GANs) that learns the regularities of the self-organisation in order to fill gaps of observations when working on reconstructing a plate configuration. The user provides the distribution of known ridges and trenches, the location of the region where observations lack, and our digital architecture proposes a horizontal divergence map from which missing plate boundaries are extracted. Our framework is able to prolongate and interpolate plate boundaries within an unresolved region, but fails to retrieve a plate boundary that would be completely contained in it. The attempt we make is certainly too early because geodynamic models need improvement and a larger amount of geodynamic model outputs, as independent as possible, is required. However, this work suggests applying such an approach to expand the capabilities of plate tectonics is within reach.

Pejman Rasti, Christian Wolf and Hugo Dorez, Raphael Sablong, Driffa Moussata, Salma Samiei, David Rousseau. Machine Learning-Based Classification of the Health State of Mice Colon in Cancer Study from Confocal Laser Endomicroscopy. In "Nature Scientific Reports 9", 2019.

In this article, we address the problem of the classification of the health state of the colon's wall of mice, possibly injured by cancer with machine learning approaches. This problem is essential for translational research on cancer and is a priori challenging since the amount of data is usually limited in all preclinical studies for practical and ethical reasons. Three states considered including cancer, health, and inflammatory on tissues. Fully automated machine learning-based methods are proposed, including deep learning, transfer learning, and shallow learning with SVM. These methods addressed different training strategies corresponding to clinical questions such as the automatic clinical state prediction on unseen data using a pre-trained model, or in an alternative setting, real-time estimation of the clinical state of individual tissue samples during the examination. Experimental results show the best performance of 99.93% correct recognition rate obtained for the second strategy as well as the performance of 98.49% which were achieved for the more difficult first case.

Ozgur Erkent, Christian Wolf and Christian Laugier. End-to-End Learning of Semantic Grid Estimation Deep Neural Network with Occupancy Grids. In "Unmanned Systems", 07(03):171-181, 2019.

We propose semantic grid, a spatial 2D map of the environment around an autonomous vehicle consisting of cells which represent the semantic information of the corresponding region such as car, road, vegetation, bikes, etc. It consists of an integration of an occupancy grid, which computes the grid states with a Bayesian filter approach, and semantic segmentation information from monocular RGB images, which is obtained with a deep neural network. The network fuses the information and can be trained in an end-to-end manner. The output of the neural network is refined with a conditional random field. The proposed method is tested in various datasets (KITTI dataset, Inria-Chroma dataset and SYNTHIA) and different deep neural network architectures are compared.

Quentin Debard, Jilles Dibangoye, Stéphane Canu and Christian Wolf. Learning 3D Navigation Protocols on Touch Interfaces with Cooperative Multi-Agent Reinforcement Learning. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2019.

Using touch devices to navigate in virtual 3D environments such as computer assisted design (CAD) models or geographical information systems (GIS) is inherently difficult for humans, as the 3D operations have to be performed by the user on a 2D touch surface. This ill-posed problem is classically solved with a fixed and handcrafted interaction protocol, which must be learned by the user. We propose to automatically learn a new interaction protocol allowing to map a 2D user input to 3D actions in virtual environments using reinforcement learning (RL). A fundamental problem of RL methods is the vast amount of interactions often required, which are difficult to come by when humans are involved. To overcome this limitation, we make use of two collaborative agents. The first agent models the human by learning to perform the 2D finger trajectories. The second agent acts as the interaction protocol, interpreting and translating to 3D operations the 2D finger trajectories from the first agent. We restrict the learned 2D trajectories to be similar to a training set of collected human gestures by first performing state representation learning, prior to reinforcement learning. This state representation learning is addressed by projecting the gestures into a latent space learned by a variational auto encoder (VAE).

Anshul Paigwar, Ozgur Erkent, Christian Wolf and Christian Laugier. Attentional PointNet for 3D-Object Detection in Point Clouds. In CVPR Workshop on autonomuous driving, 2019.
Accurate detection of objects in 3D point clouds is a central problem for autonomous navigation. Most existing methods use techniques of hand-crafted features representation or multi-modal approaches prone to sensor failure. Approaches like PointNet that directly operate on sparse point data have shown good accuracy in the classification of single 3D objects. However, LiDAR sensors on Autonomous Vehicles generate a large scale point cloud. Real-time object detection in such a cluttered environment still remains a challenge. In this study, we propose Attentional PointNet, which is a novel end-to-end trainable deep architecture for object detection in point clouds. We extend the theory of visual attention mechanism to 3D point clouds and introduce a new recurrent 3D Localization Network module. Rather than processing the whole point cloud, the network learns where to look (finding regions of interest), which significantly reduces the number of points to be processed and inference time. Evaluation on KITTI car detection benchmark shows that our Attentional PointNet achieves comparable results with the state-of-the-art LiDAR-based 3D detection methods in detection and speed.
Théo Jaunet, Romain Vuillemot and Christian Wolf. What if we Reduced the Memory of an Artificial Doom Player? In IEEE VIS Workshop on AI Explainability (Best submission prize), 2019.


Gerard Bailly, Alaeddine Mihoub, Christian Wolf and Fréderic Elisei. Gaze and face-to-face interaction: from multimodal data to behavioral models. Book chapter in "Eye-tracking in Interaction: Studies on the role of eye gaze in dialogue (Advances in Interaction Studies) ", Geert Brône & Bert Oben, ed., 2018.

This chapter describes experimental and modeling work aiming at describing gaze patterns that are mutually exchanged by interlocutors during situated and task-directed face-to-face two-ways interactions. We will show that these gaze patterns (incl. blinking rate) are significantly influenced by the cognitive states of the interlocutors (speaking, listening, thinking, etc.), their respective roles in the conversation (e.g. instruction giver, respondent) as well as their social relationship (e.g. colleague, supervisor).

This chapter provides insights into the (micro-)coordination of gaze with other components of attention management as well as methodologies for capturing and modeling behavioral regularities observed in experimental data. A particular emphasis is put on statistical models, which are able to learn behaviors in a data-driven way.

We will introduce several statistical models of multimodal behaviors that can be trained on such multimodal signals and generate behaviors given perceptual cues. We will notably compare performances and properties of models which explicitly model the temporal structure of studied signals, and which relate them to internal cognitive states. In particular we study Semi-Hidden Markov Models and Dynamic Bayesian Networks and compare them to classifiers without sequential models (Support Vector Machines and Decision Trees).

We will further show that the gaze of conversational agents (virtual talking heads, speaking robots) may have a strong impact on communication efficiency. One of the conclusions we draw from these experiments is that multimodal behavioral models able to generate co-verbal gaze patterns should be designed with great care in order not to increase cognitive load. Experiments involving an impoverished or irrelevant control of the gaze of artificial agents (virtual talking heads and humanoid robots) have demonstrated its negative impact on communication (Garau, Slater, Bee, & Sasse, 2001).

Bastien Moysset, Christopher Kermorvant, Christian Wolf. Learning to detect, localize and recognize many text objects in document images from few examples. In International Journal on Document Analysis and Recognition (IJDAR), 21(3):161–175, 2018.
The current trend in object detection and localization is to learn predictions with high capacity deep neural networks trained on a very large amount of annotated data and using a high amount of processing power. In this work, we pro- pose a new neural model which directly predicts bounding box coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data is not as abun- dant as in the classical configuration of natural images and Imagenet/Pascal VOC tasks. We particularly target the de- tection of text in document images, but our method is not limited to this setting. The proposed model also facilitates the detection of many objects in a single image and can deal with inputs of variable sizes without resizing.
Emre Dogan, Gonen Eren, Christian Wolf, Eric Lombardi, Atilla Baskurt. Multi-view pose estimation with mixtures-of-parts and adaptive viewpoint selection. In IET Computer Vision, 12(4):403–411, 2018.
We propose a new method for human pose estimation which leverages information from multiple views to impose a strong prior on articulated pose. The novelty of the method concerns the types of coherence modelled. Consistency is maximised over the different views through different terms modelling classical geometric information (coherence of the resulting poses) as well as appearance information which is modelled as latent variables in the global energy function. Moreover, adequacy of each view is assessed and their contributions are adjusted accordingly. Experiments on the HumanEva and UMPM datasets show that the proposed method significantly decreases the estimation error compared to single-view results.
Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, Greg Mori. Object Level Visual Reasoning in Videos. In European Conference on Computer Vision (ECCV) 2018.
Human activity recognition is typically addressed by training models to detect key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context. The next open challenges in activity recognition require a level of understanding that pushes beyond this, requiring fine distinctions and a detailed comprehension of the interactions between actors and objects in a scene. We propose a model capable of learning to reason about semantically meaningful spatio-temporal interactions in videos. Key to our approach is the choice of performing this reasoning on an object level through the integration of state of the art object instance segmentation networks. This allows the model to learn detailed spatial interactions that exist at a semantic, object-interaction relevant level. We evaluated our method on three standard datasets: the Twenty-BN Something-Something dataset, the VLOG dataset and the EPIC Kitchens dataset, and achieve state of the art results on both. Finally, we also show visualizations of the interactions learned by the model, which illustrate object classes and their interactions corresponding to different activity classes.
Fabien Baradel, Christian Wolf, Julien Mille and Graham W. Taylor. Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points. In Computer Vision and Pattern Recognition (CVPR), 2018.

We propose a method for human activity recognition from RGB data which does not rely on any pose information during test time, and which does not explicitly calculate pose information internally. Instead, a visual attention module learns to predict glimpse sequences in each frame. These glimpses correspond to interest points in the scene which are relevant to the classified activities. No spatial coherence is forced on the glimpse locations, which gives the module liberty to explore different points at each frame and better optimize the process of scrutinizing visual information.

Tracking and sequentially integrating this kind of unstructured data is a challenge, which we address by separating the set of glimpses from a set of recurrent tracking/recognition workers. These workers receive the glimpses, jointly performing subsequent motion tracking and prediction of the activity itself. The glimpses are soft-assigned to the workers, optimizing coherence of the assignments in space, time and feature space using an external memory module. No hard decisions are taken, i.e.~each glimpse point is assigned to all existing workers, albeit with different importance. Our methods outperform state-of-the-art methods on the largest human activity recognition dataset available to-date; NTU RGB+D Dataset, and on a smaller human action recognition dataset Northwestern-UCLA Multiview Action 3D Dataset.

Fabien Baradel, Christian Wolf, Julien Mille. Human Activity Recognition by attending to RGB frames from deep pose features. In British Machine Vision Conference (BMVC), 2018.
We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D.
Ozgur Erkent, Christian Wolf, Christian Laugier, David Sierra Goncalez and Victor Romero Cano. Semantic Grid Estimation with a Hybrid Bayesian and Deep Neural Network Approach. In International Conference on Intelligent Robots (IROS), 2018.

In an autonomous vehicle setting, we propose a method for the estimation of a semantic grid, i.e. a bird's eye grid centered on the car's position and aligned in its driving direction, which contains high-level semantic information on the environment and its actors. Each grid cell contains a semantic label with divers classes, as for instance {Road, Vegetation, Building, Pedestrian, Car ...}.

We propose a hybrid approach, which combines the advantages of two different methodologies: we use Deep Learning to perform semantic segmentation on monocular RGB images with supervised learning from labeled groundtruth data. We combine these segmentations with occupancy grids calculated from LIDAR data using a generative Bayesian particle filter. The fusion itself is carried out with a deep network, which learns to integrate geometric information from the LIDAR with semantic information from the RGB data.

We tested our method on two datasets, namely the KITTI dataset, which is publicly available and widely used, and our own dataset obtained from with our own platform, a Renault ZOE equipped with a LIDAR and various sensors. We largely outperform baselines which calculate the semantic grid either from the RGB image alone or from LIDAR output alone, showing the interest of this hybrid approach.

Fabrice Jumel, Jacques Saraydaryan, Raphael Leber, Laetitia Matignon, Eric Lombardi, Christian Wolf and Olivier Simonin. Context Aware Robot Architecture, Application to the RoboCup@Home Challenge. Robocup Symposium, 2018.

This paper presents an architecture dedicated to the orchestration of high level abilities of a humanoid robot, such as a Pepper, which must perform some tasks as the ones proposed in the RoboCup@Home competition. We present the main abilities that a humanoid service robot should provide. We choose to build them based on recent methodologies linked to social navigation and deep learning. We detail the architecture, on how high level abilities are connected with low level sub-functions. Finally we present first experimental results with a Pepper humanoid.

Quentin Debard, Christian Wolf, Stéphane Canu and Julien Arne. Learning to recognize touch gestures: recurrent vs. convolutional features and dynamic sampling. In International Conferene on Automatic Face and Gesture Recognition (FG), oral presentation, 2018.

We propose a fully automatic method for learning gestures on big touch devices in a potentially multi-user context. The goal is to learn general models capable of adapting to different gestures, user styles and hardware variations (e.g. device sizes, sampling frequencies and regularities).

Based on deep neural networks, our method features a novel dynamic sampling and temporal normalization component, transforming variable length gestures into fixed length representations while preserving finger/surface contact transitions, that is, the topology of the signal. This sequential representation is then processed with a convolutional model capable, unlike recurrent networks, of learning hierarchical representations with different levels of abstraction.

To demonstrate the interest of the proposed method, we introduce a new touch gestures dataset with 6758 gestures performed by 27 people, which is, up to our knowledge, the first of its kind: a publicly available multi-touch gesture dataset for interaction. We also tested our method on a standard dataset in symbolic touch gesture recognition, the MMG dataset, outperforming the state of the art and reporting close to perfect performance.


Natalia Neverova, Christian Wolf, Florian Nebout, Graham W. Taylor. Hand Pose Estimation through Weakly-Supervised Learning of a Rich Intermediate Representation. In Computer Vision and Image Understanding (CVIU) 167:56-67, 2017.
We propose a method for hand pose estimation based on a deep regressor trained on two different kinds of input. Raw depth data is fused with an intermediate representation in the form of a segmentation of the hand into parts. This intermediate representation contains important topological information and provides useful cues for reasoning about joint locations. The mapping from raw depth to segmentation maps is learned in a semi/weakly-supervised way from two different datasets: (i) a synthetic dataset created through a rendering pipeline including densely labeled ground truth (pixelwise segmentations); and (ii) a dataset with real images for which ground truth joint positions are available, but not dense segmentations. Loss for training on real images is generated from a patch-wise restoration process, which aligns tentative segmentation maps with a large dictionary of synthetic poses. The underlying premise is that the domain shift between synthetic and real data is smaller in the intermediate representation, where labels carry geometric and topological meaning, than in the raw input domain. Experiments on the NYU dataset show that the proposed training method decreases error on joints over direct regression of joints from depth data by 15.7%.
Eric Guerin, Eric Galin, Julie Digne, Adrien Peytavie, Christian Wolf, Bedrich Benes, Benoit Martinez. Interactive Example-Based Terrain Authoring with Conditional Generative Adversarial Networks. In Transactions on Graphics (SIGGRAPH Asia), 2017.

Authoring virtual terrains presents a challenge and there is a strong need for authoring tools able to create realistic terrains with simple user-inputs and with high user control. We propose an example-based authoring pipeline that uses a set of terrain synthesizers dedicated to specific tasks.

Each terrain synthesizer is a Conditional Generative Adversarial Network trained by using real-world terrains and their sketched counterparts. The training sets are built automatically with a view that the terrain synthesizers learn the generation from features that are easy to sketch. During the authoring process, the artist first creates a rough sketch of the main terrain features, such as rivers, valleys and ridges, and the algorithm automatically synthesizes a terrain corresponding to the sketch using the learned features of the training samples. Moreover, an erosion synthesizer can also generate terrain evolution by erosion at a very low computational cost. Our framework allows for an easy terrain authoring and provides a high level of realism for a minimum sketch cost. We show various examples of terrain synthesis created by experienced as well as inexperienced users who are able to design a vast variety of complex terrains in a very short time.

Damien Fourure, Remi Emonet, Elisa Fromont, Damien Muselet, Natalia Neverova, Alain Trémeau, Christian Wolf. Multi-task, Multi-domain Learning: application to semantic segmentation and pose regression. In Neurocomputing 251:68-80, 2017.
We present an approach that leverages multiple datasets annotated for different tasks (e.g., classification with different labelsets) to improve the predictive accuracy on each individual dataset. Domain adaptation techniques can correct dataset bias but they are not applicable when the tasks differ, and they need to be complemented to handle multi-task settings. We propose a new selective loss function that can be integrated into deep neural networks to exploit training data coming from multiple datasets annotated for related but possibly different label sets. We show that the gradient-reversal approach for domain adaptation can be used in this setup to additionally handle domain shifts. We also propose an auto-context approach that further captures existing correlations across tasks. Thorough experiments on two types of applications (semantic segmentation and hand pose estimation) show the relevance of our approach in different contexts.
Fabien Baradel, Christian Wolf, Julien Mille. Human Action Recognition: Pose-based Attention draws focus to Hands. ICCV Workshop on Hands in Action, 2017.

We propose a new spatio-temporal attention based mechanism for human action recognition able to automatically attend to most important human hands and detect the most discriminative moments in an action. Attention is handled in a recurrent manner employing Recurrent Neural Network (RNN) and is fully-differentiable. In contrast to standard soft-attention based mechanisms, our approach does not use the hidden RNN state as input to the attention model. Instead, attention distributions are drawn using external information: human articulated pose. We performed an extensive ablation study to show the strengths of this approach and we particularly studied the conditioning aspect of the attention mechanism.

We evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results. Another advantage of our model are certains aspects of explanability, as the spatial and temporal attention distributions at test time allow to study and verify on which parts of the input data the method focuses.

Damien Fourure, Remi Emonet, Elisa Fromont, Damien Muselet, Alain Trémeau, Christian Wolf. Residual Conv-Deconv Grid Network for Semantic Segmentation. In British Machine Vision Conference (BMVC), 2017.
This paper presents GridNet, a new Convolutional Neural Network (CNN) architecture for semantic image segmentation (full scene labelling). Classical neural networks are implemented as one stream from the input to the output with subsampling operators applied in the stream in order to reduce the feature maps size and to increase the receptive field for the final prediction. However, for semantic image segmentation, where the task consists in providing a semantic class to each pixel of an image, feature maps reduction is harmful because it leads to a resolution loss in the output prediction. To tackle this problem, our GridNet follows a grid pattern allowing multiple interconnected streams to work at different resolutions. We show that our network generalizes many well known networks such as conv-deconv, residual or U-Net networks. GridNet is trained from scratch and achieves competitive results on the Cityscapes dataset.
Bastien Moysset, Christopher Kermorvant, Christian Wolf. Full-Page Text Recognition: Learning Where to Start and When to Stop. In International Conference on Document Analysis and Recognition (ICDAR), 2017.

Text line detection and localization is a crucial step for full page document analysis, but still suffers from heterogeneity of real life documents. In this paper, we present a new approach for full page text recognition. Localization of the text lines is based on regressions with Fully Convolutional Neural Networks and Multidimensional Long Short-Term Memory as contextual layers.

In order to increase the efficiency of this localization method, only the position of the left side of the text lines are predicted. The text recognizer is then in charge of predicting the end of the text to recognize. This method has shown good results for full page text recognition on the highly heterogeneous Maurdor dataset.

Fan Li, Natalia Neverova, Christian Wolf and Graham W. Taylor Modout: Learning Multi-Modal Architectures by Stochastic Regularization. In International Conference on Automatic Face and Gesture Recognition (FG), 2017.
Model selection methods based on stochastic regularization such as Dropout have been widely used in deep learning due to their simplicity and effectiveness. The standard Dropout method treats all units, visible or hidden, in the same way, thus ignoring any \emph{a priori} information related to grouping or structure. Such structure is present in multi-modal learning applications such as affect analysis and gesture recognition, where subsets of units may correspond to individual modalities. In this paper we describe Modout, a model selection method based on stochastic regularization, which is particularly useful in the multi-modal setting. Different from previous methods, it is capable of learning whether or when to fuse two modalities in a layer, which is usually considered to be an architectural hyper-parameter by deep learning researchers and practitioners. Modout is evaluated on one synthetic and two real multi-modal datasets. The results indicate improved performance compared to other stochastic regularization methods. The result on the Montalbano dataset shows that learning a fusion structure by Modout is on par with a state-of-the-art carefully designed architecture.


Natalia Neverova, Christian Wolf, Graham W. Taylor and Florian Nebout. ModDrop: adaptive multi-modal gesture recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence - PAMI 38(8):1692-1706, 2016.
We present a method for gesture detection and localisation based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at three temporal scales. Key to our technique is a training strategy which exploits: i) careful initialization of individual modalities; and ii) gradual fusion involving random dropping of separate channels (dubbed "ModDrop") for learning cross-modality correlations while preserving uniqueness of each modality-specific representation. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams. Fusing multiple modalities at several spatial and temporal scales leads to a significant increase in recognition rates, allowing the model to compensate for errors of the individual classifiers as well as noise in the separate channels. Futhermore, the proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities. In addition, we demonstrate the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.
Natalia Neverova, Christian Wolf, Griffin Lacey, Lex Fridman, Deepak Chandra, Brandon Barbello and Graham W. Taylor. Learning Human Identity from Motion Patterns. In IEEE Access (4):1810-1820, 2016.
We present a large-scale study, exploring the capability of temporal deep neural networks in interpreting natural human kinematics and introduce the first method for active biometric authentication with mobile inertial sensors. At Google, we have created a first-of-its-kind dataset of human movements, passively collected by 1500 volunteers using their smartphones daily over several months. We (1) compare several neural architectures for efficient learning of temporal multi-modal data representations, (2) propose an optimized shift-invariant dense convolutional mechanism (DCWRNN) and (3) incorporate the discriminatively-trained dynamic features in a probabilistic generative framework taking into account temporal characteristics. Our results demonstrate, that human kinematics convey important information about user identity and can serve as a valuable component of multi-modal authentication systems.
Alaeddine Mihoub, Gerard Bailly, Christian Wolf and Fréderic Elisei. Graphical models for social behavior modeling in face-to face interaction. In Pattern Recognition Letters (75):82-89, 2016.
The goal of this paper is to model the coverbal behavior of a subject involved in face-to-face social interactions. For this end, we present a multimodal behavioral model based on a Dynamic Bayesian Network (DBN). The model was inferred from multimodal data of interacting dyads in a specific scenario designed to foster mutual attention and multimodal deixis of objects and places in a collaborative task. The challenge for this behavioral model is to generate coverbal actions (gaze, hand gestures) for the subject given his verbal productions, the current phase of the interaction and the perceived actions of the partner. In our work, the structure of the DBN was learned from data, which revealed an interesting causality graph describing precisely how verbal and coverbal human behaviors are coordinated during the studied interactions. Using this structure, DBN exhibits better performances compared to classical baseline models such as Hidden Markov Models (HMMs) and Hidden Semi-Markov Models (HSMMs). We outperform the baseline in both measures of performance, i.e. interaction unit recognition and behavior generation. DBN also reproduces more faithfully the coordination patterns between modalities observed in ground truth compared to the baseline models.
Damien Fourure, Remi Emonet, Elisa Fromont, Damien Muselet, Alain Trémeau, Christian Wolf. Semantic Segmentation via Multi-task, Multi-domain Learning In joint IAPR International Workshops on Structural and Syntactic Pattern Recognition (SSPR 2016) and Statistical Techniques in Pattern Recognition (SPR 2016).
We present an approach that leverages multiple datasets possibly annotated using different classes to improve the semantic segmentation accuracy on each individual dataset. We propose a new selective loss function that can be integrated into deep networks to exploit training data coming from multiple datasets with possibly different tasks (e.g., different label-sets). We show how the gradient-reversal approach for domain adaptation can be used in this setup. Thorought experiments on semantic segmentation applications show the relevance of our approach.
Bastien Moysset, Jérome Louradour, Christopher Kermorvant, Christian Wolf. Learning text-line localization with shared and local regression neural networks. In International Conference on Frontiers in Handwriting Recognition, 2016.
Text line detection and localisation is a crucial step for full page document analysis, but still suffers from heterogeneity of real life documents. In this paper, we present a novel approach for text line localisation based on Convolutional Neural Networks and Multidimensional Long Short-Term Memory cells as a regressor in order to predict the coordinates of the text line bounding boxes directly from the pixel values. Targeting typically large images in document image analysis, we propose a new model using weight sharing over local blocks. We compare two strategies: directly predicting the four coordinates or predicting lower-left and upper-right points separately followed by matching. We evaluate our work on the highly unconstrained Maurdor dataset and show that our method outperforms both other machine learning and image processing methods.
Damien Fourure, Remi Emonet, Elisa Fromont, Damien Muselet, Alain Trémeau, Christian Wolf. Mixed pooling Neural Networks for Color Constancy. In International Conference on Image Processing (ICIP), 2016.
Color constancy is the ability of the human visual system to perceive constant colors for a surface despite changes in the spectrum of the illumination. In computer vision, the main approach consists in estimating the illuminant color and then to remove its impact on the color of the objects. Many image processing algorithms have been proposed to tackle this prob- lem automatically. However, most of these approaches are handcrafted and mostly rely on strong empirical assumptions, e.g., that the average reflectance in a scene is gray. State- of-the-art approaches can perform very well on some given datasets but poorly adapt on some others. In this paper, we have investigated how neural networks-based approaches can be used to deal with the color constancy problem. We have proposed a new network architecture based on existing suc- cessful hand-crafted approaches and a large number of im- provements to tackle this problem by learning a suitable deep model. We show our results on most of the standard bench- marks used in the color constancy domain.


Oya Celiktutan, Christian Wolf, Bülent Sankur and Eric Lombardi. Fast Exact Hyper-Graph Matching with Dynamic Programming for Spatio-Temporal Data. In Journal on Mathematical Imaging and Vision, pp. 1-21, 2015.

Graphs and hyper-graphs are frequently used to recognize complex and often non-rigid patterns in computer vision, either through graph matching or point-set matching with graphs. Most formulations resort to the minimization of a difficult energy function containing geometric or structural terms, frequently coupled with data attached terms involving appearance information. Traditional methods solve the minimization problem approximately, for instance re- sorting to spectral techniques. In this paper, we deal with the spatio-temporal data, for a concrete example, human actions in video sequences. In this context, we first make three realistic assumptions: (i) causality of human movements; (ii) sequential nature of human movements; and (iii) one-to-one mapping of time instants. We show that, under these assumptions, the correspondence problem can be decomposed into a set of subproblems such that each subproblem can be solved recursively in terms of the others, and hence an efficient exact minimization algorithm can be derived using dynamic programming approach. Secondly, we propose a special graphical structure which is elongated in time. We argue that, instead of approximately solving the original problem, a solution can be obtained by exactly solving an approximated problem. An exact minimization algorithm is derived for this structure and successfully applied to action recognition in two settings: video data and Kinect coordinate data.

Alaeddine Mihoub, Gerard Bailly, Christian Wolf and Fréderic Elisei. Learning multimodal behavioral models for face-to-face social interaction. In Journal on Multimodal User Interfaces, (9):3, pp 195-210, 2015.
The aim of this paper is to model multimodal perception-action loops of human behavior in face-to-face interactions. The long-term goal of this research is to give artificial agents social skills to engage believable interactions with human interlocutors. To this end, we propose trainable behavioral models that generate optimal actions given others’ perceived actions and joint goals. We first compare sequential models - in particular Discrete Hidden Markov Models (DHMMs) - with standard classifiers (SVMs and Decision Trees). We propose a modification of the initialization of the DHMMs in order to better capture the recurrent structure of the sensory-motor states. We show that the explicit state duration modeling by Hidden Semi Markov Models (HSMMs) improves prediction performance. We applied these models to parallel speech and gaze data collected from interacting dyads. The challenge was to predict the gaze of one subject given the gaze of the interlocutor and the voice activity of both. For both HMMs and HSMMs the Short-Time Viterbi concept is used for incremental decoding and generation. For the proposed models we evaluated objectively many properties in order to go beyond pure classification performance. Results show that while Incremental Discrete HMMs (IDHMMs) were more efficient than classic classifiers, the Incremental Discrete HSMMs (IDHSMMs) give best performance. This result emphasizes the relevance of state duration modeling.
Bastien Moysset, Christopher Kermorvant, Christian Wolf, Jérome Louradour. Paragraph text segmentation into lines with Recurrent Neural Networks. In International Conference on Document Analysis and Recognition (ICDAR), 2015.
The detection of text lines, as a first processing step, is critical in all Text Recognition systems. State-of-the-art methods to locate lines of text are based on handcrafted heuristics fine-tuned by the Image Processing Community's experience. They succeed under certain constraints; for instance the background has to be roughly uniform. We propose to use more ``agnostic'' Machine Learning-based approaches to address text line location. The main motivation is to be able to process either damaged documents, or flows of documents with a high variety of layouts and other characteristics. A new method is presented in this work, inspired by the latest generation of optical models used for Text Recognition, namely Recurrent Neural Networks. As these models are sequential, a column of text lines in our application plays here the same role as a line of characters in more traditional text recognition settings. A key advantage of the proposed method over other data-driven approaches is that compiling a training dataset does not require labeling line boundaries: only the number of lines are required for each paragraph. Experimental results show that our approach gives similar or better results than traditional handcrafted approaches, with little engineering efforts and less hyper-parameter tuning.
Bastien Moysset, Pierre Adam, Christian Wolf, Jérome Louradour. Space Displacement Localization Neural Networks to locate origin points of handwritten text lines in historical documents. In ICDAR Workshop on Historical Document Imaging and Processing, 2015.
We describe a new method for detecting and localizing multiple objects in an image using context aware deep neural networks. Common architectures either proceed locally per pixel-wise sliding-windows, or globally by predicting object localizations for a full image. We improve on this by training a semi-local model to detect and localize objects inside a large image region, which covers an object or a part of it. Context knowledge is integrated, combining multiple predictions for different regions through a spatial context layer modeled as an LSTM network. The proposed method is applied to a complex problem in historical document image analysis, where we show that is capable of robustly detecting text lines in the images from the ANDAR-TL competition. Experiments indicate that the model can cope with difficult situations and reach the state of the art in Vision such as other deep models.
Emre Dogan, Gonen Eren, Christian Wolf, Atilla Baskurt. Activity recognition with volume motion templates and histograms of 3D gradients. In International Conference on Image Processing (ICIP), 2015.
We propose a new method for activity recognition based on a view independent representation of human motion. Robust 3D volume motion templates (VMTs) are calculated from tracklets. View independence is achieved through a rotation with respect to a canonical orientation. From this volumes, features based on 3D gradients are extracted, projected to a codebook and pooled into a bags-of-words model classified with an SVM classifier. Experiments show that the method outperforms the original HoG3D method.
Leslie Guillaume, Véronique Aubergé, Romain Magnani, Frédéric Aman, Cécile Cottier, Yuko Sasa, Christian Wolf, Florian Nebout, Natalia Neverova, Nicolas Bonnefond, Amaury Negre, Liliya Tsvetanova, Maxence Girard-Rivier. Gestural HRI in an ecological dynamic experiment: the GEE corpus based approach for the Emox robot. In International Workshop on Advanced Robotics and its Social Impacts (ARSO), 2015.
As part of a human-robot interaction project, the gestural modality is one of a possible way to communicate. In order to develop a relevant gesture recognition system associated to a smart home butler robot, our methodology is based on an IQ game-like Wizard of Oz experiment to collect spontaneous and implicitly produced gestures in an ecological context where the robot is the referee of the game. These gestures are compared with explicitly produced gestures to determine a relevant ontology of gestures. This preliminary qualitative analysis will be the base to build a big data corpus in order to optimize acceptance of the gesture dictionary in coherence with the “socio-affective glue” dynamics.
Gerard Bailly, Alaeddine Mihoub, Christian Wolf and Frédéric Elisei. Learning joint multimodal behaviors for face-to-face interaction: performance & properties of statistical models. In HRI Workshop on Behavior Coordination between Animals, Humans, and Robots, 2015.

We evaluate here the ability of statistical models, namely Hidden Markov Models (HMMs) and Dynamic Bayesian Networks (DBNs), in capturing the interplay and coordination between multimodal behaviors of two individuals involved in a face-to-face interaction. We structure the intricate sensory-mot or coupling of the joint multimodal scores by segmenting the whole interaction into so-called interaction units (IU). We show that the proposed statistical models are able to capture the natural dynamics of the interaction and that DBNs are particularly suitable for reproducing original distributions of so-called coordination histograms.


Christian Wolf, Eric Lombardi, Julien Mille, Oya Celiktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren, Moez Baccouche, Emmanuel Dellandréa, Charles-Edmond Bichot, Christophe Garcia, Bülent Sankur. Evaluation of video activity localizations integrating quality and quantity measurements. In Computer Vision and Image Understanding (127):14-30, 2014.

Evaluating the performance of computer vision algorithms is classically done by reporting classification error or accuracy, if the problem at hand is the classification of an object in an image, the recognition of an activity in a video or the categorization and labeling of the image or video. If in addition the detection of an item in an image or a video, and/or its localization are required, frequently used metrics are Recall and Precision, as well as ROC curves. These metrics give quantitative performance values which are easy to understand and to interpret even by non-experts. However, an inherent problem is the dependency of quantitative performance measures on the quality constraints that we need impose on the detection algorithm. In particular, an important quality parameter of these measures is the spatial or spatio-temporal overlap between a ground-truth item and a detected item, and this needs to be taken into account when interpreting the results.

We propose a new performance metric addressing and unifying the qualitative and quantitative aspects of the performance measures. The performance of a detection and recognition algorithm is illustrated intuitively by performance graphs which present quantitative performance values, like Recall, Precision and F-Score, depending on quality constraints of the detection. In order to compare the performance of different computer vision algorithms, a representative single performance measure is computed from the graphs, by integrating out all quality parameters. The evaluation method can be applied to different types of activity detection and recognition algorithms. The performance metric has been tested on several activity recognition algorithms participating in the ICPR 2012 HARL competition.

Mingyuan Jiu, Christian Wolf, Graham W. Taylor and Atilla Baskurt. Human body part estimation from depth images via spatially-constrained deep learning, in Pattern Recognition Letters 50(1):122-129, 2014.

Object recognition, human pose estimation and scene recognition are applications which are frequently solved through a decomposition into a collection of parts. The resulting local representation has significant advantages, especially in the case of occlusions and when the subject is non-rigid. Detection and recognition require modelling the appearance of the different object parts as well as their spatial layout. This representation has been particularly successful in body part estimation from depth images. Integrating the spatial layout of parts may require the minimization of complex energy functions. This is prohibitive in most real world applications and therefore often omitted. However, ignoring the spatial layout puts all the burden on the classifier, whose only available information is local appearance. We propose a new method to integrate spatial layout into parts classification without costly pairwise terms during testing. Spatial relationships are exploited in the training algorithm, but not during testing. As with competing methods, the proposed method classifies pixels independently, which makes real-time processing possible. We show that training a classifier with spatial relationships increases generalization performance when compared to classical training minimizing classification error on the training set. We present an application to human body part estimation from depth images.

Elisa Fromont, Remi Emonet, Taygun Kekec, Alain Trémeau, Christian Wolf. Contextually Constrained Deep Networks for Scene Labeling. In British Machine Vision Conference (BMVC), 2014.
Learning using deep learning architectures is a difficult problem: the complexity of the prediction model and the difficulty of solving non-convex optimization problems inherent in most learning algorithms can both lead to overfitting phenomena and bad local optima. To overcome these problems we would like to constraint parts of the network using some semantic context to 1) control its capacity while still allowing complex func- tions to be learned 2) obtain more meaningful layers. We first propose to learn a weak convolutional network which would provide us rough label maps over the neighborhood of a pixel. Then, we incorporate this weak learner in a bigger network. This iterative process aims at increasing the interpretability by constraining some feature maps to learn precise contextual information. Using Stanford and SIFT Flow scene labeling datasets, we show how this contextual knowledge improves accuracy of state-of-the-art architectures. The approach is generic and can be applied to similar networks where contextual cues are available at training time.
Natalia Neverova, Christian Wolf, Graham W. Taylor, Florian Nebout. Hand segmentation with structured convolutional learning In Asian Conference on Computer Vision (ACCV), 2014.
The availability of cheap and effective depth sensors has resulted in recent advances in human pose estimation and tracking. Detailed estimation of hand pose, however, remains a challenge since fingers are often occluded and may only represent just a few pixels. Moreover, labelled data is difficult to obtain. We propose a deep learning based-approach for hand pose estimation, targeting gesture recognition, that requires very little labelled data. It leverages both unlabeled data and synthetic data from renderings. The key to making it work is to integrate structural information not into the model architecture, which would slow down inference, but into the training objective. We show that adding unlabelled real-world samples significantly improves results compared to a purely supervised setting.
Natalia Neverova, Christian Wolf, Graham W. Taylor, Florian Nebout. Multi-scale deep learning for gesture detection and localization In ECCV ChaLearn Workshop on Looking at People, 2014. (This paper describes the winning entry of the ChaLearn 2014 gesture recognition competition)
We present a method for gesture detection and localization based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at two temporal scales. Key to our technique is a training strategy which exploits i) careful initialization of individual modalities; and ii) gradual fusion of modalities from strongest to weakest cross-modality structure. We present experiments on the "ChaLearn 2014 Looking at People Challenge" gesture recognition track, in which we placed first out of 17 teams.
Alaeddine Mihoub, Gerard Bailly and Christian Wolf. Modeling Perception-Action Loops: Comparing Sequential Models with Frame-Based Classifiers. In ACM Human-Agent Interaction, 2014.

Modeling multimodal perception-action loops in face-to- face interactions is a crucial step in the process of building sensory-motor behaviors for social robots or users-aware Embodied Conversational Agents (ECA). In this paper, we compare trainable behavioral models based on sequential models (HMMs) and classifiers (SVMs and Decision Trees) inherently inappropriate to model sequential aspects. These models aim at giving pertinent perception/action skills for robots in order to generate optimal actions given the perceived actions of others and joint goals. We applied these models to parallel speech and gaze data collected from interacting dyads. The challenge was to predict the gaze of one subject given the gaze of the interlocutor and the voice activity of both. We show that Incremental Discrete HMM (IDHMM) generally outperforms classifiers and that injecting input context in the modeling process significantly improves the performances of all algorithms.

Simon Gay, Olivier Georgeon, Christian Wolf. Autonomous object modeling based on affordances for spatial organization of behavior. In International joint conference on development and learning and on epigenetic robotics, 2014.
We present an architecture for self-motivated agents to organize their behaviors in space according to possibilities of interactions afforded by initially unknown objects. The long-term goal is to design agents that construct their own knowledge of objects through experience, rather than exploiting precoded knowledge. Self-motivation is defined here as a tendency to experiment and to respond to behavioral opportunities afforded by the environment. Some interactions have predefined valences that specify inborn behavioral preferences. Over time, the agent learns the relation between its perception of objects and the interactions that they afford, in the form of data structures, called signatures of interaction, which encode the minimal spatial configurations that afford an interaction. The agent keeps track of enacted interactions in a topological spatial memory, to recognize and localize subsequent possibilities of interaction (through their signatures) afforded by surrounding objects. Experiments with a simulated agent and a robot show that they learn to navigate in their environment, taking into account multiple surrounding objects, reaching or avoiding objects according to the valence of the interactions that they afford.
Natalia Neverova, Christian Wolf, Graham W. Taylor, Florian Nebout. Ranked 1st of 17 in the "ChaLearn 2014 Looking at People: Gesture Recognition" Competition, in conjunction with ECCV 2014 (Results; description in the ECCV Workshop paper).


Natalia Neverova, Christian Wolf, Giulio Paci, Giacomo Sommavilla, Graham W. Taylor, Florian Nebout. A  multi-scale approach  to  gesture  detection  and  recognition. In ICCV Workshop on Understanding Human Activities: Context and Interactions, 2013.
We propose a generalized approach to human gesture recognition based on multiple data modalities such as depth video, articulated pose and speech. In our system, each gesture is decomposed into large-scale body motion and local subtle movements such as hand articulation. The idea of learning at multiple scales is also applied to the temporal dimension, such that a gesture is considered as a set of characteristic motion impulses, or dynamic poses. Each modality is first processed separately in short spatio-temporal blocks, where discriminative data-specific features are either manually extracted or learned. Finally, we employ a Recurrent Neural Network for modeling large-scale temporal dependencies, data fusion and ultimately gesture classification. Our experiments on the 2013 Challenge on Multi-modal Gesture Recognition dataset have demonstrated that using multiple modalities at several spatial and temporal scales leads to a significant increase in performance allowing the model to compensate for errors of individual classifiers as well as noise in the separate channels.
Oya Celiktutan, Akgül Ceyhun burak, Christian Wolf and Bülent Sankur. Graph-Based Analysis of Physical Exercise Actions. In the Proceedings of the ACM Multimedia Workshop on Multimedia Indexing and Information Retrieval for Healthcare, 2013.
In this paper, we develop a graph-based method to align two dynamic sequences, and apply it to both action recognition tasks as well as to the objective quantification of the goodness of the action performance. The automated measurement of “action quality" has potential to be used to monitor action imitations, for example, during a physical therapy. We seek matches between a query sequence and model sequences selected with graph mining. The best matches are obtained through minimizing an energy function that jointly measures space and time domain discrepancies. This graph discrepancy measure has been used for recognizing actions, for separating acceptable and unacceptable action performances, or as a continuous quantification of the action performance goodness. Experimental evaluations demonstrate the improved results of our scheme vis-à-vis its nearest competitors. Furthermore, a plausible relationship has been obtained between action perturbation, given by the joint noise variances, and quality measure, given by matching energies averaged over a sequence.
Olivier Georgeon, Christian Wolf, Simon Gay. An Enactive Approach to Autonomous Agent and Robot Learning. In the Proceedings of the international joint conference on development and learning and on epigenetic robotics, 2013.
A novel way to model autonomous learning in artificial agents and robots is introduced, called an Enactive Markov Decision Process (EMDP). An EMDP keeps perception and action embedded within sensorimotor schemes rather than dissociated. On each decision cycle, the agent tries to enact a sensorimotor scheme, and the environment informs the agent whether it was indeed enacted or whether another sensorimotor scheme was enacted instead. This new modeling approach leads to implementing a new form of self-motivation called interactional motivation. An EMDP learning algorithm is presented. Results show that this algorithm allows the agent to develop active perception as it learns to master the sensorimotor contingences afforded by its coupling with the environment.
Mingyuan Jiu, Christian Wolf, Atilla Baskurt. Integrating spatial layout of object parts into classification without pairwise terms: application to fast body parts estimation from depth images. In the Proceedings of the international conference on computer vision theory and applications (Visapp), oral presentation, 2013.
Object recognition or human pose estimation methods often resort to a decomposition into a collection of parts. This local representation has significant advantages, especially in case of occlusions and when the “object” is non-rigid. Detection and recognition requires modelling the appearance of the different object parts as well as their spatial layout. The latter can be complex and requires the minimization of complex energy functions, which is prohibitive in most real world applications and therefore often omitted. However, ignoring the spatial layout puts all the burden on the classifier, whose only available information is local appearance. We propose a new method to integrate the spatial layout into the parts classification without costly pairwise terms. We present an application to body parts classification for human pose estimation.
Alaeddine Mihoub, Gerard Bailly and Christian Wolf. Social behavior modeling based on Incremental Discrete Hidden Markov Models. In the Proceedings of the International Workshop on Human Behavior Understanding, 2013.

Modeling multimodal face-to-face interaction is a crucial step in the process of building social robots or users-aware Embodied Conversational Agents (ECA). In this context, we present a novel approach for human behavior analysis and generation based on what we called “Incremental Discrete Hidden Markov Model” (IDHMM). Joint multimodal activities of interlocutors are first modeled by a set of DHMMs that are specific to supposed joint cognitive states of the interlocutors. Respecting a task-specific syntax, the IDHMM is then built from these DHMMs and split into i) a recognition model that will determine the most likely sequence of cognitive states given the multimodal activity of the interlocutor, and ii) a generative model that will compute the most likely activity of the speaker given this estimated sequence of cognitive states. Short-Term Viterbi (STV) decoding is used to incrementally recognize and generate behavior. The proposed model is applied to parallel speech and gaze data of interacting dyads.


Mingyuan Jiu, Christian Wolf, Christophe Garcia and Atilla Baskurt. Supervised learning and codebook optimization for bag of words models. In Cognitive Computation, Springer Verlag, (4):409-419, 2012.

In this paper, we present a novel approach for supervised codebook learning and optimization for bag of words models. This type of models is frequently used in visual recognition tasks like object class recognition or human action recognition. An entity is represented as a histogram of codewords, which are traditionally clustered with unsupervised methods like \textit{k}-means or random forests, and then classified in a supervised way. We propose a new supervised method for joint codebook creation and class learning, which learns the cluster centers of the codebook in a goal-directed way using the class labels of the training set. As a result, the codebook is highly correlated to the recognition problem, leading to a more discriminative codebook. We propose two different learning algorithms, one based on error backpropagation and one based on cluster label reassignment. We apply the proposed method to human action recognition from video sequences and evaluate it on the KTH dataset, reporting very promising results. The proposed technique allows to improve the discriminative power of an unsupervised learned codebook, or to keep the discriminative power while decreasing the size of the learned codebook, thus decreasing the computational complexity due to the nearest neighbor search.

Vincent Vidal, Christian Wolf, Florent Dupont Combinatorial Mesh Optimization, In The Visual Computer, 28(5):511-525, 2012.

A new mesh optimization framework for 3D triangular surface meshes is presented, which formulates the task as an energy minimization problem in the same spirit as in Hoppe et al. [1]. The desired mesh properties are controlled through a global energy function including data attached terms measuring the fidelity to the original mesh, shape potentials favoring high quality triangles and connectivity as well as budget terms controlling the sampling density. The optimization algorithm modifies mesh connectivity as well as the vertex positions. Solutions for the vertex repositioning step are obtained by a discrete graph cut algorithm examining global combinations of local candidates. Results on various 3D meshes compare favorably to recent state-of-the-art algorithms. Applications consist in optimizing triangular meshes and in simplifying meshes, while maintaining high mesh quality. Targeted areas are the improvement of the accuracy of numerical simulations, the convergence of numerical schemes, improvements of mesh rendering (normal field smoothness) or improvements of the geometric prediction in mesh compression techniques.

Moez Baccouche, Frank Mamalet Christian Wolf, Christophe Garcia, Atilla Baskurt. Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In the Proceedings of the British Machine Vision Conference (BMVC), oral presentation, 2012.
We present in this paper a novel learning-based approach for video sequence classification. Contrary to the dominant methodology, which relies on hand-crafted features that are manually engineered to be optimal for a specific task, our neural model automatically learns a sparse shift-invariant representation of the local 2D+t salient information, without any use of prior knowledge. To that aim, a spatio-temporal convolutional sparse auto-encoder is trained to project a given input in a feature space, and to reconstruct it from its projection coordinates. Learning is performed in an unsupervised manner by minimizing a global parametrized objective function. The sparsity is ensured by adding a sparsifying logistic between the encoder and the decoder, while the shift-invariance is handled by including an additional hidden variable to the objective function. The temporal evolution of the obtained sparse features is learned by a long short-term memory recurrent neural network rained to classify each sequence. We show that, since the feature learning process is problem-independent, the model achieves outstanding performances when applied to two different problems, namely human action and facial expression recognition. Obtained results are superior to the state of the art on the GEMEP-FERA dataset and among the very best on the KTH dataset.
Moez Baccouche, Frank Mamalet Christian Wolf, Christophe Garcia, Atilla Baskurt Sparse Shift-Invariant Representation of Local 2D Patterns and Sequence Learning for Human Action Recognition in the Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), oral presentation, 2012.
Most existing methods for action recognition mainly rely on manually engineered features which, despite their good performances, are highly problem dependent. We propose in this paper a fully automated model, which learns to classify human actions without using any prior knowledge. A convolutional sparse auto- encoder learns to extract sparse shift-invariant representations of the 2D local patterns present in each video frame. The evolution of these mid-level features is learned by a Recurrent Neural Network trained to classify each sequence. Experimental results on the KTH dataset show that the proposed approach outperforms existing models which rely on learned-features, and gives comparable results with the best related works.
Vincent Vidal, Christian Wolf, Florent Dupont Mesh Segmentation and Global 3D Model Extraction. Symposium on Geometry Processing, Poster, 2012.
This paper presents a method for segmenting noisy 2-manifold meshes based on a decomposition into local shape primitives maximizing global coherence. This technique works by partitioning the input mesh into regions which can be approximated by a simple geometrical primitive such as a plane, a sphere or a cylinder. The partitioning is guided by robust shape extractions based on RANSAC sampling and the final decision to keep a 3D model into the final decomposition is based on a global graphical model which involves spatial and label cost priors. Obtained segmentations on noisy mesh models outperform other approaches in terms of region contour smoothness and consistency with mechanical object decomposition. Applications of this work are reverse engineering, mesh structure analysis, mesh feature enhancement, noise removal, mesh compression, piecewise approximation of mesh geometry (points, normals, curvatures), and remeshing.
Christian Wolf, Atilla Baskurt, Action recognition in videos, Invited talk at International Conference on Image Processing Theory, Tools and Applications, Istanbul, 2012.

Activity recognition in video sequences is a difficult problem due to the complex characteristics of human articulated motion and its large variations. It requires motion estimation, which involves the separation of motion and visual appearance information, the suppression of irrelevant background clutter and background motion, the separation of motion belonging to different people, and the creation of models describing actions. In this talk we will briefly describe the different frameworks for action recognition, based on background subtraction and on space-time interest points, and we will focus and structured and on semi-structured models. These models attempt to bridge the gap between the rich descriptive power of fully structured models constructed from sets of local features and the convenience and the power of machine learning algorithms, which are mostly based on unstructured features embedded in vector spaces. Semi-structured models proceed by translating structured information into unstructured information, while structured models keep a full representation. As an example we will deal with graphs and graph matching algorithms. Hierarchical representations and parts based models will be investigated, which allow to decompose complex activities into smaller parts of less sophisticated elementary actions or elementary descriptors.

Oya Celiktutan, Christian Wolf and Bülent Sankur, Eric Lombardi Real-Time Exact Graph Matching with Application in Human Action Recognition. In the Proceedings of the International Workshop on Human Behavior Understanding, Istanbul, 2012. Oral presentation.

Graph matching is one of the principal methods to formulate the correspondence between two set of points in computer vision and pattern recognition. Most formulations are based on the minimization of a difficult energy function which is known to be NP-hard. Traditional methods solve the minimization problem approximately. In this paper, we derive an exact minimization algorithm and successfully applied to action recognition in videos. In this context, we take advantage of special properties of the time domain, in particular causality and the linear order of time, and propose a new spatio-temporal graphical structure. We show that a better solution can be obtained by exactly solving an approximated problem instead of approximately solving the original problem.


Moez Baccouche, Frank Mamalet Christian Wolf, Christophe Garcia and Atilla Baskurt, Sequential Deep Learning for Human Action Recognition, In the Proceedings of the International Workshop on Human Behavior Understanding: Inducing Behavioral Change, 2011. Oral presentation.

We propose in this paper a fully automated deep model, which learns to classify human actions without using any prior knowledge. The first step of our scheme, based on the extension of Convolutional Neural Networks to 3D, automatically learns spatio-temporal features. A Recurrent Neural Network is then trained to classify each sequence considering the temporal evolution of the learned features for each timestep. Experimental results on the KTH dataset show that the proposed approach outperforms existing deep models, and gives comparable results with the best related works.

Vincent Vidal, Christian Wolf, Florent Dupont Robust feature line extraction on CAD triangular meshes, in the Proceedings of the International Conference on Computer Graphics Theory and Applications, oral presentation, 2011.


Christian Wolf Document Ink bleed-through removal with two hidden Markov random fields and a single observation field. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 32(3):431-447, 2010.

We present a new method for blind document bleed through removal based on separate Markov Random Field (MRF) regularization for the recto and for the verso side, where separate priors are derived from the full graph. The segmentation algorithm is based on Bayesian Maximum a Posteriori (MAP) estimation. The advantages of this separate approach are the adaptation of the prior to the contents creation process (e.g. superimposing two hand written pages), and the improvement of the estimation of the recto pixels through an estimation of the verso pixels covered by recto pixels; Moreover, the formulation as a binary labeling problem with two hidden labels per pixels naturally leads to an efficient optimization method based on the minimum cut/maximum flow in a graph. The proposed method is evaluated on scanned document images from the 18th century, showing an improvement of character recognition results compared to other restoration methods.

Christian Wolf and Gérald Gavin Inference and parameter estimation on hierarchical belief networks for image segmentation. In Neurocomputing 73(4-6):563-569, 2010.

We introduce a new causal hierarchical belief network for image segmentation. Contrary to classical tree structured (or pyramidal) models, the factor graph of the network contains cycles. Each level of the hierarchical structure features the same number of sites as the base level and each site on a given level has several neighbors on the parent level. Compared to tree structured models, the (spatial) random process on the base level of the model is stationary which avoids known drawbacks, namely visual artifacts in the segmented image. We propose different parameterizations of the conditional probability distributions governing the transitions between the image levels. A parametric distribution depending on a single parameter allows the design of a fast inference algorithm on graph cuts, whereas for arbitrary distributions, we propose inference with loopy belief propagation. The method is evaluated on scanned documents, showing an improvement of character recognition results compared to other methods.

Christian Wolf and Jean-Michel Jolion Integrating a discrete motion model into GMM based background subtraction, in the Proceedings of the IEEE International Conference on Pattern Recognition, oral presentation, 2010.
GMM based algorithms have become the de facto standard for background subtraction in video sequences, mainly because of their ability to track multiple background distributions, which allows them to handle complex scenes including moving trees, flags moving in the wind etc. However, it is not always easy to determine which distributions of the mixture belong to the background and which distributions belong to the foreground, which disturbs the results of the labeling process for each pixel. In this work we tackle this problem by taking the labeling decision together for all pixels of several consecutive frames minimizing a global energy function taking into account spatial and temporal relationships. A discrete approximative optical-flow like motion model is integrated into the energy function and solved with Ishikawa's convex graph cuts algorithm.
Anh-Phong Ta, Christian Wolf, Guillaume Lavoué, Atilla Baskurt and Jean-Michel Jolion Pairwise features for human action recognition, In International Conference on Pattern Recognition (ICPR), 2010.
Existing action recognition approaches mainly rely on the discriminative power of individual local descriptors extracted from spatio-temporal interest points (STIP), while the geometric relationships among the local features are ignored. This paper presents new features, called pairwise features (PWF), which encode both the appearance and the spatio-temporal relations of the local features for action recognition. First STIPs are extracted, then PWFs are constructed by grouping pairs of STIPs which are both close in space and close in time. We propose a combination of two codebooks for video representation. Experiments on two standard human action datasets: the KTH dataset and the Weizmann dataset show that the proposed approach outperforms most existing methods.
Anh-Phong Ta, Christian Wolf, Guillaume Lavoué and Atilla Baskurt Recognizing and localizing individual activities through graph matching, in the Proceedings of the International Conference on Advanced Video and Signal-Based Surveillance, 2010 (IEEE). ,oral presentation, 22.5% acceptance rate; Best Paper for track 'recognition', 5% acceptance rate.
In this paper we tackle the problem of detecting individual human actions in video sequences. While the most successful methods are based on local features, which proved that they can deal with changes in background, scale and illumination, most existing methods have two main shortcomings: first, they are mainly based on the individual power of spatio-temporal interest points (STIP), and therefore ignore the spatio-temporal relationships between them. Second, these methods mainly focus on direct classification techniques to classify the human activities, as opposed to detection and localization. In order to overcome these limitations, we propose a new approach, which is based on a graph matching algorithm for activity recognition. In contrast to most previous methods which classify entire video sequences, we design a video matching method from two sets of ST-points for human activity recognition. First, points are extracted, and a hyper graphs are constructed from them, i.e. graphs with edges involving more than 2 nodes (3 in our case). The activity recognition problem is then transformed into a problem of finding instances of model graphs in the scene graph. By matching local features instead of classifying entire sequences, our method is able to detect multiple different activities which occur simultaneously in a video sequence. Experiments on two standard datasets demonstrate that our method is comparable to the existing techniques on classification, and that it can, additionally, detect and localize activities.
Pierre-Yves Laffont, Jong-Yun Jun, Christian Wolf, Yu-Wing Tai, Khalid Idrissi, George Drettakis, Sung-Eui Yoon, Interactive Content-Aware Zooming, In the Proceedings of Grapĥics Interface, 2010.
We propose a novel, interactive content-aware zooming operator that allows effective and efficient visualization of high resolution images on small screens, which may have different aspect ratios compared to the input images. Our approach applies an image retargeting method in order to fit an entire image into the limited screen space. This can provide global, but approximate views for lower zoom levels. However, as we zoom more closely into the image, we continuously unroll the distortion to provide local, but more detailed and accurate views for higher zoom levels. In addition, we propose to use an adaptive view-dependent mesh to achieve high retargeting quality, while maintaining interactive performance. We demonstrate the effectiveness of the proposed operator by comparing it against the traditional zooming approach, and a method stemming from a direct combination of existing works.
Moez Baccouche, Frank Mamalet Christian Wolf, Christophe Garcia, Atilla Baskurt Action Classifcation in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks In International Conference on Artificial Neural Networks (ICANN), 2010.
In this paper, we propose a novel approach for action classification in soccer videos using a recurrent neural network scheme. Thereby, we extract from each video action at each timestep a set of features which describe both the visual content (by the mean of a BoW approach) and the dominant motion (with a key point based approach). A Long Short-Term Memory-based Recurrent Neural Network is then trained to classify each video sequence considering the temporal evolution of the features for each timestep. Experimental results on the MICC-Soccer-Actions-4 database show that the proposed approach outperforms classification methods of related works (with a classification rate of 77 %), and that the combination of the two features (BoW and dominant motion) leads to a classification rate of 92 %.


Ranked 5th of 43 in the ICDAR 2009 document image binarisation contest!
Anh-Phong Ta, Christian Wolf, Guillaume Lavoué, Atilla Baskurt 3D Object detection and viewpoint selection in sketch images using local patch-based Zernike moments, in the Proceedings of the IEEE Workshop on Content Based Multimedia Indexing, pp. 189-194, 2009.
In this paper we present a new approach to detect and recognize 3D models in 2D storyboards which have been drawn during the production process of animated cartoons. Our method is robust to occlusion, scale and rotation. The lack of texture and color makes it difficult to extract local features of the target object from the sketched storyboard. Therefore the existing approaches using local descriptors like interest points can fail in such images. We propose a new framework which combines patch-based Zernike descriptors with a method enforcing spatial constraints for exactly detecting 3D models represented as a set of 2D views in the storyboards. Experimental results show that the proposed method can deal with partial object occlusion and is suitable for poorly textured objects.
Marc Mouret, Christine Solnon, Christian Wolf Classification of images based on Hidden Markov Models, in the Proceedings of the IEEE Workshop on Content Based Multimedia Indexing, pp. 169-174, 2009.
We propose to use hidden Markov models (HMMs) to classify images. Images are modeled by extracting symbols corresponding to 3x3 binary neighborhoods of interest points, and by ordering these symbols by decreasing saliency order, thus obtaining strings of symbols. HMMs are learned from sets of strings modeling classes of images. The method has been tested on the SIMPLIcity database and shows an improvement over competing approaches based on interest points. We also evaluate these approaches for classifying thumbnail images, i.e., low resolution images.
Vincent Vidal, Christian Wolf, Florent Dupont, Guillaume Lavoué Global triangular mesh regularization using conditional Markov random fields. Poster (refereed but not published, acceptance rate ~35%) at Symposium on Geometry Processing, 2009
We present a global mesh optimization framework based on a Conditional Markov Random Fied (CMRF or CRF) model suited for 3D triangular meshes of arbitrary topology. The remeshing task is formulated as a Bayesian estimation problem including data attached terms measuring the fidelity to the original mesh as well as a prior favoring high quality triangles. Since the best solution for vertex relocation is strongly related to the mesh connectivity, our approach iteratively modifies the mesh structure (connectivity plus vertex addition/removal) as well as the vertex positions, which are moved according to a well-defined energy function resulting from the CMRF model. Good solutions for the proposed model are obtained by a discrete graph cut algorithm examining global combinations of local candidates. Results on various 3D meshes compare favorably to recent state-of-the-art algorithms regarding the trade-off between triangle shape improvement and surface fidelity. Applications of this work mainly consist in regularizing meshes for numerical simulations and for improving mesh rendering.
Christian Wolf Families of Markov models for document image segmentation, In IEEE Machine Learning for Signal Processing Workshop, 2009
In this paper we compare several directed and undirected graphical models for different image segmentation problems in the domain of document image processing and analysis. We show that adapting the structure of the model to specific sitations at hand, for instance character restoration, recto/verso separation and segmenting high resolution character images, can significantly improve segmentation performance. We propose inference algorithms for the different models and we test them on different data sets.


Christian Wolf, Improving recto document side restoration with an estimation of the verso side from a single scanned page In the Proceedings of the IEEE International Conference on Pattern Recognition, pp. 1-4, 2008. .
We present a new method for blind document bleed through removal based on separately restoring the recto and the verso side. The segmentation algorithm is based on separate Markov random fields (MRF) which results in a better adaptation of the prior to the content creation process (e.g. superimposing two pages), and the improvement of the estimation of the verso pixels through an estimation of the verso pixels covered by recto pixels. The labels of the initial recto and verso clusters are recognized without using any color or gray value information. The proposed method is evaluated empirically as well as through OCR improvement.
Guillaume Lavoué and Christian Wolf , Markov Random Fields for Improving 3D Mesh Analysis and Segmentation, In the Proceedings of the Eurographics 2008 Workshop on 3D Object Retrieval.
Abstract Mesh analysis and clustering have became important issues in order to improve the efficiency of common processing operations like compression, watermarking or simplification. In this context we present a new method for clustering / labeling a 3D mesh given any field of scalar values associated with its vertices (curvature, density, roughness etc.). Our algorithm is based on Markov Random Fields, graphical probabilistic models. This Bayesian framework allows (1) to integrate both the attributes and the geometry in the clustering, and (2) to obtain an optimal global solution using only local interactions, du to the Markov property of the random field. We have defined new observation and prior models for 3D meshes, adapted from image processing which achieve very good results in terms of spatial coherency of the labeling. All model parameters are estimated, resulting in a fully automatic process (the only required parameter is the number of clusters) which works in reasonable time (several seconds).


Christian Wolf and Jean-Michel Jolion Quality, quantity and generality in the evaluation of object detection algorithms Proceedings of the Image Eval Conference, July 12th, 2007, Amsterdam, NL. 8 pages.

Evaluation of object detection algorithms is a non-trivial task: a detection result is usually evaluated by comparing the bounding box of the detected object with the bounding box of the ground truth object. The commonly used precision and recall measures are computed from the overlap area of these two rectangles. However, these measures have several drawbacks: they don't give intuitive information about the proportion of the correctly detected objects and the number of false alarms, and they cannot be accumulated across multiple images without creating ambiguity in their interpretation. Furthermore, quantitative and qualitative evaluation is often mixed resulting in ambiguous measures.

In this paper we propose an approach to evaluation which tackles these problems. The performance of a detection algorithm is illustrated intuitively by performance graphs which present object level precision and recall depending on constraints on detection quality. In order to compare different detection algorithms, a representative single performance value is computed from the graphs. The evaluation method can be applied to different types of object detection algorithms. It has been tested on different text detection algorithms, among which are the participants of the Image Eval text detection competition.


Christian Wolf and Jean-Michel Jolion. Object count/Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms, In International Journal on Document Analysis and Recognition , 8(4):280-296, 2006.

Evaluation of object detection algorithms is a non-trivial task: a detection result is usually evaluated by comparing the bounding box of the detected object with the bounding box of the ground truth object. The commonly used precision and recall measures are computed from the overlap area of these two rectangles. However, these measures have several drawbacks: they don't give intuitive information about the proportion of the correctly detected objects and the number of false alarms, and they cannot be accumulated across multiple images without creating ambiguity in their interpretation. Furthermore, quantitative and qualitative evaluation is often mixed resulting in ambiguous measures.

In this paper we propose a new approach which tackles these problems. The performance of a detection algorithm is illustrated intuitively by performance graphs which present object level precision and recall depending on constraints on detection quality. In order to compare different detection algorithms, a representative single performance value is computed from the graphs. The influence of the test database on the detection performance is illustrated by performance/generality graphs. The evaluation method can be applied to different types of object detection algorithms. It has been tested on different text detection algorithms, among which are the participants of the ICDAR 2003 text detection competition.

  Author         = {C. Wolf and J.-M. Jolion},
  Title          = {Object count/Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms},
  Journal        = {International Journal on Document Analysis and Recognition},
  year           = {2006},
  volume     = {8},
  number     = {4},
  pages      = {280-296}


S.M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao, J. Zhu, W. Ou, C. Wolf, J.-M. Jolion, L. Todoran, M. Worring, et X. Lin. ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions International Journal on Document Analysis and Recognition (IJDAR), 7(2-3):105-122, 2005 (Special Issue on Camera-based Text and Document Recognition)
This paper describes the robust reading competitions for ICDAR 2003. With the rapid growth in research over the last few years on recognizing text in natural scenes, there is an urgent need to establish some common benchmark datasets, and gain a clear understanding of the current state of the art. We use the term robust reading to refer to text images that are beyond the capabilities of current commercial OCR packages. We chose to break down the robust reading problem into three sub-problems, and run competitions for each stage, and also a competition for the best overall system. The sub-problems we chose were text locating, character recognition and word recognition. By breaking down the problem in this way, we hoped to gain a better understanding of the state of the art in each of the sub-problems. Furthermore, our methodology involved storing detailed results of applying each algorithm to each image in the data sets, allowing researchers to study in depth the strengths and weaknesses of each algorithm. The text locating contest was the only one to have any entries. We give a brief description of each entry, and present the results of this contest, showing cases where the leading entries succeed and fail. We also describe an algorithm for combining the outputs of the individual text locaters, and show how the combination scheme improves on any of the individual systems.


Graham W. Taylor and Christian Wolf Reinforcement Learning for Parameter Control of Text Detection in Images and Video Sequences Proceedings of the IEEE International Conference on Information & Communication Technologies , 2004. 6 pages.
A framework for parameterization in computer vision algorithms is evaluated by optimizing ten parameters of the text detection for semantic indexing algorithm preposed by Wolf et al. The Fuzzy ARTMAP neural network is used for generalization, offering much faster learning than in a previous tabular implementation. Difficulties in using a continuous action space are overcome by employing the DIRECT method for global optimization without derivatives. The chosen parameters are evaluated using metrics of recall and precision, and are shown to be superior to the parameters previously recommended.


Christian Wolf and Jean-Michel Jolion. Extraction and Recognition of Artificial Text in Multimedia Documents. Pattern Analysis and Applications, 6(4):309-326, 2003.
The systems currently available for content based image and video retrieval work without semantic knowledge, i.e. they use image processing methods to extract low level features of the data. The similarity obtained by these approaches does not always correspond to the similarity a human user would expect. A way to include more semantic knowledge into the indexing process is to use the text included in the images and video sequences. It is rich in information but easy to use, e.g. by key word based queries. In this paper we present an algorithm to localize artificial text in images and videos using a measure of accumulated gradients and morphological processing. The quality of the localized text is improved by robust multiple frame integration. A new technique for the binarization of the text boxes based on a criterion maximizing local contrast is proposed. Finally, detection and OCR results for a commercial OCR are presented, justifying the choice of the binarization technique
  Author         = {C. Wolf and J.-M. Jolion},
  Title          = {Extraction and {R}ecognition of {A}rtificial {T}ext in {M}ultimedia {D}ocuments},
  Journal        = {Pattern {A}nalysis and {A}pplications},
  year           = {2003},
  volume     = {6},
  number     = {4},
  pages      = {309-326}


Christian Wolf , Jean-Michel Jolion and Francoise Chassaing. Text Localization, Enhancement and Binarization in Multimedia Documents Proceedings of the International Conference on Pattern Recognition (ICPR), volume 4, pages 1037-1040, IEEE Computer Society. August 11th-15th, 2002, Quebec City, Canada. 4 pages.
The systems currently available for content based image and video retrieval work without semantic knowledge, i.e. they use image processing methods to extract low level features of the data. The similarity obtained by these ap-proaches does not always correspond to the similarity a human user would expect. A way to include more semantic knowledge into the indexing process is to use the text included in the images and video sequences. It is rich in information but easy to use, e.g. by key word based queries. In this paper we present an algorithm to localize artificial text in images and videos using a measure of accumulated gradients and morphological post processing to detect the text. The quality of the localized text is improved by robust multiple frame integration. A new technique for the bina-rization of the text boxes is proposed. Finally, detection and OCR results for a commercial OCR are presented.
  Author         = {C. Wolf and J.-M. Jolion and F. Chassaing},
  Title          = {Text {L}ocalization, {E}nhancement and {B}inarization in {M}ultimedia {D}ocuments},
  BookTitle      = {Proceedings of the {I}nternational {C}onference on {P}attern {R}ecognition},
  Volume         = {2},
  Pages          = {1037-1040},
  year           = 2002,
Christian Wolf and David Doermann Binarization of Low Quality Text using a Markov Random Field Model. Proceedings of the International Conference on Pattern Recognition (ICPR), volume 2, pages 160-163, IEEE Computer Society. August 11th-15th, 2002, Quebec City, Canada. 4 pages.
Binarization techniques have been developed in the document analysis community for over 30 years and many algorithms have been used successfully. On the other hand, document analysis tasks are more and more frequently being applied to multimedia documents such as video sequences. Due to low resolution and lossy compression, the binarization of text included in the frames is a non trivial task. Existing techniques work without a model of the spatial relationships in the image, which makes them less powerful. We introduce a new technique based on a Markov Random Field (MRF) model of the document. The model parameters (clique potentials) are learned from training data and the binary image is estimated in a Bayesian framework. The performance is evaluated using commercial OCR software.
  Author         = {C. Wolf and D. Doermann},
  Title          = {Binarization of {L}ow {Q}uality {T}ext using a {M}arkov {R}andom {F}ield {M}odel},
  BookTitle      = {Proceedings of the {I}nternational {C}onference on {P}attern {R}ecognition},
  Volume         = {3},
  Pages          = {160-163},
  year           = 2002,
Christian Wolf, David Doermann and Mika Rautiainen. Video Indexing and Retrieval at UMD, Proceedings of the Text Retrieval Conference (TREC), November 19th-22th, 2002, Gaithersburg, USA. 10 pages.

Our team from the University of Maryland and INSA de Lyon participated in the feature extraction evaluation with overlay text features and in the search evaluation with a query retrieval and browsing system. For search we developed a weighted query mechanism by integrating 1) text (OCR and speech recognition) content using full text and n-grams through the MG system, 2) color correlogram indexing of image and video shots reported last year in TREC, and 3) ranked versions of the extracted binary features. A command line version of the interface allows users to formulate simple queries, store them and use weighted combinations of the simple queries to generate compound queries.

One novel component of our interactive approach is the ability for the users to formulate dynamic queries previously developed for database applications at Maryland. The interactive interface treats each video clip as visual object in a multi-dimensional space, and each "feature" of that clip is mapped to one dimension. The user can visualize any two dimensions by placing any two features on the horizontal and vertical axis with additional dimensions visualized by adding attributes to each object.


Christian Wolf , Jean-Michel Jolion , Walter Kropatsch , and Horst Bischof . Content based Image Retrieval using Interest Points and Texture Features, Proceedings of the International Conference on Pattern Recognition (ICPR), volume 4, pages 234-237. IEEE Computer Society. September 3rd, 2000, Barcelona, Spain. 4 pages.

Interest point detectors are used in computer vision to detect image points with special properties, which can be geometric (corners) or non-geometric (contrast etc.). Gabor functions and Gabor filters are regarded as excellent tools for feature extraction and texture segmentation. This article presents methods how to combine these methods for content based image retrieval and to generate a textural description of images. Special emphasis is devoted to distance measure texture descriptions. Experimental results of a query system are given.

This work was supported in part by the Austrian Science Foundation (FWF) under grant S-7002-MAT.

  Author         = {C. Wolf and J.M. Jolion and W. Kropatsch and H. Bischof},
  Title          = {Content {B}ased {I}mage {R}etrieval using {I}nterest {P}oints and {T}exture {F}eatures},
  BookTitle      = {Proceedings of the {I}nternational {C}onference on {P}attern {R}ecognition},
  Volume         = {4},
  Pages          = {234-237},
  year           = 2000,