Visual Concept Detection in TV and Movies

Session Chair: Chair: Georges Quenot, LIG, FR; 9:15-10:30

Comparison of Two Methods for Unsupervised Person Identification in TV Shows

Paul Gay, Gregor Dupuy, Jean-Marc Odobez, Meignier Sylvain and Paul Deléglise

We address the task of identifying the persons appearing in TV shows. The target people are all people whose identity is said or written, like the journalists and the well known people, as politicians, athletes, celebrities, etc. In our approach, overlaid names displayed on the images are used to identify the persons without any use of biometric models for the speakers and the faces. Two identification methods are evaluated as part of the REPERE French evaluation campaign. The first one relies on co-occurrence times between overlay person names and speaker/face clusters and a rule-based decision which assign a name to each monomodal cluster. the second method uses a Conditionnal Random Field (CRF) which combine different types of co-occurrence statistics and pair-wised constraints to identify jointly speakers and faces.

Scene understanding for identifying persons in TV shows: beyond face authentication

Bendris Meriem, Delphine Charlet, Damnati Geraldine, Benoit Favre and Mickaël Rouvier

Our goal is to automatically identify people in TV news and debates without any predefined dictionary of people. In this paper, we focus on the problem of person identification beyond face authentication in order to improve the identification results and not only where the face is detectable. We propose to use automatic scene analysis as features for people identification. We exploit two features: scene classification (studio and report) and camera identification. Then, people are identified by propagation strategies of overlaid names (OCR results) and speakers to scene classes and specific camera shots. Experiments performed on the REPERE corpus show improvement of face identification using scene understanding features (+13.9% of F-measure compared to the baseline)

Local Features for Visual Retrieval

Session Chair: Patrick Gros, INRIA Rennes, FR; 11:00-12:30

Bags of Trajectory Words for Video Indexing

Sabin Tiberius Strat, Alexandre Benoit and Patrick Lambert

A semantic indexing system capable of detecting both spatial appearance and motion-related semantic concepts requires the use of both spatial and motion descriptors. However, extracting motion descriptors on very large video collections requires great computational resources, which has caused most approaches to limit themselves to a spatial description. This paper explores the use of motion descriptors to complement such spatial descriptions and improve the overall performance of a generic semantic indexing system. We propose a framework for extracting and describing trajectories of tracked points that keeps computational cost manageable, then we construct Bag of Words representations with these trajectories. After supervised classification, a late fusion step combines information from spatial descriptors with that from our proposed Bag of Trajectory Words descriptors to improve overall results. We evaluate our approach in the very difficult context of the TRECVid Semantic INdexing (SIN) dataset.

Searching Images with Mpeg-7 Powered Localized dEscriptors: The SIMPLE answer to effective Content Based Image Retrieval

Chryssanthi Iakovidou, Nektarios Anagnostopoulos, Athanasios Kapoutsis, Yiannis Boutalis and Savvas Chatzichristofis

In this paper we propose and evaluate a new technique that localizes the description ability of the well established MPEG-7 global descriptors. We employ the SURF detector to define salient image patches of blob-like textures and use the MPEG-7 Scalable Color (SC), Color Layout (CL) and Edge Histogram (EH) descriptors to produce the final local features' vectors. In order to test the new descriptors in the most straightforward fashion, we use the Bag-Of-Visual-Words framework for indexing and retrieval. The experimental results conducted on two different benchmark databases, with varying codebook sizes revealed an astonishing boost in the retrieval performance of the proposed descriptors compared both to their own performance (in their original form) and to other state-of-the-art methods of local and global descriptors. Open-source implementation of the proposed descriptors are available in c#.

Locally Linear Salient Coding for Image Classification

Mohammadreza Babaee, Reza Bahmanyar, Gerhard Rigoll and Mihai Datcu

Representing images with their descriptive features is the fundamental problem in CBIR. Feature coding as a key-step in feature description has attracted much attention in recent years. Among the proposed coding strategies, Bag-of-Words (BoW) is the most widely used model. Recently saliency has been mentioned as the fundamental characteristic of BoW. Based on this idea, Salient Coding (SaC) has been introduced. Empirical studies show that SaC can represent the global structure of data only with a large enough number of codewords. In this paper, we remedy this limitation by introducing Locally Linear Salient Coding (LLSaC). This method discovers the global structure of the data by exploiting the local linear reconstructions of the data points. This knowledge in addition to the salient responses, provided by SaC, helps to describe the structure of the data even with a few codewords. Experimental results show that LLSaC obtains state-of-the-art results on various data types such as multimedia and Earth Observation.

Multi-Modal Retrieval

Session Chair: Werner Bailer, Joanneum Research, AT; 14:00-15:30

Online multimodal matrix factorization for human action video retrieval

Fabian Paez, Jorge A. Vanegas and Fabio Gonzalez

This paper addresses the problem of searching for videos containing instances of specific human actions. The proposed strategy builds a multimodal latent space representation where both visual content and annotations are simultaneously mapped. The hypothesis behind the method is that such a latent space yields better results when built from multiple data modalities. The semantic embedding is learned using matrix factorization through stochastic gradient descent, which makes it suitable to deal with large-scale collections. The method is evaluated on a large-scale human action video dataset with three modalities corresponding to action labels, action attributes and visual features. The evaluation is based on a query-by-example strategy, where a sample video is used as input to the system. A retrieved video is considered relevant if it contains an instance of the same human action present in the query. Experimental results show that the learned multimodal latent semantic representation produces improved performance when compared a exclusively visual representation.

Scalable Video Summarization of Cultural Video Documents in Cross-Media Space based on Data Cube Approach

Karina Ruby Perez Daniel, Jenny Benois-Pineau, Sofian Maabout, Gabriel Sargent and Mariko Nakano

Video summarization has been a core problem to manage the growing amount of content in multimedia databases. An efficient video summary should display an overview of the video content and most of existing approaches fulfill this goal. However the information does not allow user to get all details of interest selectively and progressively. This paper proposes a scalable video summarization approach which provides multiple views and levels of details. Our method relies on the usage of cross media space and consensus clustering approach. A video document is modeled as a data cube where the level of details is refined over nonconsensual features of the space. The method is designed for weakly structured content such as cultural documentaries and was tested on the INA corpus of cultural archives.

Inverse Square Rank Fusion for Multimodal Information Search

André Mourão, Flavio Martins and Joao Magalhaes

Rank fusion is the task of combining multiple ranked document lists (ranks) into a single ranked list. It is a late fusion approach designed to improve document weighting and improving individual systems performance. Rank fusion techniques have been applied throughout multiple domains: e.g. combining results from multiple textual retrieval functions, multimodal queries and federated search. In this paper, we present the Inverse Square Rank fusion method family, a set of novel fully unsupervised rank fusion methods based on quadratic decay and on logarithmic document frequency normalization. Our experiments created with standard IR datasets (image and text fusion) and image datasets (image features fusion), show that ISR outperforms existing rank fusion algorithms. Thus, the proposed technique has comparable or better performance than existing state-of-the-art approaches, while maintaining a low computational complexity and avoiding the need for document scores or training data.

Image Retrieval in Remote Sensing

Session Chair: Sébastien Lefèvre, University of Bretagne Sud and Philippe-Henri Gosselin, ENSEA, FR; Room E.2.42, 14:00-15:30

Evaluation of Second-order Visual Features for Land-Use Classification

Romain Negrel, David Picard and Philippe-Henri Gosselin

This paper investigates the use of recent visual features based on second-order statistics, as well as new processing techniques to improve the quality of features. More specifically, we present and evaluate Fisher Vectors (FV), Vectors of Locally Aggregated Descriptors (VLAD), and Vectors of Locally Aggregated Tensors (VLAT). These techniques are combined with several normalization techniques, such as power law normalization and orthogonalisation/whitening of descriptor spaces. Results on the UC Merced land use dataset shows the relevance of these new methods for land-use classification, as well as a significant improvement over Bag-of-Words.

Comparing The Information Extracted By Feature Descriptors From Images Using Huffman Coding

Gholamreza Bahmanyar, Gerhard Rigoll and Mihai Datcu

Traditionally, images are understood based on their primitive features such as color, texture, and shape. The proposed feature extraction methods usually cover a range of primitive features. SIFT, for example, in addition to the shape-based information, extracts texture and color information to some extent. Thus, different descriptors may cover a common range of primitive features which we call information overlap. Selecting a set of feature descriptors with low information overlap allows more comprehensive understanding of the data by providing a broader range of new features. This article introduces a new method based on information theory for comparing various descriptors. The idea is to code each description of an image by Huffman coding. The distance between the coded descriptions are then measured using Levenshtein distance as the information overlap. Results show that the computed information overlap clearly describes the differences between the learning from different descriptions of Earth Observation images.

Bag of morphological words for content-based geographical retrieval

Erchan Aptoula

Placed in the context of geographical content-based image retrieval, in this paper we explore the description potential of morphological texture descriptors when combined with the popular bag-of-visual-words paradigm. In particular, we adapt existing global morphological texture descriptors, so that they are computed within local subwindows and then form a vocabulary of ``visual morphological words'' through clustering. The resulting image features, are thus visual word histograms and are evaluated using the UC Merced Land Use-Land Cover dataset. Moreover, the local approach under study is compared against alternative global and local descriptors across a variety of settings. Despite being one of the initial attempts at localized morphological content description, the retrieval scores indicate that vocabulary based morphological content description possesses a significant discriminatory potential.

An adaptive CBIR system for remote sensed data

Houria Sebai and Assia Kourgli

Nowadays, content-based image-retrieval techniques constitute powerful tools for archiving and mining of large remote sensing image databases. However, the gap between low-level unsupervised extracted features in content-based retrieval and the high-level semantic concepts of user queries limits their performances. For that reason, we propose an adaptive content based image retrieval (CBIR) approach based on 3D-LBP (Local Binary Pattern) and HOG (Histogram of Orientated Gradients) features. The aim is to increase the performance by optimizing image features selection according to image nature (more or less textured and structured) while at the same time maintaining a small sized feature to attain better matching and lower complexity. The feature adaptation is based on two measures: a statistical measure on HOG distribution to quantify shape information and the mean range of local variances for texture measuring. Experiments demonstrate the adaptive scheme permits to reach more accuracy and better performances regarding to retrieval results and time computation.

Poster & Demo Session

Session Chair: Mathias Lux; 16:00-18:00

Automatic Object Annotation From Weakly Labeled Data With Latent Structured SVM

Christian Ries, Fabian Richter, Stefan Romberg and Rainer Lienhart

Ultrasound Image Processing based on Machine Learning for the Fully Automatic Evaluation of the Carotid Intima-Media Thickness

Rosa-María Menchón-Lara and José-Luis Sancho-Gómez

Mode of Teaching Based Segmentation and Annotation of Video Lectures

Yogesh Singh Rawat, Chidansh Bhatt and Mohan S Kankanhalli

Detecting Image Communities

Ersin Esen, Savas Özkan, Seda Tankiz, Ilkay Atil and Mehmet Ali Arabaci

Efficient Approximate Nearest Neighbor Search by Optimized Residual Vector Quantization

Liefu Ai, Junqing Yu, Tao Guan and Yunfeng He

Novel Fourier Descriptor Based on Complex Coordinates Shape Signature

Emir Sokic and Samim Konjicija

Annotation of still images by multiple visual concepts

Abdelkader Hamadi, Philippe Mulhem and Georges Quénot

Augmenting Training Sets with Still Images for Video Concept Detection

Sebastian Gerke, Antje Linnemann and Patrick Ndjiki-Nya

Improving Tag Transfer for Image Annotation using Visual and Semantic Information

Sergio Rodriguez-Vaamonde, Lorenzo Torresani, Koldo Espinosa and Estibaliz Garrote

Uploader models for Video Concept Detection

Bernard Merialdo and Usman Niaz

Enhancing region-based object tracking with the SP-SIFT feature

Fulgencio Navarro, Marcos Escudero-Viñolo and Jesús Bescós

Automatic propagation of manual annotations for multimodal person identification

Mateusz Budnik, Johann Poignant, Laurent Besacier and Georges Quenot

LabelMovie: Semi-supervised Machine Video Annotation Tool with Quality Assurance and Crowdsourcing Options (DEMO)

Zsolt Palotai, Miklós Láng, András Sárkány, Zoltán Tősér, Daniel Sonntag, Takumi Toyama and András Lőrincz

Towards Efficient Multimedia Exploration Using The Metric Space Approach (DEMO)

Jakub Lokoc, Tomas Grosup, Premysl Cech and Tomas Skopal

AXES-RESEARCH — A User-Oriented Tool for Enhanced Multimodal Search and Retrieval in Audiovisual Libraries (DEMO)

Peggy van der Kreeft, Kay Macquarrie, Max Kemman, Martijn Kleppe and Kevin McGuinness

CBMI 2014 Program