The long tail phenomena appears when a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model. this has always been a problem in machine learning. We start by explaining why representation sharing in general, and embedding approaches in particular, can help to represent tail objects. Several embedding approaches are presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification (the so-called zero-shot setting). Finally, we present our latest results on image description, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful descriptive sentence.
Bio: Samy Bengio (PhD in computer science, University of Montreal, 1993) is a research scientist at Google since 2007. Before that, he was senior researcher in statistical machine learning at IDIAP Research Institute since 1999. His most recent research interests are in machine learning, in particular large scale online learning, image ranking and annotation, music and speech processing, and deep learning. He is action editor of the Journal of Machine Learning Research and on the editorial board of the Machine Learning Journal. He was associate editor of the journal of computational statistics, general chair of the Workshops on Machine Learning for Multimodal Interactions (MLMI'2004, 2005 and 2006), programme chair of the IEEE Workshop on Neural Networks for Signal Processing (NNSP'2002), chair of BayLearn (2012, 2013, 2014), and on the programme committee of several international conferences such as NIPS, ICML, ECML and ICLR. More information can be found on his website: http://bengio.abracadoudou.com.
A traditional third-person camera passively watches the world, typically from a stationary position. In contrast, a first-person (wearable) camera is inherently linked to the ongoing experiences of its wearer. It encounters the visual world in the context of the wearer's physical activity, behavior, and goals. This distinction has many intriguing implications for computer vision research, in topics ranging from fundamental visual recognition problems to high-level multimedia applications.
I will present our recent work in this space, driven by the notion that the camera wearer is an active participant in the visual observations received. First, I will show how to exploit egomotion when learning image representations. Cognitive science tells us that proper development of visual perception requires internalizing the link between "how I move" and "what I see"---yet today's best recognition methods are deprived of this link, learning solely from bags of images downloaded from the Web. We introduce a deep feature learning approach that embeds information not only from the video stream the observer sees, but also the motor actions he simultaneously makes. We demonstrate the impact for recognition, including a scenario where features learned from ego-video on an autonomous car substantially improve large-scale scene recognition. Next, I will present our work exploring video summarization from the first person perspective. Leveraging cues about ego-attention and interactions to infer a storyline, we automatically detect the highlights in long videos. We show how hours of wearable camera data can be distilled to a succinct visual storyboard that is understandable in just moments, and examine the possibility of person- and scene-independent cues for heightened attention. Overall, whether considering action or attention, the first-person setting offers exciting new opportunities for large-scale visual learning.
Bio: Kristen Grauman is an Associate Professor in the Department of Computer Science at the University of Texas at Austin. Her research in computer vision and machine learning focuses on visual search and recognition. Before joining UT-Austin in 2007, she received her Ph.D. in the EECS department at MIT, in the Computer Science and Artificial Intelligence Laboratory. She is an Alfred P. Sloan Research Fellow and Microsoft Research Faculty Fellow, and a recipient of the 2013 PAMI Young Researcher Award, IJCAI Computers and Thought Award, and Presidential Early Career Award for Scientists and Engineers (PECASE). She and her collaborators were recognized with the CVPR Best Student Paper Award in 2008 for their work on hashing algorithms for large-scale image retrieval, and the Marr Best Paper Prize at ICCV in 2011 for their work on modeling relative visual attributes.
Small image patches tend to repeat "as is" at multiple scales of a natural image. This fractal-like behavior has been used (by us and by others) for various tasks, including image compression, super-resolution, and denoising. However, it turns out that this internal patch recurrence property is strong only in images taken under *ideal* imaging conditions, but significantly diminishes when the imaging conditions deviate from ideal ones. In the first part of my talk I will briefly review some methods for exploiting the cross-scale recurrence prior for image enhancement. In the second part of my talk I will show how we can exploit the *deviations* from ideal patch recurrences in order to recover important information about the unknown sensor, as well as unknown physical properties of the scene.
In particular, I will show how the deviations from ideal patch recurrence can be used for:
- Recovering the unknown camera blur (giving rise to "Blind Deblurring").
- Recovering the unknown atmospheric scattering parameters from images taken under fog, haze, and bad weather conditions (giving rise to "Blind Dehazing").
Bio: Michal Irani is a Professor at the Weizmann Institute of Science, Israel, in the Department of Computer Science and Applied Mathematics. She received a B.Sc. degree in Mathematics and Computer Science from the Hebrew University of Jerusalem, and M.Sc. and Ph.D. degrees in Computer Science from the same institution. During 1993-1996 she was a member of the Vision Technologies Laboratory at the Sarnoff Research Center (Princeton). She joined the Weizmann Institute in 1997. Michal's research interests center around computer vision, image processing, and video information analysis. Michal's prizes and honors include the David Sarnoff Research Center Technical Achievement Award (1994), the Yigal Allon three-year Fellowship for Outstanding Young Scientists (1998), and the Morris L. Levinson Prize in Mathematics (2003). She received the ECCV Best Paper Award in 2000 and in 2002, and was awarded the Honorable Mention for the Marr Prize in 2001 and in 2005.
Inspired by the ability of humans to interpret and understand 3D scenes nearly effortlessly, the problem of 3D scene understanding has long been advocated as the "holy grail" of computer vision. In the early days this problem was addressed in a bottom-up fashion without enabling satisfactory or reliable results for scenes of realistic complexity. In recent years there has been considerable progress on many sub-problems of the overall 3D scene understanding problem. As the performance for these sub-tasks starts to achieve remarkable performance levels we argue that the problem to automatically infer and understand 3D scenes should be addressed again. This talk highlights recent progress on some essential components (such as object recognition and person detection), on our attempt towards 3D scene understanding, as well as on our work towards activity recognition and the ability to describe video content with natural language. These efforts are part of a longer-term agenda towards visual scene understanding. While visual scene understanding has long been advocated as the "holy grail" of computer vision, we believe it is time to address this challenge again, based on the progress in recent years.
In computer vision research, considerable progress has been made on naming things in digital pictures automatically to the level that human cognition is in sight. Initially facilities like Google Search used the annotation added to the image. Recognising an instance of an arbitrary object class from the picture data essentially started in 2003 when Fergus and Zisserman recognised a picture to contain a “motor bicycle” or not. In hindsight, they made a few assumptions: 1. parts of one object are spatially close (where it is unnecessary to first delineate the object in the scene), 2. specific visual parts are recurring among all members of the concept class (such as a part of a wheel which will be visible on all pictures of a motor bicycle), and 3. the recognition of these class-specific parts can be learned from examples. Visual concept recognition by computation from the data given some visual examples has achieved remarkable progress in the last decade as we will soon be able to recognise from 50 – 1000 examples for each of the 20,000 different things we are living by. In the talk an overview of the standard recognition algorithm is given, including new extensions such as algorithms to localise the object next to recognising it, algorithms to recognise actions, and algorithms for the search of all images of 1 specific object of which just 1 image is available. We end with: is this algorithm also present in the brain and what is more in visual cognition?
Bio: Arnold W.M. Smeulders graduated from Delft Technical University in Physics and Leyden Medical Faculty (PhD) both on a computer vision topics of the time. He is professor at the University of Amsterdam, leading the ISIS group on visual search engines. We have been a top three performer in the international TRECvid scientific competition for visual concept search over the last 10 years. In 2010 Euvision was co-founded as a university spin-off. He is scientific director of COMMIT/, the large public private-ICT-research program of the Netherlands. He is past associated editor of the IEEE trans PAMI and current editor of IJCV. He was visiting professor in Hong Kong, Tsukuba, Modena, Cagliari, and Orlando. He is IAPR fellow and Member of the Academia Europeana.
In this talk, I will present our recent works on deep learning for face recognition. With a novel deep model and a moderate training set with 400,000 face images, 99.47% accuracy has been achieved on LFW, the most challenging and extensively studied face recognition dataset. Deep learning provides a powerful tool to separate intra-personal and inter-personal variations, whose distributions are complex and highly nonlinear, through hierarchical feature transforms. It is essential to learn effective face representations by using two supervisory signals simultaneously, i.e. the face identification and verification signals. Some people understand the success of deep learning as using a complex model with many parameters to fit a dataset. To clarify such misunderstanding, we further investigate face recognition process in deep nets, what information is encoded in neurons, and how robust they are to data corruptions. We discovered several interesting properties of deep nets, including sparseness, selectiveness and robustness. In Multi-View Perception, a hybrid deep model is proposed to simultaneously accomplish the tasks of face recognition, pose estimation, and face reconstruction. It employs deterministic and random neurons to encode identity and pose information respectively. Given a face image taken in an arbitrary view, it can untangle the identity and view features, and in the meanwhile the full spectrum of multi-view images of the same identity can be reconstructed. It is also capable to interpolate and predict images under viewpoints that are unobserved in the training data.
Bio: Xiaogang Wang received his Bachelor degree in Electrical Engineering and Information Science from the Special Class of Gifted Young at the University of Science and Technology of China in 2001, M. Phil. degree in Information Engineering from the Chinese University of Hong Kong in 2004, and PhD degree in Computer Science from Massachusetts Institute of Technology in 2009. He is an assistant professor in the Department of Electronic Engineering at the Chinese University of Hong Kong since August 2009. He received the Outstanding Young Researcher in Automatic Human Behaviour Analysis Award in 2011, Hong Kong RGC Early Career Award in 2012, and Young Researcher Award of the Chinese University of Hong Kong. He is the associate editor of the Image and Visual Computing Journal. He was the area chair of ICCV 2011, ECCV 2014 and ACCV 2014. His research interests include computer vision, deep learning, crowd video surveillance, object detection, and face recognition.