Read an Excerpt
3C VisionCues, Contexts, and Channels
By Virginio Cantoni Stefano Levialdi Bertrand Zavidovique
ELSEVIERCopyright © 2011 Elsevier Inc.
All right reserved.
Chapter OneNatural and Artificial Vision
Vision can be considered as the activity performed by biological systems to exploit perceived light distributions for building adequate representations of natural scenes in the mind. Such representations can, in turn, be redescribed and presented in a pictorial form, for example, as an artist's rendering on a canvas or on a computer screen. Two main processes are considered here. The former is mainly an analysis task, outlined in as "the process of discovering from images what is present in the world and where it is." The latter is a synthesis task that may be described, by paraphrasing the previous definition, as "the process of rendering, through images, a model of the world." In that sense, these two processes may be considered as dual ones; the first structures the acquired data toward information, whereas the second embodies an internal concept pictorially.
These two tasks may naturally be performed by humans. Nevertheless, in the 1960s, different dedicated systems were designed to execute each task.
Initially, pattern recognition and then image analysis were gradually developed to achieve practical results in different areas of image science (biomedical images classification, optical character recognition (OCR), human signature recognition, remote sensing, automatic inspection, etc.). Further, new methodologies and algorithms for extending and generalizing existing recognition techniques were established. Machine vision has often been divided into computer vision and image understanding; in the former, a correspondence exists with the engineering disciplines, which try to optimize a system performance for a class of specific actions, whereas in the latter the goal is to explain scene contents with reference to a class of models.
Initially, the artificial representations on a screen were graphical sketches made by geometrical primitives (straight lines, circles, and rectangles) using a few gray levels, next a set of packages called Computer-Aided X (X for design, engineering, manufacturing, etc.) were implemented, igniting the fast explosion in the field of computer graphics. From data visualization to visual realism and pictorial manipulation, a wide number of algorithms have been developed to improve the image quality in texture and color rendering, illumination control, dimensionality, and other areas. Today, the possibility of multimedia systems integrating images, sounds, text, and animation allows us to build virtual reality equipment, both in first and in third person, which enable the user to control and interact with the visualized information.
It is worthwhile to describe, using a functional schema, the characteristics of the vision processes. By looking at Figure 1.1, we may see how the dual aspects of vision can be subdivided into six activities:
Looking: The field of view is scrutinized both for spread and for purposive attention. In the spread (diffuse) instance, the observing system is passively acquiring all the light information from the scene; whereas in the purposive instance, the system actively inspects the visual field pursuing a goal.
Perceiving: A region of interest (ROI) is discovered in space, and/or in time, respectively, on a still frame or on an image sequence: all the computing resources of the system are then concentrated on the ROIs.
Seeing: ROIs are interpreted; the scene components are recognized, locating their position and orientation, and described in a more abstract form, where symbols give access to an associated semantics. Geometric modeling and then a variety of knowledge representations are used to achieve image understanding.
Moving upward along the other arm of V(ision), we find a brief description of the three activities that are dual to the previous ones given:
Conceptualizing: A high-level decision process takes place, by means of which a visual form (concrete or abstract) is conceived to aid reasoning, understanding, and learning.
Sinopiting: A sketching activity that roughly describes the scene spatial (and further spatiotemporal) composition by graphical annotation, indicating placements, shapes, and relationships (and likely further evolutions). This is still a creative activity that may be later modified and refined.
Rendering: The final composition of pictorial elements (using a gray scale or a color palette) designed to visually grasp the concepts and relationships (as in a synthetic molecule, a weather map, or a biomedical image). In some instances, like in many dynamic simulations (e.g., flight simulators), all the laws of physical optics are applied to achieve a visual realism, so providing good three-dimensionality and lifelike computer-generated image sequences. Resulting productions belong to what is now called virtual reality.
Within biological systems, along the phylogenetic evolution of primates, the activity of looking has been accomplished through a powerful and reliable system, with a limited amount of resources, that is briefly sketched in Figure 1.2.
Three different facets of the human vision system may be considered, namely, functional, physiological, and computational ones.
The first covers, by means of task-oriented strategies, operations that start from an unconscious attitude. The eye-scanning mechanism is first a random process in a wide visual field (about 180° for a fixed eye position) mainly geared to detect relevant regions (the where function). Next, an attentive inspection of such regions is performed (the what function): for this purpose the corresponding detailed information is phototransduced and delivered through the optic nerve to the brain (the acquisition function).
The second aspect is the physiological one, where the previously mentioned functions are associated to the anatomy of the human visual system. The where function is achieved through a full retinal vision, performed with a spatially nonuniform distribution of different photosensors. The what function is accomplished by foveal vision, operating with some thousands of cone cells, covering a field of view of approximately 2° around the optical axis. Such cones, belonging to red, green, and blue classes, are uniformly distributed, and the minimum visual angle that can be resolved is of 0.5 min of arc. Finally, the last function, the acquisition, is obtained by transmitting nervous pulses from each retina to the brain, through 1 million fibers belonging to the nerve. Three levels of preprocessing have been found in the retina that allows a signal data compression of two orders of magnitude.
The third computational aspect is directly connected to the data processing mode that differs for each function. First, a low-resolution process is executed by the full retinal vision, in parallel. There each photoreceptor is able to simultaneously acquire light information, whereas a network of retinal cells performs preprocessing. Second, foveal vision is performed on each relevant region at a time, in sequence. There, the resolution is much higher and the analysis is more sophisticated. The order in which the analysis of the ROIs is performed, the scan path, depends on the task, as demonstrated by Yarbus. A wide number of experiments on eye fixation points (the ROIs) over a number of different pictures were reported, to understand causes of ROI saliency. The last function, which essentially provides neurological data to the brain, particularly to its cortical region, has been proven to operate bidirectionally so that, in turn, the brain may send control signals to the eye.
As for artificial vision, the phylogenetic evolution of machines is condensed into half a century. However, the dramatic concentration of several hundred specific vision projects into the single current PC-based multicore system can be summarized through an historical perspective. By the same token, main activities mentioned in the V-cycle, and corresponding difficulties, are reintroduced within the frame of computer evolution.
Initially, the image analysis was performed by raster-scanning gray-level images (usually having pixels with 8 bits). In the same period, the synthesis of images, so-called computer graphics, was featuring vector representation, where the image components—that is, the lines—were coded by vectors. There was a remarkable difference between the complexity of the images to be analyzed and that of the artificially generated ones, the latter usually of a simple geometrical nature.
At the end of the 1970s, systems had a display-oriented image memory with interactive capabilities and video processing. Such modules were specialized for local convolution (with a limited kernel), template correlation and edge detection, as well as a high-speed address controller for zoom, pan, roam operations, and gray-scale modification.
Currently, special hardware functions can generate vectors and characters on a graphic plane. With the addition of these features, plus high-speed central processors, increased bits per pixel, and professional color monitors, today's vision systems are suitable for both image analysis and synthesis. Finally, new interaction tools like mice, gloves, helmets, spectacles, and multimedia devices enable present workstations to have extremely high performances for all instances of multidimensional image processing.
The last five decades have witnessed the historical evolution of systems that have tried to emulate and compete with the visual human system. These will be now described, following the first track of the V-cycle, as may be seen in Figure 1.3.
The first systems for image analysis and pattern recognition were designed around a commercial computer on which a special-purpose interface was built for image acquisition, generally using a standard camera. At that time (1960–1970), serious obstacles were met in image transfer rate and core memory size. Moreover, the processing of raster-scanned images was computationally intensive and slow respective to the tasks that such systems were aimed at: character recognition, path recognition in a spark bubble chamber, chromosome karyotype classification , and so on.
The second decade (1970–1980) may be described as one in which image preprocessing was developed using both ad hoc hardware and suitable algorithms. Basically, most negative effects introduced by digitization, noise, limited resolution, and such were compensated through enhancement and restoration techniques so that further analysis and recognition could be conveniently performed.
During the third decade (1980–1990), to improve the machine perception strategies, sensing and processing were operating in a closed loop, achieving the so-called active perception. In this way, only the most significant information is considered, facilitating high-level processing. A wide number of new strategies were developed for obtaining only the relevant information from an image. Two of the most significant ones are the synergic exploitation of the spatial and temporal parallelisms and the multiresolution approach for smart sensing.
The critical elements of a scene may be captured by means of motion tracking, stereo vision, and change detection, using a front-end processor that may operate autonomously. In this new configuration, besides the peripheral control loop typical of active vision systems, another central control loop, which includes the main vision computer, enables us to sequentially probe the external world (artificial seeing) and compare it with its internal model of the world.
The evolution of the fourth decade (1990–2000) has brought higher computing power to the sensory area; more particularly, smart artificial retinas have been built, where the optoelectronic function is combined with moderate processing capabilities.
Finally, during the fifth decade (2000–2010), the Internet revolution and wireless technology sparked novel architectures based on a high number of interconnected active elements (sensor networks, smart dust, up to cognitive networks) that may all process images cooperatively in a natural environment. These new systems are able to provide fast, robust, and efficient image processing. Typical applications include distributed robotics, environment monitoring, emergency management (fire, heart quake, and floods), urban sensing, etc.
Excerpted from 3C Vision by Virginio Cantoni Stefano Levialdi Bertrand Zavidovique Copyright © 2011 by Elsevier Inc.. Excerpted by permission of ELSEVIER. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.