Metadata Synthesis for Fast Photo Analysis
Balancing Accuracy, Performance and Programming Simplicity

The popularity of digital cameras and the growth in Internet-based photo sharing has triggered a dramatic explosion in the number of photo images stored by average consumers. For most of us, the real value of having a photo collection is the ability to browse and retrieve images of people, places and events, quickly and easily.

The problem is that within just a few years, family photo collections can easily mushroom to thousands of images. As this occurs, the task of organizing, browsing and searching them can lead to a lot of frustration.

Many people organize their photo collections by using descriptive file names, grouping photos into folders, or manually tagging them with informative text. These techniques become increasingly cumbersome as the volume of a photo collection continues to grow. This issue has motivated Intel to develop solutions for practical machine-assisted photo classification.

Machine-based photo search involves synthesizing metadata – which can be defined in this case as data that describes the context of photos – by extracting structural features (descriptors) from photo media files, using them to semantically classify the images, and storing the results to enable fast index-based retrieval and organization.

This technology is very promising, but there are major obstacles to deploying it on a large scale. Limited processor performance constrains usage models. And programming complexity limits re-use and increases software development and maintenance costs.

The Intel Digital Home Innovations Team is developing a software framework that will simplify the programming model. Our goal is to make fast photo search capability accessible to developers of third-party applications.

At the same time, we are investigating ways to accelerate the process of math-intensive metadata extraction and classification by harnessing the performance of multi-core processors. The goal is to make scene classification practical so that future generations of consumer electronics devices will be able to support a variety of interesting usage models using third-party applications.

New Usage Models
Automated classification holds considerable promise within a variety of usage models and applications involving digital photos. These include searching through media stored on the desktop, photo editing, authoring slideshows and Web-enabled photo applications. These applications will ultimately include searching for photos using next-generation set top boxes and other connected CE devices in the living room.

Thematic grouping is one of the compelling and cool consumer application areas that could be brought to life by improvements in metadata synthesis. In this type of application, the computing device is provided with a theme, and then uses compiled semantic information to assemble an appropriate photo set, such as “our baby grows up” or "Susan's soccer career.” Improved photo search technology can make it easy and hassle-free for users to assemble photo albums, create slide shows, build scrap books for Web sites or create photo-illustrated blogs.

Intel has developed a software architecture in which the fundamental building blocks of metadata synthesis are the same regardless of the application. This software has been prototyped in Intel’s labs and is currently being shared and refined through collaboration with a select group of third-party companies. The goal of this effort is to provide software ingredients that add value to many present and future consumer software applications involving digital images.

Image Search in the Digital Home
Roll your cursor over Figure 1 for a sequential look at a potential future usage model based on image search with metadata synthesis.


Figure 1. Image search process. Roll your cursor sequentially over A, B, and C to learn more.

A. Retrieval: The photos are moved from the camera to the Photo Collection (image database). Information harvested can include dates, exposure settings, filenames, and other metadata originating in the digital camera. This information is represented by the dates in the bottom of Figure 1A.

B. Machine Tagging: Illustrates the automated analysis and classification process. The photos are now sorted and organized in multiple ways, shown in the illustration by “Sunsets”, “Beaches” and “Friends”. Computing systems can organize the original photo collection with little or no user interaction. Photo analysis and classification add more information about each photo in the collection.

Our user now hits the ‘Guide’ button on her remote. From the navigation pane, she selects ‘Beaches’ under the ‘Scenes’ menu. The navigation pane fades away, leaving a photo set that looks similar to the one she saw before, but with fewer photos in it. Each photo is of the beach. She browses these more intently. All the photos on one day are from Venice Beach.

She selects the entire day and presses the ‘Guide’ button on her remote and ‘Places’ from the navigation pane. She taps a few letters on the remote for ‘CA Beaches’ until she sees a list of beaches. She scrolls down to Venice Beach and selects that. The photos are now all annotated ‘My Day at Venice Beach, CA Oct 12th 2007.’

C. Semantic Annotation: In this example, the Internet is used to further define the “Beach” category by identifying photos of Venice Beach. The photos are each individually annotated. For example, the first photo may say “What a Sunset!”, the second “With My Best Friend Susan”, and the third “With Beach Dog”. Previous classifications could be used to provide item labels in the third set.

In this example, a user arrives home from her vacation and decides to take a quick look at her digital photos. She plugs her camera into the A/V system, picks up the remote, and sits on her sofa. As her TV comes on, she sees thumbnails of the many photographs she took while on vacation. They are loosely organized using information from the camera like the date or file name, but the system quickly analyzes the photos to better describe and organize them. While there are too many photos to view at one time, the TV’s graphical user interface does a good job of balancing miniatures of the more recent photos with graphics to illustrate the volume of other photos available. The user browses the photos, quickly moving past some while lingering on others. She finds one with her friends at the beach and captures the theme to remember what a great day that was.

Balancing Speed and Simplicity
At Intel we are investigating ways to help consumers quickly and easily find the content they want in a huge domain of digital images. This research involves four areas:

  1. Collecting existing information about media by harvesting existing metadata
  2. Synthesizing metadata directly from the media itself
  3. Organizing and storing metadata for fast and easy retrieval
  4. Providing mechanisms to make the information available to users

In a related article, we recently provided an overview of some of Intel’s exploration of advanced user interfaces designed to make information available to users. The focus of this article is synthesis of new metadata directly from media.

Regardless of the application in which photo search is used, we need to overcome three major obstacles before we can realize the benefits of this technology:

  • Is the technology accurate enough? Accuracy has improved significantly over the past decade due to advances in mathematics. Classification accuracy has risen to about 90 percent, which is acceptable for many consumer applications.
  • Is the performance adequate? The classification process can require tens of seconds today. By continuing to refine classification engine algorithms and hardware architecture, we anticipate sub-second processing time in the future.
  • Is the software simple enough to support widespread adoption? Can we make the technology so readily accessible that developers can cost-effectively integrate it within applications on multiple platforms?

The remainder of this article discusses each of these challenges, and suggests solutions. At Intel our goal is to balance performance with programming complexity, subject to application requirements. We are working to make the programming model practical and make fast photo search accessible within applications running on available and emerging Intel® architecture processors.

Accuracy I: Machine Training
We look for combinations of features that are good discriminators for classifying photo scenes. If we can identify a set of features, we can teach a computer how to use those features to classify the photo. This process is called training, and the output is known as a classifier. The training process involves these steps:

  1. Select an interesting classification
  2. Choose features that can assist in the classification
  3. Train a machine to understand the association between sets of features and the classification
  4. Measure accuracy
  5. Improve and repeat

For example, let us say that we want to train the machine to detect the classifier “water” in photos. We observe that water tends to have features including “soft edges” and “blue color.” We give our learning machine the ability to extract edge information and color from photographs. We then feed our learning machine many thousands of photos called the “training set.” For each of these photos, we tell the machine whether the photo has water or not. The machine extracts the feature information and looks for patterns that associate the features with the water-presence information we provided. Fundamentally, the calculation of the decision boundary is done by iterative consideration of many examples, in much the same way that learning by example is done in human beings.

Accuracy II: Classification
The whole point of scene classification is having the machine classify a novel photo it has not seen before. We present the machine with photographs that have not been used to train the classifier and the machine determines their classification.

When a new photo is presented for classification, the algorithm extracts the descriptors and then plots the results. Values that fall to one side of the decision boundary are negative for the feature while those falling on the other side are positive for the feature. This final step, which we call classification, is the objective of scene classification engines. We can precisely calculate the ability of the system to correctly classify photos if ground truth data is available.

Scene classification requires many examples, on the order of thousands to as many as 15,000. Achieving acceptable accuracy requires careful selection of training sets that involve careful selection of what ‘scene positive’ means.

There are two extremes to getting this job done. One extreme is to train the classifier in advance and include the classification models with the software distributions. The idea here is to allow professional experts to decide what water is, train for it, and then provide the models that are used by application software. The other extreme is to train in place. The application would use training services provided by the photo analysis library to train classifiers on user-provided examples. Either approach is capable of delivering the accuracy we need for entertainment applications. Pre-training yields good out-of-the-box classification while training-in-place provides more flexability. The middle road might combine classifier models with a set of source photos used for initial training and then allow the training models to be refined with subsequent user-provided examples.

Performance: Opportunities for Acceleration
Because descriptor extraction is the most demanding aspect of pattern classification, accelerating this process is a great way to improve photo classification performance.

We optimize descriptor extractors along three vectors: support functions, image extraction, and parallelism. While each of these optimization vectors can include aspects of the others, it is useful to consider optimization in these three layers.

  • Support functions
    If we can improve classification capability and reduce the processing time for each image, we can set the stage for compelling new usages around photo and eventually video classification. Improving classification capability requires us to increase the number and sophistication of extracted descriptors.

    Photo structural analysis algorithms segment the original image into multiple sub-image blocks. Then each image block is independently analyzed and the results are combined to yield whole image descriptions. As a class, structure descriptor extraction is inherently a Single Instruction Multiple Data (SIMD) parallel processing task, where each sub image is a basic component of parallel distribution.

    Descriptor extractors employ a large number of transforms, filters, and math services. Using optimized routines like those from the Intel® Integrated Performance Primitives yields a significant performance improvement over stock algorithms. Intel estimates that metadata generation time can be eventually cut from around 22 seconds per image today to 250-500 ms per image by 2010. (Source: Intel metrics derived from photo search prototype and engineering estimates.)

    Actual performance improvements will depend on the implementation of optimized routines, but the important point is that you can achieve significant performance improvement by a simple substitution of support functions without re-structuring the essential descriptor implementation.
     
  • Image extraction
    A software-based descriptor extractor procedure may work two ways. It can take a photo as input and implement image extraction as the first step of the procedure. Alternatively, the descriptor extractor may take an image as input leaving the work of image extraction to the caller.

    The second approach provides flexibility and programming conservation benefits, but regardless of the approach, the job of doing descriptor extraction includes the work required to do image extraction, so it is an important consideration in descriptor extractor optimization.

    Image extraction begins with capturing a digital photo using a format such as JPEG or TIFF, or a graphics capture format such as like PNG or BMP. Before we extract a descriptor, we need to ‘unwrap’ the essential image from the file format. The image is a sequence of numbers describing the pixels that make up the image.

    Image representations can vary widely based on color and structure. Just as we can convert one file format to another, we can also convert image representations. By targeting a specific set of parameters for an image representation, we can ‘normalize’ images to make post-processing easier.

    There are a number of image extractors available as commercial and open source software. Intel provides a rich set of image extractors optimized for Intel® architecture as part of the Image and Video Processing libraries in the Intel® Integrated Performance Primitives. The JPEG image extractor, for example, is highly optimized to employ parallel processing, where it is supported by the underlying hardware, and reduce cache thrashing.
     
  • Parallelism
    Another performance optimization technique is re-structuring the descriptor extractor architecture to optimize for threaded (parallel) execution. Optimization proceeds by deconstructing a linear implementation, partitioning functionality to concurrent instruction threads, laying out the solution space to effectively utilize the memory architecture, and then re-assembling a functionally equivalent algorithm post-analysis.

    Actual performance gains are directly related to the concurrency implementation of the software and the concurrency and specific memory architecture of the processing system. Well-designed concurrency implementations will demonstrate performance acceleration across a variety of concurrent processing systems including Intel® Pentium® processors with HT Technology and Intel® Core™ microarchitecture.

    Another factor to consider is that at the component level, structural descriptor extractions are mutually independent. An edge histogram extraction, for example, is not dependent on a color structure extraction. And while classification relies on available structural descriptions, classifier computations themselves are mutually independent. As a result, distribution of the components in the photo analysis chain is straightforward and linear acceleration is readily achievable across a relatively small set of multiple processing machines.

Simplicity: Reducing Programming Complexity
There are many high performance and highly accurate intelligent search solutions in university labs today. So why not use them?

Accessibility is one of the major reasons. The programming model is so complicated that third-party software vendors do not consider it cost effective to deploy the technology as they would need to do it – on a large scale across multiple computing platforms. What application vendors need is a practical programming model that removes the complexity and permits accessibility. Intel is working to meet that need.

Intel’s Goal: Accurate, Faster and Simpler Image Search
There is an immediate path to acceleration through Intel’s existing multi-core solutions. In the multi-core environment, both descriptor extraction and classification can be done in parallel to yield impressive performance gains. Combining the right APIs with the right accelerators will enable a single image search application to work across multiple processor architectures while reducing programming complexity.

As shown in Figure 2, the Intel software architecture for photo search includes an Image Analysis Framework (IAF) that abstracts media analysis from applications. This architecture is designed to accelerate metadata extraction and application performance, while enabling single applications to work with a variety of processors.


Figure 2. Intel image analysis software architecture is designed to enable faster time-to-market for third-party photo search capability based on metadata synthesis. The Image Analysis Framework (IAF) provides an abstraction layer beneath third-party applications. Intel® Integrated Performance Primitives (Intel® IPP) are multi-core software functions optimized for multimedia applications, including image analysis and metadata management. Because the IAF is transparent to the foreground application, it enables a single third-party photo application to work with multiple processors and benefit from the accelerated media analysis and metadata generation enabled by multi-core processors. Intel® IPP are optimized for Intel® processor cores to achieve a 2x-4x performance gain, and the IAF distributes tasks across cores, providing 3x performance when graphics processing units (GPU) are available. Intel estimates that metadata generation time can be eventually cut from around 22 seconds per image today to 250-500 ms per image by 2010. (Source: Intel – Metrics derived from search prototype and engineering estimates.)

Summary
As digital photography and imaging applications continue to grow in popularity, many consumers are discovering that searching for and locating photos from among thousands of images can be a cumbersome and frustrating task. Traditional text search based on file and folder names is a time-consuming process. And machine image recognition is currently effective but slow, requiring more than 20 seconds per photo and slowing the performance of foreground applications.

Intel image analysis software ingredients are designed to help application vendors take advantage of optimized Intel Performance Primitives and Intel multi-core processors to dramatically accelerate photo search based on metadata synthesis.

Increasing automated classification capability and dramatically speeding up the processing time for each image promises to enable compelling new usage models and applications based on fast automated classification photos and video.

References
1: Tolba, A.S., El-Baz, A.H., El-Harby, A.A. (2005). Face Recognition: A Literature Review. International Journal of Signal Processing Vol. 2 No. 2: ISSN 1304-4494

2: Duda, Richard O., Hart, Peter E., Stork, David G. (2000). Pattern Classification. Wiley-Interscience; 2 Sub edition. ISBN-10: 0471056693