The popularity of digital cameras and the growth in Internet-based photo sharing has triggered a dramatic explosion
in the number of photo images stored by average consumers. For most of us, the real value of having a photo collection
is the ability to browse and retrieve images of people, places and events, quickly and easily.
The problem is that within just a few years, family photo collections can easily mushroom to thousands of images. As this
occurs, the task of organizing, browsing and searching them can lead to a lot of frustration.
Many people organize their photo collections by using descriptive file names, grouping photos into folders, or manually
tagging them with informative text. These techniques become increasingly cumbersome as the volume of a photo collection
continues to grow. This issue has motivated Intel to develop solutions for practical machine-assisted photo classification.
Machine-based photo search involves synthesizing metadata – which can be defined in this case as data that describes the
context of photos – by extracting structural features (descriptors) from photo media files, using them to semantically
classify the images, and storing the results to enable fast index-based retrieval and organization.
This technology is very promising, but there are major obstacles to deploying it on a large scale. Limited processor
performance constrains usage models. And programming complexity limits re-use and increases software development and
maintenance costs.
The Intel Digital Home Innovations Team is developing a software framework that will simplify the programming model.
Our goal is to make fast photo search capability accessible to developers of third-party applications.
At the same time, we are investigating ways to accelerate the process of math-intensive metadata extraction and
classification by harnessing the performance of multi-core processors. The goal is to make scene classification
practical so that future generations of consumer electronics devices will be able to support a variety of interesting usage
models using third-party applications.
Automated classification holds considerable promise within a variety of usage models and applications involving digital
photos. These include searching through media stored on the desktop, photo editing, authoring slideshows and Web-enabled
photo applications. These applications will ultimately include searching for photos using next-generation set top boxes
and other connected CE devices in the living room.
Thematic grouping is one of the compelling and cool consumer application areas that could be brought to life by improvements
in metadata synthesis. In this type of application, the computing device is provided with a theme, and then uses compiled
semantic information to assemble an appropriate photo set, such as “our baby grows up” or "Susan's soccer career.”
Improved photo search technology can make it easy and hassle-free for users to assemble photo albums, create slide shows,
build scrap books for Web sites or create photo-illustrated blogs.
Intel has developed a software architecture in which the fundamental building blocks of metadata synthesis are the same
regardless of the application. This software has been prototyped in Intel’s labs and is currently being shared and refined
through collaboration with a select group of third-party companies. The goal of this effort is to provide software ingredients
that add value to many present and future consumer software applications involving digital images.
Image Search in the Digital Home
Roll your cursor over Figure 1 for a sequential look at a potential future usage model based on image search with metadata synthesis.

Figure 1. Image search process. Roll your cursor sequentially over A, B, and C to learn more.
A. Retrieval: The photos are moved from the camera to the Photo Collection (image database).
Information harvested can include dates, exposure settings, filenames, and other metadata originating in the digital camera.
This information is represented by the dates in the bottom of Figure 1A.
B. Machine Tagging: Illustrates the automated analysis and classification process. The photos are now sorted and organized in
multiple ways, shown in the illustration by “Sunsets”, “Beaches” and “Friends”. Computing systems can organize the original
photo collection with little or no user interaction. Photo analysis and classification add more information about each photo
in the collection.
Our user now hits the ‘Guide’ button on her remote. From the navigation pane, she selects ‘Beaches’ under the ‘Scenes’ menu.
The navigation pane fades away, leaving a photo set that looks similar to the one she saw before, but with fewer photos in it.
Each photo is of the beach. She browses these more intently. All the photos on one day are from Venice Beach.
She selects the entire day and presses the ‘Guide’ button on her remote and ‘Places’ from the navigation pane. She taps a
few letters on the remote for ‘CA Beaches’ until she sees a list of beaches. She scrolls down to Venice Beach and selects
that. The photos are now all annotated ‘My Day at Venice Beach, CA Oct 12th 2007.’
C. Semantic Annotation: In this example, the Internet is used to further define
the “Beach” category by identifying photos of Venice Beach. The photos are each individually annotated. For example, the
first photo may say “What a Sunset!”, the second “With My Best Friend Susan”, and the third “With Beach Dog”. Previous
classifications could be used to provide item labels in the third set.
In this example, a user arrives home from her vacation and decides to take a quick look at her digital photos. She
plugs her camera into the A/V system, picks up the remote, and sits on her sofa. As her TV comes on, she sees
thumbnails of the many photographs she took while on vacation. They are loosely organized using information from
the camera like the date or file name, but the system quickly analyzes the photos to better describe and organize them.
While there are too many photos to view at one time, the TV’s graphical user interface does a good job of balancing
miniatures of the more recent photos with graphics to illustrate the volume of other photos available. The user browses
the photos, quickly moving past some while lingering on others. She finds one with her friends at the beach and captures
the theme to remember what a great day that was.
Balancing Speed and Simplicity
At Intel we are investigating ways to help consumers quickly and easily find the content they want in a huge domain of
digital images. This research involves four areas:
- Collecting existing information about media by harvesting existing metadata
- Synthesizing metadata directly from the media itself
- Organizing and storing metadata for fast and easy retrieval
- Providing mechanisms to make the information available to users
In a related article, we recently provided an overview of some of Intel’s exploration of
advanced user interfaces
designed to make information available to users. The focus of this article is synthesis of new metadata directly from media.
Regardless of the application in which photo search is used, we need to overcome three major obstacles before we can realize
the benefits of this technology:
- Is the technology accurate enough? Accuracy has improved significantly over the past decade due to advances in
mathematics. Classification accuracy has risen to about 90 percent, which is acceptable for many consumer applications.
- Is the performance adequate? The classification process can require tens of seconds today. By continuing to refine
classification engine algorithms and hardware architecture, we anticipate sub-second processing time in the future.
- Is the software simple enough to support widespread adoption? Can we make the technology so readily accessible that
developers can cost-effectively integrate it within applications on multiple platforms?
The remainder of this article discusses each of these challenges, and suggests solutions. At Intel our goal is to balance
performance with programming complexity, subject to application requirements. We are working to make the programming model
practical and make fast photo search accessible within applications running on available and emerging Intel® architecture
processors.
Accuracy I: Machine Training
We look for combinations of features that are good discriminators for classifying photo scenes. If we can identify a
set of features, we can teach a computer how to use those features to classify the photo. This process is called training,
and the output is known as a classifier. The training process involves these steps:
- Select an interesting classification
- Choose features that can assist in the classification
- Train a machine to understand the association between sets of features and the classification
- Measure accuracy
- Improve and repeat
For example, let us say that we want to train the machine to detect the classifier “water” in photos. We observe that
water tends to have features including “soft edges” and “blue color.” We give our learning machine the ability to extract
edge information and color from photographs. We then feed our learning machine many thousands of photos called the
“training set.” For each of these photos, we tell the machine whether the photo has water or not. The machine extracts the
feature information and looks for patterns that associate the features with the water-presence information we provided.
Fundamentally, the calculation of the decision boundary is done by iterative consideration of many examples, in much the
same way that learning by example is done in human beings.
Accuracy II: Classification
The whole point of scene classification is having the machine classify a novel photo it has not seen before. We present
the machine with photographs that have not been used to train the classifier and the machine determines their classification.
When a new photo is presented for classification, the algorithm extracts the descriptors and then plots the results. Values
that fall to one side of the decision boundary are negative for the feature while those falling on the other side are positive
for the feature. This final step, which we call classification, is the objective of scene classification engines. We can
precisely calculate the ability of the system to correctly classify photos if ground truth data is available.
Scene classification requires many examples, on the order of thousands to as many as 15,000. Achieving acceptable accuracy
requires careful selection of training sets that involve careful selection of what ‘scene positive’ means.
There are two extremes to getting this job done. One extreme is to train the classifier in advance and include the
classification models with the software distributions. The idea here is to allow professional experts to decide what water
is, train for it, and then provide the models that are used by application software. The other extreme is to train in
place. The application would use training services provided by the photo analysis library to train classifiers on user-provided
examples. Either approach is capable of delivering the accuracy we need for entertainment applications. Pre-training yields
good out-of-the-box classification while training-in-place provides more flexability. The middle road might combine
classifier models with a set of source photos used for initial training and then allow the training models to be refined with
subsequent user-provided examples.
Performance: Opportunities for Acceleration
Because descriptor extraction is the most demanding aspect of pattern classification, accelerating this process is a great
way to improve photo classification performance.
We optimize descriptor extractors along three vectors: support functions, image extraction, and parallelism. While each of
these optimization vectors can include aspects of the others, it is useful to consider optimization in these three layers.
-
Support functions
If we can improve classification capability and reduce the processing time for each image, we can set the stage for
compelling new usages around photo and eventually video classification. Improving classification capability requires
us to increase the number and sophistication of extracted descriptors.
Photo structural analysis algorithms segment the original image into multiple sub-image blocks. Then each image block
is independently analyzed and the results are combined to yield whole image descriptions. As a class, structure descriptor
extraction is inherently a Single Instruction Multiple Data (SIMD) parallel processing task, where each sub image is a
basic component of parallel distribution.
Descriptor extractors employ a large number of transforms, filters, and math services. Using optimized routines like
those from the Intel® Integrated Performance Primitives yields a significant performance improvement over stock algorithms. Intel
estimates that metadata generation time can be eventually cut from around 22 seconds per image today to 250-500 ms per
image by 2010. (Source: Intel metrics derived from photo search prototype and engineering estimates.)
Actual performance improvements will depend on the implementation of optimized routines, but the important point is
that you can achieve significant performance improvement by a simple substitution of support functions without re-structuring
the essential descriptor implementation.
-
Image extraction
A software-based descriptor extractor procedure may work two ways. It can take a photo as input and implement image
extraction as the first step of the procedure. Alternatively, the descriptor extractor may take an image as input
leaving the work of image extraction to the caller.
The second approach provides flexibility and programming conservation benefits, but regardless of the approach, the job
of doing descriptor extraction includes the work required to do image extraction, so it is an important consideration
in descriptor extractor optimization.
Image extraction begins with capturing a digital photo using a format such as JPEG or TIFF, or a graphics capture format
such as like PNG or BMP. Before we extract a descriptor, we need to ‘unwrap’ the essential image from the file format.
The image is a sequence of numbers describing the pixels that make up the image.
Image representations can vary widely based on color and structure. Just as we can convert one file format to another,
we can also convert image representations. By targeting a specific set of parameters for an image representation, we can
‘normalize’ images to make post-processing easier.
There are a number of image extractors available as commercial and open source software. Intel provides a rich set of
image extractors optimized for Intel® architecture as part of the Image and Video Processing libraries in the Intel®
Integrated Performance Primitives. The JPEG image extractor, for example, is highly optimized to employ parallel processing,
where it is supported by the underlying hardware, and reduce cache thrashing.
-
Parallelism
Another performance optimization technique is re-structuring the descriptor extractor architecture to optimize for
threaded (parallel) execution. Optimization proceeds by deconstructing a linear implementation, partitioning functionality
to concurrent instruction threads, laying out the solution space to effectively utilize the memory architecture, and
then re-assembling a functionally equivalent algorithm post-analysis.
Actual performance gains are directly related to the concurrency implementation of the software and the concurrency and
specific memory architecture of the processing system. Well-designed concurrency implementations will demonstrate
performance acceleration across a variety of concurrent processing systems including Intel® Pentium® processors with
HT Technology and Intel® Core microarchitecture.
Another factor to consider is that at the component level, structural descriptor extractions are mutually independent.
An edge histogram extraction, for example, is not dependent on a color structure extraction. And while classification
relies on available structural descriptions, classifier computations themselves are mutually independent. As a result,
distribution of the components in the photo analysis chain is straightforward and linear acceleration is readily
achievable across a relatively small set of multiple processing machines.
Simplicity: Reducing Programming Complexity
There are many high performance and highly accurate intelligent search solutions in university labs today.
So why not use them?
Accessibility is one of the major reasons. The programming model is so complicated that third-party software vendors
do not consider it cost effective to deploy the technology as they would need to do it – on a large scale across
multiple computing platforms. What application vendors need is a practical programming model that removes the
complexity and permits accessibility. Intel is working to meet that need.
Intel’s Goal: Accurate, Faster and Simpler Image Search
There is an immediate path to acceleration through Intel’s existing multi-core solutions. In the multi-core
environment, both descriptor extraction and classification can be done in parallel to yield impressive performance
gains. Combining the right APIs with the right accelerators will enable a single image search application to work
across multiple processor architectures while reducing programming complexity.
As shown in Figure 2, the Intel software architecture for photo search includes an Image Analysis Framework (IAF)
that abstracts media analysis from applications. This architecture is designed to accelerate metadata extraction
and application performance, while enabling single applications to work with a variety of processors.

Figure 2. Intel image analysis software architecture is designed to enable faster time-to-market for
third-party photo search capability based on metadata synthesis. The Image Analysis Framework (IAF) provides
an abstraction layer beneath third-party applications. Intel® Integrated Performance Primitives (Intel® IPP)
are multi-core software functions optimized for multimedia applications, including image analysis and metadata management.
Because the IAF is transparent to the foreground application, it enables a single third-party photo application
to work with multiple processors and benefit from the accelerated media analysis and metadata generation
enabled by multi-core processors. Intel® IPP are optimized for Intel® processor cores to achieve a
2x-4x performance gain, and the IAF distributes tasks across cores, providing 3x performance when graphics
processing units (GPU) are available. Intel estimates that metadata generation time can be eventually cut
from around 22 seconds per image today to 250-500 ms per image by 2010. (Source: Intel – Metrics derived
from search prototype and engineering estimates.)
As digital photography and imaging applications continue to grow in popularity, many consumers are discovering that
searching for and locating photos from among thousands of images can be a cumbersome and frustrating task. Traditional
text search based on file and folder names is a time-consuming process. And machine image recognition is currently
effective but slow, requiring more than 20 seconds per photo and slowing the performance of foreground applications.
Intel image analysis software ingredients are designed to help application vendors take advantage of optimized Intel
Performance Primitives and Intel multi-core processors to dramatically accelerate photo search based on metadata synthesis.
Increasing automated classification capability and dramatically speeding up the processing time for each image promises
to enable compelling new usage models and applications based on fast automated classification photos and video.
1: Tolba, A.S., El-Baz, A.H., El-Harby, A.A. (2005). Face Recognition: A Literature Review.
International Journal of Signal Processing Vol. 2 No. 2: ISSN 1304-4494
2: Duda, Richard O., Hart, Peter E., Stork, David G. (2000). Pattern Classification.
Wiley-Interscience; 2 Sub edition. ISBN-10: 0471056693