
The popularity of digital cameras and the growth in Internet-based photo sharing has triggered a dramatic explosion in the number of photo images stored by average consumers. For most of us, the real value of having a photo collection is the ability to browse and retrieve images of people, places and events, quickly and easily.
The problem is that within just a few years, family photo collections can easily mushroom to thousands of images. As this occurs, the task of organizing, browsing and searching them can lead to a lot of frustration.
Many people organize their photo collections by using descriptive file names, grouping photos into folders, or manually tagging them with informative text. These techniques become increasingly cumbersome as the volume of a photo collection continues to grow. This issue has motivated Intel to develop solutions for practical machine-assisted photo classification.
Machine-based photo search involves synthesizing metadata – which can be defined in this case as data that describes the context of photos – by extracting structural features (descriptors) from photo media files, using them to semantically classify the images, and storing the results to enable fast index-based retrieval and organization.
This technology is very promising, but there are major obstacles to deploying it on a large scale. Limited processor performance constrains usage models. And programming complexity limits re-use and increases software development and maintenance costs.
The Intel Digital Home Innovations Team is developing a software framework that will simplify the programming model. Our goal is to make fast photo search capability accessible to developers of third-party applications.
At the same time, we are investigating ways to accelerate the process of math-intensive metadata extraction and classification by harnessing the performance of multi-core processors. The goal is to make scene classification practical so that future generations of consumer electronics devices will be able to support a variety of interesting usage models using third-party applications.
Automated classification holds considerable promise within a variety of usage models and applications involving digital photos. These include searching through media stored on the desktop, photo editing, authoring slideshows and Web-enabled photo applications. These applications will ultimately include searching for photos using next-generation set top boxes and other connected CE devices in the living room.
Thematic grouping is one of the compelling and cool consumer application areas that could be brought to life by improvements in metadata synthesis. In this type of application, the computing device is provided with a theme, and then uses compiled semantic information to assemble an appropriate photo set, such as “our baby grows up” or "Susan's soccer career.” Improved photo search technology can make it easy and hassle-free for users to assemble photo albums, create slide shows, build scrap books for Web sites or create photo-illustrated blogs.
Intel has developed a software architecture in which the fundamental building blocks of metadata synthesis are the same regardless of the application. This software has been prototyped in Intel’s labs and is currently being shared and refined through collaboration with a select group of third-party companies. The goal of this effort is to provide software ingredients that add value to many present and future consumer software applications involving digital images.
Roll your cursor over Figure 1 for a sequential look at a potential future usage model based on image search with metadata synthesis.

Figure 1. Image search process. Roll your cursor sequentially over A, B, and C to learn more.
A. Retrieval: The photos are moved from the camera to the Photo Collection (image database). Information harvested can include dates, exposure settings, filenames, and other metadata originating in the digital camera. This information is represented by the dates in the bottom of Figure 1A.
B. Machine Tagging: Illustrates the automated analysis and classification process. The photos are now sorted and organized in
multiple ways, shown in the illustration by “Sunsets”, “Beaches” and “Friends”. Computing systems can organize the original
photo collection with little or no user interaction. Photo analysis and classification add more information about each photo
in the collection.
Our user now hits the ‘Guide’ button on her remote. From the navigation pane, she selects ‘Beaches’ under the ‘Scenes’ menu.
The navigation pane fades away, leaving a photo set that looks similar to the one she saw before, but with fewer photos in it.
Each photo is of the beach. She browses these more intently. All the photos on one day are from Venice Beach.
She selects the entire day and presses the ‘Guide’ button on her remote and ‘Places’ from the navigation pane. She taps a
few letters on the remote for ‘CA Beaches’ until she sees a list of beaches. She scrolls down to Venice Beach and selects
that. The photos are now all annotated ‘My Day at Venice Beach, CA Oct 12th 2007.’
C. Semantic Annotation: In this example, the Internet is used to further define the “Beach” category by identifying photos of Venice Beach. The photos are each individually annotated. For example, the first photo may say “What a Sunset!”, the second “With My Best Friend Susan”, and the third “With Beach Dog”. Previous classifications could be used to provide item labels in the third set.
In this example, a user arrives home from her vacation and decides to take a quick look at her digital photos. She plugs her camera into the A/V system, picks up the remote, and sits on her sofa. As her TV comes on, she sees thumbnails of the many photographs she took while on vacation. They are loosely organized using information from the camera like the date or file name, but the system quickly analyzes the photos to better describe and organize them. While there are too many photos to view at one time, the TV’s graphical user interface does a good job of balancing miniatures of the more recent photos with graphics to illustrate the volume of other photos available. The user browses the photos, quickly moving past some while lingering on others. She finds one with her friends at the beach and captures the theme to remember what a great day that was.
At Intel we are investigating ways to help consumers quickly and easily find the content they want in a huge domain of digital images. This research involves four areas:
In a related article, we recently provided an overview of some of Intel’s exploration of advanced user interfaces designed to make information available to users. The focus of this article is synthesis of new metadata directly from media.
Regardless of the application in which photo search is used, we need to overcome three major obstacles before we can realize the benefits of this technology:
The remainder of this article discusses each of these challenges, and suggests solutions. At Intel our goal is to balance performance with programming complexity, subject to application requirements. We are working to make the programming model practical and make fast photo search accessible within applications running on available and emerging Intel® architecture processors.
We look for combinations of features that are good discriminators for classifying photo scenes. If we can identify a set of features, we can teach a computer how to use those features to classify the photo. This process is called training, and the output is known as a classifier. The training process involves these steps:
For example, let us say that we want to train the machine to detect the classifier “water” in photos. We observe that water tends to have features including “soft edges” and “blue color.” We give our learning machine the ability to extract edge information and color from photographs. We then feed our learning machine many thousands of photos called the “training set.” For each of these photos, we tell the machine whether the photo has water or not. The machine extracts the feature information and looks for patterns that associate the features with the water-presence information we provided. Fundamentally, the calculation of the decision boundary is done by iterative consideration of many examples, in much the same way that learning by example is done in human beings.
The whole point of scene classification is having the machine classify a novel photo it has not seen before. We present the machine with photographs that have not been used to train the classifier and the machine determines their classification.
When a new photo is presented for classification, the algorithm extracts the descriptors and then plots the results. Values that fall to one side of the decision boundary are negative for the feature while those falling on the other side are positive for the feature. This final step, which we call classification, is the objective of scene classification engines. We can precisely calculate the ability of the system to correctly classify photos if ground truth data is available.
Scene classification requires many examples, on the order of thousands to as many as 15,000. Achieving acceptable accuracy requires careful selection of training sets that involve careful selection of what ‘scene positive’ means.
There are two extremes to getting this job done. One extreme is to train the classifier in advance and include the classification models with the software distributions. The idea here is to allow professional experts to decide what water is, train for it, and then provide the models that are used by application software. The other extreme is to train in place. The application would use training services provided by the photo analysis library to train classifiers on user-provided examples. Either approach is capable of delivering the accuracy we need for entertainment applications. Pre-training yields good out-of-the-box classification while training-in-place provides more flexability. The middle road might combine classifier models with a set of source photos used for initial training and then allow the training models to be refined with subsequent user-provided examples.
Because descriptor extraction is the most demanding aspect of pattern classification, accelerating this process is a great way to improve photo classification performance.
We optimize descriptor extractors along three vectors: support functions, image extraction, and parallelism. While each of these optimization vectors can include aspects of the others, it is useful to consider optimization in these three layers.
There are many high performance and highly accurate intelligent search solutions in university labs today. So why not use them?
Accessibility is one of the major reasons. The programming model is so complicated that third-party software vendors do not consider it cost effective to deploy the technology as they would need to do it – on a large scale across multiple computing platforms. What application vendors need is a practical programming model that removes the complexity and permits accessibility. Intel is working to meet that need.
There is an immediate path to acceleration through Intel’s existing multi-core solutions. In the multi-core environment, both descriptor extraction and classification can be done in parallel to yield impressive performance gains. Combining the right APIs with the right accelerators will enable a single image search application to work across multiple processor architectures while reducing programming complexity.
As shown in Figure 2, the Intel software architecture for photo search includes an Image Analysis Framework (IAF) that abstracts media analysis from applications. This architecture is designed to accelerate metadata extraction and application performance, while enabling single applications to work with a variety of processors.

Figure 2. Intel image analysis software architecture is designed to enable faster time-to-market for
third-party photo search capability based on metadata synthesis. The Image Analysis Framework (IAF) provides
an abstraction layer beneath third-party applications. Intel® Integrated Performance Primitives (Intel® IPP)
are multi-core software functions optimized for multimedia applications, including image analysis and metadata management.
Because the IAF is transparent to the foreground application, it enables a single third-party photo application
to work with multiple processors and benefit from the accelerated media analysis and metadata generation
enabled by multi-core processors. Intel® IPP are optimized for Intel® processor cores to achieve a
2x-4x performance gain, and the IAF distributes tasks across cores, providing 3x performance when graphics
processing units (GPU) are available. Intel estimates that metadata generation time can be eventually cut
from around 22 seconds per image today to 250-500 ms per image by 2010. (Source: Intel – Metrics derived
from search prototype and engineering estimates.)
As digital photography and imaging applications continue to grow in popularity, many consumers are discovering that searching for and locating photos from among thousands of images can be a cumbersome and frustrating task. Traditional text search based on file and folder names is a time-consuming process. And machine image recognition is currently effective but slow, requiring more than 20 seconds per photo and slowing the performance of foreground applications.
Intel image analysis software ingredients are designed to help application vendors take advantage of optimized Intel Performance Primitives and Intel multi-core processors to dramatically accelerate photo search based on metadata synthesis.
Increasing automated classification capability and dramatically speeding up the processing time for each image promises to enable compelling new usage models and applications based on fast automated classification photos and video.
1: Tolba, A.S., El-Baz, A.H., El-Harby, A.A. (2005). Face Recognition: A Literature Review. International Journal of Signal Processing Vol. 2 No. 2: ISSN 1304-4494
2: Duda, Richard O., Hart, Peter E., Stork, David G. (2000). Pattern Classification. Wiley-Interscience; 2 Sub edition. ISBN-10: 0471056693