In recent years, as the demand for low energy and high performance computing has steadily increased, heterogeneous computing has emerged as an important and promising solution. Because most workloads can typically run most efficiently on certain types of cores, mapping tasks on the best available resources can not only save energy but also deliver high performance. However, optimal task scheduling for performance and/or energy is yet to be solved for heterogeneous platforms. The work presented herein mathematically formulates the optimal heterogeneous system task scheduling as an optimization problem using queueing theory. We analytically solve for the common case of two processor types, e.g., CPU+GPU, and give an optimal policy (CAB). We design the GrIn heuristic to efficiently solve for near-optimal policy for any number of processor types (within 1.6% of the optimal). Both policies work for any task size distribution and processing order, and are therefore, general and practical. We extensively simulate and validate the theory, and implement the proposed policy in a CPU-GPU real platform to show the optimal throughput and energy improvement. Comparing to classic policies like load-balancing, our results range from 1.08x~2.24x better performance or 1.08x~2.26x better energy efficiency in simulations, and 2.37x~9.07x better performance in experiments.

Many multimedia information retrieval or machine learning problems require efficient high-dimensional nearest neighbor search techniques. For instance, multimedia objects (images, music or videos) can be represented by high-dimensional feature vectors. Finding two similar multimedia objects then comes down to finding two objects that have similar feature vectors. In the current context of mass use of social networks, large scale multimedia databases or large scale machine learning applications are more and more common, calling for efficient nearest neighbor search approaches. This thesis builds on product quantization, an efficient nearest neighbor search technique that compresses high-dimensional vectors into short codes. This makes it possible to store very large databases entirely in RAM, enabling low response times. We propose several contributions that exploit the capabilities of modern CPUs, especially SIMD and the cache hierarchy, to further decrease response times offered by product quantization.