Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/off-chip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.
The emerging hybrid DRAM-NVM architecture is challenging the existing memory management mechanism in operating system. In this paper, we introduce memos, which can schedule memory resources over the entire memory hierarchy including cache, channels, main memory comprising DRAM and NVM simultaneously. Powered by our newly designed kernel-level monitoring module and page migration engine, memos can dynamically optimize the data placement at the memory hierarchy in terms of the on-line memory patterns, current resource utilization and feature of memory medium. Our experimental results show that memos can achieve high memory utilization, contributing to system throughput by 19.1% and QoS by 23.6% on average. Moreover, memos can reduce the NVM side memory latency by 3~83.3%, energy consumption by 25.1~99%, and benefit the NVM lifetime significantly (40X improvement on average).
Time variation during program execution can leak sensitive information. Time variations due to program control flow and hardware resource contention have been used to steal encryption keys in cipher implementations such as AES and RSA. A number of approaches to mitigate timing-based side-channel attacks have been proposed including cache partitioning, control-flow obfuscation and injecting timing noise into the outputs of code. While these techniques make timing-based side-channel attacks more difficult, they do not eliminate the risks. Prior techniques are either too specific or too expensive, and all leave remnants of the original timing side channel for later attackers to attempt to exploit. In this work, we show that the state-of-the-art techniques in timing side-channel protection, which limit timing leakage but do not eliminate it, still have significant vulnerabilities to timing-based side-channel attacks. To provide a means for total protection from timing-based side-channel attacks, we develop Ozone, the first zero timing leakage execution resource for a modern microarchitecture. Code in Ozone execute under a special hardware thread that gains exclusive access to a single core's resources for a fixed (and limited) number of cycles during which it cannot be interrupted. Memory access under Ozone thread execution is limited to a fixed size uncached scratchpad memory, and all Ozone threads begin execution with a known fixed microarchitectural state. We evaluate Ozone using a number of security sensitive kernels that have previously been targets of timing side-channel attacks, and show that Ozone eliminates timing leakage with minimal performance overhead.