# Distributed, Parallel, and Cluster Computing (cs.DC)

• Currently, progressively larger deep neural networks are trained on ever growing data corpora. As this trend is only going to increase in the future, distributed training schemes are becoming increasingly relevant. A major issue in distributed training is the limited communication bandwidth between contributing nodes or prohibitive communication cost in general. These challenges become even more pressing, as the number of computation nodes increases. To counteract this development we propose sparse binary compression (SBC), a compression framework that allows for a drastic reduction of communication cost for distributed training. SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits. By doing so, our method also allows us to smoothly trade-off gradient sparsity and temporal sparsity to adapt to the requirements of the learning task. Our experiments show, that SBC can reduce the upstream communication on a variety of convolutional and recurrent neural network architectures by more than four orders of magnitude without significantly harming the convergence speed in terms of forward-backward passes. For instance, we can train ResNet50 on ImageNet in the same number of iterations to the baseline accuracy, using $\times 3531$ less bits or train it to a $1\%$ lower accuracy using $\times 37208$ less bits. In the latter case, the total upstream communication required is cut from 125 terabytes to 3.35 gigabytes for every participating client.
• We study the fundamental problem of distributed energy-aware network formation with mobile agents of limited computational power that have the capability to wirelessly transmit and receive energy in a peer-to-peer manner. Specifically, we design simple distributed protocols consisting of a small number of states and interaction rules for the construction of both arbitrary and binary trees. Further, we theoretically and experimentally evaluate a plethora of energy redistribution protocols that exploit different levels of knowledge in order to achieve desired energy distributions which require, for instance, that every agent has twice the energy of the agents of higher depth (according to the tree network). Our study shows that without using any knowledge about the network structure, such energy distributions cannot be achieved in a timely manner, which means that there might be high energy loss during the redistribution process. On the other hand, only a few extra bits of information seem to be enough to guarantee quick convergence to energy distributions that satisfy particular properties, yielding low energy loss.
• Distributed asynchronous SGD has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous SGD under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, GEM, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential SGD with momentum. Experimental results highlight the stability and speedup of GEM compared to existing schemes, even when scaling to one hundred asynchronous workers. Results also indicate better generalization compared to the targeted SGD with momentum.
• Deep learning emerges as an important new resource-intensive workload and has been successfully applied in computer vision, speech, natural language processing, and so on. Distributed deep learning is becoming a necessity to cope with growing data and model sizes. Its computation is typically characterized by a simple tensor data abstraction to model multi-dimensional matrices, a data-flow graph to model computation, and iterative executions with relatively frequent synchronizations, thereby making it substantially different from Map/Reduce style distributed big data computation. RPC, commonly used as the communication primitive, has been adopted by popular deep learning frameworks such as TensorFlow, which uses gRPC. We show that RPC is sub-optimal for distributed deep learning computation, especially on an RDMA-capable network. The tensor abstraction and data-flow graph, coupled with an RDMA network, offers the opportunity to reduce the unnecessary overhead (e.g., memory copy) without sacrificing programmability and generality. In particular, from a data access point of view, a remote machine is abstracted just as a "device" on an RDMA channel, with a simple memory interface for allocating, reading, and writing memory regions. Our graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface. The result is up to 25 times speedup in representative deep learning benchmarks against the standard gRPC in TensorFlow and up to 169% improvement even against an RPC implementation optimized for RDMA, leading to faster convergence in the training process.
• Tendermint-core blockchains offer strong consistency (no forks) in an open system relying on two ingredients (i) a set of validators that generate blocks via a variant of Practical Byzantine Fault Tolerant (PBFT) consensus protocol and (ii) a rewarding mechanism that dynamically selects nodes to be validators for the next block via proof-of-stake, a non-energy consuming alternative of proof-of-work. It is well-known that in those open systems the main threat is the tragedy of commons that may yield the system to collapse if the rewarding mechanism is not adequate. At minima the rewarding mechanism must be f air, i.e. distributing the rewards in proportion to the merit of participants. The contribution of this paper is twofold. First, we provide a formal description of Tendermint-core protocol and we prove that in eventual synchronous systems (i) it verifies a variant of one-shot consensus for the validation of one single block and (ii) a variant of the repeated consensus problem for multiple blocks. Our second contribution relates to the fairness of Tendermint rewarding mechanism. We prove that Tendermint rewarding is not fair. However, a small twist in the protocol makes it eventually fair. Additionally, we prove that there exists an (eventual) fair rewarding mechanism in repeated consensus-based blockchains if and only if the system is (eventually) synchronous.
• Age estimation is a difficult task which requires the automatic detection and interpretation of facial features. Recently, Convolutional Neural Networks (CNNs) have made remarkable improvement on learning age patterns from benchmark datasets. However, for a face "in the wild" (from a video frame or Internet), the existing algorithms are not as accurate as for a frontal and neutral face. In addition, with the increasing number of in-the-wild aging data, the computation speed of existing deep learning platforms becomes another crucial issue. In this paper, we propose a high-efficient age estimation system with joint optimization of age estimation algorithm and deep learning system. Cooperated with the city surveillance network, this system can provide age group analysis for intelligent demographics. First, we build a three-tier fog computing architecture including an edge, a fog and a cloud layer, which directly processes age estimation from raw videos. Second, we optimize the age estimation algorithm based on CNNs with label distribution and K-L divergence distance embedded in the fog layer and evaluate the model on the latest wild aging dataset. Experimental results demonstrate that: 1. our system collects the demographics data dynamically at far-distance without contact, and makes the city population analysis automatically; and 2. the age model training has been speed-up without losing training progress or model quality. To our best knowledge, this is the first intelligent demographics system which has potential applications in improving the efficiency of smart cities and urban living.
• This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of \em straggling processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code. The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system. We show that for a random model for stragglers, the proposed moment encoding based gradient descent method can be viewed as the stochastic gradient descent method. This allows us to obtain convergence guarantees for the proposed solution. Furthermore, the proposed moment encoding based method is shown to outperform the existing schemes in a real distributed computing setup.
• The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods---one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain's ability to obtain resilient neural network parameters and system energy efficiency improvement.
• Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from software design are no longer sufficient to implement high-performance codes, due to fundamental differences between software and hardware architectures. In this work, we propose a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures with little off-chip data movement. To quantify the effect of our transformations, we use them to optimize a set of high-throughput FPGA kernels, demonstrating that they are sufficient to scale up parallelism within the hardware constraints of the device. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS.
• In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.
• Virtualization is a promising technology that has facilitated cloud computing to become the next wave of the Internet revolution. Adopted by data centers, millions of applications that are powered by various virtual machines improve the quality of services. Although virtual machines are well-isolated among each other, they suffer from redundant boot volumes and slow provisioning time. To address limitations, containers were born to deploy and run distributed applications without launching entire virtual machines. As a dominant player, Docker is an open-source implementation of container technology. When managing a cluster of Docker containers, the management tool, Swarmkit, does not take the heterogeneities in both physical nodes and virtualized containers into consideration. The heterogeneity lies in the fact that different nodes in the cluster may have various configurations, concerning resource types and availabilities, etc., and the demands generated by services are varied, such as CPU-intensive (e.g. Clustering services) as well as memory-intensive (e.g. Web services). In this paper, we target on investigating the Docker container cluster and developed, DRAPS, a resource-aware placement scheme to boost the system performance in a heterogeneous cluster.
• A smart contract on a blockchain cannot keep a secret because its data is replicated on all nodes in a network. To remedy this problem, it has been suggested to combine blockchains with trusted execution environments (TEEs), such as Intel SGX, for executing applications that demand privacy. Untrusted blockchain nodes cannot get access to the data and computations inside the TEE. This paper first explores some pitfalls that arise from the combination of TEEs with blockchains. Since TEEs are, in principle, stateless they are susceptible to rollback attacks, which should be prevented to maintain privacy for the application. However, in blockchains with non-final consensus protocols, such as the proof-of-work in Ethereum and others, the contract execution must handle rollbacks by design. This implies that TEEs for securing blockchain execution cannot be directly used for such blockchains; this approach works only when the consensus decisions are final. Second, this work introduces an architecture and a prototype for smart-contract execution within Intel SGX technology for Hyperledger Fabric, a prominent platform for enterprise blockchain applications. Our system resolves difficulties posed by the execute-order-validate architecture of Fabric and prevents rollback attacks on TEE-based execution as far as possible. For increasing security, our design encapsulates each application on the blockchain within its own enclave that shields it from the host system. An evaluation shows that the overhead moving execution into SGX is within 10%-20% for a sealed-bid auction application.
• Consider a network of agents connected by communication links, where each agent holds a real value. The gossip problem consists in estimating the average of the values diffused in the network in a distributed manner. Current techniques for gossiping are designed to deal with worst-case scenarios, which is irrelevant in applications to distributed statistical learning and denoising in sensor networks. We design second-order gossip methods tailor-made for the case where the real values are i.i.d. samples from the same distribution. In some regular network structures, we are able to prove optimality of our methods, and simulations suggest that they are efficient in a wide range of random networks. Our approach of gossip stems from a new acceleration framework using the family of orthogonal polynomials with respect to the spectral measure of the network graph.
• As the cost-per-byte of storage systems dramatically decreases, SSDs are finding their ways in emerging cloud infrastructure. Similar trend is happening for main memory subsystem, as advanced DRAM technologies with higher capacity, frequency and number of channels are deploying for cloud-scale solutions specially for non-virtualized environment where cloud subscribers can exactly specify the configuration of underling hardware. Given the performance sensitivity of standard workloads to the memory hierarchy parameters, it is important to understand the role of memory and storage for data intensive workloads. In this paper, we investigate how the choice of DRAM (high-end vs low-end) impacts the performance of Hadoop, Spark, and MPI based Big Data workloads in the presence of different storage types on bare metal cloud. Through a methodical experimental setup, we have analyzed the impact of DRAM capacity, operating frequency, the number of channels, storage type, and scale-out factors on the performance of these popular frameworks. Based on micro-architectural analysis, we classified data-intensive workloads into three groups namely I/O bound, compute bound, and memory bound. The characterization results show that neither DRAM capacity, frequency, nor the number of channels play a significant role on the performance of all studied Hadoop workloads as they are mostly I/O bound. On the other hand, our results reveal that iterative tasks (e.g. machine learning) in Spark and MPI are benefiting from a high-end DRAM in particular high frequency and large number of channels, as they are memory or compute bound. Our results show that using SSD PCIe cannot shift the bottleneck from storage to memory, while it can change the workload behavior from I/O bound to compute bound.
• We present Stamp-it, a new, concurrent, lock-less memory reclamation scheme with amortized, constant-time (thread-count independent) reclamation overhead. Stamp-it has been implemented and proved correct in the C++ memory model using as weak memory-consistency assumptions as possible. We have likewise (re)implemented six other comparable reclamation schemes. We give a detailed performance comparison, showing that Stamp-it performs favorably (sometimes better, at least as good as) than most of these other schemes while being able to reclaim free memory nodes earlier.
• In this paper, we study the symmetric rendezvous search problem on the line with n > 2 robots that are unaware of their locations and the initial distances between them. In the symmetric version of this problem, the robots execute the same strategy. The multi-robot symmetric rendezvous algorithm, MSR presented in this paper is an extension our symmetric rendezvous algorithm, SR presented in [23]. We study both the synchronous and asynchronous cases of the problem. The asynchronous version of MSR algorithm is called MASR algorithm. We consider that robots start executing MASR at different times. We perform the theoretical analysis of MSR and MASR, and show that their competitive ratios are $O(n^{0.67})$ and $O(n^{1.5})$, respectively. Finally, we confirm our theoretical results through simulations.

wenling yang Jan 30 2018 19:08 UTC

Luhao Wang Jan 30 2018 00:28 UTC

well written paper! State-of-art works that are good to publish to some decent conferences/journals

mahdi aliakbari Jan 29 2018 20:49 UTC

Very well written paper with formal problem formulation and extensive results on multiple benchmarks

Faraz Rabbani Jan 29 2018 07:53 UTC