results for au:Watanabe_S in:cs

- Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1 % relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
- The strong converse for a coding theorem shows that the optimal asymptotic rate possible with vanishing error cannot be improved by allowing a fixed error. Building on a method introduced by Gu and Effros for centralized coding problems, we develop a general and simple recipe for proving strong converse that is applicable for distributed problems as well. Heuristically, our proof of strong converse mimics the standard steps for proving a weak converse, except that we apply those steps to a modified distribution obtained by conditioning the original distribution on the event that no error occurs. A key component of our recipe is the replacement of the hard Markov constraints implied by the distributed nature of the problem with a soft information cost using a variational formula introduced by Oohama. We illustrate our method by providing a short proof of the strong converse for the Wyner-Ziv problem and new strong converse theorems for interactive function computation, common randomness and secret key agreement, and the wiretap channel.
- Apr 24 2018 cs.CL arXiv:1804.08050v1This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.
- Apr 03 2018 cs.CL arXiv:1804.00015v1This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.
- We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emphsymbolic input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input. The MMDA architecture attempts to eliminate the need for an external LM, by enabling seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on CER and achieves 8-10\% relative WER improvement on the WSJ data set.
- The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech and recorded by 6 Kinect microphone arrays and 4 binaural microphone pairs. The challenge features a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including conversational language modeling. We discuss the rationale for the challenge and provide a detailed description of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.
- Spectral mask estimation using bidirectional long short-term memory (BLSTM) neural networks has been widely used in various speech enhancement applications, and it has achieved great success when it is applied to multichannel enhancement techniques with a mask-based beamformer. However, when these masks are used for single channel speech enhancement they severely distort the speech signal and make them unsuitable for speech recognition. This paper proposes a student-teacher learning paradigm for single channel speech enhancement. The beamformed signal from multichannel enhancement is given as input to the teacher network to obtain soft masks. An additional cross-entropy loss term with the soft mask target is combined with the original loss, so that the student network with single-channel input is trained to mimic the soft mask obtained with multichannel input through beamforming. Experiments with the CHiME-4 challenge single channel track data shows improvement in ASR performance.
- This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74\% WER for the real test set in the 6-channel track, which corresponds to the 2nd place in the challenge. In addition, the proposed baseline recipe includes four different speech enhancement measures, short-time objective intelligibility measure (STOI), extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and speech distortion ratio (SDR) for the simulation test set. Thus, the recipe also provides an experimental platform for speech enhancement studies with these performance measures.
- In the distributed function computation problem, dichotomy theorems, initiated by Han-Kobayashi, seek to classify functions by whether the rate regions for function computation improve on the Slepian-Wolf regions or not. In this paper, we develop a general approach to derive converse bounds on the distributed function computation problem. By using this approach, we recover the sufficiency part, i.e. the conditions such that the Slepian-Wolf regions become optimal, of the known dichotomy theorems in the two-terminal distributed computing. Furthermore, we derive an improved sufficient condition on the dichotomy theorem in the multiterminal distributed computing for the class of i.i.d. sources with positivity condition. Finally, we derive the matching sufficient and necessary condition on the dichotomy theorem in the multiterminal distributed computing for the class of smooth sources.
- Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.
- Stochastic matrix factorization (SMF) can be regarded as a restriction of non-negative matrix factorization (NMF). SMF is useful for inference of topic models, NMF for binary matrices data, Markov chains, and Bayesian networks. However, SMF needs strong assumptions to reach a unique factorization and its theoretical prediction accuracy has not yet been clarified. In this paper, we study the maximum the pole of zeta function (real log canonical threshold) of a general SMF and derive an upper bound of the generalization error in Bayesian inference. The results give a foundation for a widely applicable and rigorous factorization method of SMF and mean that the generalization error in SMF becomes smaller than regular statistical models by Bayesian inference.
- In the present paper, we apply the network simplex algorithm for solving the minimum cost flow problem, to the maximum flow problem. Then we prove that the cycling phenomenon which causes the infinite loop in the algorithm, does not occur in the network simplex algorithm associated with the maximum flow problem.
- Jun 12 2017 cs.CL arXiv:1706.02737v1We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.
- We show a reduction method to construct a code for the Gray-Wyner (GW) network from a given code for the Wyner-Ahlswede-Körner (WAK) network. By combining this reduction with a converse bound on the GW network, we derive a converse bound on the WAK network. The derived bound gives an alternative proof of the strong converse theorem for the WAK network.
- The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.
- Non-negative matrix factorization (NMF) is a new knowledge discovery method that is used for text mining, signal processing, bioinformatics, and consumer analysis. However, its basic property as a learning machine is not yet clarified, as it is not a regular statistical model, resulting that theoretical optimization method of NMF has not yet established. In this paper, we study the real log canonical threshold of NMF and give an upper bound of the generalization error in Bayesian learning. The results show that the generalization error of the matrix factorization can be made smaller than regular statistical models if Bayesian learning is applied.
- The problem of zero-rate multiterminal hypothesis testing is revisited from the perspective of information-spectrum approach and finite blocklength analysis. A Neyman-Pearson-like test is proposed and its non-asymptotic performance is clarified; for a short block length, it is numerically determined that the proposed test is superior to the previously reported Hoeffding-like test proposed by Han-Kobayashi. For a large deviation regime, it is shown that our proposed test achieves an optimal trade-off between the type I and type II exponents presented by Han-Kobayashi. Among the class of symmetric (type based) testing schemes, when the type I error probability is non-vanishing, the proposed test is optimal up to the second-order term of the type II error exponent; the latter term is characterized in terms of the variance of the projected relative entropy density. The information geometry method plays an important role in the analysis as well as the construction of the test.
- Sep 23 2016 cs.CL arXiv:1609.06773v2Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4-14.6% relative improvements in Character Error Rate (CER).
- Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.
- Multiple parties observing correlated data seek to recover each other's data and attain omniscience. To that end, they communicate interactively over a noiseless broadcast channel - each bit transmitted over this channel is received by all the parties. We give a universal interactive communication protocol, termed the recursive data exchange protocol (RDE), which attains omniscience for any sequence of data observed by the parties and provide an individual sequence guarantee of performance. As a by-product, for observations of length $n$, we show the universal rate optimality of RDE up to an $\mathcal{O}\left(n^{-\frac 12}\sqrt{\log n}\right)$ term in a generative setting where the data sequence is independent and identically distributed (in time). Furthermore, drawing on the duality between omniscience and secret key agreement due to Csiszar and Narayan, we obtain a universal protocol for generating a multiparty secret key of rate at most $\mathcal{O}\left(n^{-\frac 12}\sqrt{\log n}\right)$ less than the maximum rate possible. A key feature of RDE is its recursive structure whereby when a subset $A$ of parties recover each-other's data, the rates appear as if the parties have been executing the protocol in an alternative model where the parties in $A$ are collocated.
- The problem of distributed function computation is studied, where functions to be computed is not necessarily symbol-wise. A new method to derive a converse bound for distributed computing is proposed; from the structure of functions to be computed, information that is inevitably conveyed to the decoder is identified, and the bound is derived in terms of the optimal rate needed to send that information. The class of informative functions is introduced, and, for the class of smooth sources, the optimal rate for computing those functions is characterized. Furthermore, for i.i.d. sources with joint distribution that may not be full support, functions that are composition of symbol-wise function and the type of a sequence are considered, and the optimal rate for computing those functions is characterized in terms of the hypergraph entropy. As a byproduct, our method also provides a conceptually simple proof of the known fact that computing a Boolean function may require as large rate as reproducing the entire source.
- Two parties observing correlated data seek to exchange their data using interactive communication. How many bits must they communicate? We propose a new interactive protocol for data exchange which increases the communication size in steps until the task is done. Next, we derive a lower bound on the minimum number of bits that is based on relating the data exchange problem to the secret key agreement problem. Our single-shot analysis applies to all discrete random variables and yields upper and lower bound of a similar form. In fact, the bounds are asymptotically tight and lead to a characterization of the optimal rate of communication needed for data exchange for a general sequence such as mixture of IID random variables as well as the optimal second-order asymptotic term in the length of communication needed for data exchange for the IID random variables, when the probability of error is fixed. This gives a precise characterization of the asymptotic reduction in the length of optimal communication due to interaction; in particular, two-sided Slepian-Wolf compression is strictly suboptimal.
- This work establishes connection between channel simulation and coded source compression. First, we consider classical source coding with quantum side-information where the quantum side-information is observed by a helper and sent to the decoder via a classical channel. We derive a single-letter characterization of the achievable rate region for this problem. The direct part of our result is proved via the measurement compression theory by Winter, a quantum to classical channel simulation. Our result reveals that a helper's scheme that separately conducts a measurement and a compression is suboptimal, and the measurement compression is fundamentally needed to achieve the optimal rate region. We then study coded source compression in the fully quantum regime. We characterise the quantum resources involved in this problem, and derive a single-letter expression of the achievable rate region when entanglement assistance is available. The direct coding proof is based on a combination of two fundamental protocols, namely the quantum state merging protocol and the quantum reverse Shannon theorem. Our work hence resolves coded source compression in the quantum regime.
- The coding problem over the Gray-Wyner network is studied from the second-order coding rates perspective. A tilted information density for this network is introduced in the spirit of Kostina-Verdú, and, under a certain regularity condition, the second-order region is characterized in terms of the variance of this tilted information density and the tangent vector of the first-order region. The second-order region is proved by the type method: the achievability part is proved by the type-covering argument, and the converse part is proved by a refinement of the perturbation approach that was used by Gu-Effros to show the strong converse of the Gray-Wyner network. This is the first instance that the second-order region is characterized for a multi-terminal problem where the characterization of the first-order region involves an auxiliary random variable.
- We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
- Two parties observing correlated random variables seek to run an interactive communication protocol. How many bits must they exchange to simulate the protocol, namely to produce a view with a joint distribution within a fixed statistical distance of the joint distribution of the input and the transcript of the original protocol? We present an information spectrum approach for this problem whereby the information complexity of the protocol is replaced by its information complexity density. Our single-shot bounds relate the communication complexity of simulating a protocol to tail bounds for information complexity density. As a consequence, we obtain a strong converse and characterize the second-order asymptotic term in communication complexity for indepedent and identically distributed observation sequences. Furthermore, we obtain a general formula for the rate of communication complexity which applies to any sequence of observations and protocols. Connections with results from theoretical computer science and implications for the function computation problem are discussed.
- We study source compression with a helper in the fully quantum regime, extending our earlier result on classical source compression with a quantum helper [arXiv:1501.04366, 2015]. We characterise the quantum resources involved in this problem, and derive a single-letter expression of the achievable rate region when entanglement assistance is available. The direct coding proof is based on a combination of two fundamental protocols, namely the quantum state merging protocol and the quantum reverse Shannon theorem (QRST). This result demonstrates an unexpected connection between distributed source compression with the QRST protocol, a quantum protocol that consumes noiseless resources to simulate a noisy quantum channel.
- Prior design is one of the most important problems in both statistics and machine learning. The cross validation (CV) and the widely applicable information criterion (WAIC) are predictive measures of the Bayesian estimation, however, it has been difficult to apply them to find the optimal prior because their mathematical properties in prior evaluation have been unknown and the region of the hyperparameters is too wide to be examined. In this paper, we derive a new formula by which the theoretical relation among CV, WAIC, and the generalization loss is clarified and the optimal hyperparameter can be directly found. By the formula, three facts are clarified about predictive prior design. Firstly, CV and WAIC have the same second order asymptotic expansion, hence they are asymptotically equivalent to each other as the optimizer of the hyperparameter. Secondly, the hyperparameter which minimizes CV or WAIC makes the average generalization loss to be minimized asymptotically but does not the random generalization loss. And lastly, by using the mathematical relation between priors, the variances of the optimized hyperparameters by CV and WAIC are made smaller with small computational costs. Also we show that the optimized hyperparameter by DIC or the marginal likelihood does not minimize the average or random generalization loss in general.
- In this paper, we derive non-asymptotic achievability and converse bounds on the random number generation with/without side-information. Our bounds are efficiently computable in the sense that the computational complexity does not depend on the block length. We also characterize the asymptotic behaviors of the large deviation regime and the moderate deviation regime by using our bounds, which implies that our bounds are asymptotically tight in those regimes. We also show the second order rates of those problems, and derive single letter forms of the variances characterizing the second order rates. Further, we address the equivocation rates for these problems.
- We study classical source coding with quantum side-information where the quantum side-information is observed by a helper and sent to the decoder via a classical channel. We derive a single-letter characterization of the achievable rate region for this problem. The direct part of our result is proved via the measurement compression theory by Winter. Our result reveals that a helper's scheme that separately conducts a measurement and a compression is suboptimal, and the measurement compression is fundamentally needed to achieve the optimal rate region.
- We revisit the problem of secret key agreement using interactive public communication for two parties and propose a new secret key agreement protocol. The protocol attains the secret key capacity for general observations and attains the second-order asymptotic term in the maximum length of a secret key for independent and identically distributed observations. In contrast to the previously suggested secret key agreement protocols, the proposed protocol uses interactive communication. In fact, the standard one-way communication protocol used prior to this work fails to attain the asymptotic results above. Our converse proofs rely on a recently established upper bound for secret key lengths. Both our lower and upper bounds are derived in a single-shot setup and the asymptotic results are obtained as corollaries.
- We establish an upper bound on the rate of codes for a wiretap channel with public feedback for a fixed probability of error and secrecy parameter. As a corollary, we obtain a strong converse for the capacity of a degraded wiretap channel with public feedback. Our converse proof is based on a reduction of active hypothesis testing for discriminating between two channels to coding for wiretap channel with feedback.
- The problem of distributed data compression for function computation is considered, where (i) the function to be computed is not necessarily symbol-wise function and (ii) the information source has memory and may not be stationary nor ergodic. We introduce the class of smooth sources and give a sufficient condition on functions so that the achievable rate region for computing coincides with the Slepian-Wolf region (i.e., the rate region for reproducing the entire source) for any smooth sources. Moreover, for symbol-wise functions, the necessary and sufficient condition for the coincidence is established. Our result for the full side-information case is a generalization of the result by Ahlswede and Csiszar to sources with memory; our dichotomy theorem is different from Han and Kobayashi's dichotomy theorem, which reveals an effect of memory in distributed function computation. All results are given not only for fixed-length coding but also for variable-length coding in a unified manner. Furthermore, for the full side-information case, the error probability in the moderate deviation regime is also investigated.
- We consider information theoretic secret key agreement and secure function computation by multiple parties observing correlated data, with access to an interactive public communication channel. Our main result is an upper bound on the secret key length, which is derived using a reduction of binary hypothesis testing to multiparty secret key agreement. Building on this basic result, we derive new converses for multiparty secret key agreement. Furthermore, we derive converse results for the oblivious transfer problem and the bit commitment problem by relating them to secret key agreement. Finally, we derive a necessary condition for the feasibility of secure computation by trusted parties that seek to compute a function of their collective data, using an interactive public communication that by itself does not give away the value of the function. In many cases, we strengthen and improve upon previously known converse bounds. Our results are single-shot and use only the given joint distribution of the correlated observations. For the case when the correlated observations consist of independent and identically distributed (in time) sequences, we derive strong versions of previously known converses.
- We study the problem of channel resolvability for fixed i.i.d. input distributions and discrete memoryless channels (DMCs), and derive the strong converse theorem for any DMCs that are not necessarily full rank. We also derive the optimal second-order rate under a condition. Furthermore, under the condition that a DMC has the unique capacity achieving input distribution, we derive the optimal second-order rate of channel resolvability for the worst input distribution.
- This paper studies variable-length (VL) source coding of general sources with side-information. Novel one-shot coding theorems for coding with common side-information available at the encoder and the decoder and Slepian- Wolf (SW) coding (i.e., with side-information only at the decoder) are given, and then, are applied to asymptotic analyses of these coding problems. Especially, a general formula for the infimum of the coding rate asymptotically achievable by weak VL-SW coding (i.e., VL-SW coding with vanishing error probability) is derived. Further, the general formula is applied to investigating weak VL-SW coding of mixed sources. Our results derive and extend several known results on SW coding and weak VL coding, e.g., the optimal achievable rate of VL-SW coding for mixture of i.i.d. sources is given for countably infinite alphabet case with mild condition. In addition, the usefulness of the encoder side-information is investigated. Our result shows that if the encoder side-information is useless in weak VL coding then it is also useless even in the case where the error probability may be positive asymptotically.
- Using terminologies of information geometry, we derive upper and lower bounds of the tail probability of the sample mean. Employing these bounds, we obtain upper and lower bounds of the minimum error probability of the 2nd kind of error under the exponential constraint for the error probability of the 1st kind of error in a simple hypothesis testing for a finite-length Markov chain, which yields the Hoeffding type bound. For these derivations, we derive upper and lower bounds of cumulant generating function for Markov chain. As a byproduct, we obtain another simple proof of central limit theorem for Markov chain.
- We consider the parameter estimation of Markov chain when the unknown transition matrix belongs to an exponential family of transition matrices. Then, we show that the sample mean of the generator of the exponential family is an asymptotically efficient estimator. Further, we also define a curved exponential family of transition matrices. Using a transition matrix version of the Pythagorean theorem, we give an asymptotically efficient estimator for a curved exponential family.
- We study finite-length bounds for source coding with side information for Markov sources and channel coding for channels with conditional Markovian additive noise. For this purpose, we propose two criteria for finite-length bounds. One is the asymptotic optimality and the other is the efficient computability of the bound. Then, we derive finite-length upper and lower bounds for coding length in both settings so that their computational complexity is efficient. To discuss the first criterion, we derive the large deviation bounds, the moderate deviation bounds, and second order bounds for these two topics, and show that these finite-length bounds achieves the asymptotic optimality in these senses. For this discussion, we introduce several kinds of information measure for transition matrices.
- Aug 07 2013 cs.CY arXiv:1308.1611v1Having developed greatly over millennium under its culture, the ancient buildings and old town atmospheres maintain a quality of comfort. However, people only appreciate the "good old" quality and do not think further about the rational reasons why they feel comfort in it. This keeps them from creating their own things and models with good old quality, relying on the imported western thinking and methods as a result of modernization. Since people are now struggling under the imbalanced and complex society, we believe that support is needed for generating things and frameworks with good old quality in modern time situations and scenes.
- We investigate the Wyner-Ziv coding in which the statistics of the principal source is known but the statistics of the channel generating the side-information is unknown except that it is in a certain class. The class consists of channels such that the distortion between the principal source and the side-information is smaller than a threshold, but channels may be neither stationary nor ergodic. In this situation, we define a new rate-distortion function as the minimum rate such that there exists a Wyner-Ziv code that is universal for every channel in the class. Then, we show an upper bound and a lower bound on the rate-distortion function, and derive a matching condition such that the upper and lower bounds coincide. The relation between the new rate-distortion function and the rate-distortion function of the Heegard-Berger problem is also discussed.
- We present novel non-asymptotic or finite blocklength achievability bounds for three side-information problems in network information theory. These include (i) the Wyner-Ahlswede-Korner (WAK) problem of almost-lossless source coding with rate-limited side-information, (ii) the Wyner-Ziv (WZ) problem of lossy source coding with side-information at the decoder and (iii) the Gel'fand-Pinsker (GP) problem of channel coding with noncausal state information available at the encoder. The bounds are proved using ideas from channel simulation and channel resolvability. Our bounds for all three problems improve on all previous non-asymptotic bounds on the error probability of the WAK, WZ and GP problems--in particular those derived by Verdu. Using our novel non-asymptotic bounds, we recover the general formulas for the optimal rates of these side-information problems. Finally, we also present achievable second-order coding rates by applying the multidimensional Berry-Esseen theorem to our new non-asymptotic bounds. Numerical results show that the second-order coding rates obtained using our non-asymptotic achievability bounds are superior to those obtained using existing finite blocklength bounds.
- This paper investigates the privacy amplification problem, and compares the existing two bounds: the exponential bound derived by one of the authors and the min-entropy bound derived by Renner. It turns out that the exponential bound is better than the min-entropy bound when a security parameter is rather small for a block length, and that the min-entropy bound is better than the exponential bound when a security parameter is rather large for a block length. Furthermore, we present another bound that interpolates the exponential bound and the min-entropy bound by a hybrid use of the Renyi entropy and the inf-spectral entropy.
- A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature $1/\log n$, where $n$ is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for and unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models.
- We investigate the privacy amplification problem in which Eve can observe the uniform binary source through a binary erasure channel (BEC) or a binary symmetric channel (BSC). For this problem, we derive the so-called expurgation exponent of the information leaked to Eve. The exponent is derived by relating the leaked information to the error probability of the linear code that is generated by the linear hash function used in the privacy amplification, which is also interesting in its own right. The derived exponent is larger than state-of-the-art exponent recently derived by Hayashi at low rate.
- The cognitive interference channel with confidential messages (CICC) proposed by Liang et. al. is investigated. When the security is considered in coding systems, it is well known that the sender needs to use a stochastic encoding to avoid the information about the transmitted confidential message to be leaked to an eavesdropper. For the CICC, the trade-off between the rate of the random number to realize the stochastic encoding and the communication rates is investigated, and the optimal trade-off is completely characterized.
- In coding schemes for the wire-tap channel or the broadcast channels with confidential messages, it is well known that the sender needs to use a stochastic encoding to avoid the information about the transmitted confidential message to be leaked to an eavesdropper. In this paper, it is investigated that the trade-off between the rate of the random number to realize the stochastic encoding and the rates of the common, private, and confidential messages. For the direct theorem, the superposition coding scheme for the wire-tap channel recently proposed by Chia and El Gamal is employed, and its strong security is proved. The matching converse theorem is also established. Our result clarifies that a combination of the ordinary stochastic encoding and the channel prefixing by the channel simulation is suboptimal.
- Two new classes of quantum channels, which we call more capable and less noisy, are introduced. The more capable class consists of channels such that the quantum capacities of the complementary channels to the environments are zero. The less noisy class consists of channels such that the private capacities of the complementary channels to the environment are zero. For the more capable class, it is clarified that the private capacity and quantum capacity coincide. For the less noisy class, it is clarified that the private capacity and quantum capacity can be single letter characterized.
- This paper investigates a lossy source coding problem in which two decoders can access their side-information respectively. The correlated sources are a product of two component correlated sources, and we exclusively investigate the case such that each component is degraded. We show the rate-distortion function for that case, and give the following observations. When the components are degraded in matched order, the rate distortion function of the product sources is equal to the sum of the component-wise rate distortion functions. On the otherhand, the former is strictly smaller than the latter when the component sources are degraded in mismatched order. The converse proof for the mismatched case is motivated by the enhancement technique used for broadcast channels. For binary Hamming and Gaussian examples, we evaluate the rate-distortion functions.
- We investigate the secret key agreement from correlated vector Gaussian sources in which the legitimate parties can use the public communication with limited rate. For the class of protocols with the one-way public communication, we show that the optimal trade-off between the rate of key generation and the rate of the public communication is characterized as an optimization problem of a Gaussian random variable. The characterization is derived by using the enhancement technique introduced by Weingarten et.al. for MIMO Gaussian broadcast channel.
- We consider a communication system where a relay helps transmission of messages from a sender to a receiver. The relay is considered not only as a helper but as a wire-tapper who can obtain some knowledge about transmitted messages. In this paper we study a relay channel with confidential messages(RCC), where a sender attempts to transmit common information to both a receiver and a relay and also has private information intended for the receiver and confidential to the relay. The level of secrecy of private information confidential to the relay is measured by the equivocation rate, i.e., the entropy rate of private information conditioned on channel outputs at the relay. The performance measure of interest for the RCC is the rate triple that includes the common rate, the private rate, and the equivocation rate as components. The rate-equivocation region is defined by the set that consists of all these achievable rate triples. In this paper we give two definitions of the rate-equivocation region. We first define the rate-equivocation region in the case of deterministic encoder and call it the deterministic rate-equivocation region. Next, we define the rate-equivocation region in the case of stochastic encoder and call it the stochastic rate-equivocation region. We derive explicit inner and outer bounds for the above two regions. On the deterministic/stochastic rate-equivocation region we present two classes of relay channels where inner and outer bounds match. We also evaluate the deterministic and stochastic rate-equivocation regions of the Gaussian RCC.
- Apr 15 2010 cs.LG arXiv:1004.2316v2In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to $2\lambda/n$, where $\lambda$ is the real log canonical threshold and $n$ is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.
- We investigate the secret key agreement from correlated Gaussian sources in which the legitimate parties can use the public communication with limited rate. For the class of protocols with the one-way public communication, we show a closed form expression of the optimal trade-off between the rate of key generation and the rate of the public communication. Our results clarify an essential difference between the key agreement from discrete sources and that from continuous sources.
- Jan 19 2010 cs.LG arXiv:1001.2957v2Bayes statistics and statistical physics have the common mathematical structure, where the log likelihood function corresponds to the random Hamiltonian. Recently, it was discovered that the asymptotic learning curves in Bayes estimation are subject to a universal law, even if the log likelihood function can not be approximated by any quadratic form. However, it is left unknown what mathematical property ensures such a universal law. In this paper, we define a renormalizable condition of the statistical estimation problem, and show that, under such a condition, the asymptotic learning curves are ensured to be subject to the universal law, even if the true distribution is unrealizable and singular for a statistical model. Also we study a nonrenormalizable case, in which the learning curves have the different asymptotic behaviors from the universal law.
- Sep 04 2009 cs.CE arXiv:0909.0611v2Coupled human balancing tasks are investigated based on both pseudo-neural controllers characterized by time-delayed feedback with random gain and natural human balancing tasks. It is shown numerically that, compared to single balancing tasks, balancing tasks coupled by mechanical structures exhibit enhanced stability against balancing errors in terms of both amplitude and velocity and also improve the tracking ability of the controllers. We then perform an experiment in which numerical pseudo-neural controllers are replaced with natural human balancing tasks carried out using computer mice. The results reveal that the coupling structure generates asymmetric tracking abilities in subjects whose tracking abilities are nearly symmetric in their single balancing tasks.
- The privacy amplification is a technique to distill a secret key from a random variable by a function so that the distilled key and eavesdropper's random variable are statistically independent. There are three kinds of security criteria for the key distilled by the privacy amplification: the normalized divergence criterion, which is also known as the weak security criterion, the variational distance criterion, and the divergence criterion, which is also known as the strong security criterion. As a technique to distill a secret key, it is known that the encoder of a Slepian-Wolf (the source coding with full side-information at the decoder) code can be used as a function for the privacy amplification if we employ the weak security criterion. In this paper, we show that the encoder of a Slepian-Wolf code cannot be used as a function for the privacy amplification if we employ the criteria other than the weak one.
- Jun 02 2009 cs.LG arXiv:0906.0211v2Many learning machines that have hierarchical structure or hidden variables are now being used in information science, artificial intelligence, and bioinformatics. However, several learning machines used in such fields are not regular but singular statistical models, hence their generalization performance is still left unknown. To overcome these problems, in the previous papers, we proved new equations in statistical learning, by which we can estimate the Bayes generalization loss from the Bayes training loss and the functional variance, on the condition that the true distribution is a singularity contained in a learning machine. In this paper, we prove that the same equations hold even if a true distribution is not contained in a parametric model. Also we prove that, the proposed equations in a regular case are asymptotically equivalent to the Takeuchi information criterion. Therefore, the proposed equations are always applicable without any condition on the unknown true distribution.
- Jan 19 2009 cs.LG arXiv:0901.2376v1In statistical problems, a set of parameterized probability distributions is used to estimate the true probability distribution. If Fisher information matrix at the true distribution is singular, then it has been left unknown what we can estimate about the true distribution from random samples. In this paper, we study a singular regression problem and prove a limit theorem which shows the relation between the singular regression problem and two birational invariants, a real log canonical threshold and a singular fluctuation. The obtained theorem has an important application to statistics, because it enables us to estimate the generalization error from the training error without any knowledge of the true probability distribution.
- We consider the problem of secret key agreement in Gaussian Maurer's Model. In Gaussian Maurer's model, legitimate receivers, Alice and Bob, and a wire-tapper, Eve, receive signals randomly generated by a satellite through three independent memoryless Gaussian channels respectively. Then Alice and Bob generate a common secret key from their received signals. In this model, we propose a protocol for generating a common secret key by using the result of soft-decision of Alice and Bob's received signals. Then, we calculate a lower bound on the secret key rate in our proposed protocol. As a result of comparison with the protocol that only uses hard-decision, we found that the higher rate is obtained by using our protocol.
- Dec 06 2007 cs.LG arXiv:0712.0653v2Learning machines which have hierarchical structures or hidden variables are singular statistical models because they are nonidentifiable and their Fisher information matrices are singular. In singular statistical models, neither the Bayes a posteriori distribution converges to the normal distribution nor the maximum likelihood estimator satisfies asymptotic normality. This is the main reason why it has been difficult to predict their generalization performances from trained states. In this paper, we study four errors, (1) Bayes generalization error, (2) Bayes training error, (3) Gibbs generalization error, and (4) Gibbs training error, and prove that there are mathematical relations among these errors. The formulas proved in this paper are equations of states in statistical estimation because they hold for any true distribution, any parametric model, and any a priori distribution. Also we show that Bayes and Gibbs generalization errors are estimated by Bayes and Gibbs training errors, and propose widely applicable information criteria which can be applied to both regular and singular statistical models.
- This paper deals with a universal coding problem for a certain kind of multiterminal source coding network called a generalized complementary delivery network. In this network, messages from multiple correlated sources are jointly encoded, and each decoder has access to some of the messages to enable it to reproduce the other messages. Both fixed-to-fixed length and fixed-to-variable length lossless coding schemes are considered. Explicit constructions of universal codes and the bounds of the error probabilities are clarified by using methods of types and graph-theoretical analysis.