# Computer Vision and Pattern Recognition (cs.CV)

• Convolutional Neural Networks (CNNs) are computationally intensive, which limits their application on mobile devices. Their energy is dominated by the number of multiplies needed to perform the convolutions. Winograd's minimal filtering algorithm (Lavin, 2015) and network pruning (Han et al., 2015) can reduce the operation count, but these two methods cannot be directly combined $-$ applying the Winograd transform fills in the sparsity in both the weights and the activations. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations. Second, we prune the weights in the Winograd domain to exploit static weight sparsity. For models on CIFAR-10, CIFAR-100 and ImageNet datasets, our method reduces the number of multiplications by $10.4\times$, $6.8\times$ and $10.8\times$ respectively with loss of accuracy less than $0.1\%$, outperforming previous baselines by $2.0\times$-$3.0\times$. We also show that moving ReLU to the Winograd domain allows more aggressive pruning.
• In this paper we present a novel video classification methodology that aims to recognize different categories of third-person videos efficiently. The idea is to tracking motion in videos and extracting both short-term and long-term features from motion time series by training a multi-channel one dimensional Convolutional Neural Network (1D-CNN). The positive point about our method is that we only try to learn representative temporal features along the temporal dimension. Spatial features are extracted using pre-trained networks that have already been trained on large scale image recognition datasets. Learning features toward only one dimension reduces the number of calculations significantly and makes our method applicable to even smaller datasets. Furthermore we show that not only our method could reach state-of-the-art results on two public datasets UCF11 and jHMDB, but also we could obtain a strong feature vector representation which in compare with other methods it is much efficient.
• Heatmap regression has been used for landmark localization for quite a while now. Most of the methods use a very deep stack of bottleneck modules for heatmap classification stage, followed by heatmap regression to extract the keypoints. In this paper, we present a single dendritic CNN, termed as Pose Conditioned Dendritic Convolution Neural Network (PCD-CNN), where a classification network is followed by a second and modular classification network, trained in an end to end fashion to obtain accurate landmark points. Following a Bayesian formulation, we disentangle the 3D pose of a face image explicitly by conditioning the landmark estimation on pose, making it different from multi-tasking approaches. Extensive experimentation shows that conditioning on pose reduces the localization error by making it agnostic to face pose. The proposed model can be extended to yield variable number of landmark points and hence broadening its applicability to other datasets. Instead of increasing depth or width of the network, we train the CNN efficiently with Mask-Softmax Loss and hard sample mining to achieve upto $15\%$ reduction in error compared to state-of-the-art methods for extreme and medium pose face images from challenging datasets including AFLW, AFW, COFW and IBUG.
• Automated emotion recognition in the wild from facial images remains a challenging problem. Although recent advances in Deep Learning have supposed a significant breakthrough in this topic, strong changes in pose, orientation and point of view severely harm current approaches. In addition, the acquisition of labeled datasets is costly, and current state-of-the-art deep learning algorithms cannot model all the aforementioned difficulties. In this paper, we propose to apply a multi-task learning loss function to share a common feature representation with other related tasks. Particularly we show that emotion recognition benefits from jointly learning a model with a detector of facial Action Units (collective muscle movements). The proposed loss function addresses the problem of learning multiple tasks with heterogeneously labeled data, improving previous multi-task approaches. We validate the proposal using two datasets acquired in non controlled environments, and an application to predict compound facial emotion expressions.
• The two most important requirements for unsupervised data-dependent hashing methods are to preserve similarity in the low-dimensional feature space and to minimize the binary quantization loss. Even though there are many hashing methods that have been proposed in the literature, there is room for improvement to address both requirements simultaneously and adequately. In this paper, we propose a novel approach, named Simultaneous Compression and Quantization (SCQ), to jointly learn to compress and binarize input data in a single formulation. A simple scale pre-processing is introduced to help to preserve data similarity. With this approach, we introduce a loss function and its relaxed version, termed Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE) respectively, which involve the challenging binary and orthogonal constraints. We then propose novel algorithms that can effectively handle these challenging constraints. Comprehensive experiments on unsupervised image retrieval show that our proposed methods consistently outperform other state-of-the-art hashing methods while still being very computationally-efficient.
• Deep neural networks have been shown to be vulnerable to adversarial examples: very small perturbations of the input having a dramatic impact on the predictions. A wealth of adversarial attacks and distance metrics to quantify the similarity between natural and adversarial images have been proposed, recently enlarging the scope of adversarial examples with geometric transformations beyond pixel-wise attacks. In this context, we investigate the robustness to adversarial attacks of new Convolutional Neural Network architectures providing equivariance to rotations. We found that rotation-equivariant networks are significantly less vulnerable to geometric-based attacks than regular networks on the MNIST, CIFAR-10, and ImageNet datasets.
• Osteoarthritis is a disease found in the world, including in Indonesia. The purpose of this study was to detect the disease Osteoarthritis using Self Organizing mapping (SOM), and to know the procedure of artificial intelligence on the methods of Self Organizing Mapping (SOM). In this system, there are several stages to preserve to detect disease Osteoarthritis using Self Organizing maps is the result of photographic images rontgen Ossa Manus normal and sick with the resolution (150 x 200 pixels) do the repair phase contrast, the Gray scale, thresholding process, Histogram of process , and do the last process, where the process of doing training (Training) and testing on images that have kept the shape data (.text). the conclusion is the result of testing by using a data image, where 42 of data have 12 Normal image data and image data 30 sick. On the results of the process of training data there are 8 X-ray image revealed normal right and 19 data x-ray image of pain expressed is correct. Then the accuracy on the process of training was 96.42%, and in the process of testing normal true image 4 obtained revealed Normal, 9 data pain stated true pain and 1 data imagery hurts stated incorrectly, then the accuracy gained from the results of testing are 92,8%.
• It is demonstrated that the classical Hough transform with shift-elevation parametrization of digital straight lines has additive complexity of at most $\mathcal{O}(n^3 / \log n)$ on a $n\times n$ image. The proof is constructive and uses ensemble computation approach to build summation circuits. The proposed method has similarities with the fast Hough transform (FHT) and may be considered a form of the "divide and conquer" technique. It is based on the fact that lines with close slopes can be decomposed into common components, allowing generalization for other pattern families. When applied to FHT patterns, the algorithm yields exactly the $\Theta(n^2\log n)$ FHT asymptotics which might suggest that the actual classical Hough transform circuits could smaller size than $\Theta(n^3/ \log n)$.
• In digital photography, two image restoration tasks have been studied extensively and resolved independently: demosaicing and super-resolution. Both these tasks are related to resolution limitations of the camera. Performing super-resolution on a demosaiced images simply exacerbates the artifacts introduced by demosaicing. In this paper, we show that such accumulation of errors can be easily averted by jointly performing demosaicing and super-resolution. To this end, we propose a deep residual network for learning an end-to-end mapping between Bayer images and high-resolution images. By training on high-quality samples, our deep residual demosaicing and super-resolution network is able to recover high-quality super-resolved images from low-resolution Bayer mosaics in a single step without producing the artifacts common to such processing when the two operations are done separately. We perform extensive experiments to show that our deep residual network achieves demosaiced and super-resolved images that are superior to the state-of-the-art both qualitatively and in terms of PSNR and SSIM metrics.
• In this paper, we propose a new variant of Linear Discriminant Analysis to overcome underlying drawbacks of traditional LDA and other LDA variants targeting problems involving imbalanced classes. Traditional LDA sets assumptions related to Gaussian class distribution and neglects influence of outlier classes, that might hurt in performance. We exploit intuitions coming from a probabilistic interpretation of visual saliency estimation in order to define saliency of a class in multi-class setting. Such information is then used to redefine the between-class and within-class scatters in a more robust manner. Compared to traditional LDA and other weight-based LDA variants, the proposed method has shown certain improvements on facial image classification problems in publicly available datasets.
• Salient object detection, which aims to identify and locate the most salient pixels or regions in images, has been attracting more and more interest due to its various real-world applications. However, this vision task is quite challenging, especially under complex image scenes. Inspired by the intrinsic reflection of natural images, in this paper we propose a novel feature learning framework for large-scale salient object detection. Specifically, we design a symmetrical fully convolutional network (SFCN) to learn complementary saliency features under the guidance of lossless feature reflection. The location information, together with contextual and semantic information, of salient objects are jointly utilized to supervise the proposed network for more accurate saliency predictions. In addition, to overcome the blurry boundary problem, we propose a new structural loss function to learn clear object boundaries and spatially consistent saliency. The coarse prediction results are effectively refined by these structural information for performance improvements. Extensive experiments on seven saliency detection datasets demonstrate that our approach achieves consistently superior performance and outperforms the very recent state-of-the-art methods.
• Manipulation and re-use of images in scientific publications is a concerning problem that currently lacks a scalable solution. Current tools for detecting image duplication are mostly manual or semi-automated, despite the availability of an overwhelming target dataset for a learning-based approach. This paper addresses the problem of determining if, given two images, one is a manipulated version of the other by means of copy, rotation, translation, scale, perspective transform, histogram adjustment, or partial erasing. We propose a data-driven solution based on a 3-branch Siamese Convolutional Neural Network. The ConvNet model is trained to map images into a 128-dimensional space, where the Euclidean distance between duplicate images is smaller than or equal to 1, and the distance between unique images is greater than 1. Our results suggest that such an approach has the potential to improve surveillance of the published and in-peer-review literature for image manipulation.
• Object detection is a major challenge in computer vision, involving both object classification and object localization within a scene. While deep neural networks have been shown in recent years to yield very powerful techniques for tackling the challenge of object detection, one of the biggest challenges with enabling such object detection networks for widespread deployment on embedded devices is high computational and memory requirements. Recently, there has been an increasing focus in exploring small deep neural network architectures for object detection that are more suitable for embedded devices, such as Tiny YOLO and SqueezeDet. Inspired by the efficiency of the Fire microarchitecture introduced in SqueezeNet and the object detection performance of the single-shot detection macroarchitecture introduced in SSD, this paper introduces Tiny SSD, a single-shot detection deep convolutional neural network for real-time embedded object detection that is composed of a highly optimized, non-uniform Fire sub-network stack and a non-uniform sub-network stack of highly optimized SSD-based auxiliary convolutional feature layers designed specifically to minimize model size while maintaining object detection performance. The resulting Tiny SSD possess a model size of 2.3MB (~26X smaller than Tiny YOLO) while still achieving an mAP of 61.3% on VOC 2007 (~4.2% higher than Tiny YOLO). These experimental results show that very small deep neural network architectures can be designed for real-time object detection that are well-suited for embedded scenarios.
• Photorealistic image style transfer algorithms aim at stylizing a content photo using the style of a reference photo with the constraint that the stylized photo should remains photorealistic. While several methods exist for this task, they tend to generate spatially inconsistent stylizations with noticeable artifacts. In addition, these methods are computationally expensive, requiring several minutes to stylize a VGA photo. In this paper, we present a novel algorithm to address the limitations. The proposed algorithm consists of a stylization step and a smoothing step. While the stylization step transfers the style of the reference photo to the content photo, the smoothing step encourages spatially consistent stylizations. Unlike existing algorithms that require iterative optimization, both steps in our algorithm have closed-form solutions. Experimental results show that the stylized photos generated by our algorithm are twice more preferred by human subjects in average. Moreover, our method runs 60 times faster than the state-of-the-art approach. Code and additional results are available at https://github.com/NVIDIA/FastPhotoStyle.
• Robust model fitting plays a vital role in computer vision, and research into algorithms for robust fitting continues to be active. Arguably the most popular paradigm for robust fitting in computer vision is consensus maximisation, which strives to find the model parameters that maximise the number of inliers. Despite the significant developments in algorithms for consensus maximisation, there has been a lack of fundamental analysis of the problem in the computer vision literature. In particular, whether consensus maximisation is "tractable" remains a question that has not been rigorously dealt with, thus making it difficult to assess and compare the performance of proposed algorithms, relative to what is theoretically achievable. To shed light on these issues, we present several computational hardness results for consensus maximisation. Our results underline the fundamental intractability of the problem, and resolve several ambiguities existing in the literature.
• Visual data such as images and videos contain a rich source of structured semantic labels as well as a wide range of interacting components. Visual content could be assigned with fine-grained labels describing major components, coarse-grained labels depicting high level abstractions, or a set of labels revealing attributes. Such categorization over different, interacting layers of labels evinces the potential for a graph-based encoding of label information. In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos. We consider the use of the Bidirectional Inference Neural Network (BINN) and Structured Inference Neural Network (SINN) for performing graph-based inference in label space and propose a Long Short-Term Memory (LSTM) based extension for exploiting activity progression on untrimmed videos. The methods were evaluated on (i) the Animal with Attributes (AwA), Scene Understanding (SUN) and NUS-WIDE datasets for multi-label image classification, (ii) the first two releases of the YouTube-8M large scale dataset for multi-label video classification, and (iii) the THUMOS'14 and MultiTHUMOS video datasets for action detection. Our results demonstrate the effectiveness of structured label inference in these challenging tasks, achieving significant improvements against baselines.
• Unsupervised image translation, which aims in translating two independent sets of images, is challenging in discovering the correct correspondences without paired data. Existing works build upon Generative Adversarial Network (GAN) such that the distribution of the translated images are indistinguishable from the distribution of the target set. However, such set-level constraints cannot learn the instance-level correspondences (e.g. aligned semantic parts in object configuration task). This limitation often results in false positives (e.g. geometric or semantic artifacts), and further leads to mode collapse problem. To address the above issues, we propose a novel framework for instance-level image translation by Deep Attention GAN (DA-GAN). Such a design enables DA-GAN to decompose the task of translating samples from two sets into translating instances in a highly-structured latent space. Specifically, we jointly learn a deep attention encoder, and the instancelevel correspondences could be consequently discovered through attending on the learned instance pairs. Therefore, the constraints could be exploited on both set-level and instance-level. Comparisons against several state-ofthe- arts demonstrate the superiority of our approach, and the broad application capability, e.g, pose morphing, data augmentation, etc., pushes the margin of domain translation problem.
• Feb 20 2018 cs.CV arXiv:1802.06446v1
Purpose. Intraoperative Optical Coherence Tomography (iOCT) is an increasingly available imaging technique for ophthalmic microsurgery that provides high-resolution cross-sectional information of the surgical scene. We propose to build on its desirable qualities and present a method for tracking the orientation and location of a surgical needle. Thereby, we enable direct analysis of instrument-tissue interaction directly in OCT space without complex multimodal calibration that would be required with traditional instrument tracking methods. Method. The intersection of the needle with the iOCT scan is detected by a peculiar multi-step ellipse fitting that takes advantage of the directionality of the modality. The geometric modelling allows us to use the ellipse parameters and provide them into a latency aware estimator to infer the 5DOF pose during needle movement. Results. Experiments on phantom data and ex-vivo porcine eyes indicate that the algorithm retains angular precision especially during lateral needle movement and provides a more robust and consistent estimation than baseline methods. Conclusion. Using solely crosssectional iOCT information, we are able to successfully and robustly estimate a 5DOF pose of the instrument in less than 5.5 ms on a CPU.
• Sign recognition is an integral part of autonomous cars. Any misclassification of traffic signs can potentially lead to a multitude of disastrous consequences, ranging from a life-threatening accident to a large-scale interruption of transportation services relying on autonomous cars. In this paper, we propose and examine realistic security attacks against sign recognition systems for Deceiving Autonomous caRs with Toxic Signs (we call the proposed attacks DARTS). Leveraging the concept of adversarial examples, we modify innocuous signs/advertisements in the environment in such a way that they seem normal to human observers but are interpreted as the adversary's desired traffic sign by autonomous cars. Further, we pursue a fundamentally different perspective to attacking autonomous cars, motivated by the observation that the driver and vehicle-mounted camera see the environment from different angles (the camera commonly sees the road with a higher angle, e.g., from top of the car). We propose a novel attack against vehicular sign recognition systems: we create signs that change as they are viewed from different angles, and thus, can be interpreted differently by the driver and sign recognition. We extensively evaluate the proposed attacks under various conditions: different distances, lighting conditions, and camera angles. We first examine our attacks virtually, i.e., we check if the digital images of toxic signs can deceive the sign recognition system. Further, we investigate the effectiveness of attacks in real-world settings: we print toxic signs, install them in the environment, capture videos using a vehicle-mounted camera, and process them using our sign recognition pipeline.
• Feb 20 2018 cs.CV arXiv:1802.06424v1
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple streams/modalities takes place via another 2-layer BGRU. A slight improvement in the classification rate over an end-to-end audio-only and MFCC-based model is reported in clean audio conditions and low levels of noise. In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.
• The campaign against drug abuse is fought by all countries, most notably on ATS drugs. The technical limitations of the current test kits to detect new brand of ATS drugs present a challenge to law enforcement authorities and forensic laboratories. Meanwhile, new molecular imaging devices which allowed mankind to characterize the physical 3D molecular structure have been recently introduced, and it can be used to remedy the limitations of existing drug test kits. Thus, a new type of 3D molecular structure representation technique should be developed to cater the 3D molecular structure acquired physically using these molecular imaging devices. One of the applications of image processing methods to represent a 3D image is 3D moments, and this study formulates a new 3D moments technique, namely 3D Hahn moments, to represent the 3D molecular structure of ATS drugs. The performance of the proposed technique was analysed using drug chemical structures obtained from UNODC for the ATS drugs, while non-ATS drugs are obtained randomly from ChemSpider database. The evaluation shows the technique is qualified to be further explored in the future works to be fully compatible with ATS drug identification domain.
• Silent speech interfaces have been recently proposed as a way to enable communication when the acoustic signal is not available. This introduces the need to build visual speech recognition systems for silent and whispered speech. However, almost all the recently proposed systems have been trained on vocalised data only. This is in contrast with evidence in the literature which suggests that lip movements change depending on the speech mode. In this work, we introduce a new audiovisual database which is publicly available and contains normal, whispered and silent speech. To the best of our knowledge, this is the first study which investigates the differences between the three speech modes using the visual modality only. We show that an absolute decrease in classification rate of up to 3.7% is observed when training and testing on normal and whispered, respectively, and vice versa. An even higher decrease of up to 8.5% is reported when the models are tested on silent speech. This reveals that there are indeed visual differences between the 3 speech modes and the common assumption that vocalized training data can be used directly to train a silent speech recognition system may not be true.
• Strong intelligent machines powered by deep neural networks are increasingly deployed as black boxes to make decisions in risk-sensitive domains, such as finance and medical. To reduce potential risk and build trust with users, it is critical to interpret how such machines make their decisions. Existing works interpret a pre-trained neural network by analyzing hidden neurons, mimicking pre-trained models or approximating local predictions. However, these methods do not provide a guarantee on the exactness and consistency of their interpretation. In this paper, we propose an elegant closed form solution named $OpenBox$ to compute exact and consistent interpretations for the family of Piecewise Linear Neural Networks (PLNN). The major idea is to first transform a PLNN into a mathematically equivalent set of linear classifiers, then interpret each linear classifier by the features that dominate its prediction. We further apply $OpenBox$ to demonstrate the effectiveness of non-negative and sparse constraints on improving the interpretability of PLNNs. The extensive experiments on both synthetic and real world data sets clearly demonstrate the exactness and consistency of our interpretation.
• Recognizing a license plate clearly while seeing a surveillance camera snapshot is often important in cases where the troublemaker vehicle(s) have to be identified. In many real world situations, these images are blurred due to fast motion of the vehicle and cannot be recognized by the human eye. For this kind of blurring, the kernel involved can be said to be a linear uniform convolution described by its angle and length. We propose a new de-blurring technique in this paper to parametrically estimate the kernel as accurately as possible with emphasis on the length estimation process. We use a technique which employs Hough transform in estimating the kernel angle. To accurately estimate the kernel length, a novel approach using the cepstral transform is introduced. We compare the de-blurred results obtained using our scheme with those of other recently introduced blind de-blurring techniques. The comparisons corroborate that our scheme can remove a large blur from the image captured by the camera to recover vital semantic information about the license plate.
• Major winning Convolutional Neural Networks (CNNs), such as VGGNet, ResNet, DenseNet, \etc, include tens to hundreds of millions of parameters, which impose considerable computation and memory overheads. This limits their practical usage in training and optimizing for real-world applications. On the contrary, light-weight architectures, such as SqueezeNet, are being proposed to address this issue. However, they mainly suffer from low accuracy, as they have compromised between the processing power and efficiency. These inefficiencies mostly stem from following an ad-hoc designing procedure. In this work, we discuss and propose several crucial design principles for an efficient architecture design and elaborate intuitions concerning different aspects of the design procedure. Furthermore, we introduce a new layer called \it SAF-pooling to improve the generalization power of the network while keeping it simple by choosing best features. Based on such principles, we propose a simple architecture called \it SimpNet. We empirically show that SimpNet provides a good trade-off between the computation/memory efficiency and the accuracy solely based on these primitive but crucial principles. SimpNet outperforms the deeper and more complex architectures such as VGGNet, ResNet, WideResidualNet \etc, on several well-known benchmarks, while having 2 to 25 times fewer number of parameters and operations. We obtain state-of-the-art results (in terms of a balance between the accuracy and the number of involved parameters) on standard datasets, such as CIFAR10, CIFAR100, MNIST and SVHN. The implementations are available at \hrefurlhttps://github.com/Coderx7/SimpNet.
• We present a framework for learning efficient holistic representation for handwritten word images. The proposed method uses a deep convolutional neural network with traditional classification loss. The major strengths of our work lie in: (i) the efficient usage of synthetic data to pre-train a deep network, (ii) an adapted version of ResNet-34 architecture with region of interest pooling (referred as HWNet v2) which learns discriminative features with variable sized word images, and (iii) realistic augmentation of training data with multiple scales and elastic distortion which mimics the natural process of handwriting. We further investigate the process of fine-tuning at various layers to reduce the domain gap between synthetic and real domain and also analyze the in-variances learned at different layers using recent visualization techniques proposed in literature. Our representation leads to state of the art word spotting performance on standard handwritten datasets and historical manuscripts in different languages with minimal representation size. On the challenging IAM dataset, our method is first to report an mAP above 0.90 for word spotting with a representation size of just 32 dimensions. Further more, we also present results on printed document datasets in English and Indic scripts which validates the generic nature of the proposed framework for learning word image representation.
• Early detection of lung nodules is of great importance in lung cancer screening. Existing research recognizes the critical role played by CAD systems in early detection and diagnosis of lung nodules. However, many CAD systems, which are used as cancer detection tools, produce a lot of false positives (FP) and require a further FP reduction step. Furthermore, guidelines for early diagnosis and treatment of lung cancer are consist of different shape and volume measurements of abnormalities. Segmentation is at the heart of our understanding of nodules morphology making it a major area of interest within the field of computer aided diagnosis systems. This study set out to test the hypothesis that joint learning of false positive (FP) nodule reduction and nodule segmentation can improve the computer aided diagnosis (CAD) systems' performance on both tasks. To support this hypothesis we propose a 3D deep multi-task CNN to tackle these two problems jointly. We tested our system on LUNA16 dataset and achieved an average dice similarity coefficient (DSC) of 91% as segmentation accuracy and a score of nearly 92% for FP reduction. As a proof of our hypothesis, we showed improvements of segmentation and FP reduction tasks over two baselines. Our results support that joint training of these two tasks through a multi-task learning approach improves system performance on both. We also showed that a semi-supervised approach can be used to overcome the limitation of lack of labeled data for the 3D segmentation task.
• Feb 20 2018 cs.CV arXiv:1802.06140v1
Motivated by the growing demand for interactive environments, we propose an accurate real-time 3D shape reconstruction technique. To provide a reliable 3D reconstruction which is still a challenging task when dealing with real-world applications, we integrate several components including (i) Photometric Stereo (PS), (ii) perspective Cook-Torrance reflectance model that enables PS to deal with a broad range of possible real-world object reflections, (iii) realistic lightening situation, (iv) a Recurrent Optimization Network (RON) and finally (v) heuristic Dijkstra Gaussian Mean Curvature (DGMC) initialization approach. We demonstrate the potential benefits of our hybrid model by providing 3D shape with highly-detailed information from micro-prints for the first time. All real-world images are taken by a mobile phone camera under a simple setup as a consumer-level equipment. In addition, complementary synthetic experiments confirm the beneficial properties of our novel method and its superiority over the state-of-the-art approaches.
• Feb 20 2018 cs.CV arXiv:1802.06130v1
Denoising is a fundamental imaging problem. Versatile but fast filtering has been demanded for mobile camera systems. We present an approach to multiscale filtering which allows real-time applications on low-powered devices. The key idea is to learn a set of kernels that upscales, filters, and blends patches of different scales guided by local structure analysis. This approach is trainable so that learned filters are capable of treating diverse noise patterns and artifacts. Experimental results show that the presented approach produces comparable results to state-of-the-art algorithms while processing time is orders of magnitude faster.
• The ability for computational agents to reason about the high-level content of real world scene images is important for many applications. Existing attempts at addressing the problem of complex scene understanding lack representational power, efficiency, and the ability to create robust meta-knowledge about scenes. In this paper, we introduce scenarios as a new way of representing scenes. The scenario is a simple, low-dimensional, data-driven representation consisting of sets of frequently co-occurring objects and is useful for a wide range of scene understanding tasks. We learn scenarios from data using a novel matrix factorization method which we integrate into a new neural network architecture, the ScenarioNet. Using ScenarioNet, we can recover semantic information about real world scene images at three levels of granularity: 1) scene categories, 2) scenarios, and 3) objects. Training a single ScenarioNet model enables us to perform scene classification, scenario recognition, multi-object recognition, content-based scene image retrieval, and content-based image comparison. In addition to solving many tasks in a single, unified framework, ScenarioNet is more computationally efficient than other CNNs because it requires significantly fewer parameters while achieving similar performance on benchmark tasks and is more interpretable because it produces explanations when making decisions. We validate the utility of scenarios and ScenarioNet on a diverse set of scene understanding tasks on several benchmark datasets.
• While great advances are made in pattern recognition and machine learning, the successes of such fields remain restricted to narrow applications and seem to break down when training data is scarce, a shift in domain occurs, or when intelligent reasoning is required for rapid adaptation to new environments. In this work, we list several of the shortcomings of modern machine-learning solutions, specifically in the contexts of computer vision and in reinforcement learning and suggest directions to explore in order to try to ameliorate these weaknesses.
• Social media, as a major platform for communication and information exchange, is a rich repository of the opinions and sentiments of 2.3 billion users about a vast spectrum of topics. To sense the whys of certain social user's demands and cultural-driven interests, however, the knowledge embedded in the 1.8 billion pictures which are uploaded daily in public profiles has just started to be exploited since this process has been typically been text-based. Following this trend on visual-based social analysis, we present a novel methodology based on Deep Learning to build a combined image-and-text based personality trait model, trained with images posted together with words found highly correlated to specific personality traits. So the key contribution here is to explore whether OCEAN personality trait modeling can be addressed based on images, here called \emphMindPics, appearing with certain tags with psychological insights. We found that there is a correlation between those posted images and their accompanying texts, which can be successfully modeled using deep neural networks for personality estimation. The experimental results are consistent with previous cyber-psychology results based on texts or images. In addition, classification results on some traits show that some patterns emerge in the set of images corresponding to a specific text, in essence to those representing an abstract concept. These results open new avenues of research for further refining the proposed personality model under the supervision of psychology experts.

aaditya prakash Jan 29 2018 19:58 UTC

Eddie Smolansky May 26 2017 05:23 UTC

Updated summary [here](https://github.com/eddiesmo/papers).

# How they made the dataset
- automated filtering with yolo and landmark detection projects
- crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong)
- quality control through s

...(continued)
Qian Wang Mar 07 2017 17:21 UTC

"To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics."