Deep Learning

IT이야기

Deep Learning

딜레이라마 2017. 2. 27. 21:23

DEEP LEARNING TUTORIALS

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. See these course notes for a brief introduction to Machine Learning for AI and an introduction to Deep Learning algorithms. Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms, see for example:

The monograph or review paper Learning Deep Architectures for AI (Foundations & Trends in Machine Learning, 2009).

• The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references.

• The LISA public wiki has a reading list and a bibliography.

• Geoff Hinton has readings from 2009’s NIPS tutorial.

The tutorials presented here will introduce you to some of the most important deep learning algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU. The algorithm tutorials have some prerequisites. You should know some python, and be familiar with numpy. Since this tutorial is about using Theano, you should read over the Theano basic tutorial first. Once you’ve done that, read through our Getting Started chapter – it introduces the notation, and [downloadable] datasets used in the algorithm tutorials, and the way we do optimization by stochastic gradient descent.

The purely supervised learning algorithms are meant to be read in order:

1. Logistic Regression - using Theano for something simple

2. Multilayer perceptron - introduction to layers

3. Deep Convolutional Network - a simplified version of LeNet5

The unsupervised and semi-supervised learning algorithms can be read in any order (the auto-encoders can be read independently of the RBM/DBN thread):

• Auto Encoders, Denoising Autoencoders - description of autoencoders

• Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets

• Restricted Boltzmann Machines - single layer generative RBM model

• Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning

Datasets

The MNIST dataset consists of handwritten digit images and it is divided in 60,000 examples for the training set and 10,000 examples for testing. In many papers as well as in this tutorial, the official training set of 60,000 is divided into an actual training set of 50,000 examples and 10,000 validation examples (for selecting hyper-parameters like learning rate and size of the model). All digit images have been size-normalized and centered in a fixed size image of 28 x 28 pixels. In the original dataset each pixel of the image is represented by a value between 0 and 255, where 0 is black, 255 is white and anything in between is a different shade of grey. Here are some examples of MNIST digits:

For convenience we pickled the dataset to make it easier to use in python. It is available for download here. The pickled file represents a tuple of 3 lists : the training set, the validation set and the testing set. Each of the three lists is a pair formed from a list of images and a list of class labels for each of the images. An image is represented as numpy 1-dimensional array 5 Deep Learning Tutorial, Release 0.1 of 784 (28 x 28) float values between 0 and 1 (0 stands for black, 1 for white). The labels are numbers between 0 and 9 indicating which digit the image represents. The code block below shows how to load the dataset.

import cPickle, gzip, numpy # Load the dataset f = gzip.open(’mnist.pkl.gz’, ’rb’) train_set, valid_set, test_set = cPickle.load(f) f.close()

When using the dataset, we usually divide it in minibatches (see Stochastic Gradient Descent). We encourage you to store the dataset into shared variables and access it based on the minibatch index, given a fixed and known batch size. The reason behind shared variables is related to using the GPU. There is a large overhead when copying data into the GPU memory. If you would copy data on request ( each minibatch individually when needed) as the code will do if you do not use shared variables, due to this overhead, the GPU code will not be much faster then the CPU code (maybe even slower). If you have your data in Theano shared variables though, you give Theano the possibility to copy the entire data on the GPU in a single call when the shared variables are constructed. Afterwards the GPU can access any minibatch by taking a slice from this shared variables, without needing to copy any information from the CPU memory and therefore bypassing the overhead. Because the datapoints and their labels are usually of different nature (labels are usually integers while datapoints are real numbers) we suggest to use different variables for label and data. Also we recommend using different variables for the training set, validation set and testing set to make the code more readable (resulting in 6 different shared variables).

Since now the data is in one variable, and a minibatch is defined as a slice of that variable, it comes more natural to define a minibatch by indicating its index and its size. In our setup the batch size stays constant throughout the execution of the code, therefore a function will actually require only the index to identify on which datapoints to work. The code below shows how to store your data and how to access a minibatch:

Notation

Dataset notation

We label data sets as D. When the distinction is important, we indicate train, validation, and test sets as: Dtrain, Dvalid and Dtest. The validation set is used to perform model selection and hyper-parameter selection, whereas the test set is used to evaluate the final generalization error and compare different algorithms in an unbiased way. The tutorials mostly deal with classification problems, where each data set D is an indexed set of pairs (x (i) , y(i) ). We use superscripts to distinguish training set examples: x (i) ∈ RD is thus the i-th training example of dimensionality D. Similarly, y (i) ∈ {0, ..., L} is the i-th label assigned to input x (i) . It is straightforward to extend these examples to ones where y (i) has other types (e.g. Gaussian for regression, or groups of multinomials for predicting multiple symbols).

Math Conventions

• W: upper-case symbols refer to a matrix unless specified otherwise

• Wij : element at i-th row and j-th column of matrix W

• Wi· , Wi : vector, i-th row of matrix W

• W·j : vector, j-th column of matrix W

• b: lower-case symbols refer to a vector unless specified otherwise

• bi : i-th element of vector b

List of Symbols and acronyms

• D: number of input dimensions.

• D (i) h : number of hidden units in the i-th layer.

• fθ(x), f(x): classification function associated with a model P(Y |x, θ), defined as argmaxkP(Y = k|x, θ). Note that we will often drop the θ subscript.

• L: number of labels.

• L(θ, D): log-likelihood D of the model defined by parameters θ.

• `(θ, D) empirical loss of the prediction function f parameterized by θ on data set D.

• NLL: negative log-likelihood

• θ: set of all parameters for a given model

reference : LISA lab, University of Montreal