Statistical learning theory / Vladimir N. Vapnik p. cm(Adaptive and learning . cn Inductive Principles. 62 I The Idea about the Nature of Random Phenomena. This article presents a very general overview of statistical learning theory The Nature of Statistical Learning Theory (). Cached. Download as a PDF. An Overview of Statistical Learning Theory. Vladimir N. Vapnik. Abstract— Statistical learning theory was introduced in the late. 's. Until the 's it was a.
|Language:||English, Spanish, Hindi|
|Genre:||Business & Career|
|ePub File Size:||21.63 MB|
|PDF File Size:||18.88 MB|
|Distribution:||Free* [*Register to download]|
Page 1. Statistics for. Engineering and. Information Science. Vladimir N. Vapnik. The Nature of Statistical. Learning Theory. Second Edition. Springer. Page 2. The Nature of Statistical Learning Theory Digitally watermarked, DRM-free; Included format: PDF; ebooks can be used on all reading devices; Immediate. The Nature of Statistical Learning Theory. Authors; (view PDF · Direct Methods in Statistical Learning Theory. Vladimir N. Vapnik. Pages PDF.
The goal of statistical learning theory is to study, in a sta- tistical framework, the properties of learning algorithms. In particular, most results take the form of so-called error bounds. This tutorial intro- duces the techniques that are used to obtain such results. This is studied in a statistical framework, that is there are assumptions of statistical nature about the underlying phenomena in the way the data is generated. As a motivation for the need of such a theory, let us just quote V.
It is used to split a matrix into its constituent parts. Matrix factorization is equivalent to the factoring of numbers, such as the factoring of 10 into 2 x 5. It is used to solve linear equations.
The following matrix factorization techniques are available: LU Decomposition is for square matrices and decomposes a matrix into L and U components. QR Decomposition is for m x n matrices not limited to square matrices and decomposes a matrix into Q and R components. It is not limited to square matrices like LU decomposition. Cholesky Decomposition is used for solving linear least squares for linear regression, as well as simulation and optimization methods.
Singular Value Decomposition explained in the next section. Singular Value Decomposition In the previous section, we have seen eigen-decomposition of a matrix that decomposes into eigenvectors and eigenvalues. Singular value decomposition is a type of matrix factorization method that decomposes into singular vectors and singular values.
It offers various useful applications in signal processing, psychology, sociology, climate, atmospheric science, statistics, and astronomy. It is utilized in computing the least square solutions.
Moore-Penrose inverse is the most popular form of matrix pseudoinverse. It is also known as the element-wise product. It is simpler than the matrix product.
Hadamard product is utilized in JPEG lossy compression algorithms. Hadamard product is commutative, associative, and distributive. It easily obtains the inverse and simplifies the computation of power matrices. Hadamard product is utilized in various fields such as code correction in satellite transmissions, information theory, cryptography, pattern recognition, neural network, maximum likelihood estimation, JPEG lossy compression, multivariate statistical analysis, and linear modeling.
Model selection is based on the principle of the maximum entropy, which states from the conflicting models, the one with the highest entropy is the best. If the log is taken to be the natural log, then the entropy is expressed in nats. More commonly, entropy is expressed in bits. It measures the distance similarity or dissimilarity of one distribution from another reference probability distribution. It can be expressed as, It does sound like a distance measure, but it is not.
It is an iterative process that finds the minimum of any given function. Image Source Three types of the gradient descent algorithm: full batch, stochastic, and mini-batch gradient descent. Full batch gradient descent uses the whole dataset for computing gradient while stochastic gradient descent uses a sample of the dataset for computing gradient.
Mini-batch gradient descent is a combination of both stochastic and batch gradient descent. The training set is split into various small groups called batches.
These small batches compute the loss one by one and average the final loss result. It took at least two decades to understand this fact in full detail. We will talk about this in what follows. In classical statistics a problem analogous to the pattern recognition problem was considered by Ronald Fisher in the s, the so-called problem of discriminant analysis.
Fisher considered the following problem. The problem was: given the generative models the model how the data are generated known up to values of its parameters estimate the discriminative rule.
The proposed solution was: First, using the data, estimate the parameters of the statistical laws and 1. Realism and Instrumentalism Second, construct the optimal decision rule using the estimated parameters. To estimate the densities, Fisher suggested the maximum likelihood method. This scheme later was generalized for the case when the unknown density belonged to a nonparametric family.
To estimate these generative models the methods of nonparametric statistics were used see example in Chapter 2 Section 2.
This model is based on understanding of how the data are generated. By the time the Perceptron was introduced, classical discriminant analysis based on Gaussian distribution functions had been studied in great detail. One of the important results obtained for a particular model two Gaussian distributions with the same covariance matrix is the introduction of a concept called the Mahalanobis distance.
However, to construct this model using classical methods requires the estimation of about 0. Roughly speaking, to estimate one parameter of the model requires C examples. The Perceptron used only This shocked theorists.
It looked as if the classical statistical approach failed to overcome the curse of dimensionality in a situation where a heuristic method that minimized the empirical loss easily overcame this curse.
Later the methods based on the idea of minimizing different type of empirical losses were called the predictive discriminative models of induction, in contrast to the classical generative models. In a wide philosophical sense predictive models do not necessarily connect prediction of an event with understanding of the law that governs the event; they are just looking for a function that explains the data best.
Instead they depend on the so-called capacity factors of the admissible set of functions — the VC entropy, the Growth function, or the VC dimension — that can be much smaller than the dimensionality.
This function minimizes some empirical loss functional, whose construction is similar to the Mahalanobis distance. For a long time this heuristic of Fisher was not considered an important result it was ignored in most classical statistics textbooks.
Realism and Instrumentalism Figure 1. Why do the generative and discriminative approaches lead to different results? There are two answers to this very important question which can be described from two different points of view: technical and philosophical conceptual.
Two different concepts of what is meant by a good approximation are possible: 1 A good approximation of the BB rule is a function that is close in a metric of functional space to the function that the BB uses. In the classical setting often we assume that the BB uses the Bayesian rule.
In Figure 1. Suppose that the straight line is the function used by the black box. Then from the point of view of function estimation, the polynomial curve shown in Figure 1.
Realism and Instrumentalism very different from the line and therefore cannot be a good estimate of the true BB rule. From the other point of view, the polynomial rule separates the data well and as we will show later can belong to a set with small VC dimension and therefore can be a good instrument for prediction. The lesson the Perceptron teaches us is that sometimes it is useful to give up the ambitious goal of estimating the rule the BB uses the generative model of induction.
Before discussing this question let me make the following remark.