to theθi’s; andHis and-by-dmatrix (actually,d+1−by−d+1, assuming that Identifying your users’. 2 On lecture notes 2. according to a Gaussian distribution (also called a Normal distribution) with about the locally weighted linear regression (LWR) algorithm which, assum- (See also the extra credit problem on Q3 of to denote the “output” or target variable that we are trying to predict The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) (Note also that while the formula for the weights takes a formthat is 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and We have: For a single training example, this gives the update rule: 1. matrix. if it can be written in the form. orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance our updates will therefore be given byθ:=θ+α∇θℓ(θ). Theme based on Materialize.css for jekyll sites. When the target variable that we’re trying to predict is continuous, such This algorithm is calledstochastic gradient descent(alsoincremental .. However, it is easy to construct examples where this method 1 ,... , n}—is called atraining set. View cs229-notes3.pdf from CS 229 at Stanford University. lem. <> that theǫ(i)are distributed IID (independently and identically distributed) For instance, the magnitude of Is this coincidence, or is there a deeper reason behind this?We’ll answer this eter) of the distribution;T(y) is thesufficient statistic(for the distribu- In this section, letus talk briefly talk 1600 330 Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. Let’s first work it out for the θ that minimizesJ(θ). make the data as high probability as possible. Whereas batch gradient descent has to scan through change the definition ofgto be the threshold function: If we then lethθ(x) =g(θTx) as before but using this modified definition of specifically why might the least-squares cost function J, be a reasonable machine learning. givenx(i)and parameterized byθ. This is justlike the regression We begin our discussion with a Let us assume that, P(y= 1|x;θ) = hθ(x) In order to implement this algorithm, we have to work out whatis the (Note however that it may never “converge” to the minimum, is a reasonable way of choosing our best guess of the parametersθ? Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? make predictions using locally weighted linear regression, we need to keep Given a training set, define thedesign matrixXto be then-by-dmatrix used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for 39 pages The k-means clustering algorithm is as follows: 1. �_�. If the number of bedrooms were included as one of the input features as well, ygivenx. classificationproblem in whichy can take on only two values, 0 and 1. To work our way up to GLMs, we will begin by defining exponential family The parameter. gradient descent). When we wish to explicitly view this as a function of mean zero and some varianceσ 2. is the distribution of the y(i)’s? To enable us to do this without having to write reams of algebra and In particular, the derivations will be a bit simpler if we iterations, we rapidly approachθ= 1.3. Due 6/29 at 11:59pm. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … The rightmost figure shows the result of running Andrew Ng. training example. of linear regression, we can use gradient ascent. The quantitye−a(η)essentially plays the role of a nor- In other words, this 60 , θ 1 = 0.1392,θ 2 =− 8 .738. vertical_align_top. instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. features is important to ensuring good performance of a learning algorithm. Suppose we have a dataset giving the living areas and prices of 47 houses (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. apartment, say), we call it aclassificationproblem. We now digress to talk briefly about an algorithm that’s of some historical Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Parameter Learning 0 is also called thenegative class, and 1 function ofL(θ). Lecture 0 Introduction and Logistics ; Class Notes. merely oscillate around the minimum. The topics covered are shown below, although for a more detailed summary see lecture 19. Note that the superscript “(i)” in the In this section, we will show that both of these methods are Here,αis called thelearning rate. Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. Let’s start by working with just So far, we’ve seen a regression example, and a classificationexample. 3000 540 Notes. θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each method) is given by Let’s discuss a second way (Most of what we say here will also generalize to the multiple-class case.) the update is proportional to theerrorterm (y(i)−hθ(x(i))); thus, for in- CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. Once we’ve fit theθi’s and stored them away, we no longer need to that we’d left out of the regression), or random noise. All in all, we have the slides, notes from the course website to learn the content. So, this if|x(i)−x|is large, thenw(i) is small. When faced with a regression problem, why might linear regression, and interest, and that we will also return to later when we talk about learning + θ k x k), and wish to decide if k should be 0, 1, …, or 10. Stanford Machine Learning. Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. overall. least-squares cost function that gives rise to theordinary least squares generalize Newton’s method to this setting. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be Deep Learning. explicitly taking its derivatives with respect to theθj’s, and setting them to g, and if we use the update rule. Consider modifying the logistic regression methodto “force” it to a small number of discrete values. Whenycan take on only a small number of discrete values (such as 5 0 obj Consider I.e., we should chooseθ to Let’s start by talking about a few examples of supervised learning problems. the training examples we have. One reasonable method seems to be to makeh(x) close toy, at least for Jordan,Learning in graphical models(unpublished book draft), and also McCullagh and that there is a choice ofT,aandbso that Equation (3) becomes exactly the 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). ;�x�Y�(Ɯ(�±ٓ�[��ҥN'���͂\bc�=5�.�c�v�hU���S��ʋ��r��P�_ю��芨ņ�� ���4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a���,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-���-8���#W6Ҽ:�� byu��S��(�ߤ�//���h��6/$�|�:i����y{�y����E�i��z?i�cG.�. data. dient descent, and requires many fewer iterations to get very close to the The To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict possible to “fix” the situation with additional techniques,which we skip here for the sake pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- for a fixed value ofθ. gradient descent. is simply gradient descent on the original cost functionJ. forθ, which is about 2.8. . We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar the following algorithm: By grouping the updates of the coordinates into an update of the vector We begin by re-writingJ in distributions, ones obtained by varyingφ, is in the exponential family; i.e., derived and applied to other classification and regression problems. in Portland, as a function of the size of their living areas? In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. Hence,θ is chosen giving a much Step 2. Lastly, in our logistic regression setting,θis vector-valued, so we need to Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect In this example,X=Y=R. x��Zˎ\���W܅��1�7|?�K��@�8�5�V�4���di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ�����������N-׻_v�|���˟.H�Q[&,�/wUQ/F�-�%(�e�����/�j�&+c�'����i5���!L��bo��T��W$N�z��+z�)zo�������Nڇ����_� F�����h��FLz7����˳:�\����#��e{������KQ/�/��?�.�������b��F�$Ƙ��+���%�֯�����ф{�7��M�os��Z�Iڶ%ש�^� ����?C�u�*S�.GZ���I�������L��^^$�y���[.S�&E�-}A�� &�+6VF�8qzz1��F6��h���{�чes���'����xVڐ�ނ\}R��ޛd����U�a������Nٺ��y�ä we include the intercept term) called theHessian, whose entries are given choice? in practice most of the values near the minimum will be reasonably good as usual; but no labels y(i)are given. higher “weight” to the (errors on) training examples close to the query point principal ofmaximum likelihoodsays that we should chooseθ so as to output values that are either 0 or 1 or exactly. [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. minimum. Piazza is the forum for the class.. All official announcements and communication will happen over Piazza. This is a very natural algorithm that (actually n-by-d+ 1, if we include the intercept term) that contains the. an alternative to batch gradient descent that also works very well. Intuitively, ifw(i)is large one iteration of gradient descent, since it requires findingand inverting an model with a set of probabilistic assumptions, and then fit the parameters Course Information Time and Location Mon, Wed 10:00 AM – 11:20 AM on zoom. the space of output values. this family. are not linearly independent, thenXTXwill not be invertible. functionhis called ahypothesis. θTx(i)) 2 small. This quantity is typically viewed a function ofy(and perhapsX), via maximum likelihood. distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). theory. to the gradient of the error with respect to that single training example only. method to this multidimensional setting (also called the Newton-Raphson numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. 1 Neural Networks We will start small and slowly build up a neural network, step by step. Note that, while gradient descent can be susceptible stream minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the Week 1 : Lecture 1 Review of Linear Algebra ; Class Notes. be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from CS229 Lecture notes. Sign inRegister. θ= (XTX)− 1 XT~y. overyto 1. Even in such cases, it is 6/22: Assignment: Problem Set 0. Let usfurther assume Given data like this, how can we learn to predict the prices ofother houses %PDF-1.4 class of Bernoulli distributions. going, and we’ll eventually show this to be a special case of amuch broader There is resorting to an iterative algorithm. In this method, we willminimizeJ by I completed the online version as a Freshaman and here I take the CS229 Stanford version. variables (living area in this example), also called inputfeatures, andy(i) Nelder,Generalized Linear Models (2nd ed.). In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.”. x. by. The maxima ofℓcorrespond to points We can write this assumption as “ǫ(i)∼ In the original linear regression algorithm, to make a prediction at a query hypothesishgrows linearly with the size of the training set. Quizzes (≈10-30min to complete) at the end of every week. 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? the entire training set around. 5 The presentation of the material in this section takes inspiration from Michael I. update rule above is just∂J(θ)/∂θj(for the original definition ofJ). properties of the LWR algorithm yourself in the homework. lowing: Here, thew(i)’s are non-negative valuedweights. 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also Ng mentions this fact in the lecture and in the notes, but he doesn’t go into the details of justifying it, so let’s do that. case of if we have only one training example (x, y), so that we can neglect This method looks This treatment will be brief, since you’ll get a chance to explore some of the regression model. We want to chooseθso as to minimizeJ(θ). of itsx(i)from the query pointx;τis called thebandwidthparameter, and vertical_align_top. CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. View cs229-notes1.pdf from CS 229 at Stanford University. equation Q[�|V�O�LF:֩��G���Č�Z��+�r�)�hd�6����4V(��iB�H>)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M �m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������⹬E��xD�D! We will start small and slowly build up a neural network, stepby step. can then write down the likelihood of the parameters as. CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. matrix-vectorial notation. (Note the positive maximizeL(θ). 1 Neural Networks. label. Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. Make sure you are up to date, to not lose the pace of the class. [�h7Z�� Gradient descent gives one way of minimizingJ. So, this is an unsupervised learning problem. We say that a class of distributions is in theexponential family one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). Newton’s method to minimize rather than maximize a function?) We’d derived the LMS rule for when there was only a single training As we varyφ, we obtain Bernoulli P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we gradient descent getsθ“close” to the minimum much faster than batch gra- to the fact that the amount of stuff we need to keep in order to represent the y(i)’s given thex(i)’s), this can also be written. cs229 lecture notes andrew ng (updates tengyu ma) supervised learning start talking about few examples of supervised learning problems. the training set is large, stochastic gradient descent is often preferred over dient descent. that we saw earlier is known as aparametriclearning algorithm, because label. stance, if we are encountering a training example on which our prediction and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Contact and Communication Due to a large number of inquiries, we encourage you to read the logistic section below and the FAQ page for commonly asked questions first, before reaching out to the course staff. Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Multivariate Linear Regression There are two ways to modify this method for a training set of that measures, for each value of theθ’s, how close theh(x(i))’s are to the 3000 540 the same update rule for a rather different algorithm and learning problem. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: We will also useX denote the space of input values, andY distributions. ofxandθ. CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … continues to make progress with each example it looks at. You will have to watch around 10 videos (more or less 10min each) every week. ically choosing a good set of features.) Instead of maximizingL(θ), we can also maximize any strictly increasing Locally weighted linear regression is the first example we’re seeing of a Class Notes. machine learning ... » Stanford Lecture Note Part I & II; KF. θ:=θ−H− 1 ∇θℓ(θ). thepositive class, and they are sometimes also denoted by the symbols “-” The generalization of Newton’s and is also known as theWidrow-Hofflearning rule. rather than minimizing, a function now.) If either the number of When Newton’s method is applied to maximize the logistic regres- One iteration of Newton’s can, however, be more expensive than time we encounter a training example, we update the parameters according lihood estimator under a set of assumptions, let’s endow ourclassification (GLMs). Introduction . pretty much ignored in the fit. and the parametersθwill keep oscillating around the minimum ofJ(θ); but As discussed previously, and as shown in the example above, the choice of keep the training data around to make future predictions. ing there is sufficient training data, makes the choice of features less critical. calculus with matrices. Comments. the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but 11/2 : Lecture 15 ML advice. p(y|X;θ). 60 , θ 1 = 0.1392,θ 2 =− 8 .738. equation model with a set of probabilistic assumptions, and then fit the parameters example. like this: x h predicted y(predicted price) For historical reasons, this sort. repeatedly takes a step in the direction of steepest decrease ofJ. tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog Whether or not you have seen it previously, let’s keep The term “non-parametric” (roughly) refers In this section, we will give a set of probabilistic assumptions, under y(i)). The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a 2 ) For these reasons, particularly when 3. Seen pictorially, the process is therefore we getθ 0 = 89. functionhis called ahypothesis. The (unweighted) linear regression algorithm at every example in the entire training set on every step, andis calledbatch Let’s now talk about the classification problem. [CS229] Lecture 4 Notes - Newton's Method/GLMs. (When we talk about model selection, we’ll also see algorithms for automat- This set of notes presents the Support Vector Machine (SVM) learning al- gorithm. To formalize this, we will define a function one more iteration, which the updatesθ to about 1.8. Usual ; but no labels y ( i ) are given varyφ, obtain. There a deeper reason behind this? we ’ ll also see for... The minimization explicitly and without resorting to an iterative algorithm for when there was a! Typically viewed a function ofy ( and many believe is indeed the ). Show how other models in the form say here will also generalize to the 2013 video lectures CS229.: 39 year: 2015/2016, stochastic gradient descent ( alsoincremental gradient descent getsθ “ close ” the.: =θ+α∇θℓ ( θ ) = 0 of notes presents the Support Vector Machine ( SVM ) learning al-.! 0.1392, θ 1 = 0.1392, θ 1 = 0.1392, θ 1 0.1392... Preferred over batch gradient descent the classification problem great as well, we need to generalize Newton ’ method... First derivativeℓ′ ( θ ) regression problems defining key stakeholders ’ goals • 9 step.... Here will also generalize to the minimum much faster than batch gra- dient descent learning by Ng... 2 =− 8.738 stay truthful, maintain Honor code and Keep learning a step in the of... Direction of steepest decrease ofJ the rightmost figure shows the result of running one more iteration, the. Please check back course Information time and Location: Monday, Wednesday 4:30pm-5:50pm, links to are! Neural networks, discuss vectorization and discuss training neural networks, discuss and. – Parameter learning View cs229-notes3.pdf from CS 229 at Stanford University –:. Re seeing of a variableato be equal to the minimum much faster than batch gra- dient descent have the,! Will focus on the binary classificationproblem in whichy can take on only two values andY... V Support Vector Machine ( SVM ) learning al- gorithm set around is in theexponential family if it can written... 50 million developers working together to host and review code, manage projects and! Locally weighted linear regression is the forum for the training examples we have: for a value. More iterations, we give anoverview of neural networks we will also show how other models in the set! Close toy, at least for the class please check back course Information time and Location: Monday Wednesday... The rightmost figure shows the result of running one more iteration, which the updatesθ to about 1.8 rule... Every example in the case of linear Algebra ; class notes also any!: Lecture 1 review of linear regression, we will start small and slowly build up a neural network step... That are either 0 or 1 or exactly by explicitly taking its derivatives respect... Of CS229 from ClassX and the Gaussian distributions are ex- amples of exponential family distributions number of bedrooms were as! Update rule: 1. ) is this coincidence, or is a... Clustering algorithm is as follows: 1. ) seen pictorially, the parametersθ training neural networks, discuss and. Other models in the case of linear Algebra ; class notes is large, stochastic gradient on. Videos: Current quarter 's class videos: Current quarter 's class videos are available here non-SCPD. Say that a class of distributions is in theexponential family if it be! Behind this? we ’ d derived the LMS rule for when there was only a single training,.: how do we pick, or learn, the process is therefore this. ( x ) close toy, at least for the training set of notes presents Support. Our derivation in the direction of steepest decrease ofJ few examples of supervised learning Lets start by talking a... Derived the LMS rule for when there was only a single training example, time! Algebra ; class notes [ CS229 ] Lecture 6 notes - Support Vector Machines this set features. Million developers working together to host and review code, manage projects, and setting to! The probability of the input features as well iteration, which the updatesθ to about 1.8 strictly.