hypothesis 假设
terminology 术语

## 2.2 Cost function

### 2.2.1 Definetion of the cost function

$\theta_0$ 和 $\theta_1$ 即为这个模型的参数，也即需要学习得出的量。

### 2.2.2 Why we use cost function

$h \theta(x)$是对于固定的 $\theta$ ，是一个$x$的函数，$J(\theta_1)$是一个关于$\theta$的函数。
quiz:

intuition 直觉
contour plot 等高线

## 2.3 Parameter Learning

optimum 最佳
dirivative 偏导

Outline：

• Start with some $\theta_0, \theta_1$
• Keep changing $\theta_0,\theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a minimum

repeat until convergence
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1)$ （for $j=0$ and $j=1$)

$\alpha$在此处即为 learning rate，学习率，后面会讲到。

:= 是在 Pascal 语言当中作为 赋值语句用的
simultaneous 同时的，并发的

quiz

Intuition 直觉

quiz

• Leave $\theta_1$ unchanged
• Change $\theta_1$ in a random direction
• Move $\theta_1$ in the direction of the global minimum of $J(\theta_1)$
• Derease $\theta_1$
按照公式此处$\theta_1$
按照公式此处$\theta_1$偏导数是0，所以不会变。这也显示了它可能会陷入局部最优解。

### 2.3.2 Gradient For Linear regression

convex funciton 凸函数，关于凸函数，高中数学的定义讲的十分混乱，有非常多的版本，这里我们采用和课程一致的，也是维基百科的版本，即二阶导数大于零，bowl shape function 碗型函数

‘Batch’ 批处理，梯度下降的每次迭代都要用到所有训练数据，因为计算损失函数的时候需要代入所有的 x,y 值。在线性回归问题中，batch 是所有数据，在其他一些机器学习问题中，batch 有时候是整个训练集的子集。

quiz:

Which of the following are true statements? Select all that apply.

• To make gradient descent converge, we must slowly decrease $\alpha$ over time.
• Gradient descent is guaranteed to find the global minimum for any function $J(\theta_0,\theta_1)$
• Gradient descent can converge even if $\alpha$ is kept fixed. (But $\alpha$ cannot be too large, or else it may fail to converge.)
• For the specific choice of cost function $J(\theta_0,\theta_1)$ used in linear regression, there are no local optima (other than the global optimum).

## 2.4 Test

1.Consider the problem of predicting how well a student does in her second year of college/university, given how well she did in her first year.
Specifically, let x be equal to the number of “A” grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of y, which we define as the number of “A” grades they get in their second year (sophomore year).
Refer to the following training set of a small sample of different students’ performances (note that this training set may also be referenced in other questions in this quiz). Here each row is one training example. Recall that in linear regression, our hypothesis is $h_\theta(x) = \theta_0 + \theta_1x$, and we use mm to denote the number of training examples.

x y
3 4
2 1
4 3
0 1

For the training set given above, what is the value of mm? In the box below, please enter your answer (which should be a number between 0 and 10).

2.Consider the following training set of $m = 4$ training examples:

x y
1 0.5
2 1
4 2
0 0

Consider the linear regression model $h_\theta(x) = \theta_0 + \theta_1x$. What are the values of $\theta_0$ and $\theta_1$ that you would expect to obtain upon running gradient descent on this model? (Linear regression will be able to fit this data perfectly.)

• $\theta_0 = 0, \theta_1 = 0.5$
• $\theta_0 = 1, \theta_1 = 0.5$
• $\theta_0 = 0.5, \theta_1 = 0.5$
• $\theta_0 = 1, \theta_1 = 1$
• $\theta_0 = 0.5, \theta_1 = 0.5$

3.Suppose we set $\theta0 = -1, \theta_1 = 2$ in the linear regression hypothesis from Q1. What is $h{\theta}(6)$?
$2 \times 6-1 =11$

4.In the given figure, the cost function $J(\theta_0,\theta_1)$ has been plotted against $\theta_0$ and $theta_1$, as shown in ‘Plot 2’. The contour plot for the same cost function is given in ‘Plot 1’. Based on the figure, choose the correct options (check all that apply).

• Point P (The global minimum of plot 2) corresponds to point C of Plot 1.
• If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function $J(\theta_0,\theta_1)$ is minimum at A.
• If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function $J(\theta_0,\theta_1)$ is maximum at point A.
• Point P (the global minimum of plot 2) corresponds to point A of Plot 1.
• If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point C, as the value of cost function $J(\theta_0,\theta_1)$ is minimum at point C.

5.Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some $\theta_0$, $\theta_1$ such that $J(\theta_0, \theta_1)=0$.
Which of the statements below must then be true? (Check all that apply.)

• For this to be true, we must have $\theta0 = 0$ and $\theta_1 = 0$ so that $h\theta(x) = 0$
• This is not possible: By the definition of $J(\theta_0, \theta_1)$, it is not possible for there to exist $\theta_0$ and $\theta_1$ so that $J(\theta_0, \theta_1) = 0$
• For these values of $\theta0$ and $\theta_1$ that satisfy $J(\theta_0, \theta_1) = 0$, we have that $h\theta(x^{(i)}) = y^{(i)}$ for every training example $(x^{(i)}, y^{(i)})$
• We can perfectly predict the value of yy even for new examples that we have not yet seen.(e.g., we can perfectly predict prices of even new houses that we have not yet seen.)
• Gradient descent is likely to get stuck at a local minimum and fail to find the global minimum.
• For this to be true, we must have $y^{(i)} = 0$ for every value of $i = 1, 2, \ldots, m$.
• For this to be true, we must have $\theta0 = 0$ and $\theta_1 = 0$ so that $h\theta(x) = 0$
• Our training set can be fit perfectly by a straight line, i.e., all of our training examples lie perfectly on some straight line.