Midterm 02

Practice problems for topics on Midterm 02.

Tags in this problem set:

Problem #14

Tags: kernel ridge regression

Let \(\nvec{x}{1} = (1, 2, 0)^T\), \(\nvec{x}{2} = (-1, -1, -1)^T\), \(\nvec{x}{3} = (2, 2, 0)^T\), \(\nvec{x}{4} = (0, 2, 0)\).

Suppose a prediction function \(H(\vec x)\) is learned using kernel ridge regression on the above data set using the kernel \(\kappa(\vec x, \vec x') = (1 + \vec x \cdot\vec x')^2\) and regularization parameter \(\lambda = 3\). Suppose that \(\vec\alpha = (1, 0, -1, 2)^T\) is the solution of the dual problem.

Let \(\vec x = (0, 1, 0)^T\) be a new point. What is \(H(\vec x)\)?

Solution

18

Problem #15

Tags: regularization

Let \(R(\vec w)\) be the unregularized empirical risk with respect to the square loss (that is, the mean squared error) on a data set.

The image below shows the contours of \(R(\vec w)\). The dashed lines show places where \(\|\vec w\|_2\) is 2, 4, 6, etc.

Part 1)

Assuming that one of the points below is the minimizer of the unregularized risk, \(R(\vec w)\), which could it possibly be?

Solution

A

Part 2)

Let the regularized risk \(\tilde R(\vec w) = R(\vec w) + \lambda\|\vec w \|_2^2\), where \(\lambda > 0\).

Assuming that one of the points below is the minimizer of the regularized risk, \(\tilde R(\vec w)\), which could it possibly be?

Solution

B

Problem #16

Tags: kernel ridge regression

Let \(\{\nvec{x}{i}, y_i\}\) be a data set of \(n\) points, with each \(\nvec{x}{i}\in\mathbb R^d\). Recall that the solution to the kernel ridge regression problem is \(\vec\alpha = (K + n \lambda I)^{-1}\vec y\), where \(K\) is the kernel matrix, \(I\) is the identity matrix, \(\lambda > 0\) is a regularization parameter, and \(\vec y = (y_1, \ldots, y_n)^T\).

Suppose kernel ridge regression is performed with a kernel \(\kappa\) that is a kernel for a feature map \(\vec\phi : \mathbb R^d \to\mathbb R^k\).

What is the size of the kernel matrix, \(K\)?

Problem #30

Tags: kernel ridge regression

Consider the data set: \(\nvec{x}{1} = (1, 0, 2)^T\)\(\nvec{x}{2} = (-1, 0, -1)^T\)\(\nvec{x}{3} = (1, 2, 1)^T\)\(\nvec{x}{4} = (1, 1, 0)^T\) Suppose a prediction function \(H(\vec x)\) is learned using kernel ridge regression on the above data set using the kernel \(\kappa(\vec x, \vec x') = (1 + \vec x \cdot\vec x')^2\) and regularization parameter \(\lambda = 3\). Suppose that \(\vec\alpha = (-1, -2, 0, 2)^T\) is the solution of the dual problem.

Part 1)

What is the (2,3) entry of the kernel matrix?

\(K_{23} = \)

Part 2)

Let \(\vec x = (1, 1, 0)^T\) be a new point. What is \(H(\vec x)\)?

Problem #31

Tags: regularization

Let \(R(\vec w)\) be an unregularized empirical risk function with respect to some data set.

The image below shows the contours of \(R(\vec w)\). The dashed lines show places where \(\|\vec w\|_2\) is 1, 2, 3, etc.

Part 1)

Assuming that one of the points below is the minimizer of the unregularized risk, \(R(\vec w)\), which could it possibly be?

Solution

A minimizer of the unregularized risk could be point A.

Part 2)

Let the regularized risk \(\tilde R(\vec w) = R(\vec w) + \lambda\|\vec w \|_2^2\), where \(\lambda > 0\).

Assuming that one of the points below is the minimizer of the regularized risk, \(\tilde R(\vec w)\), which could it possibly be?

Solution

A minimizer of the regularized risk could be point D.

Problem #32

Tags: bayes error, bayes classifier

Shown below are two conditional densities, \(p_1(x \,|\, Y = 1)\) and \(p_0(x \given Y = 0)\), describing the distribution of a continuous random variable \(X\) for two classes: \(Y = 0\)(the solid line) and \(Y = 1\)(the dashed line). You may assume that both densities are piecewise constant.

Part 1)

Suppose \(\pr(Y = 1) = 0.5\) and \(\pr(Y = 0) = 0.5\). What is the prediction of the Bayes classifier at \(x = 1.5\)?

Solution

Class 0

Part 2)

Suppose \(\pr(Y = 1) = 0.5\) and \(\pr(Y = 0) = 0.5\). What is the Bayes error with respect to this distribution?

Part 3)

Now suppose \(\pr(Y = 1) = 0.7\) and \(\pr(Y = 0) = 0.3\). What is the prediction of the Bayes classifier at \(x = 1.5\)?

Solution

Class 1

Part 4)

Now suppose \(\pr(Y = 1) = 0.7\) and \(\pr(Y = 0) = 0.3\). What is the Bayes error with respect to this distribution?

Problem #33

Tags: bayes classifier

Suppose the Bayes classifier achieves an error rate of 15\% on a particular data distribution. True or False: It is impossible for any classifier trained on data drawn from this distribution to achieve better than 85\% accuracy on a finite test set that is drawn from this distribution.

True False
Solution

False.

Problem #34

Tags: histogram estimators

Consider the data set of ten points shown below:

Suppose this data is used to build a histogram density estimator, \(f\), with bins: \([0,2), [2, 6), [6, 10)\). Note that the bins are not evenly sized.

Part 1)

What is \(f(1.5)\)?

Part 2)

What is \(f(7)\)?

Problem #35

Tags: histogram estimators

Consider this data set of points \(x\) from two classes \(Y = 1\) and \(Y = 0\).

Suppose a histogram estimator with bins \([0,1)\), \([1, 2)\), \([2, 3)\) is used to estimate the densities \(p_1(x \given Y = 1)\) and \(p_0(x \given Y = 0)\), and these estimates are used in the Bayes classifier to make a prediction.

What will be the predicted class of a new point, \(x = 2.2\)?

Solution

Class 0.

Problem #36

Tags: density estimation, histogram estimators

Suppose a density estimate \(f : \mathbb R^3 \to\mathbb R^1\) is made using histogram estimators with bins having a length of 2 units, a width of 3 units, and a height of 1 unit.

What is the largest value that \(f(\vec x)\) can possibly have?

Problem #37

Tags: maximum likelihood

Suppose a discrete random variable \(X\) takes on values of either 0 or 1 and has the distribution:

\[\pr(X = x) = \theta^x (1 - \theta)^{1 - x}\]

where \(\theta\in[0, 1]\) is a parameter.

Given a data set \(x_1, \ldots, x_n\), what is the maximum likelihood estimate for the parameter \(\theta\)? Show your work.

Problem #38

Tags: covariance

Consider a data set of \(n\) points in \(\mathbb R^d\), \(\nvec{x}{1}, \ldots, \nvec{x}{n}\). Suppose the data are standardized, creating a set of new points \(\nvec{z}{1}, \ldots, \nvec{z}{n}\). That is, if the new points are stacked into an \(n \times d\) matrix, \(Z\), the mean and variance of each column of \(Z\) would be zero and one, respectively.

True or False: the covariance matrix of the standardized data must be the \(d\times d\) identity matrix; that is, the \(d \times d\) matrix with ones along the diagonal and zeros off the diagonal.

True False
Solution

False.

Problem #39

Tags: density estimation, maximum likelihood

Suppose data points \(\nvec{x}{1}, \ldots, \nvec{x}{n}\) are drawn from an arbitrary, unknown distribution with density \(f\).

True or False: it is guaranteed that, given enough data (that is, \(n\) large enough), a Gaussian fit to the data using the method of maximum likelihood will approximate the true underlying density \(f\) arbitrarily closely.

True False
Solution

False.

Problem #40

Tags: Gaussians, maximum likelihood

Suppose a Gaussian with a diagonal covariance matrix is fit to 200 points in \(\mathbb R^4\) using the maximum likelihood estimators. How many parameters are estimated? Count each entry of \(\mu\) and the covariance matrix that must be estimated as its own parameter.

Problem #41

Tags: covariance

Suppose a data set consists of the following three measurements for each Saturday last year: \(X_1\): The day's high temperature \(X_2\): The number of people at Pacific Beach on that day \(X_3\): The number of people wearing coats on that day

Suppose the covariance between these features is calculated and placed into a \(3 \times 3\) sample covariance matrix, \(C\). Which of the below options most likely shows the sign of each entry of the sample covariance matrix?

Solution

The second option.

Problem #42

Tags: covariance

Suppose we have two data sets, \(\mathcal{D}_1\) and \(\mathcal{D}_2\), each containing \(n/2\) points in \(\mathbb R^d\). Let \(\nvec{\mu}{1}\) and \(C^{(1)}\) be the mean and sample covariance matrix of \(\mathcal{D}_1\), and let \(\nvec{\mu}{2}\) and \(C^{(2)}\) be the mean and sample covariance matrix of \(\mathcal{D}_2\).

Suppose the two data sets are combined into a single data set \(\mathcal D\) containing \(n\) points.

Part 1)

True or False: the mean of the combined data \(\mathcal{D}\) is equal to \(\displaystyle\frac{\nvec{\mu}{1} + \nvec{\mu}{2}}{2}\).

True False
Solution

True.

Part 2)

True or False: the sample covariance matrix of the combined data \(\mathcal{D}\) is equal to \(\displaystyle\frac{C^{(1)} + C^{(2)}}{2}\).

True False
Solution

False

Problem #43

Tags: covariance

Suppose a random vector \(\vec X = (X_1, X_2)\) has a multivariate Gaussian distribution. Suppose it is known that known that \(X_1\) and \(X_2\) are independent.

Let \(C\) be the Gaussian distribution's covariance matrix.

Part 1)

True or False: \(C\) must be diagonal.

True False
Solution

True.

Part 2)

True or False: each entry of \(C\) must the same.

True False
Solution

False.

Problem #44

Tags: naive bayes

Consider the below data set which collects information on 10 pets:

Suppose a new pet is friendly and sheds fur. What does a Na\"ive Bayes classifier predict for its species?

Solution

Cat.

Problem #45

Tags: naive bayes

Suppose a data set of 100 points in 10 dimensions is used in a binary classification task; that is, each point is labeled as either a 1 or a 0.

If Gaussian Naive Bayes is trained on this data, how many univariate Gaussians will be fit?

Problem #46

Tags: conditional independence

Recall that a deck of 52 cards contains:

Hearts: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A

Diamonds: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A

Clubs: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A

Spades: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A

Also recall that Hearts and Diamonds are red, while Clubs and Spades are black.

Part 1)

Suppose a single card is drawn at random.

Let \(A\) be the event that the card is a heart. Let \(B\) be the event that the card is a 5.

Are \(A\) and \(B\) independent events?

Solution

Yes, they are independent.

Part 2)

Suppose two cards are drawn at random (without replacing them into the deck).

Let \(A\) be the event that the second card is a heart. Let \(B\) be the event that the first card is red.

Are \(A\) and \(B\) independent events?

Solution

No, they are not.

Part 3)

Suppose two cards are drawn at random (without replacing them into the deck).

Let \(A\) be the event that the second card is a heart. Let \(B\) be the event that the second card is a diamond. Let \(C\) be the event that the first card is face card.

Are \(A\) and \(B\) conditionally independent given \(C\)?

Solution

No, they are not.

Problem #47

Tags: bayes error, bayes classifier

Part 1)

Suppose a particular probability distribution has the property that, whenever data are sampled from the distribution, the sampled data are guaranteed to be linearly separable. True or False: the Bayes error with respect to this distribution is 0\%.

True False
Solution

True.

Part 2)

Now consider a different probability distribution. Suppose the Bayes classifier achieves an error rate of 0\% on this distribution. True or False: given a finite data set sampled from this distribution, the data must be linearly separable.

True False
Solution

False.

Problem #48

Tags: histogram estimators

Consider this data set of points \(x\) from two classes \(Y = 1\) and \(Y = 0\).

Suppose a histogram estimator with bins \([0,2)\), \([2, 4)\), \([4, 6)\) is used to estimate the densities \(p_1(x \given Y = 1)\) and \(p_0(x \given Y = 0)\).

What will be the predicted class-conditional density for class 0 at a new point, \(x = 2.2\)? That is, what is the estimated \(p_0(2.2 \given Y = 0)\)?

Solution

1/6.

When estimating the conditional density, we look only at the six points in class zero. Two of these fall into the bin, and the bin width is 2, so the estimated density is:

\[\frac{2}{6 \times 2} = \frac{1}{6}. \]

Problem #49

Tags: histogram estimators

Suppose \(\mathcal D\) is a data set of 100 points. Suppose a density estimate \(f : \mathbb R^3 \to\mathbb R^1\) is constructed from \(\mathcal D\) using histogram estimators with bins having a length of 2 units, a width of 2 units, and a height of 2 units.

The density estimate within a particular bin of the histogram is 0.2. How many data points from \(\mathcal D\) fall within that histogram bin?

Problem #50

Tags: Gaussians

Suppose data points \(x_1, \ldots, x_n\) are independently drawn from a univariate Gaussian distribution with unknown parameters \(\mu\) and \(\sigma\).

True or False: it is guaranteed that, given enough data (that is, \(n\) large enough), a univariate Gaussian fit to the data using the method of maximum likelihood will approximate the true underlying density arbitrarily closely.

True False
Solution

True.

Problem #51

Tags: maximum likelihood

Suppose a continuous random variable \(X\) has the density:

\[ p(x) = \theta e^{-\theta x}\]

where \(\theta\in(0, \infty)\) is a parameter, and where \(x > 0\).

Given a data set \(x_1, \ldots, x_n\), what is the maximum likelihood estimate for the parameter \(\theta\)? Show your work.

Problem #52

Tags: covariance

Let \(\mathcal D\) be a set of data points in \(\mathbb R^d\), and let \(C\) be the sample covariance matrix of \(\mathcal D\). Suppose each point in the data set is shifted in the same direction and by the same amount. That is, suppose there is a vector \(\vec\delta\) such that if \(\nvec{x}{i}\in\mathcal D\), then \(\nvec{x}{i} + \vec\delta\) is in the new data set.

True or False: the sample covariance matrix of the new data set is equal to \(C\)(the sample covariance matrix of the original data set).

True False
Solution

True.

Problem #53

Tags: Gaussians, maximum likelihood

Suppose a Gaussian with a diagonal covariance matrix is fit to 200 points in \(\mathbb R^4\) using the maximum likelihood estimators. How many parameters are estimated? Count each entry of \(\vec\mu\) and the covariance matrix that must be estimated as its own parameter (the off-diagonal elements of the covariance are zero, and shouldn't be included in your count).

Problem #54

Tags: Gaussians

Let \(f_1\) be a univariate Gaussian density with parameters \(\mu\) and \(\sigma_1\). And let \(f_2\) be a univariate Gaussian density with parameters \(\mu\) and \(\sigma_2 \neq\sigma_1\). That is, \(f_2\) is centered at the same place as \(f_1\), but with a different variance.

Consider the density \(f(x) = \frac{1}{2}(f_1(x) + f_2(x))\); the factor of \(1/2\) is a normalization factor which ensures that \(f\) integrates to one.

True or False: \(f\) must also be a Gaussian density.

True False
Solution

False. The sum of two Gaussian densities is not necessarily a Gaussian density, even if the two Gaussians have the same mean.

If you try adding two Gaussian densities with different variances, you will get:

\[ f(x) = \frac{1}{2}\left(\frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu)^2}{2\sigma_1^2}} + \frac{1}{\sqrt{2\pi}\sigma_2} e^{-\frac{(x-\mu)^2}{2\sigma_2^2}}\right)\]

For this to be a Gaussian, we'd need to be able to write it in the form:

\[ f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}, \]

but this is not possible when \(\sigma_1 \neq\sigma_2\).

Problem #55

Tags: covariance

Consider the data set \(\mathcal D\) shown below.

What will be the sign of the \((1,2)\) entry of the data's sample covariance matrix?

Solution

The sign will be negative.

Problem #56

Tags: linear and quadratic discriminant analysis

Suppose a data set of points in \(\mathbb R^2\) consists of points from two classes: Class 1 and Class 0. The mean of the points in Class 1 is \((3,0)^T\), and the mean of points in Class 0 is \((7,0)^T\). Suppose Linear Discriminant Analysis is performed using the same covariance matrix \(C = \sigma^2 I\) for both classes, where \(\sigma\) is some constant.

Suppose there were 50 points in Class 1 and 100 points in Class 0.

Consider a new point, \((5, 0)^T\), exactly halfway between the class means. What will LDA predict its label to be?

Solution

Class 0.

Problem #57

Tags: naive bayes

Consider the below data set which collects information on the weather on 10 days:

Suppose a new day is not sunny but is warm. What does a Na\"ive Bayes classifier predict for whether it rained?

Solution

Yes, it rained.

Problem #58

Tags: conditional independence

Suppose that a deck of cards has some cards missing. Namely, both the Ace of Spades and Ace of Clubs are missing, leaving 50 cards remaining.

Hearts: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A

Diamonds: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A

Clubs: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K

Spades: 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K

Also recall that Hearts and Diamonds are red, while Clubs and Spades are black.

Part 1)

Suppose a single card is drawn at random.

Let \(A\) be the event that the card is a heart. Let \(B\) be the event that the card is an Ace.

Are \(A\) and \(B\) independent events?

Solution

No, they are not.

Part 2)

Suppose a single card is drawn at random.

Let \(A\) be the event that the card is a red. Let \(B\) be the event that the card is a heart. Let \(C\) be the event that the card is an ace.

Are \(A\) and \(B\) conditionally independent given \(C\)?

Solution

Yes, they are.

Part 3)

Suppose a single card is drawn at random.

Let \(A\) be the event that the card is a King. Let \(B\) be the event that the card is red. Let \(C\) be the event that the card is not a numbered card (that is, it is a J, Q, K, or A).

Are \(A\) and \(B\) conditionally independent given \(C\)?

Solution

No, they are not conditionally independent.

Problem #70

Tags: regularization

Recall that in ridge regression, we solve the following optimization problem:

\[\operatorname{arg\,min}_{\vec w}\sum_{i=1}^n (y_i - \vec w \cdot\operatorname{Aug}(\nvec{x}{i}))^2 + \lambda\|\vec w\|^2. \]

where \(\lambda > 0\) is a hyperparameter controlling the strength of regularization.

Suppose you solve the ridge regression problem with \(\lambda = 2\), and the resulting solution is the weight vector \(\vec w_\text{old}\). You then solve the ridge regression problem with \(\lambda = 4\) and find a weight vector \(\vec w_\text{new}\).

True or False: each component of the new solution, \(\vec w_\text{new}\), must be less than or equal to the corresponding component of the old solution, \(\vec w_\text{old}\).

True False
Solution

False.

While it is true that \(\|\vec w_\text{new}\|\leq\|\vec w_\text{old}\|\), this does not imply that each component of \(\vec w_\text{new}\) is less than or equal to the corresponding component of \(\vec w_\text{old}\).

The picture to have in mind is that of the contour lines of the mean squared error (which are ovals), along with the circles representing where \(\|\vec w\| = c\) for some constant \(c\). The question asked us to consider going from \(\lambda = 2\) to \(\lambda = 4\), but to gain an intuition we can think of going from no regularization (\(\lambda = 0\)) to some regularization (\(\lambda > 0\)); this won't affect the outcome, but will make the story easier to tell.

Consider the situation shown below:

When we had no regularization, the solution was \(\vec w_\text{old}\), as marked. Suppose we add regularization, and we're told that the regularization is such that when we solve the ridge regression problem, the norm of \(\vec w_\text{new}\) will be equal to \(c\), and that the radius of the circle we've drawn is \(c\). Then the solution \(\vec w_\text{new}\) will be the point marked, since that is the point on the circle that is on the lowest contour.

Notice that the point \(\vec w_\text{new}\) is closer to the origin, and it's first component is much smaller than the first component of \(\vec w_\text{old}\). However, the second component of \(\vec w_\text{new}\) is actually larger than the second component of \(\vec w_\text{old}\).

Problem #75

Tags: regularization

The ``inifinity norm'' of a vector \(\vec w \in\mathbb R^d\), written \(\|\vec w\|_\infty\), is defined as:

\[\|\vec w\|_\infty = \max_{i = 1, \ldots, d} |w_i|. \]

That is, it is the maximum absolute value of any entry of \(\vec w\).

Let \(R(\vec w)\) be an unregularized risk function on a data set. The solid curves in the plot below are the contours of \(R(\vec w)\). The dashed lines show where \(\|\vec w\|_\infty\) is equal to 1, 2, 3, and so on.

Let \(\tilde R(\vec w) = R(\vec w) + \lambda\|\vec w\|_\infty^2\), with \(\lambda > 0\). The point marked \(A\) is the minimizer of the unregularized risk. Suppose that it is known that one of the other points is the minimizer of the regularized risk, \(\tilde R(\vec w)\), for some unknown \(\lambda > 0\). Which point is it?

Solution

D.

Problem #86

Tags: regularization

Recall that in ridge regression, we solve the following optimization problem:

\[\operatorname{arg\,min}_{\vec w}\sum_{i=1}^n (y_i - \vec w \cdot\operatorname{Aug}(\nvec{x}{i}))^2 + \lambda\|\vec w\|^2. \]

where \(\lambda > 0\) is a hyperparameter controlling the strength of regularization.

Suppose you solve the ridge regression problem with \(\lambda = 2\), and the resulting solution has a mean squared error of \(10\).

Now suppose you increase the regularization strength to \(\lambda = 4\) and solve the ridge regression problem again. True or False: it is possible that the mean squared error of the new solution is less than \(10\).

By ``mean squared error,'' we mean \(\frac1n \sum_{i=1}^n (y_i - \vec w \cdot\operatorname{Aug}(\nvec{x}{i}))^2\)

True False
Solution

False.

Problem #91

Tags: regularization

The ``\(p = \frac12\)'' norm of a vector \(\vec w \in\mathbb R^d\), written \(\|\vec w\|_{\frac12}\), is defined as:

\[\|\vec w\|_{\frac12} = \left(\sum_{i = 1}^d \sqrt{|w_i|}\right)^2 \]

Let \(R(\vec w)\) be an unregularized risk function on a data set. The solid curves in the plot below are the contours of \(R(\vec w)\). The dashed lines show where \(\|\vec w\|_{\frac12}\) is equal to 1, 2, 3, and so on.

Let \(\tilde R(\vec w) = R(\vec w) + \lambda\|\vec w\|_{\frac12}\), with \(\lambda > 0\). The point marked \(A\) is the minimizer of the unregularized risk. Suppose it is known that one of the other points is the minimizer of the regularized risk, \(\tilde R(\vec w)\), for some unknown \(\lambda > 0\). Which point is it?

Solution

D.