Machine Learning 6th

      • 有时候获得更多的训练数据 实际上并没有作用 怎样避免把过多的时间 浪费在收集更多的训练数据上 这实际上是于事无补的 。
      • Getting more training examples
      • Trying smaller sets of features
      • Trying additional features
      • Trying polynomial features
      • Increasing or decreasing λ

      • 有时候测试上述一种方法的时候会发现是一条不归路。
      • 幸运的是 有一系列简单的方法 能让你事半功倍 排除掉单子上的 至少一半的方法 留下那些确实有前途的方法 同时也有一种很简单的方法 只要你使用 就能很轻松地排除掉很多选择 从而为你节省 大量不必要花费的时间 在接下来的两段视频中 我首先介绍 怎样评估机器学习算法的性能 

怎样评估机器学习算法的性能 :

它们也被称为”机器学习诊断法”

  1. “诊断法”的意思是 这是一种测试法 你通过执行这种测试 能够深入了解 某种算法到底是否有用 这通常也能够告诉你 要想改进一种算法的效果 什么样的尝试 。

A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a training set and a test set. Typically, the training set consists of 70 % of your data and the test set is the remaining 30 %.

  1. The new procedure using these two sets is then:

  • Learn Θ and minimize Jtrainn(Θ) using the training set
  • Compute the test set error Jtest(Θ)
  1. The test set error 
1 
(he — Ytest) 
1. For linear regression: Jtest(e) — 
2mtesi 
2. For classification - Misclassification error (aka 0/1 misclassification error): 
1 if he (x) 0.5 and y = 0 or he (x) < 0.5 and y = 1 
err(he y) 
0 
otherwise 
This gives us a binary O or 1 error result based on a misclassification. The average test error for the test set is: 
1 
mtest err (he ' yt:st ) 
Test Error = 
mtest 
This gives us the proportion of the test data that was misclassified.

Given many models with different polynomial degrees, we can use a systematic approach to identify the ‘best’ function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:

      • Training set: 60%
      • Cross validation set: 20%
      • Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

      1. Optimize the parameters in Θ using the training set for each polynomial degree.
      2. Find the polynomial degree d with the least error using the cross validation set.
      3. Estimate the generalization error using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

来自 <https://www.coursera.org/learn/machine-learning/supplement/XHQqO/model-selection-and-train-validation-test-sets>

Diagnosing Bias vs. Variance

来自 <https://www.coursera.org/learn/machine-learning/supplement/81vp0/diagnosing-bias-vs-variance>

In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.

      • We need to distinguish whether bias or variance is the problem contributing to bad predictions.
      • High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both (e) and Jcv(e) will be high. Also, z (e) 
High variance (overfitting): (e) will be low and JCL' (e) will be much greater than (e)

The is summarized in the figure below:

Underfitting 
(hi h biais) 
(0) 
cross-validation 
Overfitting 
(high variance) 
(0) 
training 
d (polynome degree) 
Optimal value for d

来自 <https://www.coursera.org/learn/machine-learning/supplement/81vp0/diagnosing-bias-vs-variance>

Regularization and Bias/Variance

来自 <https://www.coursera.org/learn/machine-learning/supplement/JPJJj/regularization-and-bias-variance>

Linear regression with regularization 
Model: 00 + 01 x + 02 x2 + 03 x3 + 
2m 
Size 
Large 
High bias (underfit) 
10000. 01 
he(r) 00 
Size 
Intermediate e 
"Just right" 
Size 
Small 
High variance (overfit)

In the figure above, we see that as λ increases, our fit becomes more rigid. On the other hand, as λ approaches 0, we tend to over overfit the data. So how do we choose our parameter λ to get it ‘just right’ ? In order to choose the model and the regularization term λ, we need to:

    1. Create a list of lambdas (i.e. λ{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
    2. Create a set of models with different degrees or any other variants.
    3. Iterate through the λs and for each λ go through all the models to learn some Θ.
    4. Compute the cross validation error using the learned Θ (computed with λ) on the JCv(Θ) without regularization or λ = 0.
    5. Select the best combo that produces the lowest error on the cross validation set.
    6. Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.

Learning Curves

来自 <https://www.coursera.org/learn/machine-learning/supplement/79woL/learning-curves>

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

      • As the training set gets larger, the error for a quadratic function increases.
      • The error value will plateau out after a certain m, or training set size.

Experiencing high bias:

Low training set size: causes Jtrain(Θ) to be low and JCV(Θ) to be high.

Large training set size: causes both Jtrain(Θ) and JCV(Θ) to be high with Jtrain(Θ)JCV(Θ).

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Experiencing high variance:

Low training set size: Jtrain(Θ) will be low and JCV(Θ) will be high.

Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off. Also, Jtrain(Θ) < JCV(Θ) but the difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

Deciding What to Do Next Revisited

Our decision process can be broken down as follows:

      • Getting more training examples: Fixes high variance
      • Trying smaller sets of features: Fixes high variance
      • Adding features: Fixes high bias
      • Adding polynomial features: Fixes high bias
      • Decreasing λ: Fixes high bias
      • Increasing λ: Fixes high variance.

Diagnosing Neural Networks

      • A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
      • A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

Model Complexity Effects:

      • Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
      • Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
      • In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

来自 <https://www.coursera.org/learn/machine-learning/supplement/llc5g/deciding-what-to-do-next-revisited>

System Design Example:

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.

Building a spam classifier 
Supervised learning. — features of email. y =spam (I) or not spam (O). 
Features x: Choose 100 words indicative of spam/not spam. 
From: cheapsales@buystufffromme.com 
TO : ang@cs.stanford.edu 
Subject : Buy now! 
Deal of the week! Buy now!

System Design Example:

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.

So how could you spend your time to improve the accuracy of this classifier?

Collect lots of data (for example “honeypot” project but doesn’t always work)

Develop sophisticated features (for example: using email header data in spam emails)

Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be most helpful.

So how could you spend your time to improve the accuracy of this classifier?

      • Collect lots of data (for example “honeypot” project but doesn’t always work)
      • Develop sophisticated features (for example: using email header data in spam emails)
      • Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be most helpful.

来自 <https://www.coursera.org/learn/machine-learning/supplement/0uu7a/prioritizing-what-to-work-on>

Error Analysis

The recommended approach to solving machine learning problems is to:

Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.

Plot learning curves to decide if more data, more features, etc. are likely to help.

Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them. We could manually analyze the 100 emails and categorize them based on what type of emails they are. We could then try to come up with new cues and features that would help us classify these 100 emails correctly. Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. We could also see how classifying each word according to its root changes our error rate:

The importance of numerical evaluation 
Should discount/discounts/discounted/discounting be treated as the 
same word? 
Can use "stemming" software (E.g. "Porter stemmer") 
universe/university. 
Error analysis may not be helpful for deciding if this is likely to improve 
performance. Only solution is to try it and see if it works. 
Need numerical evaluation (e.g., cross validation error) of algorithm's 
performance with and without stemming. 
Without stemming: S" With stemming: 3-/, error 
Distinguish upper vs. lower case (Mom/mom): •/0

It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.

If predicted class and actual class are both 1, then a test example is a True Positive. If predicted class and actual class are both 0, then a test example is a True Negative. If predicted class is 0 actual class is 1, then a test example is a False Negative. If predicted class is 1 and actual class is 0, then a test example is a False Positive.

True positives 
Precision = 
predicted as p«itive 
True insitives 
True positives V False mysitives 
True msitives 
True positives 
Recall 
actual myitives 
True msitives + False negatives

Leave a comment