Problem 1
数据科学算法和工具代写 where is the conditional probability of a, b in node t, i.e. their frequencies divided by total number of data in t.
Part (a)
The pseudocode of Hunt’s algorithm is given as below:
function HuntsAlgorithm(Dt, yt):
If the records in Dt has the same class yt:
then let node t be a leaf node labeled as yt;
If Dt is an empty set:
then t is a leaf node labeled by default class yd;
If Dt contains records from more than one class:
Use an attribute test to split the data Dt into smaller subsets (child nodes)
Recursively apply HuntsAlgorithm to each subset. 数据科学算法和工具代写
The 3 main design choices in the specific decision tree induction algorithm are:
- How to specify the attribute test condition
- How to determine the best split
- When to stop splitting data into smaller subsets
GINI index for a given node t is:
where is the relative frequency of class j at node t. For a binary split, there will be two classes in this node. Let these two classes be denoted by a, b, then:
where is the conditional probability of a, b in node t, i.e. their frequencies divided by total number of data in t.
Part (b)
Decision trees are classification models. To measure the performance of a decision tree, I will normally split the data into training set and testing set. The training set is used to fit the model, while the testing set is used to calculate some evaluation metrics including:
- accuracy: the percentage of classifications that are correct
- recall: the proportion of relevant labels actually retrieved ( predicted )
- precision: the proportion of retrieved data that are actually relevant.
If the model performed very well and scored high metrics on the testing set, then the performance of the decision tree is considered good. 数据科学算法和工具代写
The re-substitution error is the error rate on the training data. The generalization error is the error rate on testing data. They are both calculated using the formula below, summing the error rate across all leaf nodes of the decision tree T:
where the function calculates the error rate of a leaf node.
If the test data are not available we may use different approaches to estimate using the re-substitution error rate based training data.
Part (c) 数据科学算法和工具代写
Data type
The label play is a binary type as it contained only {Yes, No}.
The features Outlook, Temperature, and Humidity are all ordinal as they all contained more than 2 categories but their are not continuous values.
The features Windy is binary as it contained only {Yes, No}
Re-substitution error
Before calculating the re-substitution error rate, I made two tables showing the prediction results for both tree. The wrong predictions were highlighted in yellow.
ID | Play (ground truth) | Decision Tree 1.a | Decision Tree 1.b |
1 | No | No | No |
2 | Yes | Yes | Yes |
3 | No | No | No |
4 | No | No | No |
5 | No | Yes | No |
6 | Yes | Yes | Yes |
7 | Yes | No | No |
8 | No | Yes | No |
9 | No 数据科学算法和工具代写 | Yes | No |
10 | No | No | No |
11 | Yes | Yes | Yes |
12 | No | No | No |
13 | Yes | Yes | Yes |
14 | No | No | No |
15 | Yes | Yes | No |
16 | No | No | No |
The optimistic approach estimates the generalization error the same as the re-substitution error. The pessimistic approach will let , where and N = number of leaves in the tree. Tree 1.a has two leaves, tree 1.b has 4 leaves. 数据科学算法和工具代写
- For tree 1.a
- The optimistic estimate gives
- The pessimistic estimate gives
- For tree 1.b
- The optimistic estimate gives
- The pessimistic estimate gives
Part (d) 数据科学算法和工具代写
The meaning of the penalty term is to act as a regularization term on the model’s complexity. A tree with more leaves are more complex than a tree with few leaves, while splitting the subsets finer and finer will usually lead to lower error rate, it may overfit the dataset and yield worse results on test set. Therefore the penalty term on the number of leaves will regulate model’s training behavior to stop it from splitting subsets too small.
Let , we solved . When the penalty term , the tree 1.a will have a smaller pessimistic estimate of generalization error than the tree 1.b.