当前位置:天才代写 > 作业代写,留学生作业代写-北美、澳洲、英国等靠谱代写 > 数据科学算法和工具代写 Data science代写

数据科学算法和工具代写 Data science代写

2021-12-10 14:21 星期五 所属: 作业代写,留学生作业代写-北美、澳洲、英国等靠谱代写 浏览:587

数据科学算法和工具代写

Problem 1

数据科学算法和工具代写 where  is the conditional probability of a, b in node t, i.e. their frequencies divided by total number of data in t.

Part (a)

The pseudocode of Hunt’s algorithm is given as below:

function HuntsAlgorithm(Dt, yt):
If the records in Dt has the same class yt:
then let node t be a leaf node labeled as yt;
If Dt is an empty set:
then t is a leaf node labeled by default class yd;
If Dt contains records from more than one class:
Use an attribute test to split the data Dt into smaller subsets (child nodes)
Recursively apply HuntsAlgorithm to each subset.   数据科学算法和工具代写

The 3 main design choices in the specific decision tree induction algorithm are:

  • How to specify the attribute test condition
  • How to determine the best split
  • When to stop splitting data into smaller subsets

GINI index for a given node t is:

where  is the relative frequency of class j at node t. For a binary split, there will be two classes in this node. Let these two classes be denoted by a, b, then:

where  is the conditional probability of a, b in node t, i.e. their frequencies divided by total number of data in t.

数据科学算法和工具代写
数据科学算法和工具代写

Part (b)

Decision trees are classification models. To measure the performance of a decision tree, I will normally split the data into training set and testing set. The training set is used to fit the model, while the testing set is used to calculate some evaluation metrics including:

  • accuracy: the percentage of classifications that are correct
  • recall: the proportion of relevant labels actually retrieved ( predicted )
  • precision: the proportion of retrieved data that are actually relevant.

If the model performed very well and scored high metrics on the testing set, then the performance of the decision tree is considered good.    数据科学算法和工具代写

The re-substitution error is the error rate on the training data. The generalization error is the error rate on testing data. They are both calculated using the formula below, summing the error rate across all leaf nodes of the decision tree T:

where the function  calculates the error rate of a leaf node.

If the test data are not available we may use different approaches to estimate using the re-substitution error rate based training data.

Part (c) 数据科学算法和工具代写

Data type

The label play is a binary type as it contained only {Yes, No}.

The features Outlook, Temperature, and Humidity are all ordinal as they all contained more than 2 categories but their are not continuous values.

The features Windy is binary as it contained only {Yes, No}

Re-substitution error

Before calculating the re-substitution error rate, I made two tables showing the prediction results for both tree. The wrong predictions were highlighted in yellow.

ID Play (ground truth) Decision Tree 1.a Decision Tree 1.b
1 No No No
2 Yes Yes Yes
3 No No No
4 No No No
5 No Yes No
6 Yes Yes Yes
7 Yes No No
8 No Yes No
9 No  数据科学算法和工具代写 Yes No
10 No No No
11 Yes Yes Yes
12 No No No
13 Yes Yes Yes
14 No No No
15 Yes Yes No
16 No No No

The optimistic approach estimates the generalization error the same as the re-substitution error. The pessimistic approach will let , where  and N = number of leaves in the tree. Tree 1.a has two leaves, tree 1.b has 4 leaves.  数据科学算法和工具代写

  • For tree 1.a
    • The optimistic estimate gives
    • The pessimistic estimate gives
  • For tree 1.b
    • The optimistic estimate gives
    • The pessimistic estimate gives

Part (d) 数据科学算法和工具代写 

The meaning of the penalty term is to act as a regularization term on the model’s complexity. A tree with more leaves are more complex than a tree with few leaves, while splitting the subsets finer and finer will usually lead to lower error rate, it may overfit the dataset and yield worse results on test set. Therefore the penalty term on the number of leaves will regulate model’s training behavior to stop it from splitting subsets too small.

Let , we solved . When the penalty term  , the tree 1.a will have a smaller pessimistic estimate of generalization error than the tree 1.b.

数据科学算法和工具代写
数据科学算法和工具代写

 

更多代写:python代写  多邻国代考  英国留学生代考推荐   经济论文代写  英国伦敦网课essay代写  阿拉巴马大学代写

合作平台:essay代写 论文代写 写手招聘 英国留学生代写

 

天才代写-代写联系方式