Optimize the Portfolio of Soybean Varieties at Target Farm through Machine Learning
机器学习报告代写 The important result of this project is the portfolio of 5 soybean varieties and the specific land allocation for the target farm.
The world is facing the problem of food shortages due to the following three reasons. Firstly, the population of the world has grown rapidly. Secondly, urbanization has led to a decrease in the area of food planting. Thirdly, the COVID-19 is seriously affecting production globally. Therefore, it is imperative to select the best soybean varieties to increase yields and thus the food shortage may be alleviated.
This project will use machine learning to optimize the soybean variety portfolio of the target farm by selecting 5 kinds of soybeans from 182 varieties and allocating a certain amount of land for them to optimize production. Descriptive analytics, predictive analytics and prescriptive analytics will all be used in this project.
Descriptive analytics mainly studies the influence of location on the yield, the relationship between the weather and location, the distribution of yields and varieties as well as if there are enough data to build a model.
The specific methods of predictive analytics are Linear Regression, LASSO, Regression Tree, Bagging, Random Forest, Boosted Trees and Neural Network. These methods will be used to build models, predict yield and compare accuracy. Finally, in prescriptive analytics, the mean risk heuristic is applied to give the final result of questions. Both predicted average yield given by the models and mean square error indicated the risk of growing the varieties will be considered in this part. 机器学习报告代写
The ultimate goal of the project is to provide the optimal soybean varieties through machine learning to achieve the goal of increasing yield per unit area. The final result will help soybean farmers increase their yields and reduce the stress of starvation for people in certain area. Therefore, it is of great significance to conduct this research.
Keywords: Food Shortage, Optimization, Portfolio of Soybean Varieties, Machine Learning
Food shortages have now become a global problem. This is because, on the one hand, the world’s population has grown tremendously. In the past 200 years, the world’s population has increased by 750%, from 1 billion to about 7.5 billion. On the other hand, urbanization has led to a decrease in rural areas and cultivated land, thus food shortages have become more serious. In addition, a special influencing factor emerged that the impact of the COVID-19 is gradually deepening.
The World Food Program pointed out that 135 million people around the world were facing severe food shortages before, and now affected by the epidemic, this data is expected to increase by about 130 million people this year to 265 million people. In this case, it is particularly important to increase grain production per unit area. This has become the motivation of this project to increase the yield and solve difficult food shortage problem by selecting the best soybean varieties.
The goal of this project is to optimize the portfolio of soybean varieties at target farm through machine learning.
The specific question of this project is to that 5 varieties of soybeans is selected from 182 kinds of soybeans for the target farm and a certain land is allocated as the portfolio so that the yield could be optimized. If the results are convincing and credible, there will be benefits for many people in the world, such as farmers who grow soybeans, people who are starving, especially children who are malnourished due to food shortages. Therefore, it is significant to do this research. 机器学习报告代写
The analysis includes three components: descriptive analytics, predictive analytics, and prescriptive analytics. To be more specific, descriptive analytics mainly studies some qualitative and descriptive questions to understand the general situation of the data related to the question. In predictive analytics, 7 methods are used to build models for different varieties and predict the yield for the target farm. All methods are compared accuracy by the mean square error. Finally, in the prescriptive analytics part, the Mean-Risk Heuristics that considered both average yield and risk is applied to give the final portfolio of the question.
The important result of this project is the portfolio of 5 soybean varieties and the specific land allocation for the target farm. This indicates that through machine learning, the optimal varieties can be selected, and the purpose of increasing the yield per unit area could be achieved. Such experience may also be applied in other target areas, which may help solve the problem of food shortages on a larger scale.
Some articles have studied the problem of increasing soybean or other grain yields by optimizing variety selection. For example, Huang et al. (2020) performed inter-comparison method in “Comparative Test Analysis and Evaluation of New Summer Soybean Varieties (lines) in Xinxiang Area”. “Effects of planting patterns on agronomic characters and yield of different soybean varieties” (Wang et al., 2020) applied algorithms for the purpose of constructing a reasonable group structure of soybeans and increasing yields. The structure and methods of these studies are of reference significance.
“Machine-Learning-Based Simulation for Estimating Parameters in Portfolio Optimization:
Empirical Application to Soybean Variety Selection” (Sundaramoorthi & Dong, 2019) is more instructive because the method is totally Machine-Learning-Based. Bagging, Random Forest and Regression Trees methods were used which can provide a theoretical basis for this project. Barkley, Peterson & Shroyer (2010) conduct a study using portfolio theory in business investment analysis to find the best portfolio of wheat varieties with maximum yield and minimum risk. This article inspired me to consider both risk and yield comprehensively. 机器学习报告代写
However, these studies (Basnet, Mader & Nickell, 1974; Cui et al., 2020) usually only use 1 to 2 methods for modeling and prediction. In fact, each method has drawbacks and shortcomings. Adding more methods can make the results better and more unbiased. Also, it is more likely to apply the method to other regions to increase the scope of problem solving. This project will try to fill this research gap.
Methodology and Analysis
There are three components of the methodology: descriptive analytics, predictive analytics, and prescriptive analytics. Before this three steps, data pre-processing is also an important step because it will provide a solid foundation for subsequent analysis.
Step 1: Data Pre-processing
In this step, the data is divided into two types: sufficient and insufficient. Sufficient data is directly retained, and insufficient data is merged. Taking less than 50 data as the criterion for inadequate data, among all 182 varieties, 99 are insufficient and 83 are sufficient. Synthesize 99 insufficient data into a new variety, called “Vnew”, and participate in the fitting together with the other 83, a total of 84 varieties. In the subsequent fitting, each method will build 84 models. There are also other operations for subsequent fitting. 机器学习报告代写
Step 2: Descriptive Analytics
This part is mainly to have a general understanding of the data. Firstly, plot the latitudes and longitudes on a map to visualize the locations of farms in Figure 1.
Figure 1. Location Information of the farms in the given data.
Secondly, generate frequency distribution for varieties in Figure 2.
Figure 2. The Frequency of the Variety for all data.
It is apparent from Figure 2 that the amount of data of different types varies greatly, and there are situations where the amount of data is sufficient and insufficient. We have already processed this part in Step 1, which will benefit the quality of subsequent models.
Thirdly, check to see if there is any relationship between the locations and varieties. Linear regression method is selected here, by observing the p-value and coefficient to judge the relationship between the locations and yield of varieties. Both longitude and latitude have a significant impact on yield because they have p-values close to 0. The coefficient of longitude is -0.33 which is hard to judge whether its impact on production is positive or negative. The coefficient of latitude is -2.74, which obviously shows that the lower the latitude, the greater the yield.
Fourthly, explore relationships between locations and weather related variables.
Through the linear regression, it can be proved that both factor Weather 1 and 2 positively influence the locations. This is because the two p-values are close to 0 and the coefficient of Weather 1 and 2 are 1.09 and 1.97 respectively. 机器学习报告代写
Fifthly, plot the distribution of the yield variable in Figure 3. Figure 3 describes a normal distribution of the variety yield and indicates that the yield between different varieties varies greatly. Therefore, the goal of this project should be select the better varieties to improve the total yield.
Figure 3, the histogram of the Variety yield.
Step 3: Predictive Analytics
The target variable is Variety_Yield in this project. The 84 varieties including 83 varieties with sufficient data and 1 new variety call “Vnew” consisting of all varieties with insufficient data will be used to build the model. 7 methods including Linear Regression, LASSO, Regression Tree, Bagging, Random Forest, Boosted Trees and Neural Network are applied here. 机器学习报告代写
The steps to build a model for each method are similar. Firstly, divide all data into training set and test set and ratio is 8:2. Secondly, predict the yield of all 84 varieties by the “for” loop code. More specifically, fit the training data and predict the test set. Calculate the mean square error (MSE) to test the accuracy of the method. Finally, perform the prediction of the yield of the evaluation data for the target farm.
There are several points to note among the seven methods. Firstly, LASSO is different from other methods.
It requires performing regression processing on the data frame before fitting. Secondly, even though the data was pre-processed in advance, an error occurred in the loop of the Boosted Trees method. The reason was that the data was insufficient. This problem was properly solved by adjusting the loop range. Thirdly, the Neural Network method needs to standardize the data before fitting. The formula for standardization is (X-min)/(max-min).
The result of the prediction could be seen below in Table 1.
Table 1. Predicted yield of different varieties through different method.
|Variety||Linear Regression||LASSO||Regression Tree||Bagging||Random Forest||Boosted Trees||Neural Network|
Step 3: Predictive Analytics
In this step, firstly the accuracy of different method should be summarized. Table 2 shows the MSE of 7 methods.
Table 2. MSE of 7 methods.
Then, assign different weights to different methods according to MSE. Since the MSE of the linear regression is too large and has no reference value, the weight is 0. Others are assigned weights 6, 5, 4, 3, 2 and 1, according to the ordering MSE from small to large. Refer Table 3 for the weights.
Table 3. Weights of 7 methods.
Next, calculate the weighted average of the yield predicted by different methods according to the weights above.
Also, the Mean-Risk Heuristics that considered both average yield and risk is applied here. Standard deviation is a good way to express risk.
By calculating the standard deviation (SD) of the yield of all the data for each variety, the uncertainty of planting this soybean is determined, which is the risk.
To combine the weighted average yield and risk, a formula balance=(average yield)^2/risk is used here because we hope to have greater yield and less risk, and the yield should be considered more than SD. After obtaining the new factor balance, order it and choose the top 5 for the final selection. According to the balance, the land is allocated proportionally. Therefore, the five selected varieties are V98, V39, V9, V90 and V31. The land allocation proportions are 20.95%, 20.59%, 20.07%, 19.47% and 18.92%, respectively. Table 4 describes the ordered balance and the corresponding weighted average yield and risk. The final allocation proportions are also shown. 机器学习报告代写
Table 4. Process and the final result of the land allocation.
This project uses 7 methods of machine learning including Linear Regression, LASSO, Regression Tree, Bagging, Random Forest, Boosted Trees and Neural Network to build models to analyze and predict the yield of different varieties of soybeans. Considering the accuracy of different methods, the predicted value is weighted and averaged. Risk and yield are innovatively both considered to determine the final optimal portfolio.
The final result of this project is that the five selected varieties are V98, V39, V9, V90 and V31.
The land allocation proportions are 20.95%, 20.59%, 20.07%, 19.47% and 18.92%, respectively. Also, among the 7 methods, the Bagging, Random Forest and LASSO perform the best, and linear regression and neural networks are the least suitable for this case.
Using enough methods and assigning different weights to different methods based on accuracy is the first innovation of this project, because this can make up for the shortcomings of a certain method and obtain more unbiased results. The second innovation is to summarize whether different methods are suitable for this type of data. Such experience can also be used in other regions, which can provide help for future research on soybeans or other grain varieties in more regions, so as to solve the wider food shortage problem. The finding of this project can be helpful for both soybean growers and starving people, and could be even instructive for relevant government departments to issue guiding policies on varieties.
- Huang Jinhua, Wang Lingyan, Tang Zhenhai, Dou Shishu, Li Mingwei, Ma Haitao, Zhang Suping, Li Junli, Zheng Qiudao, Fan Yongsheng. Analysis and evaluation of comparative experiment of new summer soybean varieties (lines) in Xinxiang area[J]. Anhui Agricultural Sciences, 2020, 48(17):21-23+27.
- Wang He,Sun Jiaxing,Mo Yan,Yang Shuang. Effects of Planting Patterns on Agronomic Characters and Yield of Different Soybean Varieties[J].China Seed Industry,2020(12):60-63.
- Sundaramoorthi D, & Dong L. Machine-Learning-Based Simulation for Estimating Parameters in Portfolio Optimization: Empirical Application to Soybean Variety Selection[J]. SSRN Electronic Journal, 2019. 机器学习报告代写
- Barkley, A., Peterson, H., & Shroyer, J. (2010). Wheat Variety Selection to Maximize Returns and Minimize Risk: An Application of Portfolio Theory. Journal of Agricultural and Applied Economics, 42(1), 39-55.
- Basnet B., Mader E. L. & Nickell C. D., Influence of Altitude on Seed Yield and Other Characters of Soybeans Differing in Maturity in Sikkim (Himalayan Kingdom), Agronomy Journal, 1974.
- Cui Jihan, Li Shunguo, Liu Meng, Guo Shuai, Zhao Yu, Ma Junting, Xia Xueyan. The effect of millet and peanut/soybean intercropping on yield and the differences between varieties[J]. Journal of Anhui Agricultural Sciences, 2020, 48(17) :35-40+45.