Contents
List of Figures v
List of Abbreviations vii
1 Introduction 1
1.1 Data Mining 1
1.2 R 1
1.3 Datasets 2
1.3.1 The Iris Dataset 2
1.3.2 The Bodyfat Dataset 3
2 Data Import and Export 5
2.1 Save and Load R Data 5
2.2 Import from and Export to .CSV Files 5
2.3 Import Data from SAS 6
2.4 Import/Export via ODBC 7
2.4.1 Read from Databases 7
2.4.2 Output to and Input from EXCEL Files 7
3 Data Exploration 9
3.1 Have a Look at Data 9
3.2 Explore Individual Variables 11
3.3 Explore Multiple Variables 15
3.4 More Explorations 19
3.5 Save Charts into Files 27
4 Decision Trees and Random Forest 29
4.1 Decision Trees with Package party 29
4.2 Decision Trees with Package rpart 32
4.3 Random Forest 36
5 Regression 41
5.1 Linear Regression 41
5.2 Logistic Regression 46
5.3 Generalized Linear Regression 47
5.4 Non-linear Regression 48
6 Clustering 49
6.1 The k-Means Clustering 49
6.2 The k-Medoids Clustering 51
6.3 Hierarchical Clustering 53
6.4 Density-based Clustering 54
i
ii CONTENTS
7 Outlier Detection 59
7.1 Univariate Outlier Detection 59
7.2 Outlier Detection with LOF 62
7.3 Outlier Detection by Clustering 66
7.4 Outlier Detection from Time Series 67
7.5 Discussions 68
8 Time Series Analysis and Mining 71
8.1 Time Series Data in R 71
8.2 Time Series Decomposition 72
8.3 Time Series Forecasting 74
8.4 Time Series Clustering 75
8.4.1 Dynamic Time Warping 75
8.4.2 Synthetic Control Chart Time Series Data 76
8.4.3 Hierarchical Clustering with Euclidean Distance 77
8.4.4 Hierarchical Clustering with DTW Distance 79
8.5 Time Series Classification 81
8.5.1 Classification with Original Data 81
8.5.2 Classification with Extracted Features 82
8.5.3 k-NN Classification 84
8.6 Discussions 84
8.7 Further Readings 84
9 Association Rules 85
9.1 Basics of Association Rules 85
9.2 The Titanic Dataset 85
9.3 Association Rule Mining 87
9.4 Removing Redundancy 90
9.5 Interpreting Rules 91
9.6 Visualizing Association Rules 91
9.7 Discussions and Further Readings 96
10 Text Mining 97
10.1 Retrieving Text from Twitter 97
10.2 Transforming Text 98
10.3 Stemming Words 99
10.4 Building a Term-Document Matrix 100
10.5 Frequent Terms and Associations 101
10.6 Word Cloud 103
10.7 Clustering Words 104
10.8 Clustering Tweets 105
10.8.1 Clustering Tweets with the k-means Algorithm 106
10.8.2 Clustering Tweets with the k-medoids Algorithm 107
10.9 Packages, Further Readings and Discussions 109
11 Social Network Analysis 111
11.1 Network of Terms 111
11.2 Network of Tweets 114
11.3 Two-Mode Network 119
11.4 Discussions and Further Readings 122
12 Case Study I: Analysis and Forecasting of House Price Indices 125
13 Case Study II: Customer Response Prediction and Proftt Optimization 127
CONTENTS iii
14 Case Study III: Predictive Modeling of Big Data with Limited Memory 129
15 Online Resources 131
15.1 R Reference Cards 131
15.2 R 131
15.3 Data Mining 132
15.4 Data Mining with R 133
15.5 Classification/Prediction with R 133
15.6 Time Series Analysis with R 134
15.7 Association Rule Mining with R 134
15.8 Spatial Data Analysis with R 134
15.9 Text Mining with R 134
15.10 Social Network Analysis with R 134
15.11 Data Cleansing and Transformation with R 135
15.12 Big Data and Parallel Computing with R 135
Bibliography 137
General Index 143
Package Index 145
Function Index 147
New Book Promotion 149