欢迎光临
免费的PDF电子书下载网站

数据挖掘导论(英文版)/图灵原版计算机科学系列 PDF下载

编辑推荐

“这是一本全新的数据挖掘教材,值得大力推荐。”

——Jiawei Han,伊利诺伊大学教授

本书全面介绍了数据挖掘,涵盖了五个主题:数据、分类、关联分析、聚类和异常检测。除异常检测外,每个主题都有两章:前一章涵盖基本概念、代表性算法和评估技术,而后一章讨论高级概念和算法。这样读者在透彻地理解数据挖掘的基础的同时,还能够了解更多重要的高级主题。

本书是明尼苏达大学和密歇根州立大学数据挖掘课程的教材,由于独具特色,正式出版之前就已经被斯坦福大学、得克萨斯大学奥斯汀分校等众多名校采用。

本书特色:

·与许多其他同类图书不同,本书将重点放在如何用数据挖掘知识解决各种实际问题。

·只要求具备很少的预备知识——不需要数据库背景,只需要很少的统计学或数学背景知识。

·书中包含大量的图表、综合示例和丰富的习题,并且使用示例、关键算法的简洁描述和习题,尽可能直接地聚集于数据挖掘的主要概念。

·教辅内容极为丰富,包括课程幻灯片、学生课题建议、数据挖掘资源(如数据挖掘算法和数据集)、联机指南(使用实际的数据集和数据分析软件,为本书介绍的部分数据挖掘技术提供例子讲解)。

·为采用本书作为教材的教师提供习题解答。 ;

内容简介

  本书对数据挖掘进行了全面介绍,旨在为读者提供将数据挖掘应用于实际问题所必需的知识。本书涵盖五个主题:数据、分类、关联分析、聚类和异常检测。除异常检测外,每个主题都有两章:前面一章讲述基本概念、代表性算法和评估技术,而后面一章较深入地讨论高级概念和算法。目的是在使读者透彻地理解数据挖掘基础的同时,还能了解更多重要的高级主题。此外,书中还提供了大量例子、图表和习题。

  本书适合作为相关专业高年级本科生和研究生数据挖掘课程的教材,同时也可作为从事数据挖掘研究和应用开发工作的技术人员的参考书。

作者简介

Pang-Ning Tan 现为密歇根州立大学计算机与工程系助理教授,主要教授数据挖掘、数据库系统等课程。此前,他曾是明尼苏达大学美国陆军高性能计算研究中心副研究员(2002-2003)。 Michael Steinbach 明尼苏达大学计算机与工程系研究员,在读博士。 Vipin Kumar

数据挖掘导论(英文版)/图灵原版计算机科学系列 PDF下载

目录

1 Introduction 1
1.1 What Is Data Mining? 2
1.2 Motivating Challenges 3
1.3 The Origins of Data Mining 4
1.4 Data Mining Tasks 5
1.5 Scope and Organization of the Book 8
1.6 Bibliographic Notes 9
1.7 Exercises 12

2 Data 13
2.1 Types of Data 15
2.1.1 Attributes and Measurement 15
2.1.2 Types of Data Sets 20
2.2 Data Quality 25
2.2.1 Measurement and Data Collection Issues 26
2.2.2 Issues Related to Applications 31
2.3 Data Preprocessing 32
2.3.1 Aggregation 32
2.3.2 Sampling 34
2.3.3 Dimensionality Reduction 36
2.3.4 Feature Subset Selection 37
2.3.5 Feature Creation 39
2.3.6 Discretization and Binarization 41
2.3.7 Variable Transformation 45
2.4 Measures of Similarity and Dissimilarity 47
2.4.1 Basics 47
2.4.2 Similarity and Dissimilarity between Simple Attributes 49
2.4.3 Dissimilarities between Data Objects 50
2.4.4 Similarities between Data Objects 52
2.4.5 Examples of Proximity Measures 53
2.4.6 Issues in Proximity Calculation 58
2.4.7 Selecting the Right Proximity Measure 60
2.5 Bibliographic Notes 61
2.6 Exercises 64

3 Exploring Data 71
3.1 The Iris Data Set 71
3.2 Summary Statistics 72
3.2.1 Frequencies and the Mode 72
3.2.2 Percentiles 73
3.2.3 Measures of Location: Mean and Median 73
3.2.4 Measures of Spread: Range and Variance 75
3.2.5 Multivariate Summary Statistics 76
3.2.6 Other Ways to Summarize the Data 77
3.3 Visualization 77
3.3.1 Motivations for Visualization 77
3.3.2 General Concepts 78
3.3.3 Techniques 81
3.3.4 Visualizing Higher-Dimensional Data 90
3.3.5 Dos and Don s 94
3.4 OLAP and Multidimensional Data Analysis 95
3.4.1 Representing Iris Data as a Multidimensional Array 95
3.4.2 Multidimensional Data: The General Case 97
3.4.3 Analyzing Multidimensional Data 98
3.4.4 Final Comments on Multidimensional Data Analysis 101
3.5 Bibliographic Notes 102
3.6 Exercises 103

4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 105
4.1 Preliminaries 105
4.2 General Approach to Solving a Classification Problem 107
4.3 Decision Tree Induction 108
4.3.1 How a Decision Tree Works 108
4.3.2 How to Build a Decision Tree 110
4.3.3 Methods for Expressing Attribute Test Conditions 112
4.3.4 Measures for Selecting the Best Split 114
4.3.5 Algorithm for Decision Tree Induction 119
4.3.6 An Example: Web Robot Detection 120
4.3.7 Characteristics of Decision Tree Induction 122
4.4 Model Overfitting 125
4.4.1 Overfitting Due to Presence of Noise 127
4.4.2 Overfitting Due to Lack of Representative Samples 129
4.4.3 Overfitting and the Multiple Comparison Procedure 129
4.4.4 Estimation of Generalization Errors 131
4.4.5 Handling Overfitting in Decision Tree Induction 134
4.5 Evaluating the Performance of a Classifier 135
4.5.1 Holdout Method 136
4.5.2 Random Subsampling 136
4.5.3 Cross-Validation 136
4.5.4 Bootstrap 137
4.6 Methods for Comparing Classifiers 137
4.6.1 Estimating a Confidence Interval for Accuracy 138
4.6.2 Comparing the Performance of Two Models 139
4.6.3 Comparing the Performance of Two Classifiers 140
4.7 Bibliographic Notes 141
4.8 Exercises 144

5 Classification: Alternative Techniques 151
5.1 Rule-Based Classifier 151
5.1.1 How a Rule-Based Classifier Works 153
5.1.2 Rule-Ordering Schemes 154
5.1.3 How to Build a Rule-Based Classifier 155
5.1.4 Direct Methods for Rule Extraction 155
5.1.5 Indirect Methods for Rule Extraction 161
5.1.6 Characteristics of Rule-Based Classifiers 163
5.2 Nearest-Neighbor classifiers 163
5.2.1 Algorithm 165
5.2.2 Characteristics of Nearest-Neighbor Classifiers 165
5.3 Bayesian Classifiers 166
5.3.1 Bayes Theorem 166
5.3.2 Using the Bayes Theorem for Classification 168
5.3.3 Na?ve Bayes Classifier 169
5.3.4 Bayes Error Rate 175
5.3.5 Bayesian Belief Networks 176
5.4 Artificial Neural Network (ANN) 181
5.4.1 Perceptron 181
5.4.2 Multilayer Artificial Neural Network 184
5.4.3 Characteristics of ANN 187
5.5 Support Vector Machine (SVM) 188
5.5.1 Maximum Margin Hyperplanes 188
5.5.2 Linear SVM: Separable Case 190
5.5.3 Linear SVM: Nonseparable Case 195
5.5.4 Nonlinear SVM 198
5.5.5 Characteristics of SVM 203
5.6 Ensemble Methods 203
5.6.1 Rationale for Ensemble Method 203
5.6.2 Methods for Constructing an Ensemble Classifier 204
5.6.3 Bias-Variance Decomposition 206
5.6.4 Bagging 209
5.6.5 Boosting 211
5.6.6 Random Forests 215
5.6.7 Empirical Comparison among Ensemble Methods 216
5.7 Class Imbalance Problem 217
5.7.1 Alternative Metrics 218
5.7.2 The Receiver Operating Characteristic Curve 220
5.7.3 Cost-Sensitive Learning 223
5.7.4 Sampling-Based Approaches 225
5.8 Multiclass Problem 226
5.9 Bibliographic Notes 228
5.10 Exercises 233

6 Association Analysis: Basic Concepts and Algorithms 241
6.1 Problem Definition 242
6.2 Frequent Itemset Generation 244
6.2.1 The Apriori Principle 246
6.2.2 Frequent Itemset Generation in the Apriori Algorithm 247
6.2.3 Candidate Generation and Pruning 249
6.2.4 Support Counting 252
6.2.5 Computational Complexity 255
6.3 Rule Generation 257
6.3.1 Confidence-Based Pruning 258
6.3.2 Rule Generation in Apriori Algorithm 258
6.3.3 An Example: Congressional Voting Records 259
6.4 Compact Representation of Frequent Itemsets 260
6.4.1 Maximal Frequent Itemsets 260
6.4.2 Closed Frequent Itemsets 262
6.5 Alternative Methods for Generating Frequent Itemsets 264
6.6 FP-Growth Algorithm 268
6.6.1 FP-Tree Representation 268
6.6.2 Frequent Itemset Generation in FP-Growth Algorithm 270
6.7 Evaluation of Association Patterns 273
6.7.1 Objective Measures of Interestingness 274
6.7.2 Measures beyond Pairs of Binary Variables 282
6.7.3 Simpsons Paradox 283
6.8 Effect of Skewed Support Distribution 285
6.9 Bibliographic Notes 288
6.10 Exercises 298

7 Association Analysis: Advanced Concepts 307
7.1 Handling Categorical Attributes 307
7.2 Handling Continuous Attributes 309
7.2.1 Discretization-Based Methods 310
7.2.2 Statistics-Based Methods 312
7.2.3 Non-discretization Methods 314
7.3 Handling a Concept Hierarchy 316
7.4 Sequential Patterns 318
7.4.1 Problem Formulation 318
7.4.2 Sequential Pattern Discovery 320
7.4.3 Timing Constraints 323
7.4.4 Alternative Counting Schemes 327
7.5 Subgraph Patterns 328
7.5.1 Graphs and Subgraphs 329
7.5.2 Frequent Subgraph Mining 330
7.5.3 Apriori-like Method 332
7.5.4 Candidate Generation 333
7.5.5 Candidate Pruning 338
7.5.6 Support Counting 340
7.6 Infrequent Patterns 340
7.6.1 Negative Patterns 341
7.6.2 Negatively Correlated Patterns 342
7.6.3 Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns 343
7.6.4 Techniques for Mining Interesting Infrequent Patterns 344
7.6.5 Techniques Based on Mining Negative Patterns 345
7.6.6 Techniques Based on Support Expectation 347
7.7 Bibliographic Notes 350
7.8 Exercises 353

8 Cluster Analysis: Basic Concepts and Algorithms 363
8.1 Overview 365
8.1.1 What Is Cluster Analysis? 365
8.1.2 Different Types of Clusterings 366
8.1.3 Different Types of Clusters 368
8.2 K-means 370
8.2.1 The Basic K-means Algorithm 371
8.2.2 K-means: Additional Issues 378
8.2.3 Bisecting K-means 380
8.2.4 K-means and Different Types of Clusters 381
8.2.5 Strengths and Weaknesses 383
8.2.6 K-means as an Optimization Problem 383
8.3 Agglomerative Hierarchical Clustering 385
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 385
8.3.2 Specific Techniques 387
8.3.3 The Lance-Williams Formula for Cluster Proximity 391
8.3.4 Key Issues in Hierarchical Clustering 391
8.3.5 Strengths and Weaknesses 393
8.4 DBSCAN 393
8.4.1 Traditional Density: Center-Based Approach 393
8.4.2 The DBSCAN Algorithm 394
8.4.3 Strengths and Weaknesses 398
8.5 Cluster Evaluation 398
8.5.1 Overview 399
8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation 401
8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix 406
8.5.4 Unsupervised Evaluation of Hierarchical Clustering 408
8.5.5 Determining the Correct Number of Clusters 409
8.5.6 Clustering Tendency 410
8.5.7 Supervised Measures of Cluster Validity 411
8.5.8 Assessing the Significance of Cluster Validity Measures 414
8.6 Bibliographic Notes 416
8.7 Exercises 419

9 Cluster Analysis: Additional Issues and Algorithms 427
9.1 Characteristics of Data, Clusters, and Clustering Algorithms 427
9.1.1 Example: Comparing K-means and DBSCAN 428
9.1.2 Data Characteristics 429
9.1.3 Cluster Characteristics 430
9.1.4 General Characteristics of Clustering Algorithms 431
9.2 Prototype-Based Clustering 433
9.2.1 Fuzzy Clustering 433
9.2.2 Clustering Using Mixture Models 437
9.2.3 Self-Organizing Maps (SOM) 446
9.3 Density-Based Clustering 451
9.3.1 Grid-Based Clustering 451
9.3.2 Subspace Clustering 454
9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering 457
9.4 Graph-Based Clustering 460
9.4.1 Sparsification 461
9.4.2 Minimum Spanning Tree (MST) Clustering 462
9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS 463
9.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling 464
9.4.5 Shared Nearest Neighbor Similarity 468
9.4.6 The Jarvis-Patrick Clustering Algorithm 471
9.4.7 SNN Density 472
9.4.8 SNN Density-Based Clustering 473
9.5 Scalable Clustering Algorithms 475
9.5.1 Scalability: General Issues and Approaches 476
9.5.2 BIRCH 477
9.5.3 CURE 479
9.6 Which Clustering Algorithm? 482
9.7 Bibliographic Notes 484
9.8 Exercises 488

10 Anomaly Detection 491
10.1 Preliminaries 492
10.1.1 Causes of Anomalies 492
10.1.2 Approaches to Anomaly Detection 493
10.1.3 The Use of Class Labels 494
10.1.4 Issues 495
10.2 Statistical Approaches 496
10.2.1 Detecting Outliers in a Univariate Normal Distribution 497
10.2.2 Outliers in a Multivariate Normal Distribution 499
10.2.3 A Mixture Model Approach for Anomaly Detection 500
10.2.4 Strengths and Weaknesses 502
10.3 Proximity-Based Outlier Detection 502
10.3.1 Strengths and Weaknesses 503
10.4 Density-Based Outlier Detection 504
10.4.1 Detection of Outliers Using Relative Density 505
10.4.2 Strengths and Weaknesses 506
10.5 Clustering-Based Techniques 506
10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster 507
10.5.2 Impact of Outliers on the Initial Clustering 509
10.5.3 The Number of Clusters to Use 509
10.5.4 Strengths and Weaknesses 509
10.6 Bibliographic Notes 510
10.7 Exercises 513

数据挖掘导论(英文版)/图灵原版计算机科学系列 pdf下载声明

本pdf资料下载仅供个人学习和研究使用,不能用于商业用途,请在下载后24小时内删除。如果喜欢,请购买正版

pdf下载地址

版权归出版社和作者所有,下载链接已删除。如果喜欢,请购买正版!

链接地址:数据挖掘导论(英文版)/图灵原版计算机科学系列