Michael E. Byczek, Technical Consultant
Michael E. Byczek

Algorithms for Data Science

My expertise includes using the best algorithms to quickly and efficiently analyze data. The IEEE International Conference on Data Mining (http://www.cs.uvm.edu/~icdm/) list the top 10 most influential data mining algorithms. These algorithms were chosen from 18 finalist nominations that covered classification, clustering, statistical learning, association analysis, link mining, bagging & boosting, sequential patterns, integrated mining, rough sets and graph mining. These categories were considered the most important topics in data mining research and development.

C4.5
(Classification Category)

Used to generate classifiers expressed as decision trees or more comprehensive ruleset form.

Decision trees (tree-construction algorithm): Given a set S of cases, C4.5 uses the divide-and-conquer algorithm to grow an initial tree. A pruning algorithm is used on the initial tree to avoid overfitting.

Ruleset classifiers (list of rules grouped together): Classified by finding the first rule whose conditions are satisfied by the case. These rulesets are formed from the unprunned decision tree. A hill-climbing algorithm is used to drop conditions.

CART (Classification and Regression Trees)
(Classification Category)

Binary recursive partitioning procedure capable of processing continuous and nominal attributes both as targets and predictors.

Data is used in raw form. Trees grow to maximal size without a stopping rule, then pruned back to the root with cost-complexity pruning. This is based on the split that contributes the least to the overall performance of the tree from training data.

The result is a sequence of nested pruned trees, all of which are candidate optimal trees. The “right sized” tree is selected based on the predictive performance of every tree. A possible conclusion is that none of the trees are the “best”.

k-nearest neighbor classification (kNN)
(Classification Category)

Finds a group of k objects in the training set closest to the test object with the predominance of a particular class in this neighborhood.

There are three elements: set of stored records; metric to compute distance between objects; and number of nearest neighbors (k).

Choosing a k value too small or too large will affect performance. Another issue is the choice of distance measure.

Naive Bayes
(Classification Category)

An important algorithm for supervised classification. Used to construct a rule to assign future objects to a class given only the vector of variables that describe future objects. This assumes that there already exists a set of objects that belong to a known class.

Readily applied to huge data sets because it does not require any complicated iterative parameter estimation schemes. It is also easy to understand why a classification was made.

Often used for text classification and spam filtering.

SVM (Support vector machines)
(Statistical Learning Category)

One of the most robust and accurate methods for machine learning. Only requires a dozen examples and insensitive to the number of dimensions.

In a two-class learning task, SVM is used to find the best classification function to distinguish between members of the two classes from the training data.

Maximum margin hyperplanes is an important part of the algorithm because it offers the best generalization ability. This allows the best accuracy on the training data and opportunity for the correct classification of future data.

EM (Expectation-Maximization)
(Statistical Learning Category)

Normal mixture models are used to cluster continuous data (random phenomena) and estimate the underlying density function. These models are fitted by maximum likelihood using the EM algorithm.

Apriori
(Association Analysis Category)

Simple and easy to implement. Data mining often uses this algorithm as the first step.

Seminal algorithm to find frequent itemsets using candidate generation. A common data mining procedure is to find frequent itemsets from a transaction dataset and to derive association rules.

The algorithm achieves good performance by reducing the size of candidate sets, but may generate a huge number of results and repeatedly scan the database if there are many frequent itemsets, large itemsets, or low minimum support.

PageRank
(Link Mining Category)

Search ranking algorithm using Web hyperlinks. Google was originally based on this algorithm.

The algorithm produces a static ranking of web pages that does not depend on search queries. This is based on the link structure of the Internet to determine the quality of a particular page (Internet democracy). The more times that a website is linked to by other sites increases the rank. The importance of a referring page also determines the weight that link represents.

k-means
(Clustering Category)

The most widely used partitional clustering algorithm. It is an iterative method to partition a given dataset into a user-specified number of clusters.

One disadvantage is that the algorithm falters when the data is not well described by reasonably separated spherical balls (non-covex shaped clusters in the data). A solution is to rescale the data or use a different distance measure.

Another issue is the presence of outliers, which can be fixed with a preprocessing step to eliminate small or merge close clusters.

AdaBoost
(Bagging and Boosting Category)

One of the most important ensemble methods using multiple learners to solve a problem. It has a solid theoretical foundation, very accurate prediction, and great simplicity (only needs 10 lines of code).

Based on principles that a weak learning algorithm that performs only slightly better than a random guess can be “boosted” into an accurately strong algorithm.


Copyright © 2016. Michael E. Byczek. All Rights Reserved.