Michael E. Byczek
Software Engineer

Algorithms and Data Structures

A data structure represents the way that data is stored as a collection and the relationship between elements. How data is stored impacts how it can be accessed.

Common data structures:
An algorithm in computer science is the equivalent of a cooking recipe: instructions on how to solve a problem.

In software development terms, an algorithm is logic that takes input to generate output.

Big-O notation is used to classify algorithms based on efficiency as the input grows in size. In other words, how well does an algorithm perform, time-wise, when you keep increasing the amount of input to process.

An example of an algorithm is a set of instructions to perform a search. Common search algorithms include sequential, binary, bubble, merge, and sort.

Assume a list contains the numbers (1, 26, 2, 25, 3, 20, 15, 10, 6).

A sequential search goes through each element from first to last. Assume the target is 15. A sequential search will take 7 steps to determine that the number 15 is within that list. If the target was 21 (which is not in the list), a sequential search will take 9 steps to conclude that it does not appear within that list.

For that reason, a sequential list is referred to as O(n) in Big-O Notation because as the number of elements increase, this algorithm will take increasing longer to go through every single element, especially with the worst case scenario of each target being the last element in the list. A list with 1 million elements would take 1 million steps.

The binary search algorithm works by splitting the task of locating the target into smaller tasks.

1. Order a list, such as by numbers in ascending order. The above example would be (1, 2, 3, 6, 10, 15, 20, 25, 26).

2. Locate the middle element, in this example the element 10.

3. Since all numbers are in order, we know that the target (15) is bigger than the middle element (10).

4. Split the list in half containing only the elements larger than the middle: (15, 20, 25, 26).

5. Repeat by starting with the new middle: 25. The reason 25 is the middle is because with an even number of elements, the middle is determined by the mathematical test (length of list // 2), which returns the index corresponding to 25.

6. The target (15) is less than the midpoint (25), so the list is shortened to (15, 20) representing all numbers less than the target.

7. The new midpoint is 20 (length // 2).

8. The target (15) is less than the midpoint (20), so the new list is (15).

9. With only one element left, the algorithm returns "true" since the target (15) has been located.

This algorithm took 4 steps (Checked number 10, then 25, then 20, then 15).

The Big-O Notation for a binary search is referred to as O(log n) because it is more efficient especially as the number of elements increase.

Algorithms for Data Science

The IEEE International Conference on Data Mining (http://www.cs.uvm.edu/~icdm/) list the top 10 most influential data mining algorithms. These algorithms were chosen from 18 finalist nominations that covered classification, clustering, statistical learning, association analysis, link mining, bagging & boosting, sequential patterns, integrated mining, rough sets and graph mining. These categories were considered the most important topics in data mining research and development.

(Classification Category)

Used to generate classifiers expressed as decision trees or more comprehensive ruleset form.

Decision trees (tree-construction algorithm): Given a set S of cases, C4.5 uses the divide-and-conquer algorithm to grow an initial tree. A pruning algorithm is used on the initial tree to avoid overfitting.

Ruleset classifiers (list of rules grouped together): Classified by finding the first rule whose conditions are satisfied by the case. These rulesets are formed from the unpruned decision tree. A hill-climbing algorithm is used to drop conditions.

CART (Classification and Regression Trees)
(Classification Category)

Binary recursive partitioning procedure capable of processing continuous and nominal attributes both as targets and predictors.

Data is used in raw form. Trees grow to maximal size without a stopping rule, then pruned back to the root with cost-complexity pruning. This is based on the split that contributes the least to the overall performance of the tree from training data.

The result is a sequence of nested pruned trees, all of which are candidate optimal trees. Selected based on the predictive performance of every tree. A possible conclusion is that none of the trees are the best.

k-nearest neighbor classification (kNN)
(Classification Category)

Finds a group of k objects in the training set closest to the test object with the predominance of a particular class in this neighborhood.

There are three elements: set of stored records; metric to compute distance between objects; and number of nearest neighbors (k).

Choosing a k value too small or too large will affect performance. Another issue is the choice of distance measure.

Naive Bayes
(Classification Category)

An important algorithm for supervised classification. Used to construct a rule to assign future objects to a class given only the vector of variables that describe future objects. This assumes that there already exists a set of objects that belong to a known class.

Readily applied to huge data sets because it does not require any complicated iterative parameter estimation schemes. It is also easy to understand why a classification was made.

Often used for text classification and spam filtering.

SVM (Support vector machines)
(Statistical Learning Category)

One of the most robust and accurate methods for machine learning. Only requires a dozen examples and insensitive to the number of dimensions.

In a two-class learning task, SVM is used to find the best classification function to distinguish between members of the two classes from the training data.

Maximum margin hyperplanes is an important part of the algorithm because it offers the best generalization ability. This allows the best accuracy on the training data and opportunity for the correct classification of future data.

EM (Expectation-Maximization)
(Statistical Learning Category)

Normal mixture models are used to cluster continuous data (random phenomena) and estimate the underlying density function. These models are fitted by maximum likelihood using the EM algorithm.

(Association Analysis Category)

Simple and easy to implement. Data mining often uses this algorithm as the first step.

Seminal algorithm to find frequent itemsets using candidate generation. A common data mining procedure is to find frequent itemsets from a transaction dataset and to derive association rules.

The algorithm achieves good performance by reducing the size of candidate sets, but may generate a huge number of results and repeatedly scan the database if there are many frequent itemsets, large itemsets, or low minimum support.

(Link Mining Category)

Search ranking algorithm using Web hyperlinks. Google was originally based on this algorithm.

The algorithm produces a static ranking of web pages that does not depend on search queries. This is based on the link structure of the Internet to determine the quality of a particular page (Internet democracy). The more times that a website is linked to by other sites increases the rank. The importance of a referring page also determines the weight that link represents.

(Clustering Category)

The most widely used partition clustering algorithm. It is an iterative method to partition a given dataset into a user-specified number of clusters.

One disadvantage is that the algorithm falters when the data is not well described by reasonably separated spherical balls (non-covex shaped clusters in the data). A solution is to rescale the data or use a different distance measure.

Another issue is the presence of outliers, which can be fixed with a preprocessing step to eliminate small or merge close clusters.

(Bagging and Boosting Category)

One of the most important ensemble methods using multiple learners to solve a problem. It has a solid theoretical foundation, very accurate prediction, and great simplicity (only needs 10 lines of code).

Based on principles that a weak learning algorithm that performs only slightly better than a random guess can be "boosted" into an accurately strong algorithm.