Michael E. Byczek, Technical Consultant
Michael E. Byczek

Python Analytics

Expertise with all Python packages, modules, and packages for data science analysis.

Scientific Computation
SciPy

A collection of mathematics, scientific, and engineering packages that include NumPy, Sympy, Matplotlib, IPython, and pandas.

NumPy

A fundamental scientific computing package that offers N-dimensional array object, broadcasting functions, integration with C/C++ and Fortran code, linear algebra, Fourier transform, multi-dimensional container of generic data, arbitrary data types, and database integration.

pandas

Data structures and data analysis tools similar to what is offered by default in the R language. Features include read/write data in various formats (i.e. CSV, Excel, and SQL databases), data alignment, handling of missing data, pivoting of data sets, size mutability, slicing, subsetting large data sets, split-apply-combine operations, merge/join data sets, and time-series functionality.

IPython

Provides an interactive shell, data visualization, GUI tools, parallel computing. Used for advanced statistics and quantum mechanics. Also acts as a kernel for Jupyter.

Math and Statistics

SymPy

Symbolic mathematics and a full-featured computer algebra system. Statistics capabilities include probability, probability density, expected value/variance, and random variable types. The package can is used for solving equations, calculus, matrices, and discrete math.

Statsmodels

Explore data, estimate statistical models and perform statistical tests. This includes descriptive statistics, statistical tests, plotting, and result statistics. Features: linear regression, time series, nonparametric estimators, and unit tests for correctness of results.

Machine Learning

scikit-learn

Machine learning capabilities built on NumPy, SciPy, and matplotlib. Used for data mining and data analysis: classification (identifying which category an object belongs), regression (predicting a continuous-valued attribute association with an object), clustering (group similar items into sets), dimensionality reduction (reduce number of random variables), model selection (compare, validate, and choose parameters/models), and preprocessing (feature extraction and normalization).

SHOGUN

Designed for unified large-scale learning for classification, regression, and explorative data analysis. A primary feature is the unified interface from multiple languages, such as Python, R, Java, and C++. Other benefits include clustering, metric, structured output, online learning algorithms.

PyBrain

Offers flexible and easy-to-use algorithms and the ability to test/compare these methods. The software is designed for both entry level students and state-of-the-art research. Algorithms include neural networks, reinforcement learning, unsupervised learning, black box optimization, and evolution.

PyMC

Implements the Metropolis-Hastings algorithm as a statistical package for Markov Chain Monte Carlo sampling. Includes methods for summarizing output, plotting, goodness-of-fit, and convergence diagnostics. Intended to provide efficient Bayesian analysis.

Plotting and Visualization

matplotlib

2D plotting for publication quality figures in hardcopy format. Used in python scripts, shell, Mathematica, Matlab, web application servers, and graphical user interfaces. Generate plots, histograms, power spectra, bar charts, errorcharts, and scatterplots.

Bokeh

Interactive visualization library for web browsers. Used to build elegant graphics with interactivity over very large or streaming data applications.

ggplot

Plotting system based on ggplot2 available for R. Used to make professional quality plots with minimal code. Not intended for highly customized data visualizations. Multiple layers can be combined, such as points, lines, and trendline.

Plotly

Used for dashboards, scatter plots, charts (line, bubble, bar, pie), time series, treemaps, and tables. Statistical features include error bars, box plots, histograms, 2D density plots, and distplots. 3D plots include wireframe, point clustering, parametric, scatter, surface, ribbon, and filled line.

prettyplotlib

Used to enhance mathplotlib plots through color perception and information design.

Seaborn

Visualization library based on mathplotlib for drawing attractive statistical graphics. Also supports numpy and pandas data structures long with statistical routines from scipy and statsmodels. Benefits include the ability to reveal patterns in data, comparisons between subsets, discover structure in matrices, and represent uncertainty of time series estimation.

Data Formatting and Storage

csvkit

Suite of tools for converting and working with CSV files, such as from Excel or JSON to CSV. Features include selecting a subset of columns, finding rows with matching cells, reorder columns, summary statistics, and SQL queries.

PyTables

Used to manage hierarchical datasets and handle extremely large amounts of data. Allows the ability to interactively browse, process, and search data by optimizing memory and disk resources. Features include table entities, multidimensional/nested table cells, indexing for columns of tables, numerical arrays, and variable length arrays.

SQLite3

A C library used as a lightweight disk-based database that doesn't require a separate server process. SQLite can be used for internal data storage. Methods exist for added security, such as avoiding the use of Python string operations that are vulnerable to SQL injection attacks.

Other useful packages

mrjob

Used to write and run Hadoop Streaming jobs and supports the AWS Elastic MapReduce service (EMR). Used to run jobs on EMR, a private Hadoop cluster, or locally for testing.

PyParsing

Alternative approach to simple grammars compared to the traditional lex/yacc method or regular expressions.

dateutil

Extends the standard datatime Python module. Used for relative deltas, compute dates based on recurrence rules, parsing, timezones, and how to determine date of Easter Sunday.


Copyright © 2016. Michael E. Byczek. All Rights Reserved.