In this course we have presented the basic statistical data analysis with Python. However, Python has much more to offer: a number of Python packages allow you to significantly extend your statistical data analysis and modeling. In the following, I want to give a very brief overview of most interesting and powerful ones that I have found so far:
- statsmodels
- PyMC
- scikit-learn
- A.Dobson: “An Introduction to Generalized Linear Models”
statsmodels¶
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python. Features include:
- Linear Regression
- Generalized Linear Models
- Generalized Estimating Equations
- Robust Linear Models
- Linear Mixed Effects Models
- Regression with Discrete Dependent Variables
- ANOVA
- Time Series analysis
- Models for Survival and Duration Analysis
- Statistics (e.g. Multiple Tests, Sample Size Calculations etc.)
- Nonparametric Methods
- Generalized Method of Moments
- Empirical Likelihood
- Graphics functions
- A Datasets Package
A first introduction to statistical modeling, as well as some examples, are presented in chapter “Statistical Models”.
PyMC: Bayesian Statistics and Monte Carlo Markov Modeling¶
PyMC is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness-of-fit and convergence diagnostics.
PyMC provides functionalities to make Bayesian analysis as painless as possible. Here is a short list of some of its features:
- Fits Bayesian statistical models with Markov chain Monte Carlo and other algorithms.
- Includes a large suite of well-documented statistical distributions.
- Uses NumPy for numerics wherever possible.
- Includes a module for modeling Gaussian processes.
- Sampling loops can be paused and tuned manually, or saved and restarted later.
- Creates summaries including tables and plots.
- Traces can be saved to the disk as plain text, Python pickles, SQLite or MySQL database, or hdf5 archives.
- Several convergence diagnostics are available.
- Extensible: easily incorporates custom step methods and unusual probability distributions.
- MCMC loops can be embedded in larger programs, and results can be analyzed with the full power of Python.
A very recommendable, free ebook on Bayesian methods, which also provides a very good introduction to emph{PyMC}, is href{http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/}{ Probabilistic Programming & Bayesian Methods for Hackers}. Warmly recommended!
An introduction to Bayesian Statistics, and an example from the “Bayesian Methods for Hackers” book, are presented in chapter Bayesian Statistics.
scikit-learn¶
scikit-learn is arguably the most advanced open source machine learning package available. It provides simple and efficient tools for data mining and data analysis, covering supervised as well as unsupervised learning.
It provides tools for
- Classification Identifying to which set of categories a new observation belongs to.
- Regression Predicting a continuous value for a new example.
- Clustering Automatic grouping of similar objects into sets.
- Dimensionality reduction Reducing the number of random variables to consider.
- Model selection Comparing, validating and choosing parameters and models.
- Preprocessing Feature extraction and normalization.
Generalized Linear Models¶
This is not really a Python package, but rather a book. However, this book that Annette Dobson has written has made Generalized Linear Models (GLM)} understandable and accessible for me (A. Dobson, An Introduction to Generalized Linear Models, John Wiley & Sons, 2008). While the book presents solutions for the models for R and Stata, I have developed Python solutions for almost all examples in the book ( https://github.com/thomas-haslwanter/dobson ).