Statistics API
This page provides API references for MinexPy's statistics functionality.
Statistical Analysis
minexpy.stats
Statistical analysis module for geoscience data.
This module provides comprehensive descriptive statistical metrics for analyzing geochemical and geological sample data. It includes both functional and class-based APIs for maximum flexibility.
Examples:
Basic usage with functions:
>>> import numpy as np
>>> import minexpy.stats as mstats
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> skew = mstats.skewness(data)
>>> kurt = mstats.kurtosis(data)
>>> summary = mstats.describe(data)
Class-based usage:
>>> from minexpy.stats import StatisticalAnalyzer
>>> import pandas as pd
>>>
>>> df = pd.read_csv('geochemical_data.csv')
>>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
>>> results = analyzer.summary()
StatisticalAnalyzer
Comprehensive statistical analyzer for geoscience data.
This class provides a convenient, object-oriented interface for calculating multiple statistical metrics on geochemical data. It supports both single arrays and pandas DataFrames for batch analysis.
The class is designed to be intuitive and flexible, allowing researchers to quickly obtain comprehensive statistical summaries of their data with minimal code.
Attributes:
| Name | Type | Description |
|---|---|---|
data |
ndarray or DataFrame
|
The input data being analyzed. |
is_dataframe |
bool
|
True if input is a DataFrame, False if it's an array. |
Examples:
Single array analysis:
>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> analyzer = StatisticalAnalyzer(data)
>>> summary = analyzer.summary()
>>> print(summary)
DataFrame analysis (multiple columns):
>>> import pandas as pd
>>> from minexpy.stats import StatisticalAnalyzer
>>>
>>> df = pd.read_csv('geochemical_data.csv')
>>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
>>> summary_df = analyzer.summary()
>>> print(summary_df)
Individual metric access:
>>> analyzer = StatisticalAnalyzer(data)
>>> skew = analyzer.skewness()
>>> kurt = analyzer.kurtosis()
>>> cv = analyzer.coefficient_of_variation()
Source code in minexpy/stats.py
603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 | |
__init__(data)
Initialize the statistical analyzer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like or DataFrame
|
Input data to analyze. Can be: - 1D numpy array - pandas Series - pandas DataFrame (for multi-column analysis) - Python list |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If input data is empty or contains only NaN values. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>>
>>> # Array input
>>> data = np.array([1, 2, 3, 4, 5])
>>> analyzer = StatisticalAnalyzer(data)
>>>
>>> # Series input
>>> import pandas as pd
>>> series = pd.Series([1, 2, 3, 4, 5])
>>> analyzer = StatisticalAnalyzer(series)
Source code in minexpy/stats.py
coefficient_of_variation()
Calculate coefficient of variation.
Returns:
| Type | Description |
|---|---|
float or Series
|
CV value(s). For DataFrames, returns a Series. Returns NaN for columns/arrays with zero mean. |
Source code in minexpy/stats.py
iqr()
Calculate interquartile range.
Returns:
| Type | Description |
|---|---|
float or Series
|
IQR value(s). For DataFrames, returns a Series. |
Source code in minexpy/stats.py
kurtosis(fisher=True)
Calculate kurtosis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fisher
|
bool
|
Use Fisher's definition (excess kurtosis) if True. |
True
|
Returns:
| Type | Description |
|---|---|
float or Series
|
Kurtosis value(s). For DataFrames, returns a Series with kurtosis for each column. |
Source code in minexpy/stats.py
mean()
Calculate mean.
Returns:
| Type | Description |
|---|---|
float or Series
|
Mean value(s). For DataFrames, returns a Series. |
median()
Calculate median.
Returns:
| Type | Description |
|---|---|
float or Series
|
Median value(s). For DataFrames, returns a Series. |
Source code in minexpy/stats.py
skewness()
Calculate skewness.
Returns:
| Type | Description |
|---|---|
float or Series
|
Skewness value(s). For DataFrames, returns a Series with skewness for each column. |
Source code in minexpy/stats.py
std(ddof=1)
Calculate standard deviation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ddof
|
int
|
Delta degrees of freedom. |
1
|
Returns:
| Type | Description |
|---|---|
float or Series
|
Standard deviation value(s). For DataFrames, returns a Series. |
Source code in minexpy/stats.py
summary(as_dataframe=True, percentiles=None)
Calculate comprehensive statistical summary.
This method computes all major descriptive statistics for the data. For DataFrames, it calculates statistics for each column separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
as_dataframe
|
bool
|
If True, returns results as a pandas DataFrame (for DataFrames) or Series (for arrays). If False, returns a dictionary. |
True
|
percentiles
|
list of float
|
Additional percentiles to include in the summary beyond defaults. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame, Series, or dict
|
Statistical summary. For DataFrames, returns a DataFrame with columns as rows and statistics as columns. For arrays, returns a Series or dict with statistics as values. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> analyzer = StatisticalAnalyzer(data)
>>> summary = analyzer.summary()
>>> print(summary)
Source code in minexpy/stats.py
variance(ddof=1)
Calculate variance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ddof
|
int
|
Delta degrees of freedom. |
1
|
Returns:
| Type | Description |
|---|---|
float or Series
|
Variance value(s). For DataFrames, returns a Series. |
Source code in minexpy/stats.py
z_score(value=None)
Calculate z-scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
If provided, calculates z-score for this specific value. Otherwise, returns z-scores for all data points. |
None
|
Returns:
| Type | Description |
|---|---|
float, array, or DataFrame
|
Z-score(s). For DataFrames, returns a DataFrame with z-scores for each column. |
Source code in minexpy/stats.py
coefficient_of_variation(data)
Calculate the coefficient of variation (CV).
The coefficient of variation is the ratio of the standard deviation to the mean, expressed as a percentage or decimal. It is a normalized measure of dispersion that allows comparison of variability across different scales and units.
CV is particularly useful in geochemistry for: - Comparing variability of different elements with different concentration ranges - Identifying elements with high relative variability - Assessing data quality and homogeneity
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Coefficient of variation (std/mean). Returns NaN if mean is zero. Typically expressed as a decimal (e.g., 0.25 = 25%). |
Examples:
>>> import numpy as np
>>> from minexpy.stats import coefficient_of_variation
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> cv = coefficient_of_variation(data)
>>> print(f"CV: {cv:.3f} ({cv*100:.1f}%)")
Notes
CV is dimensionless and unitless, making it ideal for comparing variability across different elements or datasets with different measurement units. A CV < 0.15 is often considered low variability, while CV > 0.5 indicates high variability.
Source code in minexpy/stats.py
describe(data, percentiles=None)
Generate a comprehensive statistical summary of the data.
This function calculates all major descriptive statistics in a single call, providing a complete overview of the data distribution. It is particularly useful for initial data exploration in geochemical analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
percentiles
|
list of float
|
Additional percentiles to calculate beyond the default set. Default percentiles include: [25, 50, 75, 90, 95, 99]. Values should be between 0 and 100. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary containing all statistical metrics with the following keys: - 'count': Number of observations - 'mean': Arithmetic mean - 'median': Median (50th percentile) - 'std': Standard deviation (sample) - 'variance': Variance (sample) - 'min': Minimum value - 'max': Maximum value - 'range': Range (max - min) - 'skewness': Skewness - 'kurtosis': Kurtosis (Fisher's definition) - 'coefficient_of_variation': CV (std/mean) - 'q1': First quartile (25th percentile) - 'q3': Third quartile (75th percentile) - 'iqr': Interquartile range - 'percentile_X': Additional percentiles as specified |
Examples:
>>> import numpy as np
>>> from minexpy.stats import describe
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2])
>>> summary = describe(data)
>>> for key, value in summary.items():
... print(f"{key}: {value:.3f}")
>>>
>>> # With custom percentiles
>>> summary_custom = describe(data, percentiles=[10, 50, 90, 95])
Source code in minexpy/stats.py
517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 | |
iqr(data)
Calculate the interquartile range (IQR).
The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). It is a robust measure of spread that is not affected by outliers.
IQR is commonly used in geochemistry for: - Identifying outliers (values beyond Q1 - 1.5IQR or Q3 + 1.5IQR) - Describing data spread in skewed distributions - Box plot construction
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Interquartile range (Q3 - Q1). |
Examples:
>>> import numpy as np
>>> from minexpy.stats import iqr
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0])
>>> interquartile_range = iqr(data)
>>> print(f"IQR: {interquartile_range:.3f}")
Source code in minexpy/stats.py
kurtosis(data, fisher=True)
Calculate the kurtosis of the data.
Kurtosis measures the "tailedness" of the probability distribution. It indicates the presence of outliers and the shape of the distribution tails compared to a normal distribution.
- High kurtosis (>0 with Fisher's definition): Heavy tails, more outliers
- Low kurtosis (<0 with Fisher's definition): Light tails, fewer outliers
- Normal distribution: Kurtosis = 0 (Fisher's) or 3 (Pearson's)
In geochemical analysis, high kurtosis often indicates the presence of anomalous values or multiple populations in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
fisher
|
bool
|
If True, uses Fisher's definition (excess kurtosis). Normal distribution has kurtosis = 0. If False, uses Pearson's definition. Normal distribution has kurtosis = 3. |
True
|
Returns:
| Type | Description |
|---|---|
float
|
Kurtosis value. With Fisher's definition, typically ranges from -2 to +10, though extreme values are possible. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import kurtosis
>>>
>>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
>>> kurt = kurtosis(data)
>>> print(f"Kurtosis (Fisher's): {kurt:.3f}")
>>>
>>> kurt_pearson = kurtosis(data, fisher=False)
>>> print(f"Kurtosis (Pearson's): {kurt_pearson:.3f}")
Notes
Fisher's definition (excess kurtosis) is more commonly used in modern statistics and is the default. It subtracts 3 from Pearson's definition so that a normal distribution has kurtosis = 0.
References
.. [1] DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological methods, 2(3), 292.
Source code in minexpy/stats.py
mean(data)
Calculate the arithmetic mean of the data.
The mean is the sum of all values divided by the number of values. It is the most common measure of central tendency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Arithmetic mean value. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import mean
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> avg = mean(data)
>>> print(f"Mean: {avg:.3f}")
Source code in minexpy/stats.py
median(data)
Calculate the median of the data.
The median is the middle value when data is sorted. It is a robust measure of central tendency that is less affected by outliers than the mean.
For geochemical data, the median is often preferred when dealing with skewed distributions or datasets containing outliers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Median value. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import median
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> med = median(data)
>>> print(f"Median: {med:.3f}")
Source code in minexpy/stats.py
mode(data)
Calculate the mode (most frequent value) of the data.
The mode is the value that appears most frequently in the dataset. For continuous data, this function returns the most common value after rounding to a reasonable precision.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
Returns:
| Type | Description |
|---|---|
tuple
|
A tuple containing (mode_value, count). The mode_value is the most frequent value, and count is the number of times it appears. If multiple modes exist, returns the first one encountered. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import mode
>>>
>>> data = np.array([1, 2, 2, 3, 3, 3, 4, 4])
>>> mode_val, count = mode(data)
>>> print(f"Mode: {mode_val} (appears {count} times)")
Source code in minexpy/stats.py
percentile(data, p)
Calculate a specific percentile of the data.
The percentile is the value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data points lie.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
p
|
float
|
Percentile to calculate, between 0 and 100. |
required |
Returns:
| Type | Description |
|---|---|
float
|
The p-th percentile value. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import percentile
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> p90 = percentile(data, 90)
>>> print(f"90th percentile: {p90:.3f}")
Source code in minexpy/stats.py
skewness(data)
Calculate the skewness of the data.
Skewness measures the asymmetry of the data distribution around the mean. It indicates whether the data is skewed to the left or right.
- Positive skewness: Right tail is longer; mass is concentrated on the left
- Negative skewness: Left tail is longer; mass is concentrated on the right
- Zero skewness: Data is symmetric (normal distribution has zero skewness)
For geochemical data, skewness is particularly useful for identifying log-normal distributions, which are common in geochemical datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Skewness value. Typically ranges from -3 to +3, though extreme values can occur with small sample sizes. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import skewness
>>>
>>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
>>> skew = skewness(data)
>>> print(f"Skewness: {skew:.3f}")
Notes
Uses Fisher's definition of skewness (third standardized moment). The calculation uses scipy.stats.skew with nan_policy='omit' to handle missing values.
References
.. [1] Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D, 47(1), 183-189.
Source code in minexpy/stats.py
std(data, ddof=1)
Calculate the standard deviation of the data.
Standard deviation measures the amount of variation or dispersion in the dataset. It is the square root of the variance.
For geochemical data, standard deviation helps quantify the variability of element concentrations, which is crucial for understanding geochemical processes and identifying anomalies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
ddof
|
int
|
Delta degrees of freedom. The divisor used in calculations is N - ddof, where N is the number of observations. - ddof=1: Sample standard deviation (default, unbiased estimator) - ddof=0: Population standard deviation |
1
|
Returns:
| Type | Description |
|---|---|
float
|
Standard deviation value. Same units as the input data. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import std
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> sample_std = std(data, ddof=1)
>>> print(f"Sample std: {sample_std:.3f}")
>>>
>>> pop_std = std(data, ddof=0)
>>> print(f"Population std: {pop_std:.3f}")
Notes
For sample data (most common case), use ddof=1. For population data, use ddof=0. The default (ddof=1) provides an unbiased estimate of the population standard deviation from a sample.
Source code in minexpy/stats.py
variance(data, ddof=1)
Calculate the variance of the data.
Variance measures the average squared deviation from the mean. It quantifies the spread or dispersion of the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
ddof
|
int
|
Delta degrees of freedom. The divisor used in calculations is N - ddof, where N is the number of observations. - ddof=1: Sample variance (default, unbiased estimator) - ddof=0: Population variance |
1
|
Returns:
| Type | Description |
|---|---|
float
|
Variance value. Units are the square of the input data units. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import variance
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> var = variance(data)
>>> print(f"Variance: {var:.3f}")
Source code in minexpy/stats.py
z_score(data, value=None)
Calculate z-scores (standardized values).
Z-scores measure how many standard deviations a value is from the mean. They are useful for: - Identifying outliers (typically |z| > 2 or 3) - Standardizing data for comparison - Detecting anomalies in geochemical data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded. |
required |
value
|
float
|
If provided, calculates the z-score for this specific value. If None, returns z-scores for all values in the data. |
None
|
Returns:
| Type | Description |
|---|---|
float or array
|
If value is provided, returns the z-score for that value. Otherwise, returns an array of z-scores for all data points. |
Examples:
>>> import numpy as np
>>> from minexpy.stats import z_score
>>>
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> z_scores = z_score(data)
>>> print(f"Z-scores: {z_scores}")
>>>
>>> z_specific = z_score(data, value=25.0)
>>> print(f"Z-score for 25.0: {z_specific:.3f}")
Source code in minexpy/stats.py
Correlation Analysis
minexpy.correlation
Correlation analysis module for geoscience datasets.
This module provides pairwise and matrix-style correlation tools that are commonly used in geochemistry, environmental geoscience, and exploration analytics. It includes linear, rank-based, nonlinear, and robust measures of dependence, plus partial correlation for controlling confounding variables.
The functions are designed to be practical for real-world field/lab datasets:
- Pairwise finite-value filtering is applied by default.
- NaN and infinite values are excluded pairwise.
- Outputs include interpretable metadata such as sample size and p-values.
Examples:
Basic pairwise analysis:
>>> import numpy as np
>>> from minexpy.correlation import pearson_correlation, spearman_correlation
>>>
>>> x = np.array([45.2, 52.3, 38.7, 61.2, 49.8])
>>> y = np.array([12.5, 15.3, 11.2, 18.4, 14.1])
>>> pearson_correlation(x, y)
{'correlation': 0.99..., 'p_value': 0.000..., 'n': 5}
>>> spearman_correlation(x, y)
{'correlation': 0.99..., 'p_value': 0.000..., 'n': 5}
Matrix-based analysis:
>>> import pandas as pd
>>> from minexpy.correlation import correlation_matrix
>>>
>>> df = pd.DataFrame({'Zn': x, 'Cu': y})
>>> correlation_matrix(df, method='pearson')
biweight_midcorrelation(x, y, c=9.0, epsilon=1e-12)
Compute robust biweight midcorrelation.
Biweight midcorrelation downweights extreme observations using a median/MAD-based weighting scheme. It is particularly useful in assay datasets where a few extreme values can dominate classical coefficients.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
First numeric variable. |
required |
y
|
array - like
|
Second numeric variable. |
required |
c
|
float
|
Tuning constant controlling outlier downweighting. Lower values increase robustness but reduce efficiency in near-normal data. |
9.0
|
epsilon
|
float
|
Numerical stabilizer used to avoid division by near-zero quantities. |
1e-12
|
Returns:
| Type | Description |
|---|---|
float
|
Robust correlation in |
Raises:
| Type | Description |
|---|---|
ValueError
|
If inputs are invalid or contain too few valid pairs. |
Examples:
>>> import numpy as np
>>> from minexpy.correlation import biweight_midcorrelation
>>> x = np.array([1, 2, 3, 4, 5, 100])
>>> y = np.array([1, 2, 3, 4, 5, -100])
>>> round(biweight_midcorrelation(x, y), 3)
0.996
Notes
This implementation follows a Tukey biweight-style weighting of centered
values with cutoff |u| < 1 after MAD scaling.
Source code in minexpy/correlation.py
548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 | |
correlation_matrix(data, method='pearson', min_periods=2)
Compute a correlation matrix using geoscience-relevant methods.
Supported methods include:
pearsonspearmankendalldistancebiweightorbiweight_midcorrelation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame or array - like
|
Tabular numeric data. If a NumPy array is provided, each column is treated as a separate variable. |
required |
method
|
str
|
Correlation method name. |
'pearson'
|
min_periods
|
int
|
Minimum number of pairwise valid observations required for each matrix entry. |
2
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Symmetric correlation matrix with variable names as row/column labels. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If method is unsupported or input is invalid. |
Examples:
>>> import pandas as pd
>>> from minexpy.correlation import correlation_matrix
>>> df = pd.DataFrame({'Zn': [45, 50, 47], 'Cu': [12, 14, 13]})
>>> correlation_matrix(df, method='pearson')
Zn Cu
Zn 1.000000 1.0
Cu 1.000000 1.0
Notes
Built-in pandas methods are used for Pearson/Spearman/Kendall. Distance and biweight matrices are computed pairwise with explicit finite-value filtering.
Source code in minexpy/correlation.py
671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 | |
distance_correlation(x, y)
Compute distance correlation for nonlinear dependence detection.
Distance correlation is zero if and only if variables are independent (under finite second moments). It therefore captures both linear and nonlinear dependence patterns that Pearson may miss.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
First numeric variable. |
required |
y
|
array - like
|
Second numeric variable. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Distance correlation value in |
Raises:
| Type | Description |
|---|---|
ValueError
|
If inputs are invalid or contain too few valid pairs. |
Examples:
>>> import numpy as np
>>> from minexpy.correlation import distance_correlation
>>> x = np.linspace(-2, 2, 50)
>>> y = x ** 2
>>> round(distance_correlation(x, y), 3)
0.54
Notes
This implementation uses pairwise absolute-distance matrices and classical
double-centering. Complexity is O(n^2) in both time and memory.
Source code in minexpy/correlation.py
kendall_correlation(x, y, method='auto', alternative='two-sided')
Compute Kendall's tau rank correlation.
Kendall's tau is based on concordant and discordant pair comparisons. It is robust for ordinal structures, less sensitive to outliers than Pearson, and frequently used for geoscience trend analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
First numeric variable. |
required |
y
|
array - like
|
Second numeric variable. |
required |
method
|
(auto, asymptotic, exact)
|
Method forwarded to |
'auto'
|
alternative
|
(two - sided, greater, less)
|
Alternative hypothesis used for p-value calculation. |
'two-sided'
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If inputs are invalid or contain too few valid pairs. |
Examples:
>>> from minexpy.correlation import kendall_correlation
>>> x = [2, 4, 6, 8, 10]
>>> y = [5, 7, 8, 11, 13]
>>> kendall_correlation(x, y)
{'correlation': 0.999..., 'p_value': 0.016..., 'n': 5}
Notes
Compared with Spearman's rho, Kendall's tau is typically smaller in magnitude but often easier to interpret probabilistically as concordance.
Source code in minexpy/correlation.py
partial_correlation(x, y, controls, alternative='two-sided')
Compute linear partial correlation between x and y.
Partial correlation estimates the residual linear relationship between two variables after regressing out one or more control variables. This is useful when evaluating element-element associations while controlling for depth, lithology proxies, or compositional trends.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
First numeric variable. |
required |
y
|
array - like
|
Second numeric variable. |
required |
controls
|
array - like
|
One or more control variables, with shape |
required |
alternative
|
(two - sided, greater, less)
|
Alternative hypothesis for residual-correlation significance test. |
'two-sided'
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If shapes are incompatible, too few observations are available, or degrees of freedom are insufficient. |
Examples:
>>> from minexpy.correlation import partial_correlation
>>> x = [10, 12, 13, 16, 20]
>>> y = [2, 3, 3.2, 4.1, 5.2]
>>> z = [100, 110, 105, 120, 130]
>>> partial_correlation(x, y, z)
{'correlation': 0.97..., 'p_value': 0.12..., 'n': 5, 'df': 2}
Notes
This implementation uses least-squares residualization and then applies Pearson correlation to residual vectors. It assumes a linear adjustment model for control effects.
Source code in minexpy/correlation.py
354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
pearson_correlation(x, y, alternative='two-sided')
Compute Pearson product-moment correlation coefficient.
Pearson correlation quantifies linear association between two variables. It is the standard first-pass measure when data are approximately linear, homoscedastic, and not dominated by outliers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
First numeric variable. |
required |
y
|
array - like
|
Second numeric variable. |
required |
alternative
|
(two - sided, greater, less)
|
Alternative hypothesis used for p-value calculation. |
'two-sided'
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If inputs are invalid or contain too few valid pairs. |
Examples:
>>> from minexpy.correlation import pearson_correlation
>>> x = [10, 12, 15, 20, 21]
>>> y = [3.1, 3.8, 4.2, 5.7, 6.0]
>>> pearson_correlation(x, y)
{'correlation': 0.985..., 'p_value': 0.002..., 'n': 5}
Notes
Pearson correlation is sensitive to extreme values. For geochemical data with heavy tails or outliers, compare results with Spearman, Kendall, or biweight midcorrelation before interpretation.
Source code in minexpy/correlation.py
spearman_correlation(x, y, alternative='two-sided')
Compute Spearman rank correlation coefficient.
Spearman correlation measures monotonic dependence by correlating ranked values instead of raw values. It is often preferred for skewed geochemical variables and monotonic but nonlinear relationships.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
First numeric variable. |
required |
y
|
array - like
|
Second numeric variable. |
required |
alternative
|
(two - sided, greater, less)
|
Alternative hypothesis used for p-value calculation. |
'two-sided'
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If inputs are invalid or contain too few valid pairs. |
Examples:
>>> from minexpy.correlation import spearman_correlation
>>> x = [1, 2, 3, 4, 5]
>>> y = [1, 4, 9, 16, 25]
>>> spearman_correlation(x, y)
{'correlation': 1.0, 'p_value': 0.0..., 'n': 5}
Notes
Spearman is robust to monotonic nonlinear scaling but can still be influenced by large numbers of tied values. In highly tied data, Kendall's tau is often more conservative.
Source code in minexpy/correlation.py
Statistical Visualization
minexpy.statviz
Statistical visualization module for geoscience datasets.
This module provides practical plotting helpers for common statistical visual diagnostics used during geochemical and environmental data analysis:
- Histogram (linear and log-scale)
- Box plot / violin plot
- ECDF (empirical cumulative distribution function)
- Q-Q plot
- P-P plot
- Scatter plot with optional trend line
All public plotting functions return (figure, axis) so users can apply
additional Matplotlib customization (annotations, styling, export settings)
after MinexPy constructs the base diagnostic plot.
Examples:
Basic histogram:
>>> import numpy as np
>>> from minexpy.statviz import plot_histogram
>>> values = np.random.lognormal(mean=2.2, sigma=0.4, size=200)
>>> fig, ax = plot_histogram(values, bins=30, scale='log')
Scatter with trend line:
>>> from minexpy.statviz import plot_scatter
>>> x = np.array([1, 2, 3, 4, 5])
>>> y = np.array([2, 4, 6, 8, 10])
>>> fig, ax = plot_scatter(x, y, add_trendline=True)
plot_box_violin(data, kind='box', labels=None, ax=None, show_means=True, color='tab:blue', xlabel='Variables', ylabel='Value', title=None)
Plot box plot or violin plot for one or multiple datasets.
Box and violin plots are complementary diagnostics for comparing spread and shape across variables, lithological domains, or spatial groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array-like, mapping, sequence of arrays, or DataFrame
|
One dataset or multiple datasets. |
required |
kind
|
(box, violin)
|
Plot type to generate. |
'box'
|
labels
|
sequence of str
|
Custom labels replacing detected dataset names. |
None
|
ax
|
Axes
|
Existing axis to draw on. |
None
|
show_means
|
bool
|
If |
True
|
color
|
str
|
Primary face color for box/violin glyphs. |
'tab:blue'
|
xlabel
|
str
|
X-axis label. |
'Variables'
|
ylabel
|
str
|
Y-axis label. |
'Value'
|
title
|
str
|
Plot title. If omitted, a default title is used. |
None
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> from minexpy.statviz import plot_box_violin
>>> fig, ax = plot_box_violin({'Zn': [1, 2, 3], 'Cu': [2, 3, 4]}, kind='violin')
Notes
For DataFrame input, each numeric column is treated as one distribution. Non-finite values are removed independently per dataset.
Source code in minexpy/statviz.py
345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 | |
plot_ecdf(data, labels=None, ax=None, xlabel='Value', ylabel='Empirical Cumulative Probability', title='ECDF')
Plot empirical cumulative distribution function (ECDF).
ECDF plots avoid histogram binning artifacts and are useful for direct comparison of distribution shifts and tail behavior across groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array-like, mapping, sequence of arrays, or DataFrame
|
One dataset or multiple datasets. |
required |
labels
|
sequence of str
|
Custom labels replacing detected dataset names. |
None
|
ax
|
Axes
|
Existing axis to draw on. |
None
|
xlabel
|
str
|
X-axis label. |
'Value'
|
ylabel
|
str
|
Y-axis label. |
'Empirical Cumulative Probability'
|
title
|
str
|
Plot title. |
'ECDF'
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Examples:
Source code in minexpy/statviz.py
plot_histogram(data, bins=30, scale='linear', density=False, ax=None, color='tab:blue', alpha=0.75, label=None, xlabel='Value', ylabel=None, title='Histogram')
Plot a histogram with linear or logarithmic x-axis scaling.
Histograms are often the first diagnostic for distribution shape,
outlier concentration, and modal behavior. scale='log' is especially
useful for right-skewed concentration data with multiplicative structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
One-dimensional numeric dataset. |
required |
bins
|
int or sequence
|
Number of bins or explicit bin edges. |
30
|
scale
|
(linear, log)
|
X-axis scale and binning mode. |
'linear'
|
density
|
bool
|
If |
False
|
ax
|
Axes
|
Existing axis to draw on. |
None
|
color
|
str
|
Bar face color. |
'tab:blue'
|
alpha
|
float
|
Bar opacity. |
0.75
|
label
|
str
|
Legend label for plotted dataset. |
None
|
xlabel
|
str
|
X-axis label. |
'Value'
|
ylabel
|
str
|
Y-axis label. If |
None
|
title
|
str
|
Plot title. |
'Histogram'
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> from minexpy.statviz import plot_histogram
>>> fig, ax = plot_histogram([1, 2, 2, 3, 5], bins=5)
>>> fig, ax = plot_histogram([1, 2, 3, 4], scale='log')
Notes
For scale='log' and integer bins, logarithmically spaced bin edges are
constructed from the finite min/max of the data.
Source code in minexpy/statviz.py
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 | |
plot_pp(data, distribution='norm', distribution_parameters=None, fit_distribution=True, ax=None, color='tab:blue', xlabel='Theoretical Cumulative Probability', ylabel='Empirical Cumulative Probability', title='P-P Plot')
Plot probability-probability (P-P) diagnostic against a distribution.
P-P plots compare empirical cumulative probabilities against theoretical CDF values. They are often more sensitive in the central part of the distribution than Q-Q plots.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
One-dimensional numeric dataset. |
required |
distribution
|
str or scipy.stats distribution
|
Reference distribution name or object. |
'norm'
|
distribution_parameters
|
sequence of float
|
Fixed distribution parameters. If omitted and |
None
|
fit_distribution
|
bool
|
Fit distribution parameters from data when not provided. |
True
|
ax
|
Axes
|
Existing axis to draw on. |
None
|
color
|
str
|
Marker color. |
'tab:blue'
|
xlabel
|
str
|
X-axis label. |
'Theoretical Cumulative Probability'
|
ylabel
|
str
|
Y-axis label. |
'Empirical Cumulative Probability'
|
title
|
str
|
Plot title. |
'P-P Plot'
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Examples:
Source code in minexpy/statviz.py
603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 | |
plot_qq(data, distribution='norm', distribution_parameters=None, fit_distribution=True, show_fit_line=True, ax=None, marker='o', color='tab:blue', xlabel='Theoretical Quantiles', ylabel='Sample Quantiles', title='Q-Q Plot')
Plot quantile-quantile (Q-Q) diagnostic against a theoretical distribution.
Q-Q plots assess distributional fit by comparing sample quantiles to theoretical quantiles. Tail deviations are especially informative for heavy-tailed geochemical variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
array - like
|
One-dimensional numeric dataset. |
required |
distribution
|
str or scipy.stats distribution
|
Reference distribution name or object. |
'norm'
|
distribution_parameters
|
sequence of float
|
Fixed distribution parameters. If omitted and |
None
|
fit_distribution
|
bool
|
Fit distribution parameters from data when not provided. |
True
|
show_fit_line
|
bool
|
If |
True
|
ax
|
Axes
|
Existing axis to draw on. |
None
|
marker
|
str
|
Marker style for sample quantiles. |
'o'
|
color
|
str
|
Marker color. |
'tab:blue'
|
xlabel
|
str
|
X-axis label. |
'Theoretical Quantiles'
|
ylabel
|
str
|
Y-axis label. |
'Sample Quantiles'
|
title
|
str
|
Plot title. |
'Q-Q Plot'
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Examples:
Notes
The dashed 1:1 line shows ideal agreement. Systematic curvature indicates mismatch between empirical and theoretical distributions.
Source code in minexpy/statviz.py
489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 | |
plot_scatter(x, y, ax=None, color='tab:blue', alpha=0.8, marker='o', label=None, add_trendline=False, trendline_color='tab:red', xlabel='X', ylabel='Y', title='Scatter Plot')
Plot scatter data with optional least-squares trend line.
Scatter plots are central to geoscience exploratory analysis because they reveal linear/nonlinear patterns, heteroscedasticity, clustering, and potential outliers before formal modeling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array - like
|
X-variable values. |
required |
y
|
array - like
|
Y-variable values. |
required |
ax
|
Axes
|
Existing axis to draw on. |
None
|
color
|
str
|
Marker color. |
'tab:blue'
|
alpha
|
float
|
Marker opacity. |
0.8
|
marker
|
str
|
Marker style. |
'o'
|
label
|
str
|
Legend label for plotted points. |
None
|
add_trendline
|
bool
|
If |
False
|
trendline_color
|
str
|
Trend line color. |
'tab:red'
|
xlabel
|
str
|
X-axis label. |
'X'
|
ylabel
|
str
|
Y-axis label. |
'Y'
|
title
|
str
|
Plot title. |
'Scatter Plot'
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If vector lengths differ or fewer than two valid paired values exist. |
Examples:
>>> from minexpy.statviz import plot_scatter
>>> fig, ax = plot_scatter([1, 2, 3], [2, 4, 6], add_trendline=True)
Source code in minexpy/statviz.py
682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 | |