Skip to content

Statistics API

This page provides API references for MinexPy's statistics functionality.

Statistical Analysis

minexpy.stats

Statistical analysis module for geoscience data.

This module provides comprehensive descriptive statistical metrics for analyzing geochemical and geological sample data. It includes both functional and class-based APIs for maximum flexibility.

Examples:

Basic usage with functions:

>>> import numpy as np
>>> import minexpy.stats as mstats
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> skew = mstats.skewness(data)
>>> kurt = mstats.kurtosis(data)
>>> summary = mstats.describe(data)

Class-based usage:

>>> from minexpy.stats import StatisticalAnalyzer
>>> import pandas as pd
>>> 
>>> df = pd.read_csv('geochemical_data.csv')
>>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
>>> results = analyzer.summary()

StatisticalAnalyzer

Comprehensive statistical analyzer for geoscience data.

This class provides a convenient, object-oriented interface for calculating multiple statistical metrics on geochemical data. It supports both single arrays and pandas DataFrames for batch analysis.

The class is designed to be intuitive and flexible, allowing researchers to quickly obtain comprehensive statistical summaries of their data with minimal code.

Attributes:

Name Type Description
data ndarray or DataFrame

The input data being analyzed.

is_dataframe bool

True if input is a DataFrame, False if it's an array.

Examples:

Single array analysis:

>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> analyzer = StatisticalAnalyzer(data)
>>> summary = analyzer.summary()
>>> print(summary)

DataFrame analysis (multiple columns):

>>> import pandas as pd
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> df = pd.read_csv('geochemical_data.csv')
>>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
>>> summary_df = analyzer.summary()
>>> print(summary_df)

Individual metric access:

>>> analyzer = StatisticalAnalyzer(data)
>>> skew = analyzer.skewness()
>>> kurt = analyzer.kurtosis()
>>> cv = analyzer.coefficient_of_variation()
Source code in minexpy/stats.py
class StatisticalAnalyzer:
    """
    Comprehensive statistical analyzer for geoscience data.

    This class provides a convenient, object-oriented interface for
    calculating multiple statistical metrics on geochemical data. It
    supports both single arrays and pandas DataFrames for batch analysis.

    The class is designed to be intuitive and flexible, allowing researchers
    to quickly obtain comprehensive statistical summaries of their data
    with minimal code.

    Attributes
    ----------
    data : numpy.ndarray or pandas.DataFrame
        The input data being analyzed.
    is_dataframe : bool
        True if input is a DataFrame, False if it's an array.

    Examples
    --------
    Single array analysis:

        >>> import numpy as np
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
        >>> analyzer = StatisticalAnalyzer(data)
        >>> summary = analyzer.summary()
        >>> print(summary)

    DataFrame analysis (multiple columns):

        >>> import pandas as pd
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> df = pd.read_csv('geochemical_data.csv')
        >>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
        >>> summary_df = analyzer.summary()
        >>> print(summary_df)

    Individual metric access:

        >>> analyzer = StatisticalAnalyzer(data)
        >>> skew = analyzer.skewness()
        >>> kurt = analyzer.kurtosis()
        >>> cv = analyzer.coefficient_of_variation()
    """

    def __init__(self, data: Union[np.ndarray, pd.Series, pd.DataFrame, List[float]]):
        """
        Initialize the statistical analyzer.

        Parameters
        ----------
        data : array-like or DataFrame
            Input data to analyze. Can be:
            - 1D numpy array
            - pandas Series
            - pandas DataFrame (for multi-column analysis)
            - Python list

        Raises
        ------
        ValueError
            If input data is empty or contains only NaN values.

        Examples
        --------
        >>> import numpy as np
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> # Array input
        >>> data = np.array([1, 2, 3, 4, 5])
        >>> analyzer = StatisticalAnalyzer(data)
        >>> 
        >>> # Series input
        >>> import pandas as pd
        >>> series = pd.Series([1, 2, 3, 4, 5])
        >>> analyzer = StatisticalAnalyzer(series)
        """
        if isinstance(data, pd.DataFrame):
            self.data = data
            self.is_dataframe = True
        elif isinstance(data, pd.Series):
            self.data = data.values
            self.is_dataframe = False
        else:
            self.data = np.asarray(data)
            self.is_dataframe = False

        # Validate data
        if self.is_dataframe:
            if self.data.empty:
                raise ValueError("Input DataFrame is empty")
        else:
            valid_data = self.data[~np.isnan(self.data)]
            if len(valid_data) == 0:
                raise ValueError("Input data is empty or contains only NaN values")

    def summary(self, as_dataframe: bool = True, 
                percentiles: Optional[List[float]] = None) -> Union[pd.DataFrame, Dict]:
        """
        Calculate comprehensive statistical summary.

        This method computes all major descriptive statistics for the data.
        For DataFrames, it calculates statistics for each column separately.

        Parameters
        ----------
        as_dataframe : bool, default True
            If True, returns results as a pandas DataFrame (for DataFrames)
            or Series (for arrays). If False, returns a dictionary.
        percentiles : list of float, optional
            Additional percentiles to include in the summary beyond defaults.

        Returns
        -------
        DataFrame, Series, or dict
            Statistical summary. For DataFrames, returns a DataFrame with
            columns as rows and statistics as columns. For arrays, returns
            a Series or dict with statistics as values.

        Examples
        --------
        >>> import numpy as np
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
        >>> analyzer = StatisticalAnalyzer(data)
        >>> summary = analyzer.summary()
        >>> print(summary)
        """
        if self.is_dataframe:
            results = {}
            for col in self.data.columns:
                results[col] = describe(self.data[col].values, percentiles=percentiles)

            if as_dataframe:
                return pd.DataFrame(results).T
            return results
        else:
            results = describe(self.data, percentiles=percentiles)
            if as_dataframe:
                return pd.Series(results)
            return results

    def skewness(self) -> Union[float, pd.Series]:
        """
        Calculate skewness.

        Returns
        -------
        float or Series
            Skewness value(s). For DataFrames, returns a Series with
            skewness for each column.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: skewness(x))
        return skewness(self.data)

    def kurtosis(self, fisher: bool = True) -> Union[float, pd.Series]:
        """
        Calculate kurtosis.

        Parameters
        ----------
        fisher : bool, default True
            Use Fisher's definition (excess kurtosis) if True.

        Returns
        -------
        float or Series
            Kurtosis value(s). For DataFrames, returns a Series with
            kurtosis for each column.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: kurtosis(x, fisher=fisher))
        return kurtosis(self.data, fisher=fisher)

    def std(self, ddof: int = 1) -> Union[float, pd.Series]:
        """
        Calculate standard deviation.

        Parameters
        ----------
        ddof : int, default 1
            Delta degrees of freedom.

        Returns
        -------
        float or Series
            Standard deviation value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.std(ddof=ddof)
        return std(self.data, ddof=ddof)

    def variance(self, ddof: int = 1) -> Union[float, pd.Series]:
        """
        Calculate variance.

        Parameters
        ----------
        ddof : int, default 1
            Delta degrees of freedom.

        Returns
        -------
        float or Series
            Variance value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.var(ddof=ddof)
        return variance(self.data, ddof=ddof)

    def mean(self) -> Union[float, pd.Series]:
        """
        Calculate mean.

        Returns
        -------
        float or Series
            Mean value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.mean()
        return mean(self.data)

    def median(self) -> Union[float, pd.Series]:
        """
        Calculate median.

        Returns
        -------
        float or Series
            Median value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.median()
        return median(self.data)

    def coefficient_of_variation(self) -> Union[float, pd.Series]:
        """
        Calculate coefficient of variation.

        Returns
        -------
        float or Series
            CV value(s). For DataFrames, returns a Series.
            Returns NaN for columns/arrays with zero mean.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: coefficient_of_variation(x))
        return coefficient_of_variation(self.data)

    def iqr(self) -> Union[float, pd.Series]:
        """
        Calculate interquartile range.

        Returns
        -------
        float or Series
            IQR value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: iqr(x))
        return iqr(self.data)

    def z_score(self, value: Optional[float] = None) -> Union[float, np.ndarray, pd.DataFrame]:
        """
        Calculate z-scores.

        Parameters
        ----------
        value : float, optional
            If provided, calculates z-score for this specific value.
            Otherwise, returns z-scores for all data points.

        Returns
        -------
        float, array, or DataFrame
            Z-score(s). For DataFrames, returns a DataFrame with z-scores
            for each column.
        """
        if self.is_dataframe:
            if value is not None:
                return self.data.apply(lambda x: z_score(x, value=value))
            else:
                return self.data.apply(lambda x: z_score(x))
        return z_score(self.data, value=value)

__init__(data)

Initialize the statistical analyzer.

Parameters:

Name Type Description Default
data array - like or DataFrame

Input data to analyze. Can be: - 1D numpy array - pandas Series - pandas DataFrame (for multi-column analysis) - Python list

required

Raises:

Type Description
ValueError

If input data is empty or contains only NaN values.

Examples:

>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> # Array input
>>> data = np.array([1, 2, 3, 4, 5])
>>> analyzer = StatisticalAnalyzer(data)
>>> 
>>> # Series input
>>> import pandas as pd
>>> series = pd.Series([1, 2, 3, 4, 5])
>>> analyzer = StatisticalAnalyzer(series)
Source code in minexpy/stats.py
def __init__(self, data: Union[np.ndarray, pd.Series, pd.DataFrame, List[float]]):
    """
    Initialize the statistical analyzer.

    Parameters
    ----------
    data : array-like or DataFrame
        Input data to analyze. Can be:
        - 1D numpy array
        - pandas Series
        - pandas DataFrame (for multi-column analysis)
        - Python list

    Raises
    ------
    ValueError
        If input data is empty or contains only NaN values.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import StatisticalAnalyzer
    >>> 
    >>> # Array input
    >>> data = np.array([1, 2, 3, 4, 5])
    >>> analyzer = StatisticalAnalyzer(data)
    >>> 
    >>> # Series input
    >>> import pandas as pd
    >>> series = pd.Series([1, 2, 3, 4, 5])
    >>> analyzer = StatisticalAnalyzer(series)
    """
    if isinstance(data, pd.DataFrame):
        self.data = data
        self.is_dataframe = True
    elif isinstance(data, pd.Series):
        self.data = data.values
        self.is_dataframe = False
    else:
        self.data = np.asarray(data)
        self.is_dataframe = False

    # Validate data
    if self.is_dataframe:
        if self.data.empty:
            raise ValueError("Input DataFrame is empty")
    else:
        valid_data = self.data[~np.isnan(self.data)]
        if len(valid_data) == 0:
            raise ValueError("Input data is empty or contains only NaN values")

coefficient_of_variation()

Calculate coefficient of variation.

Returns:

Type Description
float or Series

CV value(s). For DataFrames, returns a Series. Returns NaN for columns/arrays with zero mean.

Source code in minexpy/stats.py
def coefficient_of_variation(self) -> Union[float, pd.Series]:
    """
    Calculate coefficient of variation.

    Returns
    -------
    float or Series
        CV value(s). For DataFrames, returns a Series.
        Returns NaN for columns/arrays with zero mean.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: coefficient_of_variation(x))
    return coefficient_of_variation(self.data)

iqr()

Calculate interquartile range.

Returns:

Type Description
float or Series

IQR value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py
def iqr(self) -> Union[float, pd.Series]:
    """
    Calculate interquartile range.

    Returns
    -------
    float or Series
        IQR value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: iqr(x))
    return iqr(self.data)

kurtosis(fisher=True)

Calculate kurtosis.

Parameters:

Name Type Description Default
fisher bool

Use Fisher's definition (excess kurtosis) if True.

True

Returns:

Type Description
float or Series

Kurtosis value(s). For DataFrames, returns a Series with kurtosis for each column.

Source code in minexpy/stats.py
def kurtosis(self, fisher: bool = True) -> Union[float, pd.Series]:
    """
    Calculate kurtosis.

    Parameters
    ----------
    fisher : bool, default True
        Use Fisher's definition (excess kurtosis) if True.

    Returns
    -------
    float or Series
        Kurtosis value(s). For DataFrames, returns a Series with
        kurtosis for each column.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: kurtosis(x, fisher=fisher))
    return kurtosis(self.data, fisher=fisher)

mean()

Calculate mean.

Returns:

Type Description
float or Series

Mean value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py
def mean(self) -> Union[float, pd.Series]:
    """
    Calculate mean.

    Returns
    -------
    float or Series
        Mean value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.mean()
    return mean(self.data)

median()

Calculate median.

Returns:

Type Description
float or Series

Median value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py
def median(self) -> Union[float, pd.Series]:
    """
    Calculate median.

    Returns
    -------
    float or Series
        Median value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.median()
    return median(self.data)

skewness()

Calculate skewness.

Returns:

Type Description
float or Series

Skewness value(s). For DataFrames, returns a Series with skewness for each column.

Source code in minexpy/stats.py
def skewness(self) -> Union[float, pd.Series]:
    """
    Calculate skewness.

    Returns
    -------
    float or Series
        Skewness value(s). For DataFrames, returns a Series with
        skewness for each column.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: skewness(x))
    return skewness(self.data)

std(ddof=1)

Calculate standard deviation.

Parameters:

Name Type Description Default
ddof int

Delta degrees of freedom.

1

Returns:

Type Description
float or Series

Standard deviation value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py
def std(self, ddof: int = 1) -> Union[float, pd.Series]:
    """
    Calculate standard deviation.

    Parameters
    ----------
    ddof : int, default 1
        Delta degrees of freedom.

    Returns
    -------
    float or Series
        Standard deviation value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.std(ddof=ddof)
    return std(self.data, ddof=ddof)

summary(as_dataframe=True, percentiles=None)

Calculate comprehensive statistical summary.

This method computes all major descriptive statistics for the data. For DataFrames, it calculates statistics for each column separately.

Parameters:

Name Type Description Default
as_dataframe bool

If True, returns results as a pandas DataFrame (for DataFrames) or Series (for arrays). If False, returns a dictionary.

True
percentiles list of float

Additional percentiles to include in the summary beyond defaults.

None

Returns:

Type Description
DataFrame, Series, or dict

Statistical summary. For DataFrames, returns a DataFrame with columns as rows and statistics as columns. For arrays, returns a Series or dict with statistics as values.

Examples:

>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> analyzer = StatisticalAnalyzer(data)
>>> summary = analyzer.summary()
>>> print(summary)
Source code in minexpy/stats.py
def summary(self, as_dataframe: bool = True, 
            percentiles: Optional[List[float]] = None) -> Union[pd.DataFrame, Dict]:
    """
    Calculate comprehensive statistical summary.

    This method computes all major descriptive statistics for the data.
    For DataFrames, it calculates statistics for each column separately.

    Parameters
    ----------
    as_dataframe : bool, default True
        If True, returns results as a pandas DataFrame (for DataFrames)
        or Series (for arrays). If False, returns a dictionary.
    percentiles : list of float, optional
        Additional percentiles to include in the summary beyond defaults.

    Returns
    -------
    DataFrame, Series, or dict
        Statistical summary. For DataFrames, returns a DataFrame with
        columns as rows and statistics as columns. For arrays, returns
        a Series or dict with statistics as values.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import StatisticalAnalyzer
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> analyzer = StatisticalAnalyzer(data)
    >>> summary = analyzer.summary()
    >>> print(summary)
    """
    if self.is_dataframe:
        results = {}
        for col in self.data.columns:
            results[col] = describe(self.data[col].values, percentiles=percentiles)

        if as_dataframe:
            return pd.DataFrame(results).T
        return results
    else:
        results = describe(self.data, percentiles=percentiles)
        if as_dataframe:
            return pd.Series(results)
        return results

variance(ddof=1)

Calculate variance.

Parameters:

Name Type Description Default
ddof int

Delta degrees of freedom.

1

Returns:

Type Description
float or Series

Variance value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py
def variance(self, ddof: int = 1) -> Union[float, pd.Series]:
    """
    Calculate variance.

    Parameters
    ----------
    ddof : int, default 1
        Delta degrees of freedom.

    Returns
    -------
    float or Series
        Variance value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.var(ddof=ddof)
    return variance(self.data, ddof=ddof)

z_score(value=None)

Calculate z-scores.

Parameters:

Name Type Description Default
value float

If provided, calculates z-score for this specific value. Otherwise, returns z-scores for all data points.

None

Returns:

Type Description
float, array, or DataFrame

Z-score(s). For DataFrames, returns a DataFrame with z-scores for each column.

Source code in minexpy/stats.py
def z_score(self, value: Optional[float] = None) -> Union[float, np.ndarray, pd.DataFrame]:
    """
    Calculate z-scores.

    Parameters
    ----------
    value : float, optional
        If provided, calculates z-score for this specific value.
        Otherwise, returns z-scores for all data points.

    Returns
    -------
    float, array, or DataFrame
        Z-score(s). For DataFrames, returns a DataFrame with z-scores
        for each column.
    """
    if self.is_dataframe:
        if value is not None:
            return self.data.apply(lambda x: z_score(x, value=value))
        else:
            return self.data.apply(lambda x: z_score(x))
    return z_score(self.data, value=value)

coefficient_of_variation(data)

Calculate the coefficient of variation (CV).

The coefficient of variation is the ratio of the standard deviation to the mean, expressed as a percentage or decimal. It is a normalized measure of dispersion that allows comparison of variability across different scales and units.

CV is particularly useful in geochemistry for: - Comparing variability of different elements with different concentration ranges - Identifying elements with high relative variability - Assessing data quality and homogeneity

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required

Returns:

Type Description
float

Coefficient of variation (std/mean). Returns NaN if mean is zero. Typically expressed as a decimal (e.g., 0.25 = 25%).

Examples:

>>> import numpy as np
>>> from minexpy.stats import coefficient_of_variation
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> cv = coefficient_of_variation(data)
>>> print(f"CV: {cv:.3f} ({cv*100:.1f}%)")
Notes

CV is dimensionless and unitless, making it ideal for comparing variability across different elements or datasets with different measurement units. A CV < 0.15 is often considered low variability, while CV > 0.5 indicates high variability.

Source code in minexpy/stats.py
def coefficient_of_variation(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the coefficient of variation (CV).

    The coefficient of variation is the ratio of the standard deviation
    to the mean, expressed as a percentage or decimal. It is a normalized
    measure of dispersion that allows comparison of variability across
    different scales and units.

    CV is particularly useful in geochemistry for:
    - Comparing variability of different elements with different
      concentration ranges
    - Identifying elements with high relative variability
    - Assessing data quality and homogeneity

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Coefficient of variation (std/mean). Returns NaN if mean is zero.
        Typically expressed as a decimal (e.g., 0.25 = 25%).

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import coefficient_of_variation
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> cv = coefficient_of_variation(data)
    >>> print(f"CV: {cv:.3f} ({cv*100:.1f}%)")

    Notes
    -----
    CV is dimensionless and unitless, making it ideal for comparing
    variability across different elements or datasets with different
    measurement units. A CV < 0.15 is often considered low variability,
    while CV > 0.5 indicates high variability.
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    mean = np.mean(data)
    if mean == 0:
        return np.nan
    return float(np.std(data, ddof=1) / mean)

describe(data, percentiles=None)

Generate a comprehensive statistical summary of the data.

This function calculates all major descriptive statistics in a single call, providing a complete overview of the data distribution. It is particularly useful for initial data exploration in geochemical analysis.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required
percentiles list of float

Additional percentiles to calculate beyond the default set. Default percentiles include: [25, 50, 75, 90, 95, 99]. Values should be between 0 and 100.

None

Returns:

Type Description
dict

Dictionary containing all statistical metrics with the following keys: - 'count': Number of observations - 'mean': Arithmetic mean - 'median': Median (50th percentile) - 'std': Standard deviation (sample) - 'variance': Variance (sample) - 'min': Minimum value - 'max': Maximum value - 'range': Range (max - min) - 'skewness': Skewness - 'kurtosis': Kurtosis (Fisher's definition) - 'coefficient_of_variation': CV (std/mean) - 'q1': First quartile (25th percentile) - 'q3': Third quartile (75th percentile) - 'iqr': Interquartile range - 'percentile_X': Additional percentiles as specified

Examples:

>>> import numpy as np
>>> from minexpy.stats import describe
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2])
>>> summary = describe(data)
>>> for key, value in summary.items():
...     print(f"{key}: {value:.3f}")
>>> 
>>> # With custom percentiles
>>> summary_custom = describe(data, percentiles=[10, 50, 90, 95])
Source code in minexpy/stats.py
def describe(data: Union[np.ndarray, pd.Series, List[float]], 
             percentiles: Optional[List[float]] = None) -> Dict[str, float]:
    """
    Generate a comprehensive statistical summary of the data.

    This function calculates all major descriptive statistics in a single
    call, providing a complete overview of the data distribution. It is
    particularly useful for initial data exploration in geochemical analysis.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    percentiles : list of float, optional
        Additional percentiles to calculate beyond the default set.
        Default percentiles include: [25, 50, 75, 90, 95, 99].
        Values should be between 0 and 100.

    Returns
    -------
    dict
        Dictionary containing all statistical metrics with the following keys:
        - 'count': Number of observations
        - 'mean': Arithmetic mean
        - 'median': Median (50th percentile)
        - 'std': Standard deviation (sample)
        - 'variance': Variance (sample)
        - 'min': Minimum value
        - 'max': Maximum value
        - 'range': Range (max - min)
        - 'skewness': Skewness
        - 'kurtosis': Kurtosis (Fisher's definition)
        - 'coefficient_of_variation': CV (std/mean)
        - 'q1': First quartile (25th percentile)
        - 'q3': Third quartile (75th percentile)
        - 'iqr': Interquartile range
        - 'percentile_X': Additional percentiles as specified

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import describe
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2])
    >>> summary = describe(data)
    >>> for key, value in summary.items():
    ...     print(f"{key}: {value:.3f}")
    >>> 
    >>> # With custom percentiles
    >>> summary_custom = describe(data, percentiles=[10, 50, 90, 95])
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]

    if len(data) == 0:
        raise ValueError("Input data is empty or contains only NaN values")

    if percentiles is None:
        percentiles = [25, 50, 75, 90, 95, 99]

    results = {
        'count': len(data),
        'mean': mean(data),
        'median': median(data),
        'std': std(data),
        'variance': variance(data),
        'min': float(np.min(data)),
        'max': float(np.max(data)),
        'range': float(np.max(data) - np.min(data)),
        'skewness': skewness(data),
        'kurtosis': kurtosis(data),
        'coefficient_of_variation': coefficient_of_variation(data),
        'q1': percentile(data, 25),
        'q3': percentile(data, 75),
        'iqr': iqr(data),
    }

    # Add custom percentiles
    for p in percentiles:
        if 0 <= p <= 100:
            results[f'percentile_{p}'] = percentile(data, p)

    return results

iqr(data)

Calculate the interquartile range (IQR).

The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). It is a robust measure of spread that is not affected by outliers.

IQR is commonly used in geochemistry for: - Identifying outliers (values beyond Q1 - 1.5IQR or Q3 + 1.5IQR) - Describing data spread in skewed distributions - Box plot construction

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required

Returns:

Type Description
float

Interquartile range (Q3 - Q1).

Examples:

>>> import numpy as np
>>> from minexpy.stats import iqr
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0])
>>> interquartile_range = iqr(data)
>>> print(f"IQR: {interquartile_range:.3f}")
Source code in minexpy/stats.py
def iqr(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the interquartile range (IQR).

    The IQR is the difference between the 75th percentile (Q3) and
    the 25th percentile (Q1). It is a robust measure of spread that
    is not affected by outliers.

    IQR is commonly used in geochemistry for:
    - Identifying outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR)
    - Describing data spread in skewed distributions
    - Box plot construction

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Interquartile range (Q3 - Q1).

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import iqr
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0])
    >>> interquartile_range = iqr(data)
    >>> print(f"IQR: {interquartile_range:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    return float(q3 - q1)

kurtosis(data, fisher=True)

Calculate the kurtosis of the data.

Kurtosis measures the "tailedness" of the probability distribution. It indicates the presence of outliers and the shape of the distribution tails compared to a normal distribution.

  • High kurtosis (>0 with Fisher's definition): Heavy tails, more outliers
  • Low kurtosis (<0 with Fisher's definition): Light tails, fewer outliers
  • Normal distribution: Kurtosis = 0 (Fisher's) or 3 (Pearson's)

In geochemical analysis, high kurtosis often indicates the presence of anomalous values or multiple populations in the dataset.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required
fisher bool

If True, uses Fisher's definition (excess kurtosis). Normal distribution has kurtosis = 0. If False, uses Pearson's definition. Normal distribution has kurtosis = 3.

True

Returns:

Type Description
float

Kurtosis value. With Fisher's definition, typically ranges from -2 to +10, though extreme values are possible.

Examples:

>>> import numpy as np
>>> from minexpy.stats import kurtosis
>>> 
>>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
>>> kurt = kurtosis(data)
>>> print(f"Kurtosis (Fisher's): {kurt:.3f}")
>>> 
>>> kurt_pearson = kurtosis(data, fisher=False)
>>> print(f"Kurtosis (Pearson's): {kurt_pearson:.3f}")
Notes

Fisher's definition (excess kurtosis) is more commonly used in modern statistics and is the default. It subtracts 3 from Pearson's definition so that a normal distribution has kurtosis = 0.

References

.. [1] DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological methods, 2(3), 292.

Source code in minexpy/stats.py
def kurtosis(data: Union[np.ndarray, pd.Series, List[float]], 
             fisher: bool = True) -> float:
    """
    Calculate the kurtosis of the data.

    Kurtosis measures the "tailedness" of the probability distribution.
    It indicates the presence of outliers and the shape of the distribution
    tails compared to a normal distribution.

    - High kurtosis (>0 with Fisher's definition): Heavy tails, more outliers
    - Low kurtosis (<0 with Fisher's definition): Light tails, fewer outliers
    - Normal distribution: Kurtosis = 0 (Fisher's) or 3 (Pearson's)

    In geochemical analysis, high kurtosis often indicates the presence of
    anomalous values or multiple populations in the dataset.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    fisher : bool, default True
        If True, uses Fisher's definition (excess kurtosis).
        Normal distribution has kurtosis = 0.
        If False, uses Pearson's definition.
        Normal distribution has kurtosis = 3.

    Returns
    -------
    float
        Kurtosis value. With Fisher's definition, typically ranges from
        -2 to +10, though extreme values are possible.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import kurtosis
    >>> 
    >>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
    >>> kurt = kurtosis(data)
    >>> print(f"Kurtosis (Fisher's): {kurt:.3f}")
    >>> 
    >>> kurt_pearson = kurtosis(data, fisher=False)
    >>> print(f"Kurtosis (Pearson's): {kurt_pearson:.3f}")

    Notes
    -----
    Fisher's definition (excess kurtosis) is more commonly used in
    modern statistics and is the default. It subtracts 3 from Pearson's
    definition so that a normal distribution has kurtosis = 0.

    References
    ----------
    .. [1] DeCarlo, L. T. (1997). On the meaning and use of kurtosis.
           Psychological methods, 2(3), 292.
    """
    data = np.asarray(data)
    return float(scipy_stats.kurtosis(data, fisher=fisher, nan_policy='omit'))

mean(data)

Calculate the arithmetic mean of the data.

The mean is the sum of all values divided by the number of values. It is the most common measure of central tendency.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required

Returns:

Type Description
float

Arithmetic mean value.

Examples:

>>> import numpy as np
>>> from minexpy.stats import mean
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> avg = mean(data)
>>> print(f"Mean: {avg:.3f}")
Source code in minexpy/stats.py
def mean(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the arithmetic mean of the data.

    The mean is the sum of all values divided by the number of values.
    It is the most common measure of central tendency.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Arithmetic mean value.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import mean
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> avg = mean(data)
    >>> print(f"Mean: {avg:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.mean(data))

median(data)

Calculate the median of the data.

The median is the middle value when data is sorted. It is a robust measure of central tendency that is less affected by outliers than the mean.

For geochemical data, the median is often preferred when dealing with skewed distributions or datasets containing outliers.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required

Returns:

Type Description
float

Median value.

Examples:

>>> import numpy as np
>>> from minexpy.stats import median
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> med = median(data)
>>> print(f"Median: {med:.3f}")
Source code in minexpy/stats.py
def median(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the median of the data.

    The median is the middle value when data is sorted. It is a robust
    measure of central tendency that is less affected by outliers than
    the mean.

    For geochemical data, the median is often preferred when dealing
    with skewed distributions or datasets containing outliers.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Median value.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import median
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> med = median(data)
    >>> print(f"Median: {med:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.median(data))

mode(data)

Calculate the mode (most frequent value) of the data.

The mode is the value that appears most frequently in the dataset. For continuous data, this function returns the most common value after rounding to a reasonable precision.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required

Returns:

Type Description
tuple

A tuple containing (mode_value, count). The mode_value is the most frequent value, and count is the number of times it appears. If multiple modes exist, returns the first one encountered.

Examples:

>>> import numpy as np
>>> from minexpy.stats import mode
>>> 
>>> data = np.array([1, 2, 2, 3, 3, 3, 4, 4])
>>> mode_val, count = mode(data)
>>> print(f"Mode: {mode_val} (appears {count} times)")
Source code in minexpy/stats.py
def mode(data: Union[np.ndarray, pd.Series, List[float]]) -> Tuple[float, int]:
    """
    Calculate the mode (most frequent value) of the data.

    The mode is the value that appears most frequently in the dataset.
    For continuous data, this function returns the most common value
    after rounding to a reasonable precision.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    tuple
        A tuple containing (mode_value, count). The mode_value is the
        most frequent value, and count is the number of times it appears.
        If multiple modes exist, returns the first one encountered.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import mode
    >>> 
    >>> data = np.array([1, 2, 2, 3, 3, 3, 4, 4])
    >>> mode_val, count = mode(data)
    >>> print(f"Mode: {mode_val} (appears {count} times)")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    values, counts = np.unique(data, return_counts=True)
    max_count_idx = np.argmax(counts)
    return (float(values[max_count_idx]), int(counts[max_count_idx]))

percentile(data, p)

Calculate a specific percentile of the data.

The percentile is the value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data points lie.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required
p float

Percentile to calculate, between 0 and 100.

required

Returns:

Type Description
float

The p-th percentile value.

Examples:

>>> import numpy as np
>>> from minexpy.stats import percentile
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> p90 = percentile(data, 90)
>>> print(f"90th percentile: {p90:.3f}")
Source code in minexpy/stats.py
def percentile(data: Union[np.ndarray, pd.Series, List[float]], 
               p: float) -> float:
    """
    Calculate a specific percentile of the data.

    The percentile is the value below which a given percentage of
    observations fall. For example, the 90th percentile is the value
    below which 90% of the data points lie.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    p : float
        Percentile to calculate, between 0 and 100.

    Returns
    -------
    float
        The p-th percentile value.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import percentile
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> p90 = percentile(data, 90)
    >>> print(f"90th percentile: {p90:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.percentile(data, p))

skewness(data)

Calculate the skewness of the data.

Skewness measures the asymmetry of the data distribution around the mean. It indicates whether the data is skewed to the left or right.

  • Positive skewness: Right tail is longer; mass is concentrated on the left
  • Negative skewness: Left tail is longer; mass is concentrated on the right
  • Zero skewness: Data is symmetric (normal distribution has zero skewness)

For geochemical data, skewness is particularly useful for identifying log-normal distributions, which are common in geochemical datasets.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required

Returns:

Type Description
float

Skewness value. Typically ranges from -3 to +3, though extreme values can occur with small sample sizes.

Examples:

>>> import numpy as np
>>> from minexpy.stats import skewness
>>> 
>>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
>>> skew = skewness(data)
>>> print(f"Skewness: {skew:.3f}")
Notes

Uses Fisher's definition of skewness (third standardized moment). The calculation uses scipy.stats.skew with nan_policy='omit' to handle missing values.

References

.. [1] Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D, 47(1), 183-189.

Source code in minexpy/stats.py
def skewness(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the skewness of the data.

    Skewness measures the asymmetry of the data distribution around the mean.
    It indicates whether the data is skewed to the left or right.

    - Positive skewness: Right tail is longer; mass is concentrated on the left
    - Negative skewness: Left tail is longer; mass is concentrated on the right
    - Zero skewness: Data is symmetric (normal distribution has zero skewness)

    For geochemical data, skewness is particularly useful for identifying
    log-normal distributions, which are common in geochemical datasets.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Skewness value. Typically ranges from -3 to +3, though extreme
        values can occur with small sample sizes.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import skewness
    >>> 
    >>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
    >>> skew = skewness(data)
    >>> print(f"Skewness: {skew:.3f}")

    Notes
    -----
    Uses Fisher's definition of skewness (third standardized moment).
    The calculation uses scipy.stats.skew with nan_policy='omit' to
    handle missing values.

    References
    ----------
    .. [1] Joanes, D. N., & Gill, C. A. (1998). Comparing measures of
           sample skewness and kurtosis. Journal of the Royal Statistical
           Society: Series D, 47(1), 183-189.
    """
    data = np.asarray(data)
    return float(scipy_stats.skew(data, nan_policy='omit'))

std(data, ddof=1)

Calculate the standard deviation of the data.

Standard deviation measures the amount of variation or dispersion in the dataset. It is the square root of the variance.

For geochemical data, standard deviation helps quantify the variability of element concentrations, which is crucial for understanding geochemical processes and identifying anomalies.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required
ddof int

Delta degrees of freedom. The divisor used in calculations is N - ddof, where N is the number of observations. - ddof=1: Sample standard deviation (default, unbiased estimator) - ddof=0: Population standard deviation

1

Returns:

Type Description
float

Standard deviation value. Same units as the input data.

Examples:

>>> import numpy as np
>>> from minexpy.stats import std
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> sample_std = std(data, ddof=1)
>>> print(f"Sample std: {sample_std:.3f}")
>>> 
>>> pop_std = std(data, ddof=0)
>>> print(f"Population std: {pop_std:.3f}")
Notes

For sample data (most common case), use ddof=1. For population data, use ddof=0. The default (ddof=1) provides an unbiased estimate of the population standard deviation from a sample.

Source code in minexpy/stats.py
def std(data: Union[np.ndarray, pd.Series, List[float]], 
        ddof: int = 1) -> float:
    """
    Calculate the standard deviation of the data.

    Standard deviation measures the amount of variation or dispersion
    in the dataset. It is the square root of the variance.

    For geochemical data, standard deviation helps quantify the variability
    of element concentrations, which is crucial for understanding
    geochemical processes and identifying anomalies.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    ddof : int, default 1
        Delta degrees of freedom. The divisor used in calculations is
        N - ddof, where N is the number of observations.
        - ddof=1: Sample standard deviation (default, unbiased estimator)
        - ddof=0: Population standard deviation

    Returns
    -------
    float
        Standard deviation value. Same units as the input data.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import std
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> sample_std = std(data, ddof=1)
    >>> print(f"Sample std: {sample_std:.3f}")
    >>> 
    >>> pop_std = std(data, ddof=0)
    >>> print(f"Population std: {pop_std:.3f}")

    Notes
    -----
    For sample data (most common case), use ddof=1. For population data,
    use ddof=0. The default (ddof=1) provides an unbiased estimate of
    the population standard deviation from a sample.
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.std(data, ddof=ddof, axis=None))

variance(data, ddof=1)

Calculate the variance of the data.

Variance measures the average squared deviation from the mean. It quantifies the spread or dispersion of the data.

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required
ddof int

Delta degrees of freedom. The divisor used in calculations is N - ddof, where N is the number of observations. - ddof=1: Sample variance (default, unbiased estimator) - ddof=0: Population variance

1

Returns:

Type Description
float

Variance value. Units are the square of the input data units.

Examples:

>>> import numpy as np
>>> from minexpy.stats import variance
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> var = variance(data)
>>> print(f"Variance: {var:.3f}")
Source code in minexpy/stats.py
def variance(data: Union[np.ndarray, pd.Series, List[float]], 
             ddof: int = 1) -> float:
    """
    Calculate the variance of the data.

    Variance measures the average squared deviation from the mean.
    It quantifies the spread or dispersion of the data.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    ddof : int, default 1
        Delta degrees of freedom. The divisor used in calculations is
        N - ddof, where N is the number of observations.
        - ddof=1: Sample variance (default, unbiased estimator)
        - ddof=0: Population variance

    Returns
    -------
    float
        Variance value. Units are the square of the input data units.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import variance
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> var = variance(data)
    >>> print(f"Variance: {var:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.var(data, ddof=ddof, axis=None))

z_score(data, value=None)

Calculate z-scores (standardized values).

Z-scores measure how many standard deviations a value is from the mean. They are useful for: - Identifying outliers (typically |z| > 2 or 3) - Standardizing data for comparison - Detecting anomalies in geochemical data

Parameters:

Name Type Description Default
data array - like

Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.

required
value float

If provided, calculates the z-score for this specific value. If None, returns z-scores for all values in the data.

None

Returns:

Type Description
float or array

If value is provided, returns the z-score for that value. Otherwise, returns an array of z-scores for all data points.

Examples:

>>> import numpy as np
>>> from minexpy.stats import z_score
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> z_scores = z_score(data)
>>> print(f"Z-scores: {z_scores}")
>>> 
>>> z_specific = z_score(data, value=25.0)
>>> print(f"Z-score for 25.0: {z_specific:.3f}")
Source code in minexpy/stats.py
def z_score(data: Union[np.ndarray, pd.Series, List[float]], 
            value: Optional[float] = None) -> Union[float, np.ndarray]:
    """
    Calculate z-scores (standardized values).

    Z-scores measure how many standard deviations a value is from the mean.
    They are useful for:
    - Identifying outliers (typically |z| > 2 or 3)
    - Standardizing data for comparison
    - Detecting anomalies in geochemical data

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    value : float, optional
        If provided, calculates the z-score for this specific value.
        If None, returns z-scores for all values in the data.

    Returns
    -------
    float or array
        If value is provided, returns the z-score for that value.
        Otherwise, returns an array of z-scores for all data points.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import z_score
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> z_scores = z_score(data)
    >>> print(f"Z-scores: {z_scores}")
    >>> 
    >>> z_specific = z_score(data, value=25.0)
    >>> print(f"Z-score for 25.0: {z_specific:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    mean_val = np.mean(data)
    std_val = np.std(data, ddof=1)

    if value is not None:
        return float((value - mean_val) / std_val)
    else:
        return (data - mean_val) / std_val

Correlation Analysis

minexpy.correlation

Correlation analysis module for geoscience datasets.

This module provides pairwise and matrix-style correlation tools that are commonly used in geochemistry, environmental geoscience, and exploration analytics. It includes linear, rank-based, nonlinear, and robust measures of dependence, plus partial correlation for controlling confounding variables.

The functions are designed to be practical for real-world field/lab datasets:

  • Pairwise finite-value filtering is applied by default.
  • NaN and infinite values are excluded pairwise.
  • Outputs include interpretable metadata such as sample size and p-values.

Examples:

Basic pairwise analysis:

>>> import numpy as np
>>> from minexpy.correlation import pearson_correlation, spearman_correlation
>>>
>>> x = np.array([45.2, 52.3, 38.7, 61.2, 49.8])
>>> y = np.array([12.5, 15.3, 11.2, 18.4, 14.1])
>>> pearson_correlation(x, y)
{'correlation': 0.99..., 'p_value': 0.000..., 'n': 5}
>>> spearman_correlation(x, y)
{'correlation': 0.99..., 'p_value': 0.000..., 'n': 5}

Matrix-based analysis:

>>> import pandas as pd
>>> from minexpy.correlation import correlation_matrix
>>>
>>> df = pd.DataFrame({'Zn': x, 'Cu': y})
>>> correlation_matrix(df, method='pearson')

biweight_midcorrelation(x, y, c=9.0, epsilon=1e-12)

Compute robust biweight midcorrelation.

Biweight midcorrelation downweights extreme observations using a median/MAD-based weighting scheme. It is particularly useful in assay datasets where a few extreme values can dominate classical coefficients.

Parameters:

Name Type Description Default
x array - like

First numeric variable.

required
y array - like

Second numeric variable.

required
c float

Tuning constant controlling outlier downweighting. Lower values increase robustness but reduce efficiency in near-normal data.

9.0
epsilon float

Numerical stabilizer used to avoid division by near-zero quantities.

1e-12

Returns:

Type Description
float

Robust correlation in [-1, 1]. Returns np.nan when robust dispersion of either variable is effectively zero.

Raises:

Type Description
ValueError

If inputs are invalid or contain too few valid pairs.

Examples:

>>> import numpy as np
>>> from minexpy.correlation import biweight_midcorrelation
>>> x = np.array([1, 2, 3, 4, 5, 100])
>>> y = np.array([1, 2, 3, 4, 5, -100])
>>> round(biweight_midcorrelation(x, y), 3)
0.996
Notes

This implementation follows a Tukey biweight-style weighting of centered values with cutoff |u| < 1 after MAD scaling.

Source code in minexpy/correlation.py
def biweight_midcorrelation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    c: float = 9.0,
    epsilon: float = 1e-12,
) -> float:
    """
    Compute robust biweight midcorrelation.

    Biweight midcorrelation downweights extreme observations using a
    median/MAD-based weighting scheme. It is particularly useful in assay
    datasets where a few extreme values can dominate classical coefficients.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    c : float, default 9.0
        Tuning constant controlling outlier downweighting. Lower values
        increase robustness but reduce efficiency in near-normal data.
    epsilon : float, default 1e-12
        Numerical stabilizer used to avoid division by near-zero quantities.

    Returns
    -------
    float
        Robust correlation in ``[-1, 1]``. Returns ``np.nan`` when robust
        dispersion of either variable is effectively zero.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.correlation import biweight_midcorrelation
    >>> x = np.array([1, 2, 3, 4, 5, 100])
    >>> y = np.array([1, 2, 3, 4, 5, -100])
    >>> round(biweight_midcorrelation(x, y), 3)
    0.996

    Notes
    -----
    This implementation follows a Tukey biweight-style weighting of centered
    values with cutoff ``|u| < 1`` after MAD scaling.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=3)

    x_med = np.median(x_clean)
    y_med = np.median(y_clean)

    x_mad = np.median(np.abs(x_clean - x_med))
    y_mad = np.median(np.abs(y_clean - y_med))

    if x_mad <= epsilon or y_mad <= epsilon:
        return float(np.nan)

    x_u = (x_clean - x_med) / (c * x_mad)
    y_u = (y_clean - y_med) / (c * y_mad)

    x_w = np.zeros_like(x_u)
    y_w = np.zeros_like(y_u)

    x_mask = np.abs(x_u) < 1.0
    y_mask = np.abs(y_u) < 1.0

    x_w[x_mask] = (1.0 - x_u[x_mask] ** 2) ** 2
    y_w[y_mask] = (1.0 - y_u[y_mask] ** 2) ** 2

    x_centered = (x_clean - x_med) * x_w
    y_centered = (y_clean - y_med) * y_w

    denominator = np.sqrt(np.sum(x_centered ** 2) * np.sum(y_centered ** 2))
    if denominator <= epsilon:
        return float(np.nan)

    corr = float(np.sum(x_centered * y_centered) / denominator)
    return float(np.clip(corr, -1.0, 1.0))

correlation_matrix(data, method='pearson', min_periods=2)

Compute a correlation matrix using geoscience-relevant methods.

Supported methods include:

  • pearson
  • spearman
  • kendall
  • distance
  • biweight or biweight_midcorrelation

Parameters:

Name Type Description Default
data DataFrame or array - like

Tabular numeric data. If a NumPy array is provided, each column is treated as a separate variable.

required
method str

Correlation method name.

'pearson'
min_periods int

Minimum number of pairwise valid observations required for each matrix entry.

2

Returns:

Type Description
DataFrame

Symmetric correlation matrix with variable names as row/column labels.

Raises:

Type Description
ValueError

If method is unsupported or input is invalid.

Examples:

>>> import pandas as pd
>>> from minexpy.correlation import correlation_matrix
>>> df = pd.DataFrame({'Zn': [45, 50, 47], 'Cu': [12, 14, 13]})
>>> correlation_matrix(df, method='pearson')
           Zn   Cu
Zn  1.000000  1.0
Cu  1.000000  1.0
Notes

Built-in pandas methods are used for Pearson/Spearman/Kendall. Distance and biweight matrices are computed pairwise with explicit finite-value filtering.

Source code in minexpy/correlation.py
def correlation_matrix(
    data: Union[pd.DataFrame, np.ndarray, Sequence[Sequence[float]]],
    method: str = "pearson",
    min_periods: int = 2,
) -> pd.DataFrame:
    """
    Compute a correlation matrix using geoscience-relevant methods.

    Supported methods include:

    - ``pearson``
    - ``spearman``
    - ``kendall``
    - ``distance``
    - ``biweight`` or ``biweight_midcorrelation``

    Parameters
    ----------
    data : DataFrame or array-like
        Tabular numeric data. If a NumPy array is provided, each column is
        treated as a separate variable.
    method : str, default 'pearson'
        Correlation method name.
    min_periods : int, default 2
        Minimum number of pairwise valid observations required for each matrix
        entry.

    Returns
    -------
    pandas.DataFrame
        Symmetric correlation matrix with variable names as row/column labels.

    Raises
    ------
    ValueError
        If method is unsupported or input is invalid.

    Examples
    --------
    >>> import pandas as pd
    >>> from minexpy.correlation import correlation_matrix
    >>> df = pd.DataFrame({'Zn': [45, 50, 47], 'Cu': [12, 14, 13]})
    >>> correlation_matrix(df, method='pearson')
               Zn   Cu
    Zn  1.000000  1.0
    Cu  1.000000  1.0

    Notes
    -----
    Built-in pandas methods are used for Pearson/Spearman/Kendall. Distance
    and biweight matrices are computed pairwise with explicit finite-value
    filtering.
    """
    numeric_df = _to_numeric_dataframe(data)
    method_lower = method.lower()

    if min_periods < 1:
        raise ValueError("min_periods must be at least 1")

    pandas_methods = {"pearson", "spearman", "kendall"}
    if method_lower in pandas_methods:
        return numeric_df.corr(method=method_lower, min_periods=min_periods)

    custom_methods: Mapping[str, object] = {
        "distance": distance_correlation,
        "biweight": biweight_midcorrelation,
        "biweight_midcorrelation": biweight_midcorrelation,
    }

    if method_lower not in custom_methods:
        supported = sorted(list(pandas_methods) + list(custom_methods.keys()))
        raise ValueError(
            f"Unsupported method '{method}'. Supported methods: {supported}"
        )

    method_func = custom_methods[method_lower]
    columns = list(numeric_df.columns)
    matrix = pd.DataFrame(
        np.nan,
        index=columns,
        columns=columns,
        dtype=float,
    )

    for i, col_i in enumerate(columns):
        x_values = numeric_df[col_i].to_numpy(dtype=float)

        for j in range(i, len(columns)):
            col_j = columns[j]
            y_values = numeric_df[col_j].to_numpy(dtype=float)

            mask = np.isfinite(x_values) & np.isfinite(y_values)
            n_valid = int(mask.sum())

            if n_valid < min_periods:
                value = float(np.nan)
            elif i == j:
                value = 1.0
            else:
                value = float(method_func(x_values[mask], y_values[mask]))

            matrix.iloc[i, j] = value
            matrix.iloc[j, i] = value

    return matrix

distance_correlation(x, y)

Compute distance correlation for nonlinear dependence detection.

Distance correlation is zero if and only if variables are independent (under finite second moments). It therefore captures both linear and nonlinear dependence patterns that Pearson may miss.

Parameters:

Name Type Description Default
x array - like

First numeric variable.

required
y array - like

Second numeric variable.

required

Returns:

Type Description
float

Distance correlation value in [0, 1].

Raises:

Type Description
ValueError

If inputs are invalid or contain too few valid pairs.

Examples:

>>> import numpy as np
>>> from minexpy.correlation import distance_correlation
>>> x = np.linspace(-2, 2, 50)
>>> y = x ** 2
>>> round(distance_correlation(x, y), 3)
0.54
Notes

This implementation uses pairwise absolute-distance matrices and classical double-centering. Complexity is O(n^2) in both time and memory.

Source code in minexpy/correlation.py
def distance_correlation(x: ArrayLike1D, y: ArrayLike1D) -> float:
    """
    Compute distance correlation for nonlinear dependence detection.

    Distance correlation is zero if and only if variables are independent
    (under finite second moments). It therefore captures both linear and
    nonlinear dependence patterns that Pearson may miss.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.

    Returns
    -------
    float
        Distance correlation value in ``[0, 1]``.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.correlation import distance_correlation
    >>> x = np.linspace(-2, 2, 50)
    >>> y = x ** 2
    >>> round(distance_correlation(x, y), 3)
    0.54

    Notes
    -----
    This implementation uses pairwise absolute-distance matrices and classical
    double-centering. Complexity is ``O(n^2)`` in both time and memory.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)

    x_dist = np.abs(x_clean[:, None] - x_clean[None, :])
    y_dist = np.abs(y_clean[:, None] - y_clean[None, :])

    x_centered = (
        x_dist
        - x_dist.mean(axis=0, keepdims=True)
        - x_dist.mean(axis=1, keepdims=True)
        + x_dist.mean()
    )
    y_centered = (
        y_dist
        - y_dist.mean(axis=0, keepdims=True)
        - y_dist.mean(axis=1, keepdims=True)
        + y_dist.mean()
    )

    dcov2 = float(np.mean(x_centered * y_centered))
    dvarx = float(np.mean(x_centered * x_centered))
    dvary = float(np.mean(y_centered * y_centered))

    if dvarx <= 0.0 or dvary <= 0.0:
        return 0.0

    dcor2 = dcov2 / np.sqrt(dvarx * dvary)
    dcor = np.sqrt(max(dcor2, 0.0))
    return float(np.clip(dcor, 0.0, 1.0))

kendall_correlation(x, y, method='auto', alternative='two-sided')

Compute Kendall's tau rank correlation.

Kendall's tau is based on concordant and discordant pair comparisons. It is robust for ordinal structures, less sensitive to outliers than Pearson, and frequently used for geoscience trend analysis.

Parameters:

Name Type Description Default
x array - like

First numeric variable.

required
y array - like

Second numeric variable.

required
method (auto, asymptotic, exact)

Method forwarded to scipy.stats.kendalltau.

'auto'
alternative (two - sided, greater, less)

Alternative hypothesis used for p-value calculation.

'two-sided'

Returns:

Type Description
dict

Dictionary with:

  • correlation : Kendall's tau in [-1, 1].
  • p_value : p-value for testing zero rank association.
  • n : number of paired finite observations used.

Raises:

Type Description
ValueError

If inputs are invalid or contain too few valid pairs.

Examples:

>>> from minexpy.correlation import kendall_correlation
>>> x = [2, 4, 6, 8, 10]
>>> y = [5, 7, 8, 11, 13]
>>> kendall_correlation(x, y)
{'correlation': 0.999..., 'p_value': 0.016..., 'n': 5}
Notes

Compared with Spearman's rho, Kendall's tau is typically smaller in magnitude but often easier to interpret probabilistically as concordance.

Source code in minexpy/correlation.py
def kendall_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    method: str = "auto",
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute Kendall's tau rank correlation.

    Kendall's tau is based on concordant and discordant pair comparisons.
    It is robust for ordinal structures, less sensitive to outliers than
    Pearson, and frequently used for geoscience trend analysis.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    method : {'auto', 'asymptotic', 'exact'}, default 'auto'
        Method forwarded to ``scipy.stats.kendalltau``.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis used for p-value calculation.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : Kendall's tau in ``[-1, 1]``.
        - ``p_value`` : p-value for testing zero rank association.
        - ``n`` : number of paired finite observations used.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> from minexpy.correlation import kendall_correlation
    >>> x = [2, 4, 6, 8, 10]
    >>> y = [5, 7, 8, 11, 13]
    >>> kendall_correlation(x, y)
    {'correlation': 0.999..., 'p_value': 0.016..., 'n': 5}

    Notes
    -----
    Compared with Spearman's rho, Kendall's tau is typically smaller in
    magnitude but often easier to interpret probabilistically as concordance.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)
    result = scipy_stats.kendalltau(
        x_clean,
        y_clean,
        method=method,
        alternative=alternative,
    )
    statistic, p_value = _extract_stat_pvalue(result)

    return {
        "correlation": statistic,
        "p_value": p_value,
        "n": int(x_clean.size),
    }

partial_correlation(x, y, controls, alternative='two-sided')

Compute linear partial correlation between x and y.

Partial correlation estimates the residual linear relationship between two variables after regressing out one or more control variables. This is useful when evaluating element-element associations while controlling for depth, lithology proxies, or compositional trends.

Parameters:

Name Type Description Default
x array - like

First numeric variable.

required
y array - like

Second numeric variable.

required
controls array - like

One or more control variables, with shape (n,) or (n, k).

required
alternative (two - sided, greater, less)

Alternative hypothesis for residual-correlation significance test.

'two-sided'

Returns:

Type Description
dict

Dictionary with:

  • correlation : partial correlation coefficient.
  • p_value : p-value from t-distribution approximation.
  • n : number of paired finite observations used.
  • df : residual degrees of freedom, n - k - 2.

Raises:

Type Description
ValueError

If shapes are incompatible, too few observations are available, or degrees of freedom are insufficient.

Examples:

>>> from minexpy.correlation import partial_correlation
>>> x = [10, 12, 13, 16, 20]
>>> y = [2, 3, 3.2, 4.1, 5.2]
>>> z = [100, 110, 105, 120, 130]
>>> partial_correlation(x, y, z)
{'correlation': 0.97..., 'p_value': 0.12..., 'n': 5, 'df': 2}
Notes

This implementation uses least-squares residualization and then applies Pearson correlation to residual vectors. It assumes a linear adjustment model for control effects.

Source code in minexpy/correlation.py
def partial_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    controls: Union[np.ndarray, pd.DataFrame, pd.Series, Sequence[float], Sequence[Sequence[float]]],
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute linear partial correlation between ``x`` and ``y``.

    Partial correlation estimates the residual linear relationship between two
    variables after regressing out one or more control variables. This is
    useful when evaluating element-element associations while controlling for
    depth, lithology proxies, or compositional trends.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    controls : array-like
        One or more control variables, with shape ``(n,)`` or ``(n, k)``.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis for residual-correlation significance test.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : partial correlation coefficient.
        - ``p_value`` : p-value from t-distribution approximation.
        - ``n`` : number of paired finite observations used.
        - ``df`` : residual degrees of freedom, ``n - k - 2``.

    Raises
    ------
    ValueError
        If shapes are incompatible, too few observations are available,
        or degrees of freedom are insufficient.

    Examples
    --------
    >>> from minexpy.correlation import partial_correlation
    >>> x = [10, 12, 13, 16, 20]
    >>> y = [2, 3, 3.2, 4.1, 5.2]
    >>> z = [100, 110, 105, 120, 130]
    >>> partial_correlation(x, y, z)
    {'correlation': 0.97..., 'p_value': 0.12..., 'n': 5, 'df': 2}

    Notes
    -----
    This implementation uses least-squares residualization and then applies
    Pearson correlation to residual vectors. It assumes a linear adjustment
    model for control effects.
    """
    x_values = _to_1d_float_array(x, "x")
    y_values = _to_1d_float_array(y, "y")

    if isinstance(controls, pd.DataFrame):
        control_array = controls.to_numpy(dtype=float)
    elif isinstance(controls, pd.Series):
        control_array = controls.to_numpy(dtype=float)
    else:
        control_array = np.asarray(controls, dtype=float)

    if control_array.ndim == 1:
        control_array = control_array.reshape(-1, 1)
    elif control_array.ndim != 2:
        raise ValueError("controls must be one-dimensional or two-dimensional")

    if x_values.size != y_values.size or x_values.size != control_array.shape[0]:
        raise ValueError("x, y, and controls must have the same number of rows")

    valid_mask = np.isfinite(x_values) & np.isfinite(y_values)
    valid_mask = valid_mask & np.all(np.isfinite(control_array), axis=1)

    x_clean = x_values[valid_mask]
    y_clean = y_values[valid_mask]
    z_clean = control_array[valid_mask]

    n_obs = int(x_clean.size)
    n_controls = int(z_clean.shape[1])
    df = n_obs - n_controls - 2

    if n_obs < 3:
        raise ValueError("At least 3 valid observations are required")
    if df <= 0:
        raise ValueError(
            "Not enough observations for partial correlation with the given controls; "
            "need n > number_of_controls + 2"
        )

    design = np.column_stack([np.ones(n_obs), z_clean])

    beta_x, _, _, _ = np.linalg.lstsq(design, x_clean, rcond=None)
    beta_y, _, _, _ = np.linalg.lstsq(design, y_clean, rcond=None)

    residual_x = x_clean - design @ beta_x
    residual_y = y_clean - design @ beta_y

    r, _ = _extract_stat_pvalue(scipy_stats.pearsonr(residual_x, residual_y))
    r = float(np.clip(r, -1.0, 1.0))

    if abs(r) >= 1.0:
        p_value = 0.0
    else:
        t_stat = r * np.sqrt(df / max(1.0 - r * r, np.finfo(float).eps))
        if alternative == "two-sided":
            p_value = float(2.0 * scipy_stats.t.sf(abs(t_stat), df=df))
        elif alternative == "greater":
            p_value = float(scipy_stats.t.sf(t_stat, df=df))
        elif alternative == "less":
            p_value = float(scipy_stats.t.cdf(t_stat, df=df))
        else:
            raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

    return {
        "correlation": r,
        "p_value": p_value,
        "n": n_obs,
        "df": int(df),
    }

pearson_correlation(x, y, alternative='two-sided')

Compute Pearson product-moment correlation coefficient.

Pearson correlation quantifies linear association between two variables. It is the standard first-pass measure when data are approximately linear, homoscedastic, and not dominated by outliers.

Parameters:

Name Type Description Default
x array - like

First numeric variable.

required
y array - like

Second numeric variable.

required
alternative (two - sided, greater, less)

Alternative hypothesis used for p-value calculation.

'two-sided'

Returns:

Type Description
dict

Dictionary with:

  • correlation : Pearson's r in [-1, 1].
  • p_value : p-value for testing zero linear correlation.
  • n : number of paired finite observations used.

Raises:

Type Description
ValueError

If inputs are invalid or contain too few valid pairs.

Examples:

>>> from minexpy.correlation import pearson_correlation
>>> x = [10, 12, 15, 20, 21]
>>> y = [3.1, 3.8, 4.2, 5.7, 6.0]
>>> pearson_correlation(x, y)
{'correlation': 0.985..., 'p_value': 0.002..., 'n': 5}
Notes

Pearson correlation is sensitive to extreme values. For geochemical data with heavy tails or outliers, compare results with Spearman, Kendall, or biweight midcorrelation before interpretation.

Source code in minexpy/correlation.py
def pearson_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute Pearson product-moment correlation coefficient.

    Pearson correlation quantifies linear association between two variables.
    It is the standard first-pass measure when data are approximately linear,
    homoscedastic, and not dominated by outliers.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis used for p-value calculation.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : Pearson's ``r`` in ``[-1, 1]``.
        - ``p_value`` : p-value for testing zero linear correlation.
        - ``n`` : number of paired finite observations used.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> from minexpy.correlation import pearson_correlation
    >>> x = [10, 12, 15, 20, 21]
    >>> y = [3.1, 3.8, 4.2, 5.7, 6.0]
    >>> pearson_correlation(x, y)
    {'correlation': 0.985..., 'p_value': 0.002..., 'n': 5}

    Notes
    -----
    Pearson correlation is sensitive to extreme values. For geochemical data
    with heavy tails or outliers, compare results with Spearman, Kendall, or
    biweight midcorrelation before interpretation.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)
    result = scipy_stats.pearsonr(x_clean, y_clean, alternative=alternative)
    statistic, p_value = _extract_stat_pvalue(result)

    return {
        "correlation": statistic,
        "p_value": p_value,
        "n": int(x_clean.size),
    }

spearman_correlation(x, y, alternative='two-sided')

Compute Spearman rank correlation coefficient.

Spearman correlation measures monotonic dependence by correlating ranked values instead of raw values. It is often preferred for skewed geochemical variables and monotonic but nonlinear relationships.

Parameters:

Name Type Description Default
x array - like

First numeric variable.

required
y array - like

Second numeric variable.

required
alternative (two - sided, greater, less)

Alternative hypothesis used for p-value calculation.

'two-sided'

Returns:

Type Description
dict

Dictionary with:

  • correlation : Spearman's rho in [-1, 1].
  • p_value : p-value for testing zero rank association.
  • n : number of paired finite observations used.

Raises:

Type Description
ValueError

If inputs are invalid or contain too few valid pairs.

Examples:

>>> from minexpy.correlation import spearman_correlation
>>> x = [1, 2, 3, 4, 5]
>>> y = [1, 4, 9, 16, 25]
>>> spearman_correlation(x, y)
{'correlation': 1.0, 'p_value': 0.0..., 'n': 5}
Notes

Spearman is robust to monotonic nonlinear scaling but can still be influenced by large numbers of tied values. In highly tied data, Kendall's tau is often more conservative.

Source code in minexpy/correlation.py
def spearman_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute Spearman rank correlation coefficient.

    Spearman correlation measures monotonic dependence by correlating ranked
    values instead of raw values. It is often preferred for skewed geochemical
    variables and monotonic but nonlinear relationships.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis used for p-value calculation.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : Spearman's ``rho`` in ``[-1, 1]``.
        - ``p_value`` : p-value for testing zero rank association.
        - ``n`` : number of paired finite observations used.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> from minexpy.correlation import spearman_correlation
    >>> x = [1, 2, 3, 4, 5]
    >>> y = [1, 4, 9, 16, 25]
    >>> spearman_correlation(x, y)
    {'correlation': 1.0, 'p_value': 0.0..., 'n': 5}

    Notes
    -----
    Spearman is robust to monotonic nonlinear scaling but can still be
    influenced by large numbers of tied values. In highly tied data,
    Kendall's tau is often more conservative.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)
    result = scipy_stats.spearmanr(x_clean, y_clean, alternative=alternative)
    statistic, p_value = _extract_stat_pvalue(result)

    return {
        "correlation": statistic,
        "p_value": p_value,
        "n": int(x_clean.size),
    }

Statistical Visualization

minexpy.statviz

Statistical visualization module for geoscience datasets.

This module provides practical plotting helpers for common statistical visual diagnostics used during geochemical and environmental data analysis:

  • Histogram (linear and log-scale)
  • Box plot / violin plot
  • ECDF (empirical cumulative distribution function)
  • Q-Q plot
  • P-P plot
  • Scatter plot with optional trend line

All public plotting functions return (figure, axis) so users can apply additional Matplotlib customization (annotations, styling, export settings) after MinexPy constructs the base diagnostic plot.

Examples:

Basic histogram:

>>> import numpy as np
>>> from minexpy.statviz import plot_histogram
>>> values = np.random.lognormal(mean=2.2, sigma=0.4, size=200)
>>> fig, ax = plot_histogram(values, bins=30, scale='log')

Scatter with trend line:

>>> from minexpy.statviz import plot_scatter
>>> x = np.array([1, 2, 3, 4, 5])
>>> y = np.array([2, 4, 6, 8, 10])
>>> fig, ax = plot_scatter(x, y, add_trendline=True)

plot_box_violin(data, kind='box', labels=None, ax=None, show_means=True, color='tab:blue', xlabel='Variables', ylabel='Value', title=None)

Plot box plot or violin plot for one or multiple datasets.

Box and violin plots are complementary diagnostics for comparing spread and shape across variables, lithological domains, or spatial groups.

Parameters:

Name Type Description Default
data array-like, mapping, sequence of arrays, or DataFrame

One dataset or multiple datasets.

required
kind (box, violin)

Plot type to generate.

'box'
labels sequence of str

Custom labels replacing detected dataset names.

None
ax Axes

Existing axis to draw on.

None
show_means bool

If True, display mean marker/line.

True
color str

Primary face color for box/violin glyphs.

'tab:blue'
xlabel str

X-axis label.

'Variables'
ylabel str

Y-axis label.

'Value'
title str

Plot title. If omitted, a default title is used.

None

Returns:

Type Description
tuple

(figure, axis).

Raises:

Type Description
ValueError

If kind is not 'box' or 'violin'.

Examples:

>>> from minexpy.statviz import plot_box_violin
>>> fig, ax = plot_box_violin({'Zn': [1, 2, 3], 'Cu': [2, 3, 4]}, kind='violin')
Notes

For DataFrame input, each numeric column is treated as one distribution. Non-finite values are removed independently per dataset.

Source code in minexpy/statviz.py
def plot_box_violin(
    data: CollectionLike,
    kind: str = "box",
    labels: Optional[Sequence[str]] = None,
    ax: Optional[plt.Axes] = None,
    show_means: bool = True,
    color: str = "tab:blue",
    xlabel: str = "Variables",
    ylabel: str = "Value",
    title: Optional[str] = None,
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot box plot or violin plot for one or multiple datasets.

    Box and violin plots are complementary diagnostics for comparing spread
    and shape across variables, lithological domains, or spatial groups.

    Parameters
    ----------
    data : array-like, mapping, sequence of arrays, or DataFrame
        One dataset or multiple datasets.
    kind : {'box', 'violin'}, default 'box'
        Plot type to generate.
    labels : sequence of str, optional
        Custom labels replacing detected dataset names.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    show_means : bool, default True
        If ``True``, display mean marker/line.
    color : str, default 'tab:blue'
        Primary face color for box/violin glyphs.
    xlabel : str, default 'Variables'
        X-axis label.
    ylabel : str, default 'Value'
        Y-axis label.
    title : str, optional
        Plot title. If omitted, a default title is used.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Raises
    ------
    ValueError
        If ``kind`` is not ``'box'`` or ``'violin'``.

    Examples
    --------
    >>> from minexpy.statviz import plot_box_violin
    >>> fig, ax = plot_box_violin({'Zn': [1, 2, 3], 'Cu': [2, 3, 4]}, kind='violin')

    Notes
    -----
    For DataFrame input, each numeric column is treated as one distribution.
    Non-finite values are removed independently per dataset.
    """
    names, arrays = _parse_collection(data, labels=labels)
    fig, axis = _resolve_axis(ax)

    kind_lower = kind.lower()
    if kind_lower == "box":
        box = axis.boxplot(arrays, labels=names, patch_artist=True, showmeans=show_means)
        for patch in box["boxes"]:
            patch.set_facecolor(color)
            patch.set_alpha(0.45)
        default_title = "Box Plot"
    elif kind_lower == "violin":
        violin = axis.violinplot(arrays, showmeans=show_means, showmedians=True)
        for body in violin["bodies"]:
            body.set_facecolor(color)
            body.set_edgecolor("black")
            body.set_alpha(0.45)
        axis.set_xticks(np.arange(1, len(names) + 1))
        axis.set_xticklabels(names)
        default_title = "Violin Plot"
    else:
        raise ValueError("kind must be either 'box' or 'violin'")

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(default_title if title is None else title)

    return fig, axis

plot_ecdf(data, labels=None, ax=None, xlabel='Value', ylabel='Empirical Cumulative Probability', title='ECDF')

Plot empirical cumulative distribution function (ECDF).

ECDF plots avoid histogram binning artifacts and are useful for direct comparison of distribution shifts and tail behavior across groups.

Parameters:

Name Type Description Default
data array-like, mapping, sequence of arrays, or DataFrame

One dataset or multiple datasets.

required
labels sequence of str

Custom labels replacing detected dataset names.

None
ax Axes

Existing axis to draw on.

None
xlabel str

X-axis label.

'Value'
ylabel str

Y-axis label.

'Empirical Cumulative Probability'
title str

Plot title.

'ECDF'

Returns:

Type Description
tuple

(figure, axis).

Examples:

>>> from minexpy.statviz import plot_ecdf
>>> fig, ax = plot_ecdf([1, 3, 2, 4, 8])
Source code in minexpy/statviz.py
def plot_ecdf(
    data: CollectionLike,
    labels: Optional[Sequence[str]] = None,
    ax: Optional[plt.Axes] = None,
    xlabel: str = "Value",
    ylabel: str = "Empirical Cumulative Probability",
    title: str = "ECDF",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot empirical cumulative distribution function (ECDF).

    ECDF plots avoid histogram binning artifacts and are useful for direct
    comparison of distribution shifts and tail behavior across groups.

    Parameters
    ----------
    data : array-like, mapping, sequence of arrays, or DataFrame
        One dataset or multiple datasets.
    labels : sequence of str, optional
        Custom labels replacing detected dataset names.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    xlabel : str, default 'Value'
        X-axis label.
    ylabel : str, default 'Empirical Cumulative Probability'
        Y-axis label.
    title : str, default 'ECDF'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Examples
    --------
    >>> from minexpy.statviz import plot_ecdf
    >>> fig, ax = plot_ecdf([1, 3, 2, 4, 8])
    """
    names, arrays = _parse_collection(data, labels=labels)
    fig, axis = _resolve_axis(ax)

    for name, values in zip(names, arrays):
        sorted_values = np.sort(values)
        empirical_prob = np.arange(1, sorted_values.size + 1) / sorted_values.size
        axis.step(sorted_values, empirical_prob, where="post", label=name)

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)

    if len(arrays) > 1 or labels is not None:
        axis.legend()

    return fig, axis

plot_histogram(data, bins=30, scale='linear', density=False, ax=None, color='tab:blue', alpha=0.75, label=None, xlabel='Value', ylabel=None, title='Histogram')

Plot a histogram with linear or logarithmic x-axis scaling.

Histograms are often the first diagnostic for distribution shape, outlier concentration, and modal behavior. scale='log' is especially useful for right-skewed concentration data with multiplicative structure.

Parameters:

Name Type Description Default
data array - like

One-dimensional numeric dataset.

required
bins int or sequence

Number of bins or explicit bin edges.

30
scale (linear, log)

X-axis scale and binning mode.

'linear'
density bool

If True, normalize bar heights to represent density.

False
ax Axes

Existing axis to draw on.

None
color str

Bar face color.

'tab:blue'
alpha float

Bar opacity.

0.75
label str

Legend label for plotted dataset.

None
xlabel str

X-axis label.

'Value'
ylabel str

Y-axis label. If None, set automatically to Frequency or Density based on density.

None
title str

Plot title.

'Histogram'

Returns:

Type Description
tuple

(figure, axis) for additional customization.

Raises:

Type Description
ValueError

If scale is invalid or if scale='log' and data contain non-positive values.

Examples:

>>> from minexpy.statviz import plot_histogram
>>> fig, ax = plot_histogram([1, 2, 2, 3, 5], bins=5)
>>> fig, ax = plot_histogram([1, 2, 3, 4], scale='log')
Notes

For scale='log' and integer bins, logarithmically spaced bin edges are constructed from the finite min/max of the data.

Source code in minexpy/statviz.py
def plot_histogram(
    data: ArrayLike1D,
    bins: Union[int, Sequence[float]] = 30,
    scale: str = "linear",
    density: bool = False,
    ax: Optional[plt.Axes] = None,
    color: str = "tab:blue",
    alpha: float = 0.75,
    label: Optional[str] = None,
    xlabel: str = "Value",
    ylabel: Optional[str] = None,
    title: str = "Histogram",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot a histogram with linear or logarithmic x-axis scaling.

    Histograms are often the first diagnostic for distribution shape,
    outlier concentration, and modal behavior. ``scale='log'`` is especially
    useful for right-skewed concentration data with multiplicative structure.

    Parameters
    ----------
    data : array-like
        One-dimensional numeric dataset.
    bins : int or sequence, default 30
        Number of bins or explicit bin edges.
    scale : {'linear', 'log'}, default 'linear'
        X-axis scale and binning mode.
    density : bool, default False
        If ``True``, normalize bar heights to represent density.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    color : str, default 'tab:blue'
        Bar face color.
    alpha : float, default 0.75
        Bar opacity.
    label : str, optional
        Legend label for plotted dataset.
    xlabel : str, default 'Value'
        X-axis label.
    ylabel : str, optional
        Y-axis label. If ``None``, set automatically to ``Frequency`` or
        ``Density`` based on ``density``.
    title : str, default 'Histogram'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)`` for additional customization.

    Raises
    ------
    ValueError
        If ``scale`` is invalid or if ``scale='log'`` and data contain
        non-positive values.

    Examples
    --------
    >>> from minexpy.statviz import plot_histogram
    >>> fig, ax = plot_histogram([1, 2, 2, 3, 5], bins=5)
    >>> fig, ax = plot_histogram([1, 2, 3, 4], scale='log')

    Notes
    -----
    For ``scale='log'`` and integer bins, logarithmically spaced bin edges are
    constructed from the finite min/max of the data.
    """
    values = _to_1d_float_array(data)
    fig, axis = _resolve_axis(ax)

    scale_lower = scale.lower()
    if scale_lower not in {"linear", "log"}:
        raise ValueError("scale must be either 'linear' or 'log'")

    bins_to_use: Union[int, np.ndarray, Sequence[float]] = bins
    if scale_lower == "log":
        if np.any(values <= 0):
            raise ValueError("Log-scale histogram requires strictly positive values")

        if isinstance(bins, int):
            min_value = float(np.min(values))
            max_value = float(np.max(values))
            bins_to_use = np.logspace(np.log10(min_value), np.log10(max_value), bins + 1)

    axis.hist(
        values,
        bins=bins_to_use,
        density=density,
        color=color,
        alpha=alpha,
        edgecolor="black",
        linewidth=0.8,
        label=label,
    )

    if scale_lower == "log":
        axis.set_xscale("log")

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel if ylabel is not None else ("Density" if density else "Frequency"))
    axis.set_title(title)

    if label:
        axis.legend()

    return fig, axis

plot_pp(data, distribution='norm', distribution_parameters=None, fit_distribution=True, ax=None, color='tab:blue', xlabel='Theoretical Cumulative Probability', ylabel='Empirical Cumulative Probability', title='P-P Plot')

Plot probability-probability (P-P) diagnostic against a distribution.

P-P plots compare empirical cumulative probabilities against theoretical CDF values. They are often more sensitive in the central part of the distribution than Q-Q plots.

Parameters:

Name Type Description Default
data array - like

One-dimensional numeric dataset.

required
distribution str or scipy.stats distribution

Reference distribution name or object.

'norm'
distribution_parameters sequence of float

Fixed distribution parameters. If omitted and fit_distribution=True, parameters are estimated from the sample via distribution.fit when available.

None
fit_distribution bool

Fit distribution parameters from data when not provided.

True
ax Axes

Existing axis to draw on.

None
color str

Marker color.

'tab:blue'
xlabel str

X-axis label.

'Theoretical Cumulative Probability'
ylabel str

Y-axis label.

'Empirical Cumulative Probability'
title str

Plot title.

'P-P Plot'

Returns:

Type Description
tuple

(figure, axis).

Examples:

>>> from minexpy.statviz import plot_pp
>>> fig, ax = plot_pp([1.2, 1.4, 1.8, 2.0, 2.1])
Source code in minexpy/statviz.py
def plot_pp(
    data: ArrayLike1D,
    distribution: DistributionLike = "norm",
    distribution_parameters: Optional[Sequence[float]] = None,
    fit_distribution: bool = True,
    ax: Optional[plt.Axes] = None,
    color: str = "tab:blue",
    xlabel: str = "Theoretical Cumulative Probability",
    ylabel: str = "Empirical Cumulative Probability",
    title: str = "P-P Plot",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot probability-probability (P-P) diagnostic against a distribution.

    P-P plots compare empirical cumulative probabilities against theoretical
    CDF values. They are often more sensitive in the central part of the
    distribution than Q-Q plots.

    Parameters
    ----------
    data : array-like
        One-dimensional numeric dataset.
    distribution : str or scipy.stats distribution, default 'norm'
        Reference distribution name or object.
    distribution_parameters : sequence of float, optional
        Fixed distribution parameters. If omitted and ``fit_distribution=True``,
        parameters are estimated from the sample via ``distribution.fit`` when
        available.
    fit_distribution : bool, default True
        Fit distribution parameters from data when not provided.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    color : str, default 'tab:blue'
        Marker color.
    xlabel : str, default 'Theoretical Cumulative Probability'
        X-axis label.
    ylabel : str, default 'Empirical Cumulative Probability'
        Y-axis label.
    title : str, default 'P-P Plot'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Examples
    --------
    >>> from minexpy.statviz import plot_pp
    >>> fig, ax = plot_pp([1.2, 1.4, 1.8, 2.0, 2.1])
    """
    values = np.sort(_to_1d_float_array(data))
    dist_obj = _resolve_distribution(distribution)

    params: Tuple[float, ...]
    if distribution_parameters is not None:
        params = tuple(float(p) for p in distribution_parameters)
    elif fit_distribution and hasattr(dist_obj, "fit"):
        params = tuple(float(p) for p in dist_obj.fit(values))
    else:
        params = tuple()

    n = values.size
    empirical_prob = (np.arange(1, n + 1) - 0.5) / n
    theoretical_prob = dist_obj.cdf(values, *params)
    theoretical_prob = np.clip(theoretical_prob, 0.0, 1.0)

    fig, axis = _resolve_axis(ax)
    axis.scatter(theoretical_prob, empirical_prob, color=color, alpha=0.8, label="Observed")
    axis.plot([0.0, 1.0], [0.0, 1.0], linestyle="--", color="black", linewidth=1.2, label="1:1")

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)
    axis.legend()

    return fig, axis

plot_qq(data, distribution='norm', distribution_parameters=None, fit_distribution=True, show_fit_line=True, ax=None, marker='o', color='tab:blue', xlabel='Theoretical Quantiles', ylabel='Sample Quantiles', title='Q-Q Plot')

Plot quantile-quantile (Q-Q) diagnostic against a theoretical distribution.

Q-Q plots assess distributional fit by comparing sample quantiles to theoretical quantiles. Tail deviations are especially informative for heavy-tailed geochemical variables.

Parameters:

Name Type Description Default
data array - like

One-dimensional numeric dataset.

required
distribution str or scipy.stats distribution

Reference distribution name or object.

'norm'
distribution_parameters sequence of float

Fixed distribution parameters. If omitted and fit_distribution=True, parameters are estimated from the sample via distribution.fit when available.

None
fit_distribution bool

Fit distribution parameters from data when not provided.

True
show_fit_line bool

If True, overlay least-squares line through plotted points.

True
ax Axes

Existing axis to draw on.

None
marker str

Marker style for sample quantiles.

'o'
color str

Marker color.

'tab:blue'
xlabel str

X-axis label.

'Theoretical Quantiles'
ylabel str

Y-axis label.

'Sample Quantiles'
title str

Plot title.

'Q-Q Plot'

Returns:

Type Description
tuple

(figure, axis).

Examples:

>>> from minexpy.statviz import plot_qq
>>> fig, ax = plot_qq([1.2, 1.4, 1.8, 2.0, 2.1])
Notes

The dashed 1:1 line shows ideal agreement. Systematic curvature indicates mismatch between empirical and theoretical distributions.

Source code in minexpy/statviz.py
def plot_qq(
    data: ArrayLike1D,
    distribution: DistributionLike = "norm",
    distribution_parameters: Optional[Sequence[float]] = None,
    fit_distribution: bool = True,
    show_fit_line: bool = True,
    ax: Optional[plt.Axes] = None,
    marker: str = "o",
    color: str = "tab:blue",
    xlabel: str = "Theoretical Quantiles",
    ylabel: str = "Sample Quantiles",
    title: str = "Q-Q Plot",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot quantile-quantile (Q-Q) diagnostic against a theoretical distribution.

    Q-Q plots assess distributional fit by comparing sample quantiles to
    theoretical quantiles. Tail deviations are especially informative for
    heavy-tailed geochemical variables.

    Parameters
    ----------
    data : array-like
        One-dimensional numeric dataset.
    distribution : str or scipy.stats distribution, default 'norm'
        Reference distribution name or object.
    distribution_parameters : sequence of float, optional
        Fixed distribution parameters. If omitted and ``fit_distribution=True``,
        parameters are estimated from the sample via ``distribution.fit`` when
        available.
    fit_distribution : bool, default True
        Fit distribution parameters from data when not provided.
    show_fit_line : bool, default True
        If ``True``, overlay least-squares line through plotted points.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    marker : str, default 'o'
        Marker style for sample quantiles.
    color : str, default 'tab:blue'
        Marker color.
    xlabel : str, default 'Theoretical Quantiles'
        X-axis label.
    ylabel : str, default 'Sample Quantiles'
        Y-axis label.
    title : str, default 'Q-Q Plot'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Examples
    --------
    >>> from minexpy.statviz import plot_qq
    >>> fig, ax = plot_qq([1.2, 1.4, 1.8, 2.0, 2.1])

    Notes
    -----
    The dashed 1:1 line shows ideal agreement. Systematic curvature indicates
    mismatch between empirical and theoretical distributions.
    """
    values = _to_1d_float_array(data)
    dist_obj = _resolve_distribution(distribution)

    params: Tuple[float, ...]
    if distribution_parameters is not None:
        params = tuple(float(p) for p in distribution_parameters)
    elif fit_distribution and hasattr(dist_obj, "fit"):
        params = tuple(float(p) for p in dist_obj.fit(values))
    else:
        params = tuple()

    theoretical_quantiles, sample_quantiles = scipy_stats.probplot(
        values,
        dist=dist_obj,
        sparams=params,
        fit=False,
    )

    fig, axis = _resolve_axis(ax)
    axis.scatter(
        theoretical_quantiles,
        sample_quantiles,
        marker=marker,
        color=color,
        alpha=0.8,
        label="Sample quantiles",
    )

    line_min = min(float(np.min(theoretical_quantiles)), float(np.min(sample_quantiles)))
    line_max = max(float(np.max(theoretical_quantiles)), float(np.max(sample_quantiles)))
    line_x = np.linspace(line_min, line_max, 200)

    axis.plot(line_x, line_x, linestyle="--", color="black", linewidth=1.2, label="1:1")

    if show_fit_line:
        fit = scipy_stats.linregress(theoretical_quantiles, sample_quantiles)
        axis.plot(
            line_x,
            fit.slope * line_x + fit.intercept,
            color="tab:red",
            linewidth=1.4,
            label=f"Fit line (r={fit.rvalue:.3f})",
        )

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)
    axis.legend()

    return fig, axis

plot_scatter(x, y, ax=None, color='tab:blue', alpha=0.8, marker='o', label=None, add_trendline=False, trendline_color='tab:red', xlabel='X', ylabel='Y', title='Scatter Plot')

Plot scatter data with optional least-squares trend line.

Scatter plots are central to geoscience exploratory analysis because they reveal linear/nonlinear patterns, heteroscedasticity, clustering, and potential outliers before formal modeling.

Parameters:

Name Type Description Default
x array - like

X-variable values.

required
y array - like

Y-variable values.

required
ax Axes

Existing axis to draw on.

None
color str

Marker color.

'tab:blue'
alpha float

Marker opacity.

0.8
marker str

Marker style.

'o'
label str

Legend label for plotted points.

None
add_trendline bool

If True, overlay least-squares regression line and annotate Pearson r in legend.

False
trendline_color str

Trend line color.

'tab:red'
xlabel str

X-axis label.

'X'
ylabel str

Y-axis label.

'Y'
title str

Plot title.

'Scatter Plot'

Returns:

Type Description
tuple

(figure, axis).

Raises:

Type Description
ValueError

If vector lengths differ or fewer than two valid paired values exist.

Examples:

>>> from minexpy.statviz import plot_scatter
>>> fig, ax = plot_scatter([1, 2, 3], [2, 4, 6], add_trendline=True)
Source code in minexpy/statviz.py
def plot_scatter(
    x: ArrayLike1D,
    y: ArrayLike1D,
    ax: Optional[plt.Axes] = None,
    color: str = "tab:blue",
    alpha: float = 0.8,
    marker: str = "o",
    label: Optional[str] = None,
    add_trendline: bool = False,
    trendline_color: str = "tab:red",
    xlabel: str = "X",
    ylabel: str = "Y",
    title: str = "Scatter Plot",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot scatter data with optional least-squares trend line.

    Scatter plots are central to geoscience exploratory analysis because they
    reveal linear/nonlinear patterns, heteroscedasticity, clustering, and
    potential outliers before formal modeling.

    Parameters
    ----------
    x : array-like
        X-variable values.
    y : array-like
        Y-variable values.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    color : str, default 'tab:blue'
        Marker color.
    alpha : float, default 0.8
        Marker opacity.
    marker : str, default 'o'
        Marker style.
    label : str, optional
        Legend label for plotted points.
    add_trendline : bool, default False
        If ``True``, overlay least-squares regression line and annotate
        Pearson ``r`` in legend.
    trendline_color : str, default 'tab:red'
        Trend line color.
    xlabel : str, default 'X'
        X-axis label.
    ylabel : str, default 'Y'
        Y-axis label.
    title : str, default 'Scatter Plot'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Raises
    ------
    ValueError
        If vector lengths differ or fewer than two valid paired values exist.

    Examples
    --------
    >>> from minexpy.statviz import plot_scatter
    >>> fig, ax = plot_scatter([1, 2, 3], [2, 4, 6], add_trendline=True)
    """
    x_values = _to_1d_float_array(x, name="x")
    y_values = _to_1d_float_array(y, name="y")

    if x_values.size != y_values.size:
        raise ValueError("x and y must have the same length")

    mask = np.isfinite(x_values) & np.isfinite(y_values)
    x_clean = x_values[mask]
    y_clean = y_values[mask]

    if x_clean.size < 2:
        raise ValueError("At least two valid paired observations are required")

    fig, axis = _resolve_axis(ax)
    point_label = "Samples" if label is None else label

    axis.scatter(
        x_clean,
        y_clean,
        color=color,
        alpha=alpha,
        marker=marker,
        label=point_label,
    )

    if add_trendline:
        fit = scipy_stats.linregress(x_clean, y_clean)
        line_x = np.linspace(float(np.min(x_clean)), float(np.max(x_clean)), 200)
        line_y = fit.slope * line_x + fit.intercept
        axis.plot(
            line_x,
            line_y,
            color=trendline_color,
            linewidth=1.6,
            label=f"Trend line (r={fit.rvalue:.3f})",
        )

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)

    if label is not None or add_trendline:
        axis.legend()

    return fig, axis