Statistics API

This page provides API references for MinexPy's statistics functionality.

Statistical Analysis

`minexpy.stats`

Statistical analysis module for geoscience data.

This module provides comprehensive descriptive statistical metrics for analyzing geochemical and geological sample data. It includes both functional and class-based APIs for maximum flexibility.

Examples:

Basic usage with functions:

>>> import numpy as np
>>> import minexpy.stats as mstats
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> skew = mstats.skewness(data)
>>> kurt = mstats.kurtosis(data)
>>> summary = mstats.describe(data)

Class-based usage:

>>> from minexpy.stats import StatisticalAnalyzer
>>> import pandas as pd
>>> 
>>> df = pd.read_csv('geochemical_data.csv')
>>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
>>> results = analyzer.summary()

`StatisticalAnalyzer`

Comprehensive statistical analyzer for geoscience data.

This class provides a convenient, object-oriented interface for calculating multiple statistical metrics on geochemical data. It supports both single arrays and pandas DataFrames for batch analysis.

The class is designed to be intuitive and flexible, allowing researchers to quickly obtain comprehensive statistical summaries of their data with minimal code.

Attributes:

Name	Type	Description
`data`	`ndarray or DataFrame`	The input data being analyzed.
`is_dataframe`	`bool`	True if input is a DataFrame, False if it's an array.

Examples:

Single array analysis:

>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> analyzer = StatisticalAnalyzer(data)
>>> summary = analyzer.summary()
>>> print(summary)

DataFrame analysis (multiple columns):

>>> import pandas as pd
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> df = pd.read_csv('geochemical_data.csv')
>>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
>>> summary_df = analyzer.summary()
>>> print(summary_df)

Individual metric access:

>>> analyzer = StatisticalAnalyzer(data)
>>> skew = analyzer.skewness()
>>> kurt = analyzer.kurtosis()
>>> cv = analyzer.coefficient_of_variation()

Source code in minexpy/stats.py

class StatisticalAnalyzer:
    """
    Comprehensive statistical analyzer for geoscience data.

    This class provides a convenient, object-oriented interface for
    calculating multiple statistical metrics on geochemical data. It
    supports both single arrays and pandas DataFrames for batch analysis.

    The class is designed to be intuitive and flexible, allowing researchers
    to quickly obtain comprehensive statistical summaries of their data
    with minimal code.

    Attributes
    ----------
    data : numpy.ndarray or pandas.DataFrame
        The input data being analyzed.
    is_dataframe : bool
        True if input is a DataFrame, False if it's an array.

    Examples
    --------
    Single array analysis:

        >>> import numpy as np
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
        >>> analyzer = StatisticalAnalyzer(data)
        >>> summary = analyzer.summary()
        >>> print(summary)

    DataFrame analysis (multiple columns):

        >>> import pandas as pd
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> df = pd.read_csv('geochemical_data.csv')
        >>> analyzer = StatisticalAnalyzer(df[['Zn', 'Cu', 'Pb']])
        >>> summary_df = analyzer.summary()
        >>> print(summary_df)

    Individual metric access:

        >>> analyzer = StatisticalAnalyzer(data)
        >>> skew = analyzer.skewness()
        >>> kurt = analyzer.kurtosis()
        >>> cv = analyzer.coefficient_of_variation()
    """

    def __init__(self, data: Union[np.ndarray, pd.Series, pd.DataFrame, List[float]]):
        """
        Initialize the statistical analyzer.

        Parameters
        ----------
        data : array-like or DataFrame
            Input data to analyze. Can be:
            - 1D numpy array
            - pandas Series
            - pandas DataFrame (for multi-column analysis)
            - Python list

        Raises
        ------
        ValueError
            If input data is empty or contains only NaN values.

        Examples
        --------
        >>> import numpy as np
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> # Array input
        >>> data = np.array([1, 2, 3, 4, 5])
        >>> analyzer = StatisticalAnalyzer(data)
        >>> 
        >>> # Series input
        >>> import pandas as pd
        >>> series = pd.Series([1, 2, 3, 4, 5])
        >>> analyzer = StatisticalAnalyzer(series)
        """
        if isinstance(data, pd.DataFrame):
            self.data = data
            self.is_dataframe = True
        elif isinstance(data, pd.Series):
            self.data = data.values
            self.is_dataframe = False
        else:
            self.data = np.asarray(data)
            self.is_dataframe = False

        # Validate data
        if self.is_dataframe:
            if self.data.empty:
                raise ValueError("Input DataFrame is empty")
        else:
            valid_data = self.data[~np.isnan(self.data)]
            if len(valid_data) == 0:
                raise ValueError("Input data is empty or contains only NaN values")

    def summary(self, as_dataframe: bool = True, 
                percentiles: Optional[List[float]] = None) -> Union[pd.DataFrame, Dict]:
        """
        Calculate comprehensive statistical summary.

        This method computes all major descriptive statistics for the data.
        For DataFrames, it calculates statistics for each column separately.

        Parameters
        ----------
        as_dataframe : bool, default True
            If True, returns results as a pandas DataFrame (for DataFrames)
            or Series (for arrays). If False, returns a dictionary.
        percentiles : list of float, optional
            Additional percentiles to include in the summary beyond defaults.

        Returns
        -------
        DataFrame, Series, or dict
            Statistical summary. For DataFrames, returns a DataFrame with
            columns as rows and statistics as columns. For arrays, returns
            a Series or dict with statistics as values.

        Examples
        --------
        >>> import numpy as np
        >>> from minexpy.stats import StatisticalAnalyzer
        >>> 
        >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
        >>> analyzer = StatisticalAnalyzer(data)
        >>> summary = analyzer.summary()
        >>> print(summary)
        """
        if self.is_dataframe:
            results = {}
            for col in self.data.columns:
                results[col] = describe(self.data[col].values, percentiles=percentiles)

            if as_dataframe:
                return pd.DataFrame(results).T
            return results
        else:
            results = describe(self.data, percentiles=percentiles)
            if as_dataframe:
                return pd.Series(results)
            return results

    def skewness(self) -> Union[float, pd.Series]:
        """
        Calculate skewness.

        Returns
        -------
        float or Series
            Skewness value(s). For DataFrames, returns a Series with
            skewness for each column.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: skewness(x))
        return skewness(self.data)

    def kurtosis(self, fisher: bool = True) -> Union[float, pd.Series]:
        """
        Calculate kurtosis.

        Parameters
        ----------
        fisher : bool, default True
            Use Fisher's definition (excess kurtosis) if True.

        Returns
        -------
        float or Series
            Kurtosis value(s). For DataFrames, returns a Series with
            kurtosis for each column.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: kurtosis(x, fisher=fisher))
        return kurtosis(self.data, fisher=fisher)

    def std(self, ddof: int = 1) -> Union[float, pd.Series]:
        """
        Calculate standard deviation.

        Parameters
        ----------
        ddof : int, default 1
            Delta degrees of freedom.

        Returns
        -------
        float or Series
            Standard deviation value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.std(ddof=ddof)
        return std(self.data, ddof=ddof)

    def variance(self, ddof: int = 1) -> Union[float, pd.Series]:
        """
        Calculate variance.

        Parameters
        ----------
        ddof : int, default 1
            Delta degrees of freedom.

        Returns
        -------
        float or Series
            Variance value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.var(ddof=ddof)
        return variance(self.data, ddof=ddof)

    def mean(self) -> Union[float, pd.Series]:
        """
        Calculate mean.

        Returns
        -------
        float or Series
            Mean value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.mean()
        return mean(self.data)

    def median(self) -> Union[float, pd.Series]:
        """
        Calculate median.

        Returns
        -------
        float or Series
            Median value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.median()
        return median(self.data)

    def coefficient_of_variation(self) -> Union[float, pd.Series]:
        """
        Calculate coefficient of variation.

        Returns
        -------
        float or Series
            CV value(s). For DataFrames, returns a Series.
            Returns NaN for columns/arrays with zero mean.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: coefficient_of_variation(x))
        return coefficient_of_variation(self.data)

    def iqr(self) -> Union[float, pd.Series]:
        """
        Calculate interquartile range.

        Returns
        -------
        float or Series
            IQR value(s). For DataFrames, returns a Series.
        """
        if self.is_dataframe:
            return self.data.apply(lambda x: iqr(x))
        return iqr(self.data)

    def z_score(self, value: Optional[float] = None) -> Union[float, np.ndarray, pd.DataFrame]:
        """
        Calculate z-scores.

        Parameters
        ----------
        value : float, optional
            If provided, calculates z-score for this specific value.
            Otherwise, returns z-scores for all data points.

        Returns
        -------
        float, array, or DataFrame
            Z-score(s). For DataFrames, returns a DataFrame with z-scores
            for each column.
        """
        if self.is_dataframe:
            if value is not None:
                return self.data.apply(lambda x: z_score(x, value=value))
            else:
                return self.data.apply(lambda x: z_score(x))
        return z_score(self.data, value=value)

`init(data)`

Initialize the statistical analyzer.

Parameters:

Name	Type	Description	Default
`data`	`array - like or DataFrame`	Input data to analyze. Can be: - 1D numpy array - pandas Series - pandas DataFrame (for multi-column analysis) - Python list	required

Raises:

Type	Description
`ValueError`	If input data is empty or contains only NaN values.

Examples:

>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> # Array input
>>> data = np.array([1, 2, 3, 4, 5])
>>> analyzer = StatisticalAnalyzer(data)
>>> 
>>> # Series input
>>> import pandas as pd
>>> series = pd.Series([1, 2, 3, 4, 5])
>>> analyzer = StatisticalAnalyzer(series)

Source code in minexpy/stats.py

def __init__(self, data: Union[np.ndarray, pd.Series, pd.DataFrame, List[float]]):
    """
    Initialize the statistical analyzer.

    Parameters
    ----------
    data : array-like or DataFrame
        Input data to analyze. Can be:
        - 1D numpy array
        - pandas Series
        - pandas DataFrame (for multi-column analysis)
        - Python list

    Raises
    ------
    ValueError
        If input data is empty or contains only NaN values.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import StatisticalAnalyzer
    >>> 
    >>> # Array input
    >>> data = np.array([1, 2, 3, 4, 5])
    >>> analyzer = StatisticalAnalyzer(data)
    >>> 
    >>> # Series input
    >>> import pandas as pd
    >>> series = pd.Series([1, 2, 3, 4, 5])
    >>> analyzer = StatisticalAnalyzer(series)
    """
    if isinstance(data, pd.DataFrame):
        self.data = data
        self.is_dataframe = True
    elif isinstance(data, pd.Series):
        self.data = data.values
        self.is_dataframe = False
    else:
        self.data = np.asarray(data)
        self.is_dataframe = False

    # Validate data
    if self.is_dataframe:
        if self.data.empty:
            raise ValueError("Input DataFrame is empty")
    else:
        valid_data = self.data[~np.isnan(self.data)]
        if len(valid_data) == 0:
            raise ValueError("Input data is empty or contains only NaN values")

`coefficient_of_variation()`

Calculate coefficient of variation.

Returns:

Type	Description
`float or Series`	CV value(s). For DataFrames, returns a Series. Returns NaN for columns/arrays with zero mean.

Source code in minexpy/stats.py

def coefficient_of_variation(self) -> Union[float, pd.Series]:
    """
    Calculate coefficient of variation.

    Returns
    -------
    float or Series
        CV value(s). For DataFrames, returns a Series.
        Returns NaN for columns/arrays with zero mean.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: coefficient_of_variation(x))
    return coefficient_of_variation(self.data)

`iqr()`

Calculate interquartile range.

Returns:

Type	Description
`float or Series`	IQR value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py

def iqr(self) -> Union[float, pd.Series]:
    """
    Calculate interquartile range.

    Returns
    -------
    float or Series
        IQR value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: iqr(x))
    return iqr(self.data)

`kurtosis(fisher=True)`

Calculate kurtosis.

Parameters:

Name	Type	Description	Default
`fisher`	`bool`	Use Fisher's definition (excess kurtosis) if True.	`True`

Returns:

Type	Description
`float or Series`	Kurtosis value(s). For DataFrames, returns a Series with kurtosis for each column.

Source code in minexpy/stats.py

def kurtosis(self, fisher: bool = True) -> Union[float, pd.Series]:
    """
    Calculate kurtosis.

    Parameters
    ----------
    fisher : bool, default True
        Use Fisher's definition (excess kurtosis) if True.

    Returns
    -------
    float or Series
        Kurtosis value(s). For DataFrames, returns a Series with
        kurtosis for each column.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: kurtosis(x, fisher=fisher))
    return kurtosis(self.data, fisher=fisher)

`mean()`

Calculate mean.

Returns:

Type	Description
`float or Series`	Mean value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py

def mean(self) -> Union[float, pd.Series]:
    """
    Calculate mean.

    Returns
    -------
    float or Series
        Mean value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.mean()
    return mean(self.data)

`median()`

Calculate median.

Returns:

Type	Description
`float or Series`	Median value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py

def median(self) -> Union[float, pd.Series]:
    """
    Calculate median.

    Returns
    -------
    float or Series
        Median value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.median()
    return median(self.data)

`skewness()`

Calculate skewness.

Returns:

Type	Description
`float or Series`	Skewness value(s). For DataFrames, returns a Series with skewness for each column.

Source code in minexpy/stats.py

def skewness(self) -> Union[float, pd.Series]:
    """
    Calculate skewness.

    Returns
    -------
    float or Series
        Skewness value(s). For DataFrames, returns a Series with
        skewness for each column.
    """
    if self.is_dataframe:
        return self.data.apply(lambda x: skewness(x))
    return skewness(self.data)

`std(ddof=1)`

Calculate standard deviation.

Parameters:

Name	Type	Description	Default
`ddof`	`int`	Delta degrees of freedom.	`1`

Returns:

Type	Description
`float or Series`	Standard deviation value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py

def std(self, ddof: int = 1) -> Union[float, pd.Series]:
    """
    Calculate standard deviation.

    Parameters
    ----------
    ddof : int, default 1
        Delta degrees of freedom.

    Returns
    -------
    float or Series
        Standard deviation value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.std(ddof=ddof)
    return std(self.data, ddof=ddof)

`summary(as_dataframe=True, percentiles=None)`

Calculate comprehensive statistical summary.

This method computes all major descriptive statistics for the data. For DataFrames, it calculates statistics for each column separately.

Parameters:

Name	Type	Description	Default
`as_dataframe`	`bool`	If True, returns results as a pandas DataFrame (for DataFrames) or Series (for arrays). If False, returns a dictionary.	`True`
`percentiles`	`list of float`	Additional percentiles to include in the summary beyond defaults.	`None`

Returns:

Type	Description
`DataFrame, Series, or dict`	Statistical summary. For DataFrames, returns a DataFrame with columns as rows and statistics as columns. For arrays, returns a Series or dict with statistics as values.

Examples:

>>> import numpy as np
>>> from minexpy.stats import StatisticalAnalyzer
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> analyzer = StatisticalAnalyzer(data)
>>> summary = analyzer.summary()
>>> print(summary)

Source code in minexpy/stats.py

def summary(self, as_dataframe: bool = True, 
            percentiles: Optional[List[float]] = None) -> Union[pd.DataFrame, Dict]:
    """
    Calculate comprehensive statistical summary.

    This method computes all major descriptive statistics for the data.
    For DataFrames, it calculates statistics for each column separately.

    Parameters
    ----------
    as_dataframe : bool, default True
        If True, returns results as a pandas DataFrame (for DataFrames)
        or Series (for arrays). If False, returns a dictionary.
    percentiles : list of float, optional
        Additional percentiles to include in the summary beyond defaults.

    Returns
    -------
    DataFrame, Series, or dict
        Statistical summary. For DataFrames, returns a DataFrame with
        columns as rows and statistics as columns. For arrays, returns
        a Series or dict with statistics as values.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import StatisticalAnalyzer
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> analyzer = StatisticalAnalyzer(data)
    >>> summary = analyzer.summary()
    >>> print(summary)
    """
    if self.is_dataframe:
        results = {}
        for col in self.data.columns:
            results[col] = describe(self.data[col].values, percentiles=percentiles)

        if as_dataframe:
            return pd.DataFrame(results).T
        return results
    else:
        results = describe(self.data, percentiles=percentiles)
        if as_dataframe:
            return pd.Series(results)
        return results

`variance(ddof=1)`

Calculate variance.

Parameters:

Name	Type	Description	Default
`ddof`	`int`	Delta degrees of freedom.	`1`

Returns:

Type	Description
`float or Series`	Variance value(s). For DataFrames, returns a Series.

Source code in minexpy/stats.py

def variance(self, ddof: int = 1) -> Union[float, pd.Series]:
    """
    Calculate variance.

    Parameters
    ----------
    ddof : int, default 1
        Delta degrees of freedom.

    Returns
    -------
    float or Series
        Variance value(s). For DataFrames, returns a Series.
    """
    if self.is_dataframe:
        return self.data.var(ddof=ddof)
    return variance(self.data, ddof=ddof)

`z_score(value=None)`

Calculate z-scores.

Parameters:

Name	Type	Description	Default
`value`	`float`	If provided, calculates z-score for this specific value. Otherwise, returns z-scores for all data points.	`None`

Returns:

Type	Description
`float, array, or DataFrame`	Z-score(s). For DataFrames, returns a DataFrame with z-scores for each column.

Source code in minexpy/stats.py

def z_score(self, value: Optional[float] = None) -> Union[float, np.ndarray, pd.DataFrame]:
    """
    Calculate z-scores.

    Parameters
    ----------
    value : float, optional
        If provided, calculates z-score for this specific value.
        Otherwise, returns z-scores for all data points.

    Returns
    -------
    float, array, or DataFrame
        Z-score(s). For DataFrames, returns a DataFrame with z-scores
        for each column.
    """
    if self.is_dataframe:
        if value is not None:
            return self.data.apply(lambda x: z_score(x, value=value))
        else:
            return self.data.apply(lambda x: z_score(x))
    return z_score(self.data, value=value)

`coefficient_of_variation(data)`

Calculate the coefficient of variation (CV).

The coefficient of variation is the ratio of the standard deviation to the mean, expressed as a percentage or decimal. It is a normalized measure of dispersion that allows comparison of variability across different scales and units.

CV is particularly useful in geochemistry for: - Comparing variability of different elements with different concentration ranges - Identifying elements with high relative variability - Assessing data quality and homogeneity

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required

Returns:

Type	Description
`float`	Coefficient of variation (std/mean). Returns NaN if mean is zero. Typically expressed as a decimal (e.g., 0.25 = 25%).

Examples:

>>> import numpy as np
>>> from minexpy.stats import coefficient_of_variation
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> cv = coefficient_of_variation(data)
>>> print(f"CV: {cv:.3f} ({cv*100:.1f}%)")

Notes

CV is dimensionless and unitless, making it ideal for comparing variability across different elements or datasets with different measurement units. A CV < 0.15 is often considered low variability, while CV > 0.5 indicates high variability.

Source code in minexpy/stats.py

def coefficient_of_variation(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the coefficient of variation (CV).

    The coefficient of variation is the ratio of the standard deviation
    to the mean, expressed as a percentage or decimal. It is a normalized
    measure of dispersion that allows comparison of variability across
    different scales and units.

    CV is particularly useful in geochemistry for:
    - Comparing variability of different elements with different
      concentration ranges
    - Identifying elements with high relative variability
    - Assessing data quality and homogeneity

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Coefficient of variation (std/mean). Returns NaN if mean is zero.
        Typically expressed as a decimal (e.g., 0.25 = 25%).

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import coefficient_of_variation
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> cv = coefficient_of_variation(data)
    >>> print(f"CV: {cv:.3f} ({cv*100:.1f}%)")

    Notes
    -----
    CV is dimensionless and unitless, making it ideal for comparing
    variability across different elements or datasets with different
    measurement units. A CV < 0.15 is often considered low variability,
    while CV > 0.5 indicates high variability.
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    mean = np.mean(data)
    if mean == 0:
        return np.nan
    return float(np.std(data, ddof=1) / mean)

`describe(data, percentiles=None)`

Generate a comprehensive statistical summary of the data.

This function calculates all major descriptive statistics in a single call, providing a complete overview of the data distribution. It is particularly useful for initial data exploration in geochemical analysis.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required
`percentiles`	`list of float`	Additional percentiles to calculate beyond the default set. Default percentiles include: [25, 50, 75, 90, 95, 99]. Values should be between 0 and 100.	`None`

Returns:

Type Description

dict

Dictionary containing all statistical metrics with the following keys: - 'count': Number of observations - 'mean': Arithmetic mean - 'median': Median (50th percentile) - 'std': Standard deviation (sample) - 'variance': Variance (sample) - 'min': Minimum value - 'max': Maximum value - 'range': Range (max - min) - 'skewness': Skewness - 'kurtosis': Kurtosis (Fisher's definition) - 'coefficient_of_variation': CV (std/mean) - 'q1': First quartile (25th percentile) - 'q3': Third quartile (75th percentile) - 'iqr': Interquartile range - 'percentile_X': Additional percentiles as specified

Examples:

>>> import numpy as np
>>> from minexpy.stats import describe
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2])
>>> summary = describe(data)
>>> for key, value in summary.items():
...     print(f"{key}: {value:.3f}")
>>> 
>>> # With custom percentiles
>>> summary_custom = describe(data, percentiles=[10, 50, 90, 95])

Source code in minexpy/stats.py

def describe(data: Union[np.ndarray, pd.Series, List[float]], 
             percentiles: Optional[List[float]] = None) -> Dict[str, float]:
    """
    Generate a comprehensive statistical summary of the data.

    This function calculates all major descriptive statistics in a single
    call, providing a complete overview of the data distribution. It is
    particularly useful for initial data exploration in geochemical analysis.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    percentiles : list of float, optional
        Additional percentiles to calculate beyond the default set.
        Default percentiles include: [25, 50, 75, 90, 95, 99].
        Values should be between 0 and 100.

    Returns
    -------
    dict
        Dictionary containing all statistical metrics with the following keys:
        - 'count': Number of observations
        - 'mean': Arithmetic mean
        - 'median': Median (50th percentile)
        - 'std': Standard deviation (sample)
        - 'variance': Variance (sample)
        - 'min': Minimum value
        - 'max': Maximum value
        - 'range': Range (max - min)
        - 'skewness': Skewness
        - 'kurtosis': Kurtosis (Fisher's definition)
        - 'coefficient_of_variation': CV (std/mean)
        - 'q1': First quartile (25th percentile)
        - 'q3': Third quartile (75th percentile)
        - 'iqr': Interquartile range
        - 'percentile_X': Additional percentiles as specified

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import describe
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2])
    >>> summary = describe(data)
    >>> for key, value in summary.items():
    ...     print(f"{key}: {value:.3f}")
    >>> 
    >>> # With custom percentiles
    >>> summary_custom = describe(data, percentiles=[10, 50, 90, 95])
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]

    if len(data) == 0:
        raise ValueError("Input data is empty or contains only NaN values")

    if percentiles is None:
        percentiles = [25, 50, 75, 90, 95, 99]

    results = {
        'count': len(data),
        'mean': mean(data),
        'median': median(data),
        'std': std(data),
        'variance': variance(data),
        'min': float(np.min(data)),
        'max': float(np.max(data)),
        'range': float(np.max(data) - np.min(data)),
        'skewness': skewness(data),
        'kurtosis': kurtosis(data),
        'coefficient_of_variation': coefficient_of_variation(data),
        'q1': percentile(data, 25),
        'q3': percentile(data, 75),
        'iqr': iqr(data),
    }

    # Add custom percentiles
    for p in percentiles:
        if 0 <= p <= 100:
            results[f'percentile_{p}'] = percentile(data, p)

    return results

`iqr(data)`

Calculate the interquartile range (IQR).

The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). It is a robust measure of spread that is not affected by outliers.

IQR is commonly used in geochemistry for: - Identifying outliers (values beyond Q1 - 1.5IQR or Q3 + 1.5IQR) - Describing data spread in skewed distributions - Box plot construction

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required

Returns:

Type	Description
`float`	Interquartile range (Q3 - Q1).

Examples:

>>> import numpy as np
>>> from minexpy.stats import iqr
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0])
>>> interquartile_range = iqr(data)
>>> print(f"IQR: {interquartile_range:.3f}")

Source code in minexpy/stats.py

def iqr(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the interquartile range (IQR).

    The IQR is the difference between the 75th percentile (Q3) and
    the 25th percentile (Q1). It is a robust measure of spread that
    is not affected by outliers.

    IQR is commonly used in geochemistry for:
    - Identifying outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR)
    - Describing data spread in skewed distributions
    - Box plot construction

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Interquartile range (Q3 - Q1).

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import iqr
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0])
    >>> interquartile_range = iqr(data)
    >>> print(f"IQR: {interquartile_range:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    return float(q3 - q1)

`kurtosis(data, fisher=True)`

Calculate the kurtosis of the data.

Kurtosis measures the "tailedness" of the probability distribution. It indicates the presence of outliers and the shape of the distribution tails compared to a normal distribution.

High kurtosis (>0 with Fisher's definition): Heavy tails, more outliers
Low kurtosis (<0 with Fisher's definition): Light tails, fewer outliers
Normal distribution: Kurtosis = 0 (Fisher's) or 3 (Pearson's)

In geochemical analysis, high kurtosis often indicates the presence of anomalous values or multiple populations in the dataset.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required
`fisher`	`bool`	If True, uses Fisher's definition (excess kurtosis). Normal distribution has kurtosis = 0. If False, uses Pearson's definition. Normal distribution has kurtosis = 3.	`True`

Returns:

Type	Description
`float`	Kurtosis value. With Fisher's definition, typically ranges from -2 to +10, though extreme values are possible.

Examples:

>>> import numpy as np
>>> from minexpy.stats import kurtosis
>>> 
>>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
>>> kurt = kurtosis(data)
>>> print(f"Kurtosis (Fisher's): {kurt:.3f}")
>>> 
>>> kurt_pearson = kurtosis(data, fisher=False)
>>> print(f"Kurtosis (Pearson's): {kurt_pearson:.3f}")

Notes

Fisher's definition (excess kurtosis) is more commonly used in modern statistics and is the default. It subtracts 3 from Pearson's definition so that a normal distribution has kurtosis = 0.

References

.. [1] DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological methods, 2(3), 292.

Source code in minexpy/stats.py

def kurtosis(data: Union[np.ndarray, pd.Series, List[float]], 
             fisher: bool = True) -> float:
    """
    Calculate the kurtosis of the data.

    Kurtosis measures the "tailedness" of the probability distribution.
    It indicates the presence of outliers and the shape of the distribution
    tails compared to a normal distribution.

    - High kurtosis (>0 with Fisher's definition): Heavy tails, more outliers
    - Low kurtosis (<0 with Fisher's definition): Light tails, fewer outliers
    - Normal distribution: Kurtosis = 0 (Fisher's) or 3 (Pearson's)

    In geochemical analysis, high kurtosis often indicates the presence of
    anomalous values or multiple populations in the dataset.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    fisher : bool, default True
        If True, uses Fisher's definition (excess kurtosis).
        Normal distribution has kurtosis = 0.
        If False, uses Pearson's definition.
        Normal distribution has kurtosis = 3.

    Returns
    -------
    float
        Kurtosis value. With Fisher's definition, typically ranges from
        -2 to +10, though extreme values are possible.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import kurtosis
    >>> 
    >>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
    >>> kurt = kurtosis(data)
    >>> print(f"Kurtosis (Fisher's): {kurt:.3f}")
    >>> 
    >>> kurt_pearson = kurtosis(data, fisher=False)
    >>> print(f"Kurtosis (Pearson's): {kurt_pearson:.3f}")

    Notes
    -----
    Fisher's definition (excess kurtosis) is more commonly used in
    modern statistics and is the default. It subtracts 3 from Pearson's
    definition so that a normal distribution has kurtosis = 0.

    References
    ----------
    .. [1] DeCarlo, L. T. (1997). On the meaning and use of kurtosis.
           Psychological methods, 2(3), 292.
    """
    data = np.asarray(data)
    return float(scipy_stats.kurtosis(data, fisher=fisher, nan_policy='omit'))

`mean(data)`

Calculate the arithmetic mean of the data.

The mean is the sum of all values divided by the number of values. It is the most common measure of central tendency.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required

Returns:

Type	Description
`float`	Arithmetic mean value.

Examples:

>>> import numpy as np
>>> from minexpy.stats import mean
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> avg = mean(data)
>>> print(f"Mean: {avg:.3f}")

Source code in minexpy/stats.py

def mean(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the arithmetic mean of the data.

    The mean is the sum of all values divided by the number of values.
    It is the most common measure of central tendency.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Arithmetic mean value.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import mean
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> avg = mean(data)
    >>> print(f"Mean: {avg:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.mean(data))

`median(data)`

Calculate the median of the data.

The median is the middle value when data is sorted. It is a robust measure of central tendency that is less affected by outliers than the mean.

For geochemical data, the median is often preferred when dealing with skewed distributions or datasets containing outliers.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required

Returns:

Type	Description
`float`	Median value.

Examples:

>>> import numpy as np
>>> from minexpy.stats import median
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> med = median(data)
>>> print(f"Median: {med:.3f}")

Source code in minexpy/stats.py

def median(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the median of the data.

    The median is the middle value when data is sorted. It is a robust
    measure of central tendency that is less affected by outliers than
    the mean.

    For geochemical data, the median is often preferred when dealing
    with skewed distributions or datasets containing outliers.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Median value.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import median
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> med = median(data)
    >>> print(f"Median: {med:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.median(data))

`mode(data)`

Calculate the mode (most frequent value) of the data.

The mode is the value that appears most frequently in the dataset. For continuous data, this function returns the most common value after rounding to a reasonable precision.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required

Returns:

Type	Description
`tuple`	A tuple containing (mode_value, count). The mode_value is the most frequent value, and count is the number of times it appears. If multiple modes exist, returns the first one encountered.

Examples:

>>> import numpy as np
>>> from minexpy.stats import mode
>>> 
>>> data = np.array([1, 2, 2, 3, 3, 3, 4, 4])
>>> mode_val, count = mode(data)
>>> print(f"Mode: {mode_val} (appears {count} times)")

Source code in minexpy/stats.py

def mode(data: Union[np.ndarray, pd.Series, List[float]]) -> Tuple[float, int]:
    """
    Calculate the mode (most frequent value) of the data.

    The mode is the value that appears most frequently in the dataset.
    For continuous data, this function returns the most common value
    after rounding to a reasonable precision.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    tuple
        A tuple containing (mode_value, count). The mode_value is the
        most frequent value, and count is the number of times it appears.
        If multiple modes exist, returns the first one encountered.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import mode
    >>> 
    >>> data = np.array([1, 2, 2, 3, 3, 3, 4, 4])
    >>> mode_val, count = mode(data)
    >>> print(f"Mode: {mode_val} (appears {count} times)")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    values, counts = np.unique(data, return_counts=True)
    max_count_idx = np.argmax(counts)
    return (float(values[max_count_idx]), int(counts[max_count_idx]))

`percentile(data, p)`

Calculate a specific percentile of the data.

The percentile is the value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data points lie.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required
`p`	`float`	Percentile to calculate, between 0 and 100.	required

Returns:

Type	Description
`float`	The p-th percentile value.

Examples:

>>> import numpy as np
>>> from minexpy.stats import percentile
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> p90 = percentile(data, 90)
>>> print(f"90th percentile: {p90:.3f}")

Source code in minexpy/stats.py

def percentile(data: Union[np.ndarray, pd.Series, List[float]], 
               p: float) -> float:
    """
    Calculate a specific percentile of the data.

    The percentile is the value below which a given percentage of
    observations fall. For example, the 90th percentile is the value
    below which 90% of the data points lie.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    p : float
        Percentile to calculate, between 0 and 100.

    Returns
    -------
    float
        The p-th percentile value.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import percentile
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> p90 = percentile(data, 90)
    >>> print(f"90th percentile: {p90:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.percentile(data, p))

`skewness(data)`

Calculate the skewness of the data.

Skewness measures the asymmetry of the data distribution around the mean. It indicates whether the data is skewed to the left or right.

Positive skewness: Right tail is longer; mass is concentrated on the left
Negative skewness: Left tail is longer; mass is concentrated on the right
Zero skewness: Data is symmetric (normal distribution has zero skewness)

For geochemical data, skewness is particularly useful for identifying log-normal distributions, which are common in geochemical datasets.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required

Returns:

Type	Description
`float`	Skewness value. Typically ranges from -3 to +3, though extreme values can occur with small sample sizes.

Examples:

>>> import numpy as np
>>> from minexpy.stats import skewness
>>> 
>>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
>>> skew = skewness(data)
>>> print(f"Skewness: {skew:.3f}")

Notes

Uses Fisher's definition of skewness (third standardized moment). The calculation uses scipy.stats.skew with nan_policy='omit' to handle missing values.

References

.. [1] Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D, 47(1), 183-189.

Source code in minexpy/stats.py

def skewness(data: Union[np.ndarray, pd.Series, List[float]]) -> float:
    """
    Calculate the skewness of the data.

    Skewness measures the asymmetry of the data distribution around the mean.
    It indicates whether the data is skewed to the left or right.

    - Positive skewness: Right tail is longer; mass is concentrated on the left
    - Negative skewness: Left tail is longer; mass is concentrated on the right
    - Zero skewness: Data is symmetric (normal distribution has zero skewness)

    For geochemical data, skewness is particularly useful for identifying
    log-normal distributions, which are common in geochemical datasets.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.

    Returns
    -------
    float
        Skewness value. Typically ranges from -3 to +3, though extreme
        values can occur with small sample sizes.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import skewness
    >>> 
    >>> data = np.array([1.2, 3.4, 5.6, 7.8, 9.0, 11.2, 13.4])
    >>> skew = skewness(data)
    >>> print(f"Skewness: {skew:.3f}")

    Notes
    -----
    Uses Fisher's definition of skewness (third standardized moment).
    The calculation uses scipy.stats.skew with nan_policy='omit' to
    handle missing values.

    References
    ----------
    .. [1] Joanes, D. N., & Gill, C. A. (1998). Comparing measures of
           sample skewness and kurtosis. Journal of the Royal Statistical
           Society: Series D, 47(1), 183-189.
    """
    data = np.asarray(data)
    return float(scipy_stats.skew(data, nan_policy='omit'))

`std(data, ddof=1)`

Calculate the standard deviation of the data.

Standard deviation measures the amount of variation or dispersion in the dataset. It is the square root of the variance.

For geochemical data, standard deviation helps quantify the variability of element concentrations, which is crucial for understanding geochemical processes and identifying anomalies.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required
`ddof`	`int`	Delta degrees of freedom. The divisor used in calculations is N - ddof, where N is the number of observations. - ddof=1: Sample standard deviation (default, unbiased estimator) - ddof=0: Population standard deviation	`1`

Returns:

Type	Description
`float`	Standard deviation value. Same units as the input data.

Examples:

>>> import numpy as np
>>> from minexpy.stats import std
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> sample_std = std(data, ddof=1)
>>> print(f"Sample std: {sample_std:.3f}")
>>> 
>>> pop_std = std(data, ddof=0)
>>> print(f"Population std: {pop_std:.3f}")

Notes

For sample data (most common case), use ddof=1. For population data, use ddof=0. The default (ddof=1) provides an unbiased estimate of the population standard deviation from a sample.

Source code in minexpy/stats.py

def std(data: Union[np.ndarray, pd.Series, List[float]], 
        ddof: int = 1) -> float:
    """
    Calculate the standard deviation of the data.

    Standard deviation measures the amount of variation or dispersion
    in the dataset. It is the square root of the variance.

    For geochemical data, standard deviation helps quantify the variability
    of element concentrations, which is crucial for understanding
    geochemical processes and identifying anomalies.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    ddof : int, default 1
        Delta degrees of freedom. The divisor used in calculations is
        N - ddof, where N is the number of observations.
        - ddof=1: Sample standard deviation (default, unbiased estimator)
        - ddof=0: Population standard deviation

    Returns
    -------
    float
        Standard deviation value. Same units as the input data.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import std
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> sample_std = std(data, ddof=1)
    >>> print(f"Sample std: {sample_std:.3f}")
    >>> 
    >>> pop_std = std(data, ddof=0)
    >>> print(f"Population std: {pop_std:.3f}")

    Notes
    -----
    For sample data (most common case), use ddof=1. For population data,
    use ddof=0. The default (ddof=1) provides an unbiased estimate of
    the population standard deviation from a sample.
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.std(data, ddof=ddof, axis=None))

`variance(data, ddof=1)`

Calculate the variance of the data.

Variance measures the average squared deviation from the mean. It quantifies the spread or dispersion of the data.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required
`ddof`	`int`	Delta degrees of freedom. The divisor used in calculations is N - ddof, where N is the number of observations. - ddof=1: Sample variance (default, unbiased estimator) - ddof=0: Population variance	`1`

Returns:

Type	Description
`float`	Variance value. Units are the square of the input data units.

Examples:

>>> import numpy as np
>>> from minexpy.stats import variance
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> var = variance(data)
>>> print(f"Variance: {var:.3f}")

Source code in minexpy/stats.py

def variance(data: Union[np.ndarray, pd.Series, List[float]], 
             ddof: int = 1) -> float:
    """
    Calculate the variance of the data.

    Variance measures the average squared deviation from the mean.
    It quantifies the spread or dispersion of the data.

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    ddof : int, default 1
        Delta degrees of freedom. The divisor used in calculations is
        N - ddof, where N is the number of observations.
        - ddof=1: Sample variance (default, unbiased estimator)
        - ddof=0: Population variance

    Returns
    -------
    float
        Variance value. Units are the square of the input data units.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import variance
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> var = variance(data)
    >>> print(f"Variance: {var:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    return float(np.var(data, ddof=ddof, axis=None))

`z_score(data, value=None)`

Calculate z-scores (standardized values).

Z-scores measure how many standard deviations a value is from the mean. They are useful for: - Identifying outliers (typically |z| > 2 or 3) - Standardizing data for comparison - Detecting anomalies in geochemical data

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Input data array. Can be numpy array, pandas Series, or list. NaN values are automatically excluded.	required
`value`	`float`	If provided, calculates the z-score for this specific value. If None, returns z-scores for all values in the data.	`None`

Returns:

Type	Description
`float or array`	If value is provided, returns the z-score for that value. Otherwise, returns an array of z-scores for all data points.

Examples:

>>> import numpy as np
>>> from minexpy.stats import z_score
>>> 
>>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
>>> z_scores = z_score(data)
>>> print(f"Z-scores: {z_scores}")
>>> 
>>> z_specific = z_score(data, value=25.0)
>>> print(f"Z-score for 25.0: {z_specific:.3f}")

Source code in minexpy/stats.py

def z_score(data: Union[np.ndarray, pd.Series, List[float]], 
            value: Optional[float] = None) -> Union[float, np.ndarray]:
    """
    Calculate z-scores (standardized values).

    Z-scores measure how many standard deviations a value is from the mean.
    They are useful for:
    - Identifying outliers (typically |z| > 2 or 3)
    - Standardizing data for comparison
    - Detecting anomalies in geochemical data

    Parameters
    ----------
    data : array-like
        Input data array. Can be numpy array, pandas Series, or list.
        NaN values are automatically excluded.
    value : float, optional
        If provided, calculates the z-score for this specific value.
        If None, returns z-scores for all values in the data.

    Returns
    -------
    float or array
        If value is provided, returns the z-score for that value.
        Otherwise, returns an array of z-scores for all data points.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.stats import z_score
    >>> 
    >>> data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8])
    >>> z_scores = z_score(data)
    >>> print(f"Z-scores: {z_scores}")
    >>> 
    >>> z_specific = z_score(data, value=25.0)
    >>> print(f"Z-score for 25.0: {z_specific:.3f}")
    """
    data = np.asarray(data)
    data = data[~np.isnan(data)]
    mean_val = np.mean(data)
    std_val = np.std(data, ddof=1)

    if value is not None:
        return float((value - mean_val) / std_val)
    else:
        return (data - mean_val) / std_val

Correlation Analysis

`minexpy.correlation`

Correlation analysis module for geoscience datasets.

This module provides pairwise and matrix-style correlation tools that are commonly used in geochemistry, environmental geoscience, and exploration analytics. It includes linear, rank-based, nonlinear, and robust measures of dependence, plus partial correlation for controlling confounding variables.

The functions are designed to be practical for real-world field/lab datasets:

Pairwise finite-value filtering is applied by default.
NaN and infinite values are excluded pairwise.
Outputs include interpretable metadata such as sample size and p-values.

Examples:

Basic pairwise analysis:

>>> import numpy as np
>>> from minexpy.correlation import pearson_correlation, spearman_correlation
>>>
>>> x = np.array([45.2, 52.3, 38.7, 61.2, 49.8])
>>> y = np.array([12.5, 15.3, 11.2, 18.4, 14.1])
>>> pearson_correlation(x, y)
{'correlation': 0.99..., 'p_value': 0.000..., 'n': 5}
>>> spearman_correlation(x, y)
{'correlation': 0.99..., 'p_value': 0.000..., 'n': 5}

Matrix-based analysis:

>>> import pandas as pd
>>> from minexpy.correlation import correlation_matrix
>>>
>>> df = pd.DataFrame({'Zn': x, 'Cu': y})
>>> correlation_matrix(df, method='pearson')

`biweight_midcorrelation(x, y, c=9.0, epsilon=1e-12)`

Compute robust biweight midcorrelation.

Biweight midcorrelation downweights extreme observations using a median/MAD-based weighting scheme. It is particularly useful in assay datasets where a few extreme values can dominate classical coefficients.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First numeric variable.	required
`y`	`array - like`	Second numeric variable.	required
`c`	`float`	Tuning constant controlling outlier downweighting. Lower values increase robustness but reduce efficiency in near-normal data.	`9.0`
`epsilon`	`float`	Numerical stabilizer used to avoid division by near-zero quantities.	`1e-12`

Returns:

Type	Description
`float`	Robust correlation in `[-1, 1]`. Returns `np.nan` when robust dispersion of either variable is effectively zero.

Raises:

Type	Description
`ValueError`	If inputs are invalid or contain too few valid pairs.

Examples:

>>> import numpy as np
>>> from minexpy.correlation import biweight_midcorrelation
>>> x = np.array([1, 2, 3, 4, 5, 100])
>>> y = np.array([1, 2, 3, 4, 5, -100])
>>> round(biweight_midcorrelation(x, y), 3)
0.996

Notes

This implementation follows a Tukey biweight-style weighting of centered values with cutoff |u| < 1 after MAD scaling.

Source code in minexpy/correlation.py

def biweight_midcorrelation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    c: float = 9.0,
    epsilon: float = 1e-12,
) -> float:
    """
    Compute robust biweight midcorrelation.

    Biweight midcorrelation downweights extreme observations using a
    median/MAD-based weighting scheme. It is particularly useful in assay
    datasets where a few extreme values can dominate classical coefficients.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    c : float, default 9.0
        Tuning constant controlling outlier downweighting. Lower values
        increase robustness but reduce efficiency in near-normal data.
    epsilon : float, default 1e-12
        Numerical stabilizer used to avoid division by near-zero quantities.

    Returns
    -------
    float
        Robust correlation in ``[-1, 1]``. Returns ``np.nan`` when robust
        dispersion of either variable is effectively zero.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.correlation import biweight_midcorrelation
    >>> x = np.array([1, 2, 3, 4, 5, 100])
    >>> y = np.array([1, 2, 3, 4, 5, -100])
    >>> round(biweight_midcorrelation(x, y), 3)
    0.996

    Notes
    -----
    This implementation follows a Tukey biweight-style weighting of centered
    values with cutoff ``|u| < 1`` after MAD scaling.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=3)

    x_med = np.median(x_clean)
    y_med = np.median(y_clean)

    x_mad = np.median(np.abs(x_clean - x_med))
    y_mad = np.median(np.abs(y_clean - y_med))

    if x_mad <= epsilon or y_mad <= epsilon:
        return float(np.nan)

    x_u = (x_clean - x_med) / (c * x_mad)
    y_u = (y_clean - y_med) / (c * y_mad)

    x_w = np.zeros_like(x_u)
    y_w = np.zeros_like(y_u)

    x_mask = np.abs(x_u) < 1.0
    y_mask = np.abs(y_u) < 1.0

    x_w[x_mask] = (1.0 - x_u[x_mask] ** 2) ** 2
    y_w[y_mask] = (1.0 - y_u[y_mask] ** 2) ** 2

    x_centered = (x_clean - x_med) * x_w
    y_centered = (y_clean - y_med) * y_w

    denominator = np.sqrt(np.sum(x_centered ** 2) * np.sum(y_centered ** 2))
    if denominator <= epsilon:
        return float(np.nan)

    corr = float(np.sum(x_centered * y_centered) / denominator)
    return float(np.clip(corr, -1.0, 1.0))

`correlation_matrix(data, method='pearson', min_periods=2)`

Compute a correlation matrix using geoscience-relevant methods.

Supported methods include:

pearson
spearman
kendall
distance
biweight or biweight_midcorrelation

Parameters:

Name	Type	Description	Default
`data`	`DataFrame or array - like`	Tabular numeric data. If a NumPy array is provided, each column is treated as a separate variable.	required
`method`	`str`	Correlation method name.	`'pearson'`
`min_periods`	`int`	Minimum number of pairwise valid observations required for each matrix entry.	`2`

Returns:

Type	Description
`DataFrame`	Symmetric correlation matrix with variable names as row/column labels.

Raises:

Type	Description
`ValueError`	If method is unsupported or input is invalid.

Examples:

>>> import pandas as pd
>>> from minexpy.correlation import correlation_matrix
>>> df = pd.DataFrame({'Zn': [45, 50, 47], 'Cu': [12, 14, 13]})
>>> correlation_matrix(df, method='pearson')
           Zn   Cu
Zn  1.000000  1.0
Cu  1.000000  1.0

Notes

Built-in pandas methods are used for Pearson/Spearman/Kendall. Distance and biweight matrices are computed pairwise with explicit finite-value filtering.

Source code in minexpy/correlation.py

def correlation_matrix(
    data: Union[pd.DataFrame, np.ndarray, Sequence[Sequence[float]]],
    method: str = "pearson",
    min_periods: int = 2,
) -> pd.DataFrame:
    """
    Compute a correlation matrix using geoscience-relevant methods.

    Supported methods include:

    - ``pearson``
    - ``spearman``
    - ``kendall``
    - ``distance``
    - ``biweight`` or ``biweight_midcorrelation``

    Parameters
    ----------
    data : DataFrame or array-like
        Tabular numeric data. If a NumPy array is provided, each column is
        treated as a separate variable.
    method : str, default 'pearson'
        Correlation method name.
    min_periods : int, default 2
        Minimum number of pairwise valid observations required for each matrix
        entry.

    Returns
    -------
    pandas.DataFrame
        Symmetric correlation matrix with variable names as row/column labels.

    Raises
    ------
    ValueError
        If method is unsupported or input is invalid.

    Examples
    --------
    >>> import pandas as pd
    >>> from minexpy.correlation import correlation_matrix
    >>> df = pd.DataFrame({'Zn': [45, 50, 47], 'Cu': [12, 14, 13]})
    >>> correlation_matrix(df, method='pearson')
               Zn   Cu
    Zn  1.000000  1.0
    Cu  1.000000  1.0

    Notes
    -----
    Built-in pandas methods are used for Pearson/Spearman/Kendall. Distance
    and biweight matrices are computed pairwise with explicit finite-value
    filtering.
    """
    numeric_df = _to_numeric_dataframe(data)
    method_lower = method.lower()

    if min_periods < 1:
        raise ValueError("min_periods must be at least 1")

    pandas_methods = {"pearson", "spearman", "kendall"}
    if method_lower in pandas_methods:
        return numeric_df.corr(method=method_lower, min_periods=min_periods)

    custom_methods: Mapping[str, object] = {
        "distance": distance_correlation,
        "biweight": biweight_midcorrelation,
        "biweight_midcorrelation": biweight_midcorrelation,
    }

    if method_lower not in custom_methods:
        supported = sorted(list(pandas_methods) + list(custom_methods.keys()))
        raise ValueError(
            f"Unsupported method '{method}'. Supported methods: {supported}"
        )

    method_func = custom_methods[method_lower]
    columns = list(numeric_df.columns)
    matrix = pd.DataFrame(
        np.nan,
        index=columns,
        columns=columns,
        dtype=float,
    )

    for i, col_i in enumerate(columns):
        x_values = numeric_df[col_i].to_numpy(dtype=float)

        for j in range(i, len(columns)):
            col_j = columns[j]
            y_values = numeric_df[col_j].to_numpy(dtype=float)

            mask = np.isfinite(x_values) & np.isfinite(y_values)
            n_valid = int(mask.sum())

            if n_valid < min_periods:
                value = float(np.nan)
            elif i == j:
                value = 1.0
            else:
                value = float(method_func(x_values[mask], y_values[mask]))

            matrix.iloc[i, j] = value
            matrix.iloc[j, i] = value

    return matrix

`distance_correlation(x, y)`

Compute distance correlation for nonlinear dependence detection.

Distance correlation is zero if and only if variables are independent (under finite second moments). It therefore captures both linear and nonlinear dependence patterns that Pearson may miss.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First numeric variable.	required
`y`	`array - like`	Second numeric variable.	required

Returns:

Type	Description
`float`	Distance correlation value in `[0, 1]`.

Raises:

Type	Description
`ValueError`	If inputs are invalid or contain too few valid pairs.

Examples:

>>> import numpy as np
>>> from minexpy.correlation import distance_correlation
>>> x = np.linspace(-2, 2, 50)
>>> y = x ** 2
>>> round(distance_correlation(x, y), 3)
0.54

Notes

This implementation uses pairwise absolute-distance matrices and classical double-centering. Complexity is O(n^2) in both time and memory.

Source code in minexpy/correlation.py

def distance_correlation(x: ArrayLike1D, y: ArrayLike1D) -> float:
    """
    Compute distance correlation for nonlinear dependence detection.

    Distance correlation is zero if and only if variables are independent
    (under finite second moments). It therefore captures both linear and
    nonlinear dependence patterns that Pearson may miss.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.

    Returns
    -------
    float
        Distance correlation value in ``[0, 1]``.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> import numpy as np
    >>> from minexpy.correlation import distance_correlation
    >>> x = np.linspace(-2, 2, 50)
    >>> y = x ** 2
    >>> round(distance_correlation(x, y), 3)
    0.54

    Notes
    -----
    This implementation uses pairwise absolute-distance matrices and classical
    double-centering. Complexity is ``O(n^2)`` in both time and memory.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)

    x_dist = np.abs(x_clean[:, None] - x_clean[None, :])
    y_dist = np.abs(y_clean[:, None] - y_clean[None, :])

    x_centered = (
        x_dist
        - x_dist.mean(axis=0, keepdims=True)
        - x_dist.mean(axis=1, keepdims=True)
        + x_dist.mean()
    )
    y_centered = (
        y_dist
        - y_dist.mean(axis=0, keepdims=True)
        - y_dist.mean(axis=1, keepdims=True)
        + y_dist.mean()
    )

    dcov2 = float(np.mean(x_centered * y_centered))
    dvarx = float(np.mean(x_centered * x_centered))
    dvary = float(np.mean(y_centered * y_centered))

    if dvarx <= 0.0 or dvary <= 0.0:
        return 0.0

    dcor2 = dcov2 / np.sqrt(dvarx * dvary)
    dcor = np.sqrt(max(dcor2, 0.0))
    return float(np.clip(dcor, 0.0, 1.0))

`kendall_correlation(x, y, method='auto', alternative='two-sided')`

Compute Kendall's tau rank correlation.

Kendall's tau is based on concordant and discordant pair comparisons. It is robust for ordinal structures, less sensitive to outliers than Pearson, and frequently used for geoscience trend analysis.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First numeric variable.	required
`y`	`array - like`	Second numeric variable.	required
`method`	`(auto, asymptotic, exact)`	Method forwarded to `scipy.stats.kendalltau`.	`'auto'`
`alternative`	`(two - sided, greater, less)`	Alternative hypothesis used for p-value calculation.	`'two-sided'`

Returns:

Type	Description
`dict`	Dictionary with: `correlation` : Kendall's tau in `[-1, 1]`. `p_value` : p-value for testing zero rank association. `n` : number of paired finite observations used.

Raises:

Type	Description
`ValueError`	If inputs are invalid or contain too few valid pairs.

Examples:

>>> from minexpy.correlation import kendall_correlation
>>> x = [2, 4, 6, 8, 10]
>>> y = [5, 7, 8, 11, 13]
>>> kendall_correlation(x, y)
{'correlation': 0.999..., 'p_value': 0.016..., 'n': 5}

Notes

Compared with Spearman's rho, Kendall's tau is typically smaller in magnitude but often easier to interpret probabilistically as concordance.

Source code in minexpy/correlation.py

def kendall_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    method: str = "auto",
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute Kendall's tau rank correlation.

    Kendall's tau is based on concordant and discordant pair comparisons.
    It is robust for ordinal structures, less sensitive to outliers than
    Pearson, and frequently used for geoscience trend analysis.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    method : {'auto', 'asymptotic', 'exact'}, default 'auto'
        Method forwarded to ``scipy.stats.kendalltau``.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis used for p-value calculation.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : Kendall's tau in ``[-1, 1]``.
        - ``p_value`` : p-value for testing zero rank association.
        - ``n`` : number of paired finite observations used.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> from minexpy.correlation import kendall_correlation
    >>> x = [2, 4, 6, 8, 10]
    >>> y = [5, 7, 8, 11, 13]
    >>> kendall_correlation(x, y)
    {'correlation': 0.999..., 'p_value': 0.016..., 'n': 5}

    Notes
    -----
    Compared with Spearman's rho, Kendall's tau is typically smaller in
    magnitude but often easier to interpret probabilistically as concordance.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)
    result = scipy_stats.kendalltau(
        x_clean,
        y_clean,
        method=method,
        alternative=alternative,
    )
    statistic, p_value = _extract_stat_pvalue(result)

    return {
        "correlation": statistic,
        "p_value": p_value,
        "n": int(x_clean.size),
    }

`partial_correlation(x, y, controls, alternative='two-sided')`

Compute linear partial correlation between x and y.

Partial correlation estimates the residual linear relationship between two variables after regressing out one or more control variables. This is useful when evaluating element-element associations while controlling for depth, lithology proxies, or compositional trends.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First numeric variable.	required
`y`	`array - like`	Second numeric variable.	required
`controls`	`array - like`	One or more control variables, with shape `(n,)` or `(n, k)`.	required
`alternative`	`(two - sided, greater, less)`	Alternative hypothesis for residual-correlation significance test.	`'two-sided'`

Returns:

Type	Description
`dict`	Dictionary with: `correlation` : partial correlation coefficient. `p_value` : p-value from t-distribution approximation. `n` : number of paired finite observations used. `df` : residual degrees of freedom, `n - k - 2`.

Raises:

Type	Description
`ValueError`	If shapes are incompatible, too few observations are available, or degrees of freedom are insufficient.

Examples:

>>> from minexpy.correlation import partial_correlation
>>> x = [10, 12, 13, 16, 20]
>>> y = [2, 3, 3.2, 4.1, 5.2]
>>> z = [100, 110, 105, 120, 130]
>>> partial_correlation(x, y, z)
{'correlation': 0.97..., 'p_value': 0.12..., 'n': 5, 'df': 2}

Notes

This implementation uses least-squares residualization and then applies Pearson correlation to residual vectors. It assumes a linear adjustment model for control effects.

Source code in minexpy/correlation.py

def partial_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    controls: Union[np.ndarray, pd.DataFrame, pd.Series, Sequence[float], Sequence[Sequence[float]]],
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute linear partial correlation between ``x`` and ``y``.

    Partial correlation estimates the residual linear relationship between two
    variables after regressing out one or more control variables. This is
    useful when evaluating element-element associations while controlling for
    depth, lithology proxies, or compositional trends.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    controls : array-like
        One or more control variables, with shape ``(n,)`` or ``(n, k)``.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis for residual-correlation significance test.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : partial correlation coefficient.
        - ``p_value`` : p-value from t-distribution approximation.
        - ``n`` : number of paired finite observations used.
        - ``df`` : residual degrees of freedom, ``n - k - 2``.

    Raises
    ------
    ValueError
        If shapes are incompatible, too few observations are available,
        or degrees of freedom are insufficient.

    Examples
    --------
    >>> from minexpy.correlation import partial_correlation
    >>> x = [10, 12, 13, 16, 20]
    >>> y = [2, 3, 3.2, 4.1, 5.2]
    >>> z = [100, 110, 105, 120, 130]
    >>> partial_correlation(x, y, z)
    {'correlation': 0.97..., 'p_value': 0.12..., 'n': 5, 'df': 2}

    Notes
    -----
    This implementation uses least-squares residualization and then applies
    Pearson correlation to residual vectors. It assumes a linear adjustment
    model for control effects.
    """
    x_values = _to_1d_float_array(x, "x")
    y_values = _to_1d_float_array(y, "y")

    if isinstance(controls, pd.DataFrame):
        control_array = controls.to_numpy(dtype=float)
    elif isinstance(controls, pd.Series):
        control_array = controls.to_numpy(dtype=float)
    else:
        control_array = np.asarray(controls, dtype=float)

    if control_array.ndim == 1:
        control_array = control_array.reshape(-1, 1)
    elif control_array.ndim != 2:
        raise ValueError("controls must be one-dimensional or two-dimensional")

    if x_values.size != y_values.size or x_values.size != control_array.shape[0]:
        raise ValueError("x, y, and controls must have the same number of rows")

    valid_mask = np.isfinite(x_values) & np.isfinite(y_values)
    valid_mask = valid_mask & np.all(np.isfinite(control_array), axis=1)

    x_clean = x_values[valid_mask]
    y_clean = y_values[valid_mask]
    z_clean = control_array[valid_mask]

    n_obs = int(x_clean.size)
    n_controls = int(z_clean.shape[1])
    df = n_obs - n_controls - 2

    if n_obs < 3:
        raise ValueError("At least 3 valid observations are required")
    if df <= 0:
        raise ValueError(
            "Not enough observations for partial correlation with the given controls; "
            "need n > number_of_controls + 2"
        )

    design = np.column_stack([np.ones(n_obs), z_clean])

    beta_x, _, _, _ = np.linalg.lstsq(design, x_clean, rcond=None)
    beta_y, _, _, _ = np.linalg.lstsq(design, y_clean, rcond=None)

    residual_x = x_clean - design @ beta_x
    residual_y = y_clean - design @ beta_y

    r, _ = _extract_stat_pvalue(scipy_stats.pearsonr(residual_x, residual_y))
    r = float(np.clip(r, -1.0, 1.0))

    if abs(r) >= 1.0:
        p_value = 0.0
    else:
        t_stat = r * np.sqrt(df / max(1.0 - r * r, np.finfo(float).eps))
        if alternative == "two-sided":
            p_value = float(2.0 * scipy_stats.t.sf(abs(t_stat), df=df))
        elif alternative == "greater":
            p_value = float(scipy_stats.t.sf(t_stat, df=df))
        elif alternative == "less":
            p_value = float(scipy_stats.t.cdf(t_stat, df=df))
        else:
            raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

    return {
        "correlation": r,
        "p_value": p_value,
        "n": n_obs,
        "df": int(df),
    }

`pearson_correlation(x, y, alternative='two-sided')`

Compute Pearson product-moment correlation coefficient.

Pearson correlation quantifies linear association between two variables. It is the standard first-pass measure when data are approximately linear, homoscedastic, and not dominated by outliers.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First numeric variable.	required
`y`	`array - like`	Second numeric variable.	required
`alternative`	`(two - sided, greater, less)`	Alternative hypothesis used for p-value calculation.	`'two-sided'`

Returns:

Type	Description
`dict`	Dictionary with: `correlation` : Pearson's `r` in `[-1, 1]`. `p_value` : p-value for testing zero linear correlation. `n` : number of paired finite observations used.

Raises:

Type	Description
`ValueError`	If inputs are invalid or contain too few valid pairs.

Examples:

>>> from minexpy.correlation import pearson_correlation
>>> x = [10, 12, 15, 20, 21]
>>> y = [3.1, 3.8, 4.2, 5.7, 6.0]
>>> pearson_correlation(x, y)
{'correlation': 0.985..., 'p_value': 0.002..., 'n': 5}

Notes

Pearson correlation is sensitive to extreme values. For geochemical data with heavy tails or outliers, compare results with Spearman, Kendall, or biweight midcorrelation before interpretation.

Source code in minexpy/correlation.py

def pearson_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute Pearson product-moment correlation coefficient.

    Pearson correlation quantifies linear association between two variables.
    It is the standard first-pass measure when data are approximately linear,
    homoscedastic, and not dominated by outliers.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis used for p-value calculation.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : Pearson's ``r`` in ``[-1, 1]``.
        - ``p_value`` : p-value for testing zero linear correlation.
        - ``n`` : number of paired finite observations used.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> from minexpy.correlation import pearson_correlation
    >>> x = [10, 12, 15, 20, 21]
    >>> y = [3.1, 3.8, 4.2, 5.7, 6.0]
    >>> pearson_correlation(x, y)
    {'correlation': 0.985..., 'p_value': 0.002..., 'n': 5}

    Notes
    -----
    Pearson correlation is sensitive to extreme values. For geochemical data
    with heavy tails or outliers, compare results with Spearman, Kendall, or
    biweight midcorrelation before interpretation.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)
    result = scipy_stats.pearsonr(x_clean, y_clean, alternative=alternative)
    statistic, p_value = _extract_stat_pvalue(result)

    return {
        "correlation": statistic,
        "p_value": p_value,
        "n": int(x_clean.size),
    }

`spearman_correlation(x, y, alternative='two-sided')`

Compute Spearman rank correlation coefficient.

Spearman correlation measures monotonic dependence by correlating ranked values instead of raw values. It is often preferred for skewed geochemical variables and monotonic but nonlinear relationships.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First numeric variable.	required
`y`	`array - like`	Second numeric variable.	required
`alternative`	`(two - sided, greater, less)`	Alternative hypothesis used for p-value calculation.	`'two-sided'`

Returns:

Type	Description
`dict`	Dictionary with: `correlation` : Spearman's `rho` in `[-1, 1]`. `p_value` : p-value for testing zero rank association. `n` : number of paired finite observations used.

Raises:

Type	Description
`ValueError`	If inputs are invalid or contain too few valid pairs.

Examples:

>>> from minexpy.correlation import spearman_correlation
>>> x = [1, 2, 3, 4, 5]
>>> y = [1, 4, 9, 16, 25]
>>> spearman_correlation(x, y)
{'correlation': 1.0, 'p_value': 0.0..., 'n': 5}

Notes

Spearman is robust to monotonic nonlinear scaling but can still be influenced by large numbers of tied values. In highly tied data, Kendall's tau is often more conservative.

Source code in minexpy/correlation.py

def spearman_correlation(
    x: ArrayLike1D,
    y: ArrayLike1D,
    alternative: str = "two-sided",
) -> Dict[str, Union[float, int]]:
    """
    Compute Spearman rank correlation coefficient.

    Spearman correlation measures monotonic dependence by correlating ranked
    values instead of raw values. It is often preferred for skewed geochemical
    variables and monotonic but nonlinear relationships.

    Parameters
    ----------
    x : array-like
        First numeric variable.
    y : array-like
        Second numeric variable.
    alternative : {'two-sided', 'greater', 'less'}, default 'two-sided'
        Alternative hypothesis used for p-value calculation.

    Returns
    -------
    dict
        Dictionary with:

        - ``correlation`` : Spearman's ``rho`` in ``[-1, 1]``.
        - ``p_value`` : p-value for testing zero rank association.
        - ``n`` : number of paired finite observations used.

    Raises
    ------
    ValueError
        If inputs are invalid or contain too few valid pairs.

    Examples
    --------
    >>> from minexpy.correlation import spearman_correlation
    >>> x = [1, 2, 3, 4, 5]
    >>> y = [1, 4, 9, 16, 25]
    >>> spearman_correlation(x, y)
    {'correlation': 1.0, 'p_value': 0.0..., 'n': 5}

    Notes
    -----
    Spearman is robust to monotonic nonlinear scaling but can still be
    influenced by large numbers of tied values. In highly tied data,
    Kendall's tau is often more conservative.
    """
    x_clean, y_clean = _prepare_pair(x, y, min_n=2)
    result = scipy_stats.spearmanr(x_clean, y_clean, alternative=alternative)
    statistic, p_value = _extract_stat_pvalue(result)

    return {
        "correlation": statistic,
        "p_value": p_value,
        "n": int(x_clean.size),
    }

Statistical Visualization

`minexpy.statviz`

Statistical visualization module for geoscience datasets.

This module provides practical plotting helpers for common statistical visual diagnostics used during geochemical and environmental data analysis:

Histogram (linear and log-scale)
Box plot / violin plot
ECDF (empirical cumulative distribution function)
Q-Q plot
P-P plot
Scatter plot with optional trend line

All public plotting functions return (figure, axis) so users can apply additional Matplotlib customization (annotations, styling, export settings) after MinexPy constructs the base diagnostic plot.

Examples:

Basic histogram:

>>> import numpy as np
>>> from minexpy.statviz import plot_histogram
>>> values = np.random.lognormal(mean=2.2, sigma=0.4, size=200)
>>> fig, ax = plot_histogram(values, bins=30, scale='log')

Scatter with trend line:

>>> from minexpy.statviz import plot_scatter
>>> x = np.array([1, 2, 3, 4, 5])
>>> y = np.array([2, 4, 6, 8, 10])
>>> fig, ax = plot_scatter(x, y, add_trendline=True)

`plot_box_violin(data, kind='box', labels=None, ax=None, show_means=True, color='tab:blue', xlabel='Variables', ylabel='Value', title=None)`

Plot box plot or violin plot for one or multiple datasets.

Box and violin plots are complementary diagnostics for comparing spread and shape across variables, lithological domains, or spatial groups.

Parameters:

Name	Type	Description	Default
`data`	`array-like, mapping, sequence of arrays, or DataFrame`	One dataset or multiple datasets.	required
`kind`	`(box, violin)`	Plot type to generate.	`'box'`
`labels`	`sequence of str`	Custom labels replacing detected dataset names.	`None`
`ax`	`Axes`	Existing axis to draw on.	`None`
`show_means`	`bool`	If `True`, display mean marker/line.	`True`
`color`	`str`	Primary face color for box/violin glyphs.	`'tab:blue'`
`xlabel`	`str`	X-axis label.	`'Variables'`
`ylabel`	`str`	Y-axis label.	`'Value'`
`title`	`str`	Plot title. If omitted, a default title is used.	`None`

Returns:

Type	Description
`tuple`	`(figure, axis)`.

Raises:

Type	Description
`ValueError`	If `kind` is not `'box'` or `'violin'`.

Examples:

>>> from minexpy.statviz import plot_box_violin
>>> fig, ax = plot_box_violin({'Zn': [1, 2, 3], 'Cu': [2, 3, 4]}, kind='violin')

Notes

For DataFrame input, each numeric column is treated as one distribution. Non-finite values are removed independently per dataset.

Source code in minexpy/statviz.py

def plot_box_violin(
    data: CollectionLike,
    kind: str = "box",
    labels: Optional[Sequence[str]] = None,
    ax: Optional[plt.Axes] = None,
    show_means: bool = True,
    color: str = "tab:blue",
    xlabel: str = "Variables",
    ylabel: str = "Value",
    title: Optional[str] = None,
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot box plot or violin plot for one or multiple datasets.

    Box and violin plots are complementary diagnostics for comparing spread
    and shape across variables, lithological domains, or spatial groups.

    Parameters
    ----------
    data : array-like, mapping, sequence of arrays, or DataFrame
        One dataset or multiple datasets.
    kind : {'box', 'violin'}, default 'box'
        Plot type to generate.
    labels : sequence of str, optional
        Custom labels replacing detected dataset names.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    show_means : bool, default True
        If ``True``, display mean marker/line.
    color : str, default 'tab:blue'
        Primary face color for box/violin glyphs.
    xlabel : str, default 'Variables'
        X-axis label.
    ylabel : str, default 'Value'
        Y-axis label.
    title : str, optional
        Plot title. If omitted, a default title is used.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Raises
    ------
    ValueError
        If ``kind`` is not ``'box'`` or ``'violin'``.

    Examples
    --------
    >>> from minexpy.statviz import plot_box_violin
    >>> fig, ax = plot_box_violin({'Zn': [1, 2, 3], 'Cu': [2, 3, 4]}, kind='violin')

    Notes
    -----
    For DataFrame input, each numeric column is treated as one distribution.
    Non-finite values are removed independently per dataset.
    """
    names, arrays = _parse_collection(data, labels=labels)
    fig, axis = _resolve_axis(ax)

    kind_lower = kind.lower()
    if kind_lower == "box":
        box = axis.boxplot(arrays, labels=names, patch_artist=True, showmeans=show_means)
        for patch in box["boxes"]:
            patch.set_facecolor(color)
            patch.set_alpha(0.45)
        default_title = "Box Plot"
    elif kind_lower == "violin":
        violin = axis.violinplot(arrays, showmeans=show_means, showmedians=True)
        for body in violin["bodies"]:
            body.set_facecolor(color)
            body.set_edgecolor("black")
            body.set_alpha(0.45)
        axis.set_xticks(np.arange(1, len(names) + 1))
        axis.set_xticklabels(names)
        default_title = "Violin Plot"
    else:
        raise ValueError("kind must be either 'box' or 'violin'")

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(default_title if title is None else title)

    return fig, axis

`plot_ecdf(data, labels=None, ax=None, xlabel='Value', ylabel='Empirical Cumulative Probability', title='ECDF')`

Plot empirical cumulative distribution function (ECDF).

ECDF plots avoid histogram binning artifacts and are useful for direct comparison of distribution shifts and tail behavior across groups.

Parameters:

Name	Type	Description	Default
`data`	`array-like, mapping, sequence of arrays, or DataFrame`	One dataset or multiple datasets.	required
`labels`	`sequence of str`	Custom labels replacing detected dataset names.	`None`
`ax`	`Axes`	Existing axis to draw on.	`None`
`xlabel`	`str`	X-axis label.	`'Value'`
`ylabel`	`str`	Y-axis label.	`'Empirical Cumulative Probability'`
`title`	`str`	Plot title.	`'ECDF'`

Returns:

Type	Description
`tuple`	`(figure, axis)`.

Examples:

>>> from minexpy.statviz import plot_ecdf
>>> fig, ax = plot_ecdf([1, 3, 2, 4, 8])

Source code in minexpy/statviz.py

def plot_ecdf(
    data: CollectionLike,
    labels: Optional[Sequence[str]] = None,
    ax: Optional[plt.Axes] = None,
    xlabel: str = "Value",
    ylabel: str = "Empirical Cumulative Probability",
    title: str = "ECDF",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot empirical cumulative distribution function (ECDF).

    ECDF plots avoid histogram binning artifacts and are useful for direct
    comparison of distribution shifts and tail behavior across groups.

    Parameters
    ----------
    data : array-like, mapping, sequence of arrays, or DataFrame
        One dataset or multiple datasets.
    labels : sequence of str, optional
        Custom labels replacing detected dataset names.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    xlabel : str, default 'Value'
        X-axis label.
    ylabel : str, default 'Empirical Cumulative Probability'
        Y-axis label.
    title : str, default 'ECDF'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Examples
    --------
    >>> from minexpy.statviz import plot_ecdf
    >>> fig, ax = plot_ecdf([1, 3, 2, 4, 8])
    """
    names, arrays = _parse_collection(data, labels=labels)
    fig, axis = _resolve_axis(ax)

    for name, values in zip(names, arrays):
        sorted_values = np.sort(values)
        empirical_prob = np.arange(1, sorted_values.size + 1) / sorted_values.size
        axis.step(sorted_values, empirical_prob, where="post", label=name)

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)

    if len(arrays) > 1 or labels is not None:
        axis.legend()

    return fig, axis

`plot_histogram(data, bins=30, scale='linear', density=False, ax=None, color='tab:blue', alpha=0.75, label=None, xlabel='Value', ylabel=None, title='Histogram')`

Plot a histogram with linear or logarithmic x-axis scaling.

Histograms are often the first diagnostic for distribution shape, outlier concentration, and modal behavior. scale='log' is especially useful for right-skewed concentration data with multiplicative structure.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	One-dimensional numeric dataset.	required
`bins`	`int or sequence`	Number of bins or explicit bin edges.	`30`
`scale`	`(linear, log)`	X-axis scale and binning mode.	`'linear'`
`density`	`bool`	If `True`, normalize bar heights to represent density.	`False`
`ax`	`Axes`	Existing axis to draw on.	`None`
`color`	`str`	Bar face color.	`'tab:blue'`
`alpha`	`float`	Bar opacity.	`0.75`
`label`	`str`	Legend label for plotted dataset.	`None`
`xlabel`	`str`	X-axis label.	`'Value'`
`ylabel`	`str`	Y-axis label. If `None`, set automatically to `Frequency` or `Density` based on `density`.	`None`
`title`	`str`	Plot title.	`'Histogram'`

Returns:

Type	Description
`tuple`	`(figure, axis)` for additional customization.

Raises:

Type	Description
`ValueError`	If `scale` is invalid or if `scale='log'` and data contain non-positive values.

Examples:

>>> from minexpy.statviz import plot_histogram
>>> fig, ax = plot_histogram([1, 2, 2, 3, 5], bins=5)
>>> fig, ax = plot_histogram([1, 2, 3, 4], scale='log')

Notes

For scale='log' and integer bins, logarithmically spaced bin edges are constructed from the finite min/max of the data.

Source code in minexpy/statviz.py

def plot_histogram(
    data: ArrayLike1D,
    bins: Union[int, Sequence[float]] = 30,
    scale: str = "linear",
    density: bool = False,
    ax: Optional[plt.Axes] = None,
    color: str = "tab:blue",
    alpha: float = 0.75,
    label: Optional[str] = None,
    xlabel: str = "Value",
    ylabel: Optional[str] = None,
    title: str = "Histogram",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot a histogram with linear or logarithmic x-axis scaling.

    Histograms are often the first diagnostic for distribution shape,
    outlier concentration, and modal behavior. ``scale='log'`` is especially
    useful for right-skewed concentration data with multiplicative structure.

    Parameters
    ----------
    data : array-like
        One-dimensional numeric dataset.
    bins : int or sequence, default 30
        Number of bins or explicit bin edges.
    scale : {'linear', 'log'}, default 'linear'
        X-axis scale and binning mode.
    density : bool, default False
        If ``True``, normalize bar heights to represent density.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    color : str, default 'tab:blue'
        Bar face color.
    alpha : float, default 0.75
        Bar opacity.
    label : str, optional
        Legend label for plotted dataset.
    xlabel : str, default 'Value'
        X-axis label.
    ylabel : str, optional
        Y-axis label. If ``None``, set automatically to ``Frequency`` or
        ``Density`` based on ``density``.
    title : str, default 'Histogram'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)`` for additional customization.

    Raises
    ------
    ValueError
        If ``scale`` is invalid or if ``scale='log'`` and data contain
        non-positive values.

    Examples
    --------
    >>> from minexpy.statviz import plot_histogram
    >>> fig, ax = plot_histogram([1, 2, 2, 3, 5], bins=5)
    >>> fig, ax = plot_histogram([1, 2, 3, 4], scale='log')

    Notes
    -----
    For ``scale='log'`` and integer bins, logarithmically spaced bin edges are
    constructed from the finite min/max of the data.
    """
    values = _to_1d_float_array(data)
    fig, axis = _resolve_axis(ax)

    scale_lower = scale.lower()
    if scale_lower not in {"linear", "log"}:
        raise ValueError("scale must be either 'linear' or 'log'")

    bins_to_use: Union[int, np.ndarray, Sequence[float]] = bins
    if scale_lower == "log":
        if np.any(values <= 0):
            raise ValueError("Log-scale histogram requires strictly positive values")

        if isinstance(bins, int):
            min_value = float(np.min(values))
            max_value = float(np.max(values))
            bins_to_use = np.logspace(np.log10(min_value), np.log10(max_value), bins + 1)

    axis.hist(
        values,
        bins=bins_to_use,
        density=density,
        color=color,
        alpha=alpha,
        edgecolor="black",
        linewidth=0.8,
        label=label,
    )

    if scale_lower == "log":
        axis.set_xscale("log")

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel if ylabel is not None else ("Density" if density else "Frequency"))
    axis.set_title(title)

    if label:
        axis.legend()

    return fig, axis

`plot_pp(data, distribution='norm', distribution_parameters=None, fit_distribution=True, ax=None, color='tab:blue', xlabel='Theoretical Cumulative Probability', ylabel='Empirical Cumulative Probability', title='P-P Plot')`

Plot probability-probability (P-P) diagnostic against a distribution.

P-P plots compare empirical cumulative probabilities against theoretical CDF values. They are often more sensitive in the central part of the distribution than Q-Q plots.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	One-dimensional numeric dataset.	required
`distribution`	`str or scipy.stats distribution`	Reference distribution name or object.	`'norm'`
`distribution_parameters`	`sequence of float`	Fixed distribution parameters. If omitted and `fit_distribution=True`, parameters are estimated from the sample via `distribution.fit` when available.	`None`
`fit_distribution`	`bool`	Fit distribution parameters from data when not provided.	`True`
`ax`	`Axes`	Existing axis to draw on.	`None`
`color`	`str`	Marker color.	`'tab:blue'`
`xlabel`	`str`	X-axis label.	`'Theoretical Cumulative Probability'`
`ylabel`	`str`	Y-axis label.	`'Empirical Cumulative Probability'`
`title`	`str`	Plot title.	`'P-P Plot'`

Returns:

Type	Description
`tuple`	`(figure, axis)`.

Examples:

>>> from minexpy.statviz import plot_pp
>>> fig, ax = plot_pp([1.2, 1.4, 1.8, 2.0, 2.1])

Source code in minexpy/statviz.py

def plot_pp(
    data: ArrayLike1D,
    distribution: DistributionLike = "norm",
    distribution_parameters: Optional[Sequence[float]] = None,
    fit_distribution: bool = True,
    ax: Optional[plt.Axes] = None,
    color: str = "tab:blue",
    xlabel: str = "Theoretical Cumulative Probability",
    ylabel: str = "Empirical Cumulative Probability",
    title: str = "P-P Plot",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot probability-probability (P-P) diagnostic against a distribution.

    P-P plots compare empirical cumulative probabilities against theoretical
    CDF values. They are often more sensitive in the central part of the
    distribution than Q-Q plots.

    Parameters
    ----------
    data : array-like
        One-dimensional numeric dataset.
    distribution : str or scipy.stats distribution, default 'norm'
        Reference distribution name or object.
    distribution_parameters : sequence of float, optional
        Fixed distribution parameters. If omitted and ``fit_distribution=True``,
        parameters are estimated from the sample via ``distribution.fit`` when
        available.
    fit_distribution : bool, default True
        Fit distribution parameters from data when not provided.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    color : str, default 'tab:blue'
        Marker color.
    xlabel : str, default 'Theoretical Cumulative Probability'
        X-axis label.
    ylabel : str, default 'Empirical Cumulative Probability'
        Y-axis label.
    title : str, default 'P-P Plot'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Examples
    --------
    >>> from minexpy.statviz import plot_pp
    >>> fig, ax = plot_pp([1.2, 1.4, 1.8, 2.0, 2.1])
    """
    values = np.sort(_to_1d_float_array(data))
    dist_obj = _resolve_distribution(distribution)

    params: Tuple[float, ...]
    if distribution_parameters is not None:
        params = tuple(float(p) for p in distribution_parameters)
    elif fit_distribution and hasattr(dist_obj, "fit"):
        params = tuple(float(p) for p in dist_obj.fit(values))
    else:
        params = tuple()

    n = values.size
    empirical_prob = (np.arange(1, n + 1) - 0.5) / n
    theoretical_prob = dist_obj.cdf(values, *params)
    theoretical_prob = np.clip(theoretical_prob, 0.0, 1.0)

    fig, axis = _resolve_axis(ax)
    axis.scatter(theoretical_prob, empirical_prob, color=color, alpha=0.8, label="Observed")
    axis.plot([0.0, 1.0], [0.0, 1.0], linestyle="--", color="black", linewidth=1.2, label="1:1")

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)
    axis.legend()

    return fig, axis

`plot_qq(data, distribution='norm', distribution_parameters=None, fit_distribution=True, show_fit_line=True, ax=None, marker='o', color='tab:blue', xlabel='Theoretical Quantiles', ylabel='Sample Quantiles', title='Q-Q Plot')`

Plot quantile-quantile (Q-Q) diagnostic against a theoretical distribution.

Q-Q plots assess distributional fit by comparing sample quantiles to theoretical quantiles. Tail deviations are especially informative for heavy-tailed geochemical variables.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	One-dimensional numeric dataset.	required
`distribution`	`str or scipy.stats distribution`	Reference distribution name or object.	`'norm'`
`distribution_parameters`	`sequence of float`	Fixed distribution parameters. If omitted and `fit_distribution=True`, parameters are estimated from the sample via `distribution.fit` when available.	`None`
`fit_distribution`	`bool`	Fit distribution parameters from data when not provided.	`True`
`show_fit_line`	`bool`	If `True`, overlay least-squares line through plotted points.	`True`
`ax`	`Axes`	Existing axis to draw on.	`None`
`marker`	`str`	Marker style for sample quantiles.	`'o'`
`color`	`str`	Marker color.	`'tab:blue'`
`xlabel`	`str`	X-axis label.	`'Theoretical Quantiles'`
`ylabel`	`str`	Y-axis label.	`'Sample Quantiles'`
`title`	`str`	Plot title.	`'Q-Q Plot'`

Returns:

Type	Description
`tuple`	`(figure, axis)`.

Examples:

>>> from minexpy.statviz import plot_qq
>>> fig, ax = plot_qq([1.2, 1.4, 1.8, 2.0, 2.1])

Notes

The dashed 1:1 line shows ideal agreement. Systematic curvature indicates mismatch between empirical and theoretical distributions.

Source code in minexpy/statviz.py

def plot_qq(
    data: ArrayLike1D,
    distribution: DistributionLike = "norm",
    distribution_parameters: Optional[Sequence[float]] = None,
    fit_distribution: bool = True,
    show_fit_line: bool = True,
    ax: Optional[plt.Axes] = None,
    marker: str = "o",
    color: str = "tab:blue",
    xlabel: str = "Theoretical Quantiles",
    ylabel: str = "Sample Quantiles",
    title: str = "Q-Q Plot",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot quantile-quantile (Q-Q) diagnostic against a theoretical distribution.

    Q-Q plots assess distributional fit by comparing sample quantiles to
    theoretical quantiles. Tail deviations are especially informative for
    heavy-tailed geochemical variables.

    Parameters
    ----------
    data : array-like
        One-dimensional numeric dataset.
    distribution : str or scipy.stats distribution, default 'norm'
        Reference distribution name or object.
    distribution_parameters : sequence of float, optional
        Fixed distribution parameters. If omitted and ``fit_distribution=True``,
        parameters are estimated from the sample via ``distribution.fit`` when
        available.
    fit_distribution : bool, default True
        Fit distribution parameters from data when not provided.
    show_fit_line : bool, default True
        If ``True``, overlay least-squares line through plotted points.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    marker : str, default 'o'
        Marker style for sample quantiles.
    color : str, default 'tab:blue'
        Marker color.
    xlabel : str, default 'Theoretical Quantiles'
        X-axis label.
    ylabel : str, default 'Sample Quantiles'
        Y-axis label.
    title : str, default 'Q-Q Plot'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Examples
    --------
    >>> from minexpy.statviz import plot_qq
    >>> fig, ax = plot_qq([1.2, 1.4, 1.8, 2.0, 2.1])

    Notes
    -----
    The dashed 1:1 line shows ideal agreement. Systematic curvature indicates
    mismatch between empirical and theoretical distributions.
    """
    values = _to_1d_float_array(data)
    dist_obj = _resolve_distribution(distribution)

    params: Tuple[float, ...]
    if distribution_parameters is not None:
        params = tuple(float(p) for p in distribution_parameters)
    elif fit_distribution and hasattr(dist_obj, "fit"):
        params = tuple(float(p) for p in dist_obj.fit(values))
    else:
        params = tuple()

    theoretical_quantiles, sample_quantiles = scipy_stats.probplot(
        values,
        dist=dist_obj,
        sparams=params,
        fit=False,
    )

    fig, axis = _resolve_axis(ax)
    axis.scatter(
        theoretical_quantiles,
        sample_quantiles,
        marker=marker,
        color=color,
        alpha=0.8,
        label="Sample quantiles",
    )

    line_min = min(float(np.min(theoretical_quantiles)), float(np.min(sample_quantiles)))
    line_max = max(float(np.max(theoretical_quantiles)), float(np.max(sample_quantiles)))
    line_x = np.linspace(line_min, line_max, 200)

    axis.plot(line_x, line_x, linestyle="--", color="black", linewidth=1.2, label="1:1")

    if show_fit_line:
        fit = scipy_stats.linregress(theoretical_quantiles, sample_quantiles)
        axis.plot(
            line_x,
            fit.slope * line_x + fit.intercept,
            color="tab:red",
            linewidth=1.4,
            label=f"Fit line (r={fit.rvalue:.3f})",
        )

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)
    axis.legend()

    return fig, axis

`plot_scatter(x, y, ax=None, color='tab:blue', alpha=0.8, marker='o', label=None, add_trendline=False, trendline_color='tab:red', xlabel='X', ylabel='Y', title='Scatter Plot')`

Plot scatter data with optional least-squares trend line.

Scatter plots are central to geoscience exploratory analysis because they reveal linear/nonlinear patterns, heteroscedasticity, clustering, and potential outliers before formal modeling.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	X-variable values.	required
`y`	`array - like`	Y-variable values.	required
`ax`	`Axes`	Existing axis to draw on.	`None`
`color`	`str`	Marker color.	`'tab:blue'`
`alpha`	`float`	Marker opacity.	`0.8`
`marker`	`str`	Marker style.	`'o'`
`label`	`str`	Legend label for plotted points.	`None`
`add_trendline`	`bool`	If `True`, overlay least-squares regression line and annotate Pearson `r` in legend.	`False`
`trendline_color`	`str`	Trend line color.	`'tab:red'`
`xlabel`	`str`	X-axis label.	`'X'`
`ylabel`	`str`	Y-axis label.	`'Y'`
`title`	`str`	Plot title.	`'Scatter Plot'`

Returns:

Type	Description
`tuple`	`(figure, axis)`.

Raises:

Type	Description
`ValueError`	If vector lengths differ or fewer than two valid paired values exist.

Examples:

>>> from minexpy.statviz import plot_scatter
>>> fig, ax = plot_scatter([1, 2, 3], [2, 4, 6], add_trendline=True)

Source code in minexpy/statviz.py

def plot_scatter(
    x: ArrayLike1D,
    y: ArrayLike1D,
    ax: Optional[plt.Axes] = None,
    color: str = "tab:blue",
    alpha: float = 0.8,
    marker: str = "o",
    label: Optional[str] = None,
    add_trendline: bool = False,
    trendline_color: str = "tab:red",
    xlabel: str = "X",
    ylabel: str = "Y",
    title: str = "Scatter Plot",
) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot scatter data with optional least-squares trend line.

    Scatter plots are central to geoscience exploratory analysis because they
    reveal linear/nonlinear patterns, heteroscedasticity, clustering, and
    potential outliers before formal modeling.

    Parameters
    ----------
    x : array-like
        X-variable values.
    y : array-like
        Y-variable values.
    ax : matplotlib.axes.Axes, optional
        Existing axis to draw on.
    color : str, default 'tab:blue'
        Marker color.
    alpha : float, default 0.8
        Marker opacity.
    marker : str, default 'o'
        Marker style.
    label : str, optional
        Legend label for plotted points.
    add_trendline : bool, default False
        If ``True``, overlay least-squares regression line and annotate
        Pearson ``r`` in legend.
    trendline_color : str, default 'tab:red'
        Trend line color.
    xlabel : str, default 'X'
        X-axis label.
    ylabel : str, default 'Y'
        Y-axis label.
    title : str, default 'Scatter Plot'
        Plot title.

    Returns
    -------
    tuple
        ``(figure, axis)``.

    Raises
    ------
    ValueError
        If vector lengths differ or fewer than two valid paired values exist.

    Examples
    --------
    >>> from minexpy.statviz import plot_scatter
    >>> fig, ax = plot_scatter([1, 2, 3], [2, 4, 6], add_trendline=True)
    """
    x_values = _to_1d_float_array(x, name="x")
    y_values = _to_1d_float_array(y, name="y")

    if x_values.size != y_values.size:
        raise ValueError("x and y must have the same length")

    mask = np.isfinite(x_values) & np.isfinite(y_values)
    x_clean = x_values[mask]
    y_clean = y_values[mask]

    if x_clean.size < 2:
        raise ValueError("At least two valid paired observations are required")

    fig, axis = _resolve_axis(ax)
    point_label = "Samples" if label is None else label

    axis.scatter(
        x_clean,
        y_clean,
        color=color,
        alpha=alpha,
        marker=marker,
        label=point_label,
    )

    if add_trendline:
        fit = scipy_stats.linregress(x_clean, y_clean)
        line_x = np.linspace(float(np.min(x_clean)), float(np.max(x_clean)), 200)
        line_y = fit.slope * line_x + fit.intercept
        axis.plot(
            line_x,
            line_y,
            color=trendline_color,
            linewidth=1.6,
            label=f"Trend line (r={fit.rvalue:.3f})",
        )

    axis.set_xlabel(xlabel)
    axis.set_ylabel(ylabel)
    axis.set_title(title)

    if label is not None or add_trendline:
        axis.legend()

    return fig, axis

Statistics API

Statistical Analysis

minexpy.stats

StatisticalAnalyzer

__init__(data)

coefficient_of_variation()

iqr()

kurtosis(fisher=True)

mean()

median()

skewness()

std(ddof=1)

summary(as_dataframe=True, percentiles=None)

variance(ddof=1)

z_score(value=None)

coefficient_of_variation(data)

describe(data, percentiles=None)

iqr(data)

kurtosis(data, fisher=True)

mean(data)

median(data)

mode(data)

percentile(data, p)

skewness(data)

std(data, ddof=1)

variance(data, ddof=1)

z_score(data, value=None)

Correlation Analysis

minexpy.correlation

biweight_midcorrelation(x, y, c=9.0, epsilon=1e-12)

correlation_matrix(data, method='pearson', min_periods=2)

distance_correlation(x, y)

kendall_correlation(x, y, method='auto', alternative='two-sided')

partial_correlation(x, y, controls, alternative='two-sided')

pearson_correlation(x, y, alternative='two-sided')

spearman_correlation(x, y, alternative='two-sided')

Statistical Visualization

minexpy.statviz

plot_box_violin(data, kind='box', labels=None, ax=None, show_means=True, color='tab:blue', xlabel='Variables', ylabel='Value', title=None)

plot_ecdf(data, labels=None, ax=None, xlabel='Value', ylabel='Empirical Cumulative Probability', title='ECDF')

plot_histogram(data, bins=30, scale='linear', density=False, ax=None, color='tab:blue', alpha=0.75, label=None, xlabel='Value', ylabel=None, title='Histogram')

plot_pp(data, distribution='norm', distribution_parameters=None, fit_distribution=True, ax=None, color='tab:blue', xlabel='Theoretical Cumulative Probability', ylabel='Empirical Cumulative Probability', title='P-P Plot')

plot_qq(data, distribution='norm', distribution_parameters=None, fit_distribution=True, show_fit_line=True, ax=None, marker='o', color='tab:blue', xlabel='Theoretical Quantiles', ylabel='Sample Quantiles', title='Q-Q Plot')

plot_scatter(x, y, ax=None, color='tab:blue', alpha=0.8, marker='o', label=None, add_trendline=False, trendline_color='tab:red', xlabel='X', ylabel='Y', title='Scatter Plot')

`minexpy.stats`

`StatisticalAnalyzer`

`init(data)`

`coefficient_of_variation()`

`iqr()`

`kurtosis(fisher=True)`

`mean()`

`median()`

`skewness()`

`std(ddof=1)`

`summary(as_dataframe=True, percentiles=None)`

`variance(ddof=1)`

`z_score(value=None)`

`coefficient_of_variation(data)`

`describe(data, percentiles=None)`

`iqr(data)`

`kurtosis(data, fisher=True)`

`mean(data)`

`median(data)`

`mode(data)`

`percentile(data, p)`

`skewness(data)`

`std(data, ddof=1)`

`variance(data, ddof=1)`

`z_score(data, value=None)`

`minexpy.correlation`

`biweight_midcorrelation(x, y, c=9.0, epsilon=1e-12)`

`correlation_matrix(data, method='pearson', min_periods=2)`

`distance_correlation(x, y)`

`kendall_correlation(x, y, method='auto', alternative='two-sided')`

`partial_correlation(x, y, controls, alternative='two-sided')`

`pearson_correlation(x, y, alternative='two-sided')`

`spearman_correlation(x, y, alternative='two-sided')`

`minexpy.statviz`

`plot_box_violin(data, kind='box', labels=None, ax=None, show_means=True, color='tab:blue', xlabel='Variables', ylabel='Value', title=None)`

`plot_ecdf(data, labels=None, ax=None, xlabel='Value', ylabel='Empirical Cumulative Probability', title='ECDF')`

`plot_histogram(data, bins=30, scale='linear', density=False, ax=None, color='tab:blue', alpha=0.75, label=None, xlabel='Value', ylabel=None, title='Histogram')`

`plot_pp(data, distribution='norm', distribution_parameters=None, fit_distribution=True, ax=None, color='tab:blue', xlabel='Theoretical Cumulative Probability', ylabel='Empirical Cumulative Probability', title='P-P Plot')`

`plot_qq(data, distribution='norm', distribution_parameters=None, fit_distribution=True, show_fit_line=True, ax=None, marker='o', color='tab:blue', xlabel='Theoretical Quantiles', ylabel='Sample Quantiles', title='Q-Q Plot')`

`plot_scatter(x, y, ax=None, color='tab:blue', alpha=0.8, marker='o', label=None, add_trendline=False, trendline_color='tab:red', xlabel='X', ylabel='Y', title='Scatter Plot')`