Statistical Analysis

MinexPy provides descriptive statistics workflows for single-element and multi-element geochemical datasets. This page consolidates practical statistical analysis examples, starting with the original core workflows first.

Available Statistical Tools

describe: One-call summary statistics and percentiles.
StatisticalAnalyzer: Object-oriented API for arrays and DataFrames.
z_score: Standard score calculation for anomaly/outlier screening.
percentile: Percentile calculation for threshold-based interpretation.

Basic Statistical Analysis

Analyze a single geochemical element:

import numpy as np
from minexpy.stats import StatisticalAnalyzer

# Sample zinc (Zn) concentration data (ppm)
zn_data = np.array([45.2, 52.3, 38.7, 61.2, 49.8, 55.1, 42.3, 58.9, 47.6, 51.4])

# Create analyzer
analyzer = StatisticalAnalyzer(zn_data)

# Get comprehensive summary
summary = analyzer.summary()
print(summary)

Multi-Element Analysis

Compare statistics across multiple elements:

import pandas as pd
from minexpy.stats import StatisticalAnalyzer

# Create sample geochemical data
data = {
    "Zn": [45.2, 52.3, 38.7, 61.2, 49.8],
    "Cu": [12.5, 15.3, 18.7, 22.1, 19.4],
    "Pb": [8.3, 10.1, 9.5, 11.2, 9.8],
    "Ag": [0.5, 0.7, 0.6, 0.9, 0.8],
}
df = pd.DataFrame(data)

# Analyze all elements
analyzer = StatisticalAnalyzer(df)
summary_df = analyzer.summary()

print(summary_df)

Outlier Detection with Z-Scores

Identify anomalous values using z-scores:

import numpy as np
from minexpy.stats import z_score

# Geochemical data with potential outliers
data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 45.2, 14.2, 17.9])

# Calculate z-scores
z_scores = z_score(data)

# Identify outliers (|z| > 2)
outlier_mask = np.abs(z_scores) > 2
outliers = data[outlier_mask]
outlier_indices = np.where(outlier_mask)[0]

print(f"Outliers found: {outliers}")
print(f"At indices: {outlier_indices}")
print(f"Z-scores: {z_scores[outlier_mask]}")

Distribution Shape Diagnostics

Interpret skewness, kurtosis, and coefficient of variation:

import numpy as np
from minexpy.stats import StatisticalAnalyzer

data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2])

analyzer = StatisticalAnalyzer(data)

# Get distribution metrics
skew = analyzer.skewness()
kurt = analyzer.kurtosis()
cv = analyzer.coefficient_of_variation()

print(f"Skewness: {skew:.3f}")
if skew > 0.5:
    print("  → Right-skewed distribution (positive tail)")
elif skew < -0.5:
    print("  → Left-skewed distribution (negative tail)")
else:
    print("  → Approximately symmetric distribution")

print(f"\nKurtosis: {kurt:.3f}")
if kurt > 0:
    print("  → Heavy tails (more outliers than normal)")
else:
    print("  → Light tails (fewer outliers than normal)")

print(f"\nCoefficient of Variation: {cv:.3f} ({cv * 100:.1f}%)")
if cv < 0.15:
    print("  → Low variability")
elif cv > 0.5:
    print("  → High variability")
else:
    print("  → Moderate variability")

Percentile Analysis

Inspect tail behavior and thresholds with custom percentiles:

import numpy as np
from minexpy.stats import describe

data = np.array([12.5, 15.3, 18.7, 22.1, 19.4, 16.8, 25.0, 14.2, 20.5, 17.6])

# Get summary with custom percentiles
summary = describe(data, percentiles=[10, 25, 50, 75, 90, 95, 99])

# Display key percentiles
print(f"10th percentile: {summary['percentile_10']:.2f}")
print(f"25th percentile (Q1): {summary['percentile_25']:.2f}")
print(f"50th percentile (Median): {summary['percentile_50']:.2f}")
print(f"75th percentile (Q3): {summary['percentile_75']:.2f}")
print(f"90th percentile: {summary['percentile_90']:.2f}")
print(f"95th percentile: {summary['percentile_95']:.2f}")
print(f"99th percentile: {summary['percentile_99']:.2f}")
print(f"\nIQR: {summary['iqr']:.2f}")

Working with CSV Data

Load and analyze selected elements from a CSV file:

import pandas as pd
from minexpy.stats import StatisticalAnalyzer

# Load geochemical data
df = pd.read_csv("geochemical_data.csv")

# Select elements of interest
elements = ["Zn", "Cu", "Pb", "Ag", "Mo"]

# Analyze each element
for element in elements:
    if element in df.columns:
        analyzer = StatisticalAnalyzer(df[element])
        summary = analyzer.summary()

        print(f"\n=== {element} Statistics ===")
        print(f"Mean: {summary['mean']:.2f}")
        print(f"Median: {summary['median']:.2f}")
        print(f"Std Dev: {summary['std']:.2f}")
        print(f"Skewness: {summary['skewness']:.3f}")
        print(f"Kurtosis: {summary['kurtosis']:.3f}")
        print(f"CV: {summary['coefficient_of_variation']:.3f}")

Comparing Variability Across Elements

Compare coefficient of variation (CV) across multiple elements:

import pandas as pd
from minexpy.stats import StatisticalAnalyzer

df = pd.read_csv("geochemical_data.csv")
elements = ["Zn", "Cu", "Pb"]

analyzer = StatisticalAnalyzer(df[elements])
summary = analyzer.summary()

# Compare coefficient of variation
print("Variability Comparison (CV):")
for element in elements:
    cv = summary.loc[element, "coefficient_of_variation"]
    print(f"{element}: {cv:.3f} ({cv * 100:.1f}%)")

# Find most variable element
most_variable = summary["coefficient_of_variation"].idxmax()
print(f"\nMost variable element: {most_variable}")