rfcorr
softwarePython library for Random Forest-based correlation measures, providing alternative approaches to traditional correlation analysis using tree-based ensemble methods
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
An innovative Python library that reimagines correlation measurement using tree-based ensemble methods, addressing limitations of traditional correlation approaches like Pearson and Spearman.
Motivation
βCountless tasks rely on conceptions and formalizations of βcorrelation.β But in two decades of working in areas that utilize correlationβ¦ I have found that few are measuring what their words reveal is their intuition.β
This library implements an open research agenda for alternative correlation concepts based on random forests and other tree-based ensembles.
Key Features
Advanced Correlation Properties
- Asymmetric correlations: R(x,y) β R(y,x)
- Mixed data types: Naturally handles categorical and continuous variables
- Non-linear relationships: Captures complex dependencies
- Lagged correlations: Time-series analysis support
- Uncertainty estimation: Built-in confidence measures
Supported Algorithms
- Random Forest (fully implemented)
- Extra Trees (fully implemented)
- CatBoost (work in progress)
- XGBoost (work in progress)
Technical Advantages
Tree-based approaches offer several benefits over traditional methods:
- Support for categorical features without encoding
- Robust handling of missing data
- Protection against overfitting
- Capture of interaction effects
- Generation of predictive models alongside correlation measures
Installation & Usage
pip install rfcorr
Basic usage example:
import rfcorr.random_forest
correlation_matrix = rfcorr.random_forest.get_pairwise_corr(
df.values,
num_trees=100,
lag=0,
method="regression",
use_permutation=True
)
Applications
The library includes demonstration notebooks showing:
- Rolling correlation analysis for financial time series (SPDR Sector ETFs)
- Handling of periodic data where traditional correlations fail
- Mixed categorical/continuous feature analysis
Research Impact
This project challenges conventional thinking about correlation measurement and provides practitioners with more flexible tools for understanding relationships in complex, real-world data where traditional linear assumptions may not hold.