on this page

rfcorr

software

Python library for Random Forest-based correlation measures, providing alternative approaches to traditional correlation analysis using tree-based ensemble methods

period: 2022-present
tech:
Machine Learning
══════════════════════════════════════════════════════════════════

An innovative Python library that reimagines correlation measurement using tree-based ensemble methods, addressing limitations of traditional correlation approaches like Pearson and Spearman.

Motivation

β€œCountless tasks rely on conceptions and formalizations of β€˜correlation.’ But in two decades of working in areas that utilize correlation… I have found that few are measuring what their words reveal is their intuition.”

This library implements an open research agenda for alternative correlation concepts based on random forests and other tree-based ensembles.

Key Features

Advanced Correlation Properties

  • Asymmetric correlations: R(x,y) β‰  R(y,x)
  • Mixed data types: Naturally handles categorical and continuous variables
  • Non-linear relationships: Captures complex dependencies
  • Lagged correlations: Time-series analysis support
  • Uncertainty estimation: Built-in confidence measures

Supported Algorithms

  • Random Forest (fully implemented)
  • Extra Trees (fully implemented)
  • CatBoost (work in progress)
  • XGBoost (work in progress)

Technical Advantages

Tree-based approaches offer several benefits over traditional methods:

  • Support for categorical features without encoding
  • Robust handling of missing data
  • Protection against overfitting
  • Capture of interaction effects
  • Generation of predictive models alongside correlation measures

Installation & Usage

pip install rfcorr

Basic usage example:

import rfcorr.random_forest

correlation_matrix = rfcorr.random_forest.get_pairwise_corr(
    df.values,
    num_trees=100,
    lag=0,
    method="regression",
    use_permutation=True
)

Applications

The library includes demonstration notebooks showing:

  • Rolling correlation analysis for financial time series (SPDR Sector ETFs)
  • Handling of periodic data where traditional correlations fail
  • Mixed categorical/continuous feature analysis

Research Impact

This project challenges conventional thinking about correlation measurement and provides practitioners with more flexible tools for understanding relationships in complex, real-world data where traditional linear assumptions may not hold.

on this page