Data usage

August 10, 2023 Β· View on GitHub

imodelsπŸ” data

Tabular data for various problems, especially for high-stakes rule-based modeling with the imodels package.

See also https://huggingface.co/imodels

Includes the following datasets and more (see notebooks for more details on the datasets).

To download, use the "Name" field as the key: e.g. imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels').

NameSamplesFeaturesClass 0Class 1Majority class %
heart2701515012055.6
breast_cancer277171968170.8
haberman30638122573.5
credit_g10006030070070
csi_pecarn_prop331397277354083.7
csi_pecarn_pred331339277354083.7
juvenile_clean3640286315348786.6
compas_two_year_clean6172203182299051.6
enhancer780980711569491.1
fico10459235000545952.2
iai_pecarn_prop12044731184120398.3
iai_pecarn_pred12044581184120398.3
credit_card_clean300003323364663677.9
tbi_pecarn_prop424282234205237699.1
tbi_pecarn_pred424281214205237699.1
readmission_clean101763150548614690253.9

Data usage

First, install the imodels package: pip install imodels. Then, use the imodels.get_clean_dataset function.

imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑> Tuple[numpy.ndarray, numpy.ndarray, list]
"""
Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally

Parameters
----------
dataset_name: str
    dataset_name - unique dataset identifier
data_source: str
    options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'
data_path: str
    path to load/save data (default: 'data')

Returns
-------
X: np.ndarray
    features
y: np.ndarray
    outcome
feature_names: list
"""

Example

# download compas dataset from imodels
X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
# download ionosphere dataset from pmlb
X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')
# download liver dataset from openml
X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')
# download ca housing from sklearn
X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')

Data info

Data comes from various sources - please cite those sources appropriately.

notebooks_fetch_data contains notebooks which download and preprocess the data

data_cleaned contains the cleaned csv file for each dataset

Clinical decision-rule (PECARN) datasets

To use any of the clinical decision-rule datasets, you must first accept the research data use agreement here.

There are two versions of each PECARN (TBI, IAI, and CSI) dataset.

  • prop: missing values have not been imputed
  • pred: missing values have been imputed

csi_pecarn_pred.csv note: unlike the rest of the datasets in this repo, which are fully cleaned, csi_pecarn_pred.csv contains a variable ("SITE") that should be removed before fitting models.

DatasetTaskSizeReferences
iai_pecarnPredict intra-abdominal injury requiring acute intervention before CT12,044 patients, 203 with IAI-IπŸ“„, πŸ”—
tbi_pecarnPredict traumatic brain injuries before CT42,412 patients, 376 with ciTBIπŸ“„, πŸ”—
csi_pecarnPredict cervical spine injury in children3,314 patients, 540 with CSIπŸ“„, πŸ”—

Miscellaneous notes

The breast_cancer dataset here is not the extremely common Wisconsin breast-cancer dataset but rather this dataset from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.

Some other cool datasets:

  • moleculenet - benchmarks for molecular datasets
  • srbench - benchmarking for symbolic regression
  • big-bench - language modeling benchmarks