Relational Dataset Archive

Overview

Relational Dataset Archive: An archive of standard, versioned benchmark relational datasets.

GitHub release (latest by date)


Basic Usage

This is a collection of datasets that have passed all checks set in the “Relational Data Linter.”

Many are split into standard cross-validation folds for benchmarking relational learning and inference algorithms.

One of these libraries can be used to manage these datasets locally:

Screenshot of the GitHub assets for datasets 0.0.5. It shows a table of dataset names, version numbers, and their size in bytes.


Contributing a Dataset

I would love more datasets, and I would love any feedback for whether this is useful to your research!

I drew quite a bit of inspiration for this from Jonas Schouterden’s RelationalDatasets repository.


Data Versioning and Downloading

Specific Version: Versions of each data archive may be downloaded by sending requests to a url with the following pattern, where {VERSION} represents a tag and {NAME} is the name for a dataset:

https://github.com/srlearn/datasets/releases/download/{VERSION}/{NAME}_{VERSION}.zip

Examples

curl

Download version v0.0.4 of toy_cancer:

curl -L https://github.com/srlearn/datasets/releases/download/v0.0.4/toy_cancer_v0.0.4.zip > toy_cancer_v0.0.4.zip

Download version v0.0.4 of webkb:

curl -L https://github.com/srlearn/datasets/releases/download/v0.0.4/webkb_v0.0.4.zip > webkb_v0.0.4.zip

relational-datasets

Load version v0.0.4 of toy_cancer:

from relational_datasets import load

train, test = load("toy_cancer", "v0.0.4")

RelationalDatasets.jl

Load version v0.0.4 of toy_cancer:

using RelationalDatasets

train, test = load("toy_cancer", "v0.0.4")