DECADES Datasets

This folder contains three groups of datasets for graph/sparse applications:

  1. small - datasets that contain fewer than 1 million nodes

  2. big - datasets that contain at least 1 million nodes

  3. bipartite - datasets that are bipartite networks (partitions contain fewer than 1 million nodes)

The "small" and "bipartite" folders contain a single edgelist file per dataset. The "big" folder contains four binary files for each dataset (which can be parsed using Compressed Sparse Row format):

  1. num_nodes_edges.txt - contains information about the size (nodes, edges) of the dataset

  2. node_array.bin - binary representation of the node pointers in the dataset

  3. edge_array.bin - binary representation of the edge pointers in the dataset

  4. edge_values.bin - binary representation of the edge values in the dataset

We have provided a C++ script "parse_bin_files.cpp" to parse these binary files. It can be compiled and run as follows:

g++ -std=c++11 -o parse_bin_files parse_bin_files.cpp
    ./parse_bin_files [DATASET_DIRECTORY]

Permissions and Licenses

We have obtained real-world datasets from KONECT (the Koblenz Network Collection): http://konect.uni-koblenz.de/ and SNAP (Stanford Network Analysis Project): http://snap.stanford.edu/index.html. KONECT networks are licensed under the Creative Commons Attribution-ShareAlike 2.0 Germany License.