DECADES Datasets
----------------

This folder contains three groups of datasets for graph/sparse applications:

	1. small - datasets that contain fewer than 1 million nodes
		* Kronecker_15.el (synthetic)
		* Amazon.tsv
		* Sinkhorn.tsv
		* YouTube.tsv
	2. big - datasets that contain at least 1 million nodes
		* Kronecker_21/ (synthetic)
		* Kronecker_25/ (synthetic)
		* LiveJournal/
		* Orkut/
		* Pokec/
		* Wiki/
		* Sd1_Arc/
		* Twitter/
		* Wikipedia/
	3. bipartite - datasets that are bipartite networks (partitions contain fewer than 1 million nodes)
		* Amazon.tsv
		* Dbpedia.tsv
		* Power.tsv (synthetic)
		* YouTube.tsv

The "small" and "bipartite" folders contain a single edgelist file per dataset.
The "big" folder contains four binary files for each dataset (which can be parsed using Compressed Sparse Row format):

	1. num_nodes_edges.txt - contains information about the size (nodes, edges) of the dataset
	2. node_array.bin - binary representation of the node pointers in the dataset
	3. edge_array.bin - binary representation of the edge pointers in the dataset
	4. edge_values.bin - binary representation of the edge values in the dataset

We have provided a C++ script "parse_bin_files.cpp" to parse these binary files. It can be compiled and run as follows:

	g++ -std=c++11 -o parse_bin_files parse_bin_files.cpp 
	./parse_bin_files [DATASET_DIRECTORY]

Permissions and Licenses
------------------------

We have obtained real-world datasets from KONECT (the Koblenz Network Collection): http://konect.uni-koblenz.de/.
KONECT networks are licensed under the Creative Commons Attribution-ShareAlike 2.0 Germany License.