March 15, 2020

963 words 5 mins read

benedekrozemberczki/datasets

benedekrozemberczki/datasets

A repository of pretty cool datasets that I collected for network science and machine learning research.

repo name benedekrozemberczki/datasets
repo link https://github.com/benedekrozemberczki/datasets
homepage
language
size (curr.) 98949 kB
stars (curr.) 131
created 2019-04-07
license MIT License

Datasets GitHub stars GitHub forks License

Datasets collected for network science and machine learning research.

Contents
  1. GitHub StarGazer Graphs
  2. Twitch Ego Nets
  3. Reddit Thread Graphs
  4. Deezer Ego Nets
  5. GitHub Social Network
  6. Deezer Social Networks
  7. Facebook Page-Page Networks
  8. Wikipedia Article Networks
  9. Twitch Social Networks
  10. Facebook Large Page-Page Network

GitHub StarGazer Graphs

Description

Properties

  • Number of graphs: 12,725
  • Directed: No.
  • Node features: No.
  • Edge features: No.
  • Graph labels: Yes. Binary-labeled.
  • Temporal: No.
Min Max
Nodes 10 957
Density 0.003 0.561
Diameter 2 18

Possible Tasks

  • Graph classification

Citing

>@misc{karateclub2020,
       title={An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs},
       author={Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year={2020},
       eprint={2003.04819},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
}

Twitch Ego Nets

Description

Properties

  • Number of graphs: 127,094
  • Directed: No.
  • Node features: No.
  • Edge features: No.
  • Graph labels: Yes. Binary-labeled.
  • Temporal: No.
Min Max
Nodes 14 52
Density 0.038 0.967
Diameter 1 2

Possible Tasks

  • Graph classification

Citing

>@misc{karateclub2020,
       title={An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs},
       author={Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year={2020},
       eprint={2003.04819},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
}

Reddit Thread Graphs

Description

Properties

  • Number of graphs: 203,088
  • Directed: No.
  • Node features: No.
  • Edge features: No.
  • Graph labels: Yes. Binary-labeled.
  • Temporal: No.
Min Max
Nodes 11 97
Density 0.021 0.382
Diameter 2 27

Possible Tasks

  • Graph classification

Citing

>@misc{karateclub2020,
       title={An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs},
       author={Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year={2020},
       eprint={2003.04819},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
}

Deezer Ego Nets

Description

Properties

  • Number of graphs: 9,629
  • Directed: No.
  • Node features: No.
  • Edge features: No.
  • Graph labels: Yes. Binary-labeled.
  • Temporal: No.
Min Max
Nodes 11 363
Density 0.015 0.909
Diameter 2 2

Possible Tasks

  • Graph classification

Citing

>@misc{karateclub2020,
       title={An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs},
       author={Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year={2020},
       eprint={2003.04819},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
}

GitHub Social Network

Description

Properties

  • Directed: No.
  • Node features: Yes.
  • Edge features: No.
  • Node labels: Yes. Binary-labeled.
  • Temporal: No.
GitHub
Nodes 37,700
Edges 289,003
Density 0.001
Transitvity 0.013

Possible Tasks

  • Binary node classification
  • Link prediction
  • Community detection
  • Network visualization

Citing

>@misc{rozemberczki2019multiscale,    
       title = {Multi-scale Attributed Node Embedding},   
       author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},   
       year = {2019},   
       eprint = {1909.13021},  
       archivePrefix = {arXiv},  
       primaryClass = {cs.LG}   
       }

Deezer Social Networks

Description

Properties

  • Directed: No.
  • Node features: No.
  • Edge features: No.
  • Node labels: Yes. Multi-labeled.
  • Temporal: No.
RO HR HU
Nodes 41,773 54,573 47,538
Edges 125,826 498,202 222,887
Density 0.0001 0.0004 0.0002
Transitvity 0.0752 0.1146 0.0929

Possible Tasks

  • Node classification
  • Link prediction
  • Community detection
  • Network visualization

Citing

If you find these datasets useful in your research, please cite the following paper:

>@inproceedings{rozemberczki2019gemsec,    
                title={GEMSEC: Graph Embedding with Self Clustering},    
                author={Rozemberczki, Benedek and Davies, Ryan and Sarkar, Rik and Sutton, Charles},    
                booktitle={Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019},    
                pages={65-72},    
                year={2019},    
                organization={ACM}    
                }

Facebook Page-Page Networks

Description

Properties

  • Directed: No.
  • Node features: No.
  • Edge features: No.
  • Node labels: No.
  • Temporal: No.
Nodes Edges Density Transitvity
Politicians 5,908 41,729 0.0024 0.3011
Companies 14,113 52,310 0.0005 0.1532
Athletes 13,866 86,858 0.0009 0.1292
News Sites 27,917 206,259 0.0005 0.1140
Public Figures 11,565 67,114 0.0010 0.1666
Artists 50,515 819,306 0.0006 0.1140
Government 7,057 89,455 0.0036 0.2238
TV Shows 3,892 17,262 0.0023 0.5906

Possible Tasks

  • Link prediction
  • Community detection
  • Network visualization

Citing

If you find these datasets useful in your research, please cite the following paper:

>@inproceedings{rozemberczki2019gemsec,    
                title={GEMSEC: Graph Embedding with Self Clustering},    
                author={Rozemberczki, Benedek and Davies, Ryan and Sarkar, Rik and Sutton, Charles},    
                booktitle={Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019},    
                pages={65-72},    
                year={2019},    
                organization={ACM}    
                }

Wikipedia Article Networks

Description

Properties

  • Directed: No.
  • Node features: Yes.
  • Edge features: No.
  • Node labels: Yes. Continuous target.
  • Temporal: No.
Chameleon Crocodile Squirrel
Nodes 2,277 11,631 5,201
Edges 31,421 170,918 198,493
Density 0.012 0.003 0.015
Transitvity 0.314 0.026 0.348

Possible Tasks

  • Regression
  • Link prediction
  • Community detection
  • Network visualization

Citing

If you find these datasets useful in your research, please cite the following paper:

>@misc{rozemberczki2019multiscale,    
       title = {Multi-scale Attributed Node Embedding},   
       author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},   
       year = {2019},   
       eprint = {1909.13021},  
       archivePrefix = {arXiv},  
       primaryClass = {cs.LG}   
       }

Twitch Social Networks

Description

Properties

  • Directed: No.
  • Node features: Yes.
  • Edge features: No.
  • Node labels: Yes. Binary-labeled.
  • Temporal: No.
DE EN ES FR PT RU TW
Nodes 9,498 7,126 4,648 6,549 1,912 4,385 2,772
Edges 153,138 35,324 59,382 112,666 31,299 37,304 63,462
Density 0.003 0.002 0.006 0.005 0.017 0.004 0.017
Transitvity 0.047 0.042 0.084 0.054 0.131 0.049 0.120

Possible tasks

  • Binary node classification
  • Link prediction
  • Community detection
  • Network visualization

Citing

>@misc{rozemberczki2019multiscale,    
       title = {Multi-scale Attributed Node Embedding},   
       author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},   
       year = {2019},   
       eprint = {1909.13021},  
       archivePrefix = {arXiv},  
       primaryClass = {cs.LG}   
       }

Facebook Large Page-Page Network

Description

Properties

  • Directed: No.
  • Node features: Yes.
  • Edge features: No.
  • Node labels: Yes. Binary-labeled.
  • Temporal: No.
Facebook
Nodes 22,470
Edges 171,002
Density 0.001
Transitvity 0.232

Possible tasks

  • Multi-class node classification
  • Link prediction
  • Community detection
  • Network visualization

Citing

>@misc{rozemberczki2019multiscale,    
       title = {Multi-scale Attributed Node Embedding},   
       author = {Benedek Rozemberczki and Carl Allen and Rik Sarkar},   
       year = {2019},   
       eprint = {1909.13021},  
       archivePrefix = {arXiv},  
       primaryClass = {cs.LG}   
       }
comments powered by Disqus