Network Report: A Structured Description for Network Datasets
Abstract
The rapid development of network science and technologies depends on shareable datasets. Currently, there is no standard practice for reporting and sharing network datasets. Some network dataset providers only share links, while others provide some contexts or basic statistics. As a result, critical information may be unintentionally dropped, and network dataset consumers may misunderstand or overlook critical aspects. Inappropriately using a network dataset can lead to severe consequences (e.g., discrimination) especially when machine learning models on networks are deployed in high-stake domains. Challenges arise as networks are often used across different domains (e.g., network science, physics, etc) and have complex structures. To facilitate the communication between network dataset providers and consumers, we propose network report. A network report is a structured description that summarizes and contextualizes a network dataset. Network report extends the idea of dataset reports (e.g., Datasheets for Datasets) from prior work with network-specific descriptions of the non-i.i.d. nature, demographic information, network characteristics, etc. We hope network reports encourage transparency and accountability in network research and development across different fields.
Supplemental Materials
Case Study
1. High School Contact Network
We use the high school contact network as an example of a social network. The network is created in the interdisciplinary SocioPatterns [1] project to study human contact behavior. The network report is shown in below. The network follows general characteristics of social networks: large triangle counts and clustering coefficient, power-law distribution of degrees, etc. Thus methods for analyzing social networks could be applied to this network. As the data is collected by wearable sensors that exchange ultra-low power radio packets, the sensors may detect false positive contact if students' physical distance is close (e.g., deskmates). As a result, the network may be biased towards in-class contacts. Influence models and vaccination strategies developed for this dataset may not generalize to other datasets. Furthermore, there are only 327 nodes in the network, but the average degree is > 1000. Network dataset consumers may consider data structures other than sparse matrices to speed up certain computations (e.g., eigenvalue decomposition). Visualization researchers may be interested in developing specific techniques for such 'edge clutter' networks, as opposed to 'node clutter' networks such as Figure 1 (in the manuscript).
2. MOOC Action Network
User Study
References
[1] http://www.sociopatterns.org/
[2] Kadam, Priti, Jayashree Palve, Kranti Kusale and Nikita Sankhe. “KDD CUP 2015- Predicting Dropouts in MOOC’S.” Imperial journal of interdisciplinary research 2 (2016): n. pag.
[3] Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: online learning of social representations. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.