This paper targets two main goals. First, we want to provide an overview of available datasets that can be used by researchers and where to find them. Second, we want to stress the importance of sharing datasets to allow researchers to replicate results and improve the state of the art. To answer the first goal, we analyzed 715 peer-reviewed research articles from 2010 to 2015 with focus and relevance to digital forensics to see what datasets are available and focused on three major aspects: (1) the origin of the dataset (e.g., real world vs. synthetic), (2) if datasets were released by researchers and (3) the types of datasets that exist.
Additionally, we broadened our results to include the outcome of online search results. We also discuss what we think is missing. Overall, our results show that the majority of datasets are experiment generated (56.4%) followed by real world data (36.7%). On the other hand, 54.4% of the articles use existing datasets while the rest created their own. In the latter case, only 3.8% actually released their datasets. Finally, we conclude that there are many datasets for use out there but finding them can be challenging.
All of our data analysis was performed by manual inspection. We note that human error might have been introduced, but we attempted to alleviate the errors by conducting multiple runs. Due to time constraints, our dataset of research articles included only papers from 2010 up to 2015 from selected venues and does not include every single paper published worldwide in the cyber forensics domain. We do however believe that our research paper dataset is representative in both breadth and depth. We argue that our results are still applicable and our findings paint the picture of the state of the domain with regards to datasets.
Our study was inspired by Abt and Baier (2014) who published an article named availability of ground-truth in network security research. In their article, the authors analyzed 106 network security papers over four years (2009-2013) and concluded with three main findings: (1) many researchers manually produced their datasets, (2) datasets are often not released after the work is completed and (3) there is a lack of standardized datasets that are labeled that can be used in research.
While this work was influenced by Abt and Baier (2014), the difference between both studies is that we do not exclusively focus on network traffic but on all kinds of datasets that may be useful for cybersecurity/forensics research, e.g., malware, disk images or memory dumps. Moreover, our study expands to a broader number of articles, results from Google searches and provides an overview of existing datasets. To analyze the availability of datasets which we define in Sec.
AVAILABILITY OF DATASETS
The second part of our study analyzed the availability and re-use of datasets. A summary of our findings is depicted in Table 2 and will be discussed in the following subsections.
Research that requires datasets currently faces several challenges as data is barley shared among the community. Our results show that less than 4% shared their dataset while on the other hand almost 50% make use of existing datasets. In other words, whenever a repository or a sophisticated dataset is available, researchers appreciate and utilize it. Beside the lack of sharing datasets, maintenance and availability are major issues.
CONCLUSION & FUTURE WORK
For this article we analyzed 715 research articles and performed Google searches to summarize the availability of datasets for the community. While this study comes with a comprehensive list of available datasets and repositories which can be leveraged by researchers, we also show that there is a lack of sharing data which we believe is key to improve the quality and pace of research especially in domains like digital forensics.
In the What Is Missing? section we highlight six points that we believe are needed in order to solve those current challenges: variety of datasets, updates & upgrades of repositories/datasets, a centralized repository, more research in de-identification, strategies to share complex data such as ‘cloud services’ and publisher support. On the other hand, we see first steps towards solutions, e.g., by DHS and their Impact Cyber Trust project.
Source: University of New Haven
Authors: Cinthya Grajeda | Frank Breitinger | Ibrahim Baggili