The increasing scale and impacts of cyber events remain an enduring concern, yet information covering the range of threat actors, motive, industry, or classified impact are scarce, fractured, or are only available through private organizations at a significant cost.
As the private and public sectors grapple with the multi-faceted problem of cyber security, they lack basic tools needed to make strategic decisions about prevention and response. Software solutions, organizational resilience, employee education, and improved system controls are among the many available options to enhance cybersecurity. Yet, it is difficult to make strategic decisions about how to invest scarce resources without an understanding of what types of cyber threats are most common in a specific industry or critical infrastructure sector and what their effects might be.
There exists a number of smaller niche repositories, news sites, and blogs that catalog cyber events, yet the data is often not well structured or consistently coded to distill larger analytic insights. To address this gap, CISSM has launched the Cyber Events Database project to collect publicly available information for cyber events from 2014 through the present. The dataset contains structured information across several categories and is now available to researchers and industry partners. The CISSM Cyber Events Database utilizes automated techniques paired with manual review and classification by researchers to acquire and structure data from a variety of open news sites, blogs, and other specialty sites that identify and discuss publicly attributed attacks. The data is updated monthly and yields information about the threat actor, motive, victim, industry, and end effects of the attack. CISSM has made descriptive information freely available to the public.
Researchers or public officials interested in the detailed records or access to the dataset in its entirety should contact Dr. Charles Harry at email@example.com
Researchers who plan on using the data for publication should cite the following: Harry, C., & Gallagher, N. (2018). Classifying Cyber Events. Journal of Information Warfare, 17(3), 17-31.
Data collection and data coding information:
Manually acquiring data from known web sources can be tedious, labor intensive, and difficult to apply consistently over long periods of time, yet purely automated methods in classifying data can suffer from significant bias or misclassification. CISSM researchers have employed a mixed-methods approach leveraging a python application to “scrape” data from relevant cyber sources to provide information that can be reviewed and coded by the research team to (1) ensure the events identified meet the definition of a cyber event, (2) a consistent approach is applied in the categorization of threat actor and motive, and (3) accurately classify the industry and specific effects the event achieved. We utilize a structured taxonomy developed at CISSM and published in the Journal of Information Warfare.
A script written in the Python programming language was developed by CISSM to query a list of known websites, each of which linking to individual entries, articles, and/or subpages that are candidates for inclusion in the CISSM cyber events database.
Figure 1 denotes the high-level process for how researchers execute the script, how it accesses each site via a predetermined URL for the website’s main/landing page, and finally processes data in both a daily run file and for the master data file. The data for each website is then returned in HTML format to the script for processing. Because websites formats differ greatly among the sources, the script has specific algorithms for each site to extract the same relevant information consistently. The script parses out the date published, title, URL, and a short preview of the entry/article text. The source combines this information with the local date/time of the page being accessed, as well as the title associated with the overarching website. All of this information is included in a single row of two comma-separated values (.csv) files that are added to the user’s machine by the script.
The Python script makes use of the csv, datetime, and urllib internal Python libraries, as well as Beautiful Soup 4, an external library that facilitates the bulk of the data extraction. At this stage in development, the script is fully functional via an interpreter or development environment that runs Python 3.8.
To facilitate accurate recordkeeping, prevent duplication, and ensure recently added entries are recognized, the script utilizes two different comma-separated values (.csv) files. The Daily Table .csv is a list of scraped information from the main/landing page of a known website on one day. The name of these files is automatically generated in the format month-day-scrape.csv. If the website updates over the course of the day and the script is run again, then the same file will be overwritten if it is the same day as included in the file name. New dates will generate a separate file.
There may be some overlap in the content generated on each file between subsequent days, as entries/articles from previous days often stay on the front page of websites for more than 24 hours. Old files created by the script are not destroyed-- this allows for the user to keep a comprehensive log of scrapes from day to day, with the deduplicated master script serving as a reference for the researcher to prevent duplication.
The Master Table csv is different from the Daily Table csv in that it updates daily without duplication. The first time the script is run, if the user does not already have a file named “master-scrape.csv,” the script will create a new one that will be nearly identical to the Daily Table for that day. However, over time more and more non-duplicated entries will be added to that same file as the script continues to be run on different days. It comprises all entries ever created by the script, creating a cumulative list that updates every time the script is run. Note that previous versions of the Master Table file are not kept, though the data in them will still be accessible via new files.
The script makes no effort to determine suitability of the candidate cyber event. All linked entries/articles are included in a daily deduplicated file to be reviewed by a researcher, who makes final judgements as to whether events are valid members of the dataset. We define a cyber event as the end result of any single unauthorized effort, or the culmination of many such technical actions, that engineers, through use of computer technology and networks, a desired primary effect on a target. The dataset chiefly records individual cyber events where a discernible effect was achieved by the threat actor (e.g. hacker). To be included in the dataset, each event must be traced back to an underlying source describing details surrounding the event itself. Each record has 11 fields providing more context to the record. These fields include:
Source - The URL link to the published article or source describing the event. Date - The date that the cyber event occurred. If no date is available, then the date the article was published is used as an approximate date. Dates are listed in DD-MM-YYYY form.
Type: Describes the intended end effect by utilizing the CISSM taxonomy of cyber effects. Effects are broadly defined as disruptive events when the goal of the threat actor is to impact the availability or integrity of the victims systems, and exploitive events when the goal is the collection of information from the targeted organization. If there are both disruptive and exploitive impacts resulting from the cyber event, it may be classified into both groups.
Sub-Type - Describes the specific end effect sub-group as defined in the CISSM taxonomy of cyber effects available here If there are multiple effects of the event, it may be classified into multiple sub-groups.
Victim - Details the target of the threat actor. Usually indicates a specific organization, entity, or person who was primarily affected by the cyber event. Only cyber events with a specific identified victim are included. Larger cyber campaigns against whole populations are not included in the dataset.
Actor - Identifies a person, entity, or organization that takes credit for a cyber event, is attributed to the event from an official source, or is otherwise confirmed to have carried out the event by private sector security vendors. Given the difficulty in attributing cyber events, many events are marked as “Unknown.”
Actor Type - Identifies the broad classification of cyber threat actors. Five classifications are used: Nation State (a individual or group that is a part of or used by a sovereign nation), Criminal (an organized individual or group with the intent of committing a crime for clear personal gain), Hobbyist (a mostly unorganized individual or group working for exploration or amusement), Hacktivist (an individual or a group that is working towards a political or social aim), Terrorist (organized groups who may or may not claim statehood who pursue destruction for political aims), or Undetermined.
Motive - Identifies a possible motive of the threat actor based on the underlying published source material. Threat actor motives are classified into five categories: Cyber Crime (acts that are considered criminal in the United States), Cyber Espionage (acts with the intent of uncovering sensitive information), Cyber War (acts intended to cause harm to an adversary, including nations and groups), Hacktivism (acts intended to draw attention to a social or political cause), or Demonstrative (acts that generally demonstrate personal capabilities of the actor).
Location - Details the country or countries in which the target resides based on the published source. Events that affect large populations spanning many countries may be considered widespread campaigns and removed from the data set.
Industry – Utilizes the 2-digit North American Industry Classification System code associated with the victim. Data is coded based on the details from the published source. For example a telecommunications company is coded with NAICS code 51. Attacks on specific individuals, not including celebrities or other famous persons, were not included in the dataset. Organizations that cannot be categorized under a specific NAICS code were not included.
Description - A brief written summary of the cyber event, providing the reader with a general sense of the target, impact, and other details distilled from the published source.
Once data is identified, reviewed, cleaned, and classified, it is uploaded to CISSM database at https://cissm.liquifiedapps.com. Each entry is reviewed by another member of the team and then accepted for addition to the deployed dataset.
 Harry, C., & Gallagher, N. (2018). Classifying Cyber Events. Journal of Information Warfare, 17(3), 17-31.
Special thanks to the dedicated team of researchers who have labored to collect, clean, and categorize this dataset. These researchers include: Renuka Pai, Alex Krylyuk, Chris Lidard, and Nicole Franiok.
For more information about the CISSM Cyber Events Database, please contact Dr. Charles Harry at firstname.lastname@example.org