Summary

The National Transportation Safety Board (NTSB) maintains a database of aviation accidents and incidents that can be accessed by the general public on the NTSB web site. That database contains information about accidents and selected incidents within the United States, its territories and possessions, and in international waters. The database also includes events involving US-registered aircraft that occur outside US territory.

While the database is publicly accessible on the NTSB web site, users are limited in how the information can be used. The records can be displayed on the web site, or the results of a search may be downloaded. However, because the downloads are in the form of a text or XML file, rather than a file type that can be used by common spreadsheet or database programs, users must first transform the data into a form that can be analyzed, but no resources are provided by the NTSB for that purpose.

While many spreadsheet programs, including Excel and LibreOffice, may be able to covert a text or an XML file into common formats such as a text CSV file, the data may still have to be transformed in other ways in order to be analyzed by R or other data analysis programs.

Using the data analysis program R, AirSafe.com has created two programs that can take either the text file or XML file version of the output and turn it into a CSV file and a text file that can be used by widely available spreadsheet and data analysis programs. The text file output would used the NTSB text file format, which uses vertical bar (|), with a space on either side of the vertical bar, as the delimiter between the varible fields on each accident and incident record.

AirSafe.com has made the following resources available to the public, the R programs that will create both a text and CSV file based on either the text file or XML file output from the online NTSB database, and both the text and CSV files representing the download of the entire database of just under 79,000 records (downloaded 4 September 2016).

Background

The National Transportation Safety Board is an independent US government agency that has as one of its missions the investigation of accidents involving air transportation. As part of its mission, it provides a wide range of aviation safety information to the general public, including access to a database of aircraft accidents and incidents. The database records, the vast majority of which contain records from 1948 to the present, can be accessed at http://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx. Each record has a range of data about each incident and accident, including the date and location of the event, and the number and type of injuries.

Users can choose to search for specific event characteristics, and can have the results either displayed in their browser or downloaded into either an XML file or a text. Users also have the option of downloading the entire database.

While the NTSB provides various data summaries elsewhere on the site and in reports released to the public, the organization does not provide either any resources for users to create their own summaries using resources on the site, nor does it provide any instructions on how to transform the data into a form that could be used by a spreadsheet program or other data analysis program.

Objective

AirSafe.com, which since 1996 has provided the public with extensive information and resources related to airline safety and security, sought to create a process that could both transform the NTSB database output into a form that could be analzyed by AirSafe.com, and to provide other researchers with the resources that could allow them to conduct their own analyses. Those resources include the following:

The last two resources are covered in detail in this report, and there are links to the other resources at the end of this document.

Who would find this information useful?

There are several kinds of groups or individuals who may find some or all of the above resources useful. Some of those individuals or groups include the following:

How to get the most out of this report

There are three ways that the information associated with this report will likely be used:

  1. The CSV data and the data dictionary associated with this report may be taken and used to perform analyses with a program other than R. To do this, the reader can simply go to the end of this report to find links to the CSV and text file data representing every event in the NTSB database for all dates on or before 31 August 2016 (download date was XXX 2016).

  2. The R programs developed for this report may be used to transform the NTSB database outputs for further analysis in R. Links to both programs are in the Resources section at the end of this report. If the user will only need to analyze NTSB data from earlier than 1 September 2016, the CSV and text files in the Resources section already have all of that information.

  3. The R programs associated with this report may be used as a template or stepping stone for a revised R program, or to develop similar programs for another data analysis programming environment. Links to a pair of programs, one to transform the XML NTSB data output and one to transform the text file output, are in the resources section. Both programs also contain sample summary analyses and data graphics from the records dated from January 2005 onwards to help illustrate what can be done with the processed data.

All three kinds of users would find the data dictionary useful, as it provides detailed descriptions of the variables used in the NTSB database, as well as the variables that were added by AirSafe.com to make it easier to create useful summary statistics from the NTSB data.

Transformation programs

Both the R program to transform a text NTSB output file and the one to transform an XML NTSB output file went through 10 general transformation steps (Step 0 to Step 10) to create downloadable CSV and text files. Seven of these steps were the same in both programs, and the differences in the other three steps (Step 0, Step 1, and Step 6) will be noted below.

Transformation procedures

The transformation process assumes three things:

  1. That the user has installed the R programming language in their computer,
  2. That the user has the appropriate R program in their working directory in that computer, and
  3. That the user has downloaded into the same working directory either the text file or XML file associated with a search on the NTSB online database.
Step 0: Load the NTSB output file

The text conversion program uploads the file “AviationData.txt” from the same working directory of the R program. The XML conversion program uploads the file “AviationData.xml” from the working directory. Note that when downloading from the NTSB web site, the download will go to your default download directory or to a directory that you designate. Also, if you download something less than the complete database, it may save the file under a name other than “AviationData.xml” or “AviationData.txt” into your directory.

The text file conversion program would removes the field separator character (a vertical bar) and transforms the variable names into ones appropriate for a data frame (replacing any spaces in the variable name with a period). For example, the variable name “Investigation Type” becomes “Investigation.Type”.

Step 1: Eliminate extra columns (text) or specify variable names (XML)

The text file conversion process adds a column that is devoid of data in the data frame, and in the text file program this column is removed.

The XML conversion program does not add any extra columns. Instead, Step 1 in this program ensures that the variable names used are the same as the type used in the text conversion program.

Step 2: Ensure character variables are of type character

Variables identified from the NTSB data dictionary as being of type character are specified as such in the R program. Also in this step, all data from character variables are checked, and any leading or trailing space characters are removed. Space characters include tab, newline, vertical tab, form feed, carriage return, space.

Step 3: Ensure numerical variables are of type numeric

Variables identified from the NTSB data dictionary as being of type numeric are specified as such in the R program. This step also replaces any missing or blank numeric values with the symbol NA, which stands for “Not Available”.

Step 4: Change date variables into a format suitable for R

The NTSB database has two date variables, one for the date of the event and a second for the date of the publication of the accident or incident report. Dates are in the date format dd/mm/YYYY, and are converted to the date format YYYY-mm-dd.

This step also takes the date of the event and adds three more variables: the year of the event, the month the event occurred, and the day of the week of the event’s occurrence. There is no transformation of the date of publication variable.

Step 5: Eliminate any row (record) which has no date

Any event with a missing value for the date of the event are removed and not used in any subsequent analysis.

Step 6: Change blank, “N/A”,“Unknown”, and similar responses to NA

All missing data is replaced with the R symbol NA. Missing data in an event’s record could be blank, which would occur for a numeric variable, or for character variables it could be a null character, one or more spaces, or a variety of character strings, such as “N/A”, “UNK”,“Unknown”,or “Unavailable.”

This step also puts the content for the airport name, air carrier, and aircraft make into title case. In the XML version of this program, airport names that contain an umlat are not capitalized by the function toTitleCase().

Step 7: Specify city name, state abbreviation, and full state name

The NTSB uses a location variable that typically includes the city and two-letter state abbreviation for any US location. The program adds three additional variables for the US city, two-letter state abbreviation, and full state name. It then scans the location variable for any character string that includes at least one comma, checks to see if the last comma is followed by a two-character sequence that matches a state code. If that is the case, the city name, state abbreviation, and full state name are added to those events.

A review of the full database revealed that for US locations, two-character abbreviations associated with standard US Postal Service codes were used for the 50 US states, the District of Columbia, and other US territories. Also, there were several dozen instances where non-standard two-character codes were used for locations in the Gulf of Mexico (GM), Atlantic Ocean (AO), and Pacific Ocean (PO).

There were a total of 62 two-character identifiers used in the program, with 59 representing standard US Postal Service identifiers for states, territories, and the District of Columbia; as well as three non-standard codes for bodies of water.

Step 8: Add variables for the magnitude and extent of injuries

This step creates additional variables (columns) that indicate the magnitude of the greatest injury casued by the event, as well as variables for the total number of injuries (minor plus serious injuries), as well as the total number of people involved in the event (total killed, total injured, and total uninjured).

Step 9: Arrange the new columns in logical groupings

The new variables (columns) associated with the date and location of the event, the number of people involved in the event, as well as the variables for the magnitude and extent of injuries and maximum were were regrouped so that they were adjacent to the appropriate NTSB variable.

Step 10: Save the processed data frame as a CSV file

The last step saves the output into the working directory used by the program in both CSV file named “ntsb_data.csv” and a text file named “ntsb_data.txt”. The difference betweeen the two is that the CSV file has “NA” entered for missing values and the text file has nothing entered. The text file is also created in the same style as the text file output from the NTSB database, with a vertical bar (with a space on either side) separating the variables in each record.

Sample summary statistics

The following are a sample of the kind of summary statistics and graphics that can be generated using the R programming language. They are based on a portion of the NTSB database that was downloaded on 4 September 2016, and includes only those records dated from 1 January 2005. The summary statistics below used a text file as its input.

Quick overview of data from the NTSB database

Note that although the online NTSB database was intended for events from 1962 onwards, there was at least one event from before 1962.

  • Total number of records - 78,879
  • Number of records excluded - 4
  • Number of records processed - 78,875
  • Date of most recent record - 2016-08-30
  • Number of processed records with a date before 2005 - 58,010
  • Number of records with a US location - 74,421
  • Number of processed records inovlving fatalities 15,765

Example summary histograms