April Crompton, Michelle Ballard, Bu Hyoung Lee, Ph.D.
Data Triage on Transcribed Tapes from Nixon Watergate Trials
Abstract:
The Watergate scandal was a major political scandal in the 1970's that led President
Nixon's resignation. The source of the data for this project are a portion of the
approximately 60 hours of tape subpoenaed by the Watergate Special Prosecution Force
(WSPF). They include fully transcribed conversations in the White House that were
deemed evidentiary to the Watergate trial. President Nixon chose to record his meetings
because he was concerned that his meetings were not always reported accurately, and
that his private discussions should not be misconstrued publicly to the benefit of
others.
This project supports the area of data triage by improving searchability and exploration
methods of large text datasets, as well as economizing and exploiting large datasets
for analysis. It will explore methods to enable analysts to identify information of
interest more quickly within large datasets of transcribed voice collection. For example,
keyword searches on transcribed data result in a great deal of noise, and on a dataset
this large, they are useless. We will explore methods to enable analysts to find relevant
data by tagging/posturing the text dataset for rapid retrieval.
The exploration leverages topic modeling, an unsupervised machine learning algorithm,
which will be used to segment the corpus by detecting word patterns or themes within
the corpus of data. Topic modeling was chosen because it is a data-driven approach
to identify the hidden or latent topics that can be found within the data. Large segments
of the corpus will be modeled to determine these topics, which are sets of words associated
with defined segments. Topic modeling supports data triage by categorizing the data
into bins based on these topics, which allows analysts to explore specific bins of
data relevant to their interests.
This project is of interest because it aims to build the tools to address existing
data triage challenges related to national security and technology, leading to better
solutions for real-world problems.