DATABASE TO SUPPORT PRINCIPAL COMPONENTS ANALYSIS

ABSTRACT: Our client had been accused of contaminating several miles of river sediments (including the upstream side of a dam which was upstream of the site) and the Newark Bay, and was potentially liable for hundreds of millions of dollars of cleanup and natural resource damages. There were two goals: the first was to show that most of the contamination was distinctly different from that found on or offshore of the subject site. The second was to demonstrate that, in fact, the contamination was very similar to the products of various processes, including automotive exhaust, PCB oil leakage, and municipal incineration. I developed a database to support an effort to "fingerprint" samples containing dibenzodioxins (PCDD) and dibenzofurans (PCDF) as a method of source identification. The project was successful, and resulted in publication of several journal articles.

SITUATION: The data on PCDD and PCDF concentrations in different media was derived from technical papers published in environmental journals. It was in a wide variety of formats, with the isomers presented in different orders, and inconsistent representation of non-detects. The pattern recognition program required that the data be complete (all 25 isomers), natural log transformed, with samples as rows and isomers as columns and in Lotus format. Dr. Michael Ungs had started on the project by typing in a little of the data into the required format. His progress was significantly hindered by the complexity of the data manipulation, which included taking natural logs of the absolute values of the non-detects. When I found out what he was trying to do, he was concerned because he had not made significant progress, and the project timeline was very short. In addition, he was scheduled to be out of the office for a few days, after which the pattern recognition was to begin immediately.

ACTION: I suggested that I could develop a database which would give us better analytical flexibility and the ability to trace the origin of every data point. I decided that Paradox was the best program for the project, given the crosstabulation requirement. After analyzing the data, I found that it was more complex than immediately obvious. Many of the papers covered more than one type of sample (study), and for each type, there were several samples. I designed the database to include the paper, study, sample and isomer number.

RESULTS: The database more than met the needs of the project; it also allowed us to easily normalize the isomers and congener groups with respect to the total individual isomers and groups, respectively. Although this did not help our analysis, and only got a one-sentence mention in one of the papers, it was an important option, and one that would not have been available without the database. Several journal articles were published based on this project.