Cross-dataset Analysis of Archaeological Remains at Olduvai Gorge

1 Introduction

Olduvai Gorge, Tanzania is a uniquely preserved glimpse into the Early Pleistocene (a period between about 2.6-1.0 million years ago), where researchers can approach answers to common questions about early human evolution. Accumulations of animal bones and stone tools must be carefully excavated and entered into a detailed database, with specimens being both inspected individually on-site, and analyzed later, collectively, using statistics.

We set out with the broad goal of addressing bottlenecks in the process of collecting and analyzing faunal data and the possible corollary of standardizing parts of the workflow. On-site experience with excavation and data entry at the 2019 UNCG Paleoanthropological Field School at Olduvai Gorge, Tanzania allowed us to identify a common procedure that needed improvement: generating frequency distributions of one or multiple variables, grouped by level or excavation unit, across multiple datasets, and arranged in a table. We learned that while spreadsheet software contains functions for this type of analysis, there isn’t a quick way to merge datasets in order to run functions across multiple datasets. Here we document the iterative process that led to our final software application, a custom program that quickly merges datasets in multiple formats (including open formats like CSV and ODS). We also include examples of the final workflow with our software in the form of simple exploratory data visualizations.

2 Initial goals (pre-trip)

Like most scientific fields, Anthropology and Archaeology have been evolving to take advantage of modern computing for decades. Total stations (tripod-mounted optical devices used to record exact 3D coordinates), GPS, handheld data entry computers, digital cameras, 3D scanning, DNA and particle analysis, and other technologies are integral to progress in the field. Excavation, however, is a uniquely difficult process to automate because of its tangible, destructive nature. State of the art archaeological sites are still painstakingly excavated by hand, and often the initial data about specimens is recorded in a handwritten journal. Researchers eventually translate specimen data into a digital format1, but the nature of this kind of manual data entry precludes the automatic, comparative data analysis available in some scientific fields.

CAPTION: A field journal. The specimen diagram is useful when entering the (digitally recorded) 3D coordinates later. 22.jpg

The goal of this project was to improve the process of collecting and analyzing data used by researchers in Olduvai Gorge, and to learn about software development by doing so. If possible, we wanted to apply our findings to real research questions like:

  • How important was animal tissue in the diets of early humans and, when they did acquire this resource, was it via active hunting, passive scavenging, or a combination of both?
  • Why did early humans concentrate their activities at particular locations on the landscape? What new evidence is there to resolve the “home base” debate?

Throughout the process, we tried to keep a focus on two workflow outcomes:

  1. Remove bottlenecks in the workflow used by researchers (such as data entry).
  2. Adapt (or create) standards for data storage. This can encourage collaboration between different teams of researchers and archaeological sites. Over the long-term, standards improve data-interoperability and lengthen the “lifespan” of datasets.

3 Methods

Achieving our goals required an iterative approach with periods of learning and field work (including a URCA-funded trip to Olduvai Gorge) followed by experimentation and software development.

  1. Prior to the trip, Nick (the student researcher) familiarized himself with a new programming language for the project, some concepts surrounding data entry and storage, and existing standards for biodiversity data formats. The product of this effort was a small command-line utility called DWCHelper.
  2. To experience the methods used by field researchers, and to gain a better understanding of the research questions surrounding Olduvai Gorge, Nick accompanied Dr. Charles Egeland (the faculty mentor) to UNCG’s Olduvai Gorge Field School in the Summer of 2019. After the Field School we developed an unnamed prototype GUI application for cross-dataset analysis.
  3. We developed a cross-platform, open-source tool for merging datasets, with support for multiple data formats.

4 Pre-trip Research: DWCHelper

Early investigations into existing data standards led to a small (600 lines of code) command-line program called DWCHelper2 (Darwin Core Helper). Darwin Core (more specifically, Simple Darwin Core3) is a simple specification for tabular biodiversity data which was developed by the TDWG non-profit4. DWCHelper only works with CSV files and has the following functionality:

  • formats the file according to the RFC 4180 for CSV (cleans up extra quotes, etc.)
  • detects and suggests aliases to Darwin Core terms
  • detects and suggests unused terms for removal
  • allows the user to rename or remove terms
  • saves the conversion settings for future runs (to accommodate changes to the dataset)

The intuition behind DWCHelper came from initial comparisons between existing datasets from different teams in Olduvai Gorge. We noticed that the most obvious difference, in many cases, was that researchers used different terms for the same variable. For example, Spanish paleoanthropologists might user the term “Nivel” to refer to the stratigraphic layer of a specimen, but an American researcher might use the term “Level”.

We realized, however, that amending this is a trivial process in all spreadsheet software, and attempting to automate the task was a misplaced optimization. Even if we compile a large database of common terms and their synonyms (plus translations, case changes, and misspellings), researchers still have to manually check that they agree with DWCHelper’s suggestions. The potential time saved compared to manually renaming terms is minuscule, so we abandoned the project.

However, this stage of the project was not a waste. In order to create a program that works on multiple operating systems, we chose the Go programming language5 (a compiled, garbage-collected, statically typed language developed by Google). The Go knowledge gained while developing DWCHelper helped with later versions of the software, including skills such as cross-compiling the application for Linux and Microsoft Windows, generating an installer for Windows, and usage of structs and common idiomatic interfaces in Go.

We used the following simple code to represent a single dataset in the computer’s memory, and it remains unchanged in both the second and third version of our software:

type Dataset struct {
        data   map[string][]string // maps terms to ordered data (columns)
        terms  []string            // ordered list of terms
        name   string              // name of the dataset
        height int                 // "height" of the spreadsheet (including the first row of terms)

DWCHelper is a command-line application, meaning that the user has to open a terminal or command-prompt and type out the command (along with filenames as arguments). It became clear that future versions of our software efforts should include a GUI (Graphical User Interface) to meet the average user’s assumptions about computer programs.

5 Olduvai Gorge Field School

With the help of URSCO URCA funding, Nick was able to travel to Tanzania over the summer of 2019 to attend UNCG’s Olduvai Gorge Field School.

UNCG’s Olduvai Gorge Field School is a summer program led by Dr. Charles Egeland and open to students from all universities. During a three-week period of intensive research, excavation, and activities, students work alongside international experts on research projects at the field camp in Olduvai Gorge.

Olduvai Gorge is a feature of Northern Tanzania with rich layers of well-preserved sediment going back millions of years. The gorge is a shallow chasm created by natural erosion and made famous by the Mary and Lewis Leakey team in 1971.

CAPTION: Map of Tanzania tanzania_map_edited.png

Mary Leakey described the simple, intentional stone technology used by proto-humans to butcher meat as Oldowan… more background about Olduvai gorge

CAPTION: A Paranthropus-boisei skull. Mary Leakey discovered a skull from this robust group of hominoids in Olduvai Gorge in 1959. Paranthropus-boisei-Nairobi.JPG

Through the study of fossil evidence, including hominoid fossils, lithics (stone tools that hominoids crafted and used), butchered animals, and other fossilized flora and fauna, Paleoanthropologists seek to learn about the evolution of primates and early humans.

CAPTION: Students excavating at Olduvai Gorge. 87.jpg

In addition to the labs, excavation, and excursions that all students took part in, we set time aside to analyze animal bones excavated from the 1.4-million-year-old site of BK East.

5.1 The Excavation process

To understand the type of data that paleoanthropologists routinely use some background knowledge is necessary.

A dig site in Olduvai Gorge like BK East is a grid of excavation units (usually one square meter) that excavators clear layer by layer. Hominoid bones are exciting but rare, and most specimens are zooarchaeological in nature. When a specimen is carefully removed on-site, we record its exact XYZ location, provenience, orientation within the sediment, taxonomic affinity, stratigraphic position, and other relevant details such as its relation to nearby specimens (for instance, in the case of a bone fractured in multiple pieces, or a mandible with separated teeth). We also draw the layout of excavation unit and take pictures.

Later, specimens are more carefully analyzed for features that can give us clues about the taphonopy of the animal. We look for things like the state of preservation, the presence of tooth marks (or cut marks!), the type of bone breakage, and evidence of scarring and healing.

Along with taphonomic clues, if possible, we record general information about the animal such as the genus, estimated body size, level of epiphyseal fusion (which can hint at the individual’s age), and which part of the skeleton the bone is from.

Although there is no uniform data format or program for this type of data, it can almost always be represented with a simple table in a spreadsheet: a list of variables (also known as terms, fields, or parameters) along the top, and a single row for each specimen.

Figure 1: Part of a dataset from the DK East site.

5.2 What can we improve?

Entering data during the Field School led to our goal of building a GUI application that supported the quick creation of multi-dataset, multi-variable tables and graphs. For instance, it should be easy to visualize the number of each animal type by stratigraphic level. We also decided to support the Microsoft Excel spreadsheet format. Despite its proprietary nature, Microsoft Excel is so popular that it is a de facto standard. After the Field School, we began building a software tool6 to automate the process of generating these multi-variable, cross-dataset frequency distributions.

Figure 2: A screenshot of our second application.

That application fell short of our original vision for a general-purpose tool.

Screenshot2019-09-27-000550.png Two factors contributed to our decision to abandon the second application and move on to a more general-purpose tool.

  1. It became clear that finishing the application would simply take too much time, with no real guarantee that the final product would be flexible enough for practical use. Already we had over 1000 LOC and we needed hundreds more to meet our basic goals.
  2. We studied the manuals of popular spreadsheet software and discovered that much of the functionality we were writing already existed (and in battle-tested, enterprise-approved form).

With improved knowledge of statistical functions in spreadsheet software, we realized that the bottleneck was not the statistical visualizations, but the ability to quickly combine multiple datasets and perform the statistical visualizations on them as a unit. This led to our current application, Dataset Merge Tool7.

Note that like DWCHelper, our work on the second application provided valuable experience that helped with the final iteration of our software. Most important was the addition of the GUI, written in GTK38, a popular FOSS GUI toolkit that runs on multiple platforms.

6 Final Product: Dataset Merge Tool

Dataset Merge tool is a simple, cross-platform tool to quickly merge datasets in different formats for the purpose of statistical analysis. It merge ODS (Open Document Spreadsheet), XLSX (Microsoft Excel), and CSV files, and supports exporting the merged dataset to CSV and XLSX. The application currently runs on Microsoft Windows (with an installer) and Linux, and a port to macOS is in progress.

Figure 3: Dataset Merge Tool

The application is simple in multiple senses of the word; it has fewer lines of code than DWCHelper, it only performs one simple function; and the user interface is simple and intuitive for computer users at all levels.

We limited Dataset Merge Tool’s scope in order to isolate the step in the workflow that takes the longest: merging multiple datasets. We believe that this approach is the best optimization at this time. Better tools already exist for the data visualization (spreadsheet software), and changing the format of the raw data is best left to the researcher. If we compare the specimen excavation, specimen analysis, data entry, and data analysis workflow to the workflow of a standard operating system user, you could say that Dataset Merge Tool follows the Unix Philosophy: Write programs that do one thing and do it well.

7 The Program in Action

While our application has not yet been used for rigorous statistical analysis, some experimentation quickly yields ideas for further research. The following graphs were generated by using Dataset Merge Tool to merge datasets from two Olduvai Gorge sites, and then using the “pivot table” function in Microsoft Excel (this functionality exists in all major spreadsheet software, including Libreoffice Calc and Google Sheets).


The important progress here is not the ability to create these graphs, but the ability to quickly create them.

8 Insights

“In preparing for battle, I have always found that plans are useless but planning is indispensable.”

-Richard Nixon in Six Crises, 1962

Figure 5: Source:
  • maybe a broader statement about the role of computing and statistics in all fields of science
  • education - the trade off between custom software and training researchers to code (like in R)

9 References



Sometimes this is hybrid process where coordinates and some other details can be imported automatically from a handheld computer, but many data still need to be entered manually.