Washington DC Tree Project

This project has been several months in the making and began with identifying a horticultural dataset backed by a legitimate source and a well-defined collection methodology. Pre-Cleaning exploratory analysis revealed a substantial number of duplicate records associated with the ingestion process, highlighting the importance of validation and consolidation steps when working with large, real-world datasets.

The cleaned dataset represents 198 unique tree species, identified by both scientific and common names, and includes attributes such as tree condition, overhead utility conflicts, diameter at breast height (DBH), and latitude and longitude coordinates—fields that proved essential in resolving data integrity issues.

This project remains a work in progress. Data cleaning and exploratory analysis have been completed in SQL, with geospatial mapping underway in QGIS and visualizations in development using Tableau. A future phase of the project includes designing a web application that will expose the cleaned dataset as a searchable database.

Data Cleaning and Validation

The primary goal of data cleaning and validation was to produce a clean, reliable dataset that could serve as the foundation for all subsequent phases of the project. To ensure reproducibility, the cleaning process began from an untouched copy of the raw source data.

Initial exploratory queries were used to assess overall structure, completeness, and record counts. These quickly revealed a large number of duplicate rows, despite the absence of an explicit unique identifier in the dataset. Because latitude and longitude values were recorded with meter-level precision, it was determined that the likelihood of two distinct trees sharing identical coordinates was extremely low. As a result, latitude–longitude pairs were selected as a practical surrogate key for identifying unique records.

A new working table was created using distinct coordinate pairs to eliminate duplicates. During this process, gaps in the scientific_name field were identified. Rather than relying on SQL joins—which proved unreliable due to the many-to-one relationship between common names and scientific names (e.g., multiple cultivars or common names mapping to the same species)—a small Python script was used to populate missing scientific names where confident matches could be inferred.

Date Cleaning and Validation Summary

  • Generated SQL UPDATE statements using Python to efficiently populate and correct fields at scale
  • Identified latitude–longitude coordinate pairs as a surrogate identifier and used them to extract a de-duplicated working table
  • Corrected formatting inconsistencies and resolved botanical naming issues across scientific and common name fields
  • Standardized categorical values across multiple columns to improve readability and downstream analysis
  • Converted date-related fields from VARCHAR to proper DATE data types

Exploratory Data Analysis

After cleaning, a series of targeted questions was formulated to better understand the structure and composition of the dataset. These questions were addressed using SQL queries of varying complexity, with an emphasis on producing core metrics such as counts, proportions, and categorical distributions. The goal of this phase was to generate numeric, percentage-based, and descriptive (categorical) outputs that could be directly visualized or used to inform subsequent lines of inquiry.

A significant portion of the exploratory analysis focused on investigating a potential relationship between the presence of overhead utilities and increased tree mortality. This analysis involved segmenting tree condition outcomes by overhead utility status and comparing relative frequencies to identify meaningful patterns. While this section does not attempt to establish causation, it provides a data-driven foundation for assessing whether utility conflicts may be associated with elevated risk, and it informs areas for deeper statistical or longitudinal analysis in future iterations of the project.

Exploratory Data Analysis Summary

  • Quantified the overall composition of the dataset by evaluating counts and proportions across multiple variables (e.g., condition, species, and categorical attributes)
  • Compared tree condition outcomes between native and non-native species to identify potential distributional differences
  • Identified contradictory values in the native column during exploratory analysis; validated native regions against horticultural reference sources and corrected inconsistencies using a JOIN
  • Investigated the relationship between the presence of overhead utilities and observed tree condition to assess possible associations
  • Generated minimum, maximum, and average statistics for diameter at breast height (DBH) across all tree species

QGIS Mapping and Analysis

Building interactive map for geographic analysis using cleaned dataset in QGIS


References & Citations