# key setups
library(knitr) # knitr for kable tables
library(kableExtra) # pretty tables
library(sf) # simple features (GIS)
library(tools) # md5sum
library(stringi)
library(tidyverse)
library(magrittr)
library(htmltools)
Introduction
Here, we present technical documentation for the creation of the UWED, answers to frequently asked questions, and appropriate use cases for our data products.
All data products are available under the “Data Sets” tab.
Data Sources and Methods
From the Secretary of State, we obtain annual voter registration files, beginning in 2007, and ballots files for each election, beginning in 2020. Precinct shapefiles are downloaded from the SoS website.
Census Citizen Voting Age Population by Race and Ethnicity is obtained from the US Census website using the 5-year American Community Survey (ACS) data at the Census block group level.
Analysis
- Using all years of voter registry and ballot data, a list of all unique addresses is created. These addresses are geolocated to a spatial point (latitude, longitude) using Amazon Web Services.
- These addresses are then matched to the following administrative polygons:
- All precinct shapefiles from 2007-2024
- 2024 Legislative Districts
- 2010 Census block groups/tracts/counties
- 2020 Census block groups/tracts/counties
- 2024 Washington Tribal Boundaries
- This spatial information is merged back to individual voters and ballots.
- For each voter and ballot in our database, we impute a probabilistic race/ethnicity using Bayesian Imputed Surname Geocoding (BISG).
- The imputation results in a probability for each person and/or ballot.
- These probabilities are summed across geography (precinct) to obtain estimates of race/ethnicity by precinct and year. The imputed racial/ethnic categories are: White, Black, Hispanic, Asian, and Other. Note: These categories do not match standard Census categories, and therefore do not match the full list of categories contained in the CVAP data available on our site. White, Black, and Asian are non-Hispanic populations equivalent to White alone, Black alone, and Asian alone in the CVAP data. The other category is also non-Hispanic and should match the sum of AI/AN alone, NHOPI alone, and any combination of 2 or more races. Hispanic is equivalent in both data sets (Hispanic of any race).
- For the voter registration summaries, only voters listed as “Active” are counted. For ballots data, only “Accepted” ballots are represented in these summaries.
- In 2023, SoS substituted Date of Birth (DOB) data with Year of Birth (YOB). Where we have DOB for a given Voter ID, we carry this forward. When only YOB is available, we assume the DOB is January 1 of the given year of birth.
- To estimate Citizen Voting Age Population (CVAP) by precinct, we aggregate up from the Census block group level. We use 5-year CVAP estimates for 2005-2009, 2006-2010, …, 2019-2023 (the most recent) to provide CVAP by precinct estimates for each middle year of the 5-year window, i.e. we use 2005-2009 CVAP block group estimates for 2007 CVAP precinct estimates and 2019-2023 CVAP block group estimates for 2021 CVAP precinct estimates. We use 2019-2023 CVAP block group estimates for 2022, 2023, and 2024 as well. When a block group intersects more than one precinct, an area-weighting factor (st_interpolate_aw from the sf package) apportions the block group population across precincts by the land area of the intersection.
Sources of Error and Uncertainty
- CVAP data are produced at the block group level using ACS 5-year data. At the block group level each estimate of CVAP also comes with a margin of error that is a function of the ACS survey design. When we aggregate these estimates to the precinct level, we ignore this margin of error and report the aggregate as though there were no uncertainty. While it is possible to aggregate the CVAP uncertainty to the precinct level, this estimate of uncertainty will assume estimates of CVAP by block groups within the same precinct have a covariance of 0, which is unlikely.
- The BISG method for race imputation uses 2020 Census counts containing noise from differential privacy as inputs for both block group- and county-level race imputation. The amount of noise infused into the population counts increases as the Census geography gets smaller, with counties having the least amount of noise and blocks the most. So, while block group imputation allows for more accuracy in terms of the underlying race/ethnicity data, there is more uncertainty than in the underlying county imputation data.
- The BISG method for race imputation has some known limitations: it does not perform as well for Black Americans as other races, errors can be induced by surname changes due to intermarriage, and significant shifts in the racial makeup of Census geographies in the decade following a decennial census (e.g. due to migration) can result in biased race imputations based on the last census’s racial compositions.