I use a healthcare expenditure dataset with observations geographically coded at the 5-digit zipcode level, but I’d also like to know which county an observation ‘belongs’ to. Maybe I want to cluster standard errors by county, or control for county-specific trends. You’d imagine this would be straightforward, but I haven’t yet found a government crosswalk that is comprehensive in all the ZIP5s that appear in my data. What follows is the best solution I’m aware of, to ensure that I match as many ZIP5s as possible. While this only increases the number of ZIP5-county matches by about 110 over what HUD offers, it’s an improvement of more than 6,000 over the Census crosswalk.

Mapping ZIP5s to counties is slightly tricky because they,

  1. Refer to postal routes, and are therefore collections of lines, not polygons
  2. May straddle multiple counties, necessitating an apportionment decision

The US Census Bureau addresses 1.) by devising a Zip Code Tabulation Area (ZCTA) which approximates the area served in a ZIP using Census blocks. These are by and large mapped 1:1 such that, for example, the 31211 zip5 in Macon, Georgia also appears as 31211 in a ZCTA database.

As for 2.), Available crosswalks do provide several importance measures to aid the apportionment decision, under the assumption that your analysis requires that a singular county be mapped to a ZIP. HUD provides crosswalks updated quarterly for several administrative levels (CBSA, CBSA division, Census tract, etc.), and includes the share of that zipcode’s residential addresses, business addresses, other addresses, and total addresses that lie in any of the relevant counties.

So we’re done, right? Not yet. First, it’s not clear that an address count is ideal. Imagine a scenario where a zipcode broaches several adjacent counties and features two types of housing: senior housing and homes occupied by extended families. If the average number of residents per unit in the retirement homes is 1.5, but 6 in the latter, then the share of addresses is a very imperfect proxy for population shares.

The Census Bureau addresses this with their lookup table and includes 2010 Census population data as well as area (total and land-only). Here is a helpful guide they’ve created. The crosswalk includes some relevant information in the county –> ZIP5 direction as well, such as — COPOPPCT— the county population percentage residing in that ZIP5.

Let’s first get the Census listing of county FIPS codes and names.

The county variable is concatenated from the state FIPS and county FIPS codes, and will be used to link with the ZIP-County crosswalks.

Now let’s start with that HUD crosswalk. I’ve randomly selected the December 2017 version, but the following applies to any vintage. This file includes 39,455 unique zipcodes. There is substantial change over time, as an equally-randomly selected March 2011 file features 36,413 unique zipcodes. We’ll merge this in with our county FIPS file and identify the county for which a given zipcode has the largest share of total addresses, and residential addresses in.

Screen Shot 2018-03-28 at 11.52.48 AM

The vast majority of county matches with the greatest share of residential addresses also have the highest share of total addresses. We’ll opt for the former as our apportioning variable. And to give you a sense of how important apportionment is, more than 10,000 zip5’s reside in at least two counties.

Is this problematic in any way? Let’s first inspect the residential address share for those counties that were first in the gsort sorting above. As expected, the vast majority of them are above 50%, with the lion’s share at 100%. These “100% zipcodes” are interior to a single county, at least in regards to residential address location. ResOrderShare.png

What’s troubling is that mass at 0%. The sort function should not have loaded on these counties at all, yet eye-balling them suggests they are exclusively contained within single counties, so there’s not a problem. In fact, these zipcodes may simply contain only business/other addresses, in which case then they’re unlikely to further our healthcare expenditure data analysis.

We’ll now download that Census crosswalk, and perform some similar procedures, again apportioning to the county for which the largest share of a zipcode’s population resides.

I’ve suffixed several of the variables with “_rel” since a naive merge with our HUD crosswalk generates several merge conflicts. We can more carefully identify disagreements between the two crosswalks this way.

We now merge and get the following:

Screen Shot 2018-03-28 at 12.22.40 PM

so that there’s 33,014 zipcodes appearing in both crosswalks, while the HUD database adds 106 to those found in the Census file.

Here’s our first problem – major inconsistencies in the county results from our two apportionment procedures. Screen Shot 2018-03-28 at 12.26.58 PM

That’s pretty bad, but it’s even worse when we’re not even agreeing on the same state.

Screen Shot 2018-03-28 at 12.29.08 PM.png

In these cases, HUD tells us that 100% of residential addresses are located in the designated county, while Census gives values for these 3 of between 53% and 69%. Population apportionment would therefore direct us to the state_rel values and respective counties.

So a final decision to make is how these disagreements should be reconciled. The good news is that this need applies only to 357 zipcodes which have conflicting information coming from both crosswalks. In many cases, county information is only coming from one of the crosswalks.

I think the optimal approach is the following and is the algorithm used in producing the following dataset:

  1. Rely first on Census population-apportioned county matching
  2. Then fill missing values with HUD’s residential address apportioned matches
  3. When both are present but in conflict, rely on Census values which should be considered to have more integrity than HUD values derived from Census raw data

The final dataset will have 39,561 unique zipcodes, if you were to download the HUD crosswalk vintage referred to above. It marries the reliability of Census population data without the potential for errors percolating from another government agency’s analysis, with the timeliness of the quarterly HUD datasets which expands the scope of included ZIP5s.

You can download the final Stata .dta crosswalk file here, while the entire Stata script is available here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s