Getting Started¶
geosnap provides a set of tools for collecting data and constructing space-time datasets, identifying local neighborhoods or prototypical neighborhood types, modeling neighborhood change over time, and visualizing data at each step of the process.
geosnap works with data from anywhere in the world, but comes batteries-included with three decades of national US Census data, including boundaries for metropolitan statistical areas, states, counties, and tracts, and over 100 commonly used demographic and socioeconomic variables at the census-tract level. All of these data are stored as geopandas geodataframes in efficient apache parquet files and distributed through quilt.
These data are available when you first import geosnap by streaming from our quilt bucket into memory. That can be useful if you dont need US data or if you just want to kick the tires, but it also means you need an internet connection to work with census data, and things may slow down depending on your network performance. For that reason, you can also use the store_census
function to cache the data on your local machine for faster querying. This will only take around 400mb of disk space, speed up data operations, and remove the need for an internet connection.
Using built-in data¶
You can access geosnap’s built-in data from the datasets
module. It contains a variable codebook as well as state, county, and MSA boundaries, in addition to boundaries and social data for three decades of census tracts. If you have stored an existing longitudinal database such as LTDB or the Geolytics Neighborhood Change Database, it will be available in datasets
as well.
from geosnap import datasets
dir(datasets)
['blocks_2000',
'blocks_2010',
'codebook',
'counties',
'ltdb',
'msa_definitions',
'msas',
'ncdb',
'states',
'tracts_1990',
'tracts_2000',
'tracts_2010']
Everything in datasets
is a pandas (or geopandas) geo/dataframe. To access any of the data inside, just call the appropriate attribute/method (most datasets are methods). For example, to accesss the codebook which outlines each variable in the data store, incuding its name, description, the original census sources/variable names and the formula used to calculate it, you simply call datasets.codebook()
. We support the same variable set the Longitudinal Tract Database (LTDB).
datasets.codebook().tail()
variable | label | formula | ltdb | ncdb | census_1990_form | census_1990_table_column | census_2000_form | census_2000_table_column | acs | category | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
189 | p_poverty_rate_black | percentage of blacks in poverty | p_poverty_rate_black=n_poverty_black / n_pover... | pbpov | BLKPR | NaN | NaN | NaN | NaN | NaN | Socioeconomic Status | NaN |
190 | p_poverty_rate_hispanic | percentage of Hispanics in poverty | p_poverty_rate_hispanic=n_poverty_hispanic / n... | phpov | NaN | NaN | NaN | NaN | NaN | NaN | Socioeconomic Status | NaN |
191 | p_poverty_rate_native | percentage of Native Americans in poverty | p_poverty_rate_native=n_poverty_native / n_pov... | pnapov | NaN | NaN | NaN | NaN | NaN | NaN | Socioeconomic Status | NaN |
192 | p_poverty_rate_asian | percentage of Asian and Pacific Islanders in p... | p_poverty_rate_asian=n_poverty_asian / n_pover... | papov | RASPR | NaN | NaN | NaN | NaN | NaN | Socioeconomic Status | NaN |
193 | n_total_pop | total population | NaN | pop | TRCTPOP | SF1 | P0010001 | SF1 | P001001 | B01001_001E | total population | NaN |
You can also take a look at the dataframes themselves or plot them as quick choropleth maps
datasets.tracts_2000().head()
geoid | median_contract_rent | median_home_value | median_household_income | median_income_asianhh | median_income_blackhh | median_income_hispanichh | median_income_whitehh | n_age_5_older | n_asian_age_distribution | ... | p_vacant_housing_units | p_veterans | p_vietnamese_persons | p_white_over_60 | p_white_over_65 | p_white_under_15 | p_widowed_divorced | per_capita_income | year | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25009266400 | 662 | 172400 | 53314 | 48750 | 22500 | 0 | 53808 | 3065 | 7 | ... | 3.74 | 9.84 | 0.03 | 14.59 | 11.04 | 21.40 | 17.61 | 24288 | 2000 | POLYGON ((-70.91489900000001 42.886589, -70.90... |
1 | 25009267102 | 653 | 169200 | 50739 | 46250 | 0 | 0 | 51264 | 4311 | 23 | ... | 3.22 | 13.42 | 0.06 | 13.77 | 9.67 | 21.02 | 18.91 | 20946 | 2000 | POLYGON ((-70.91489900000001 42.886589, -70.91... |
2 | 25009266200 | 662 | 163200 | 49315 | 90457 | 101277 | 26250 | 48150 | 5131 | 37 | ... | 3.30 | 9.10 | 0.07 | 13.44 | 10.63 | 22.38 | 20.74 | 21817 | 2000 | POLYGON ((-70.93079899999999 42.884589, -70.92... |
3 | 25009267101 | 624 | 179200 | 45625 | 0 | 54545 | 38750 | 44750 | 3011 | 11 | ... | 42.08 | 9.96 | 0.00 | 19.89 | 14.78 | 16.97 | 27.61 | 22578 | 2000 | POLYGON ((-70.8246893731782 42.87092164133018,... |
4 | 25009266100 | 569 | 215200 | 60677 | 48750 | 43750 | 32500 | 61224 | 3643 | 20 | ... | 3.74 | 10.31 | 0.23 | 14.44 | 10.77 | 22.19 | 15.48 | 28030 | 2000 | POLYGON ((-70.97459559012734 42.86775028124355... |
5 rows × 192 columns
datasets.states().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x128b5bdd8>
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1,3, figsize=(15,5))
axs = axs.flatten()
datasets.tracts_1990()[datasets.tracts_1990().geoid.str.startswith('11')].dropna(subset=['median_household_income']).plot(column='median_household_income', cmap='YlOrBr', k=6, scheme='quantiles', ax=axs[0])
axs[0].set_title(1990)
axs[0].axis('off')
datasets.tracts_2000()[datasets.tracts_2000().geoid.str.startswith('11')].dropna(subset=['median_household_income']).plot(column='median_household_income', cmap='YlOrBr', k=6, scheme='quantiles', ax=axs[1])
axs[1].set_title(2000)
axs[1].axis('off')
datasets.tracts_2010()[datasets.tracts_2010().geoid.str.startswith('11')].dropna(subset=['median_household_income']).plot(column='median_household_income', cmap='YlOrBr', k=6, scheme='quantiles', ax=axs[2])
axs[2].set_title(2010)
axs[2].axis('off')
As mentioned above, you save these data locally for better performance using geosnap.io.store_census
, which will download two quilt packages totaling just over 400mb (which is an exceedingly small file size, when you consider how much data are packed into those files). Once data are stored locally, you won’t need this function again unless you want to update your local package to the most recent version on quilt.
from geosnap.io import store_census
store_census()
Using geosnap’s built-in data, researchers can get a jumpstart on neighborhood analysis with US tract data, but census tracts are not without their drawbacks. Many of geosnap’s analytics require that neighborhood units remain consistent and stable in a study area over time (how can you analyze neighborhood change if your neighborhoods are different in each time period?), but with each new decennial census, tracts are redrawn according to population fluctuations. Geosnap offers two methods for dealing with this challenge.
First, geosnap can create its own set of stable longitudinal units of analysis and convert raw census or other data into those units. Its harmonize
module provides tools for researchers to define a set of geographic units and interpolate data into those units using moden spatial statistical methods. This is a good option for researchers who are interested in the ways that different interpolation methods can affect their analyses or those who want to use state-of-the-art methods to create longitudinal datasets that are more accurate than those provided by existing databases.
Second, geosnap can simply leverage existing data that has already been standardized into a set of consistent units. The io
module provides tools for reading and storing existing longitudinal databases that, once ingested, will be available in the data store and can be queried and analyzed repeatedly. This is a good option for researchers who want to get started modeling neighborhood characteristics right away and are less interested in exploring how error propagates through spatial interpolation.
Storing Data from External Databases¶
The quickest way to get started with geosnap is by importing pre-harmonized census data from either the Longitudinal Tract Database (LTDB) created by researchers from Brown University or the Neighborhood Change Database created by the consulting company Geolytics. While licensing restrictions prevent either of these databases from being distributed inside geosnap, LTDB is nonetheless free. As such, we recommended importing LTDB data before getting started with geosnap
Longitudinal Tract Database (LTDB)¶
The Longitudinal Tract Database (LTDB) is a freely available dataset developed by researchers at Brown University that provides 1970-2010 census data harmonized to 2010 boundaries.
To store LTDB data and make it available to geosnap, proceed with the following:
Download the raw data from the LTDB downloads page. Note that to construct the entire database you will need two archives: one containing the sample variables, and another containing the “full count” variables.
Use the dropdown menu called select file type and choose “full”; in the dropdown called select a year, choose “All Years”
Click the button “Download Standard Data Files”
Repeat the process, this time selecting “sample” in the select file type menu and “All years” in the select a year dropdown
Note the location of the two zip archives you downloaded. By default they are called
LTDB_Std_All_Sample.zip
andLTDB_Std_All_fullcount.zip
Start ipython/jupyter, import geosnap, and call the
store_ltdb
function with the paths of the two zip archives you downloaded from the LTDB project page:
from geosnap.io import store_ltdb
# if the archives were in my downloads folder, the paths might be something like this
sample = "/Users/knaaptime/Downloads/LTDB_Std_All_Sample.zip"
full = "/Users/knaaptime/Downloads/LTDB_Std_All_fullcount.zip"
# uncomment to run
#store_ltdb(sample=sample, fullcount=full)
That function will extract the necessary data from the archives, calculate additional variables using formulas from the codebook, create a new local quilt package for storing the data, and register the database with the datasets
. After the function has run, you will be able to access the LTDB data as a long-form geodataframe by calling the ltdb
attribute from the data store. As with the store_census
function above, this only needs to be run a single time to save the data as a local quit package and register it with geosnap. You won’t neeed to store the data again unless there’s an update to the variable formulas in the codebook.
#datasets.ltdb.head()
Geolytics Neighborhood Change Database¶
The Neighborhood Change Database (ncdb) is a commercial database created by Geolytics and the Urban Institute. Like LTDB, it provides census data harmonized to 2010 tracts. NCDB data must be purchased from Geolytics prior to use. If you have a license, you can import NCDB into geosnap with the following:
Open the Geolytics application
Choose “New Request”:
Select CSV or DBF
Make the following Selections:
year: all years in 2010 boundaries
area: all census tracts in the entire united states
counts: [right click] Check All Sibling Nodes
Click
Run Report
Note the name and location of the CSV you created
Start ipython/jupyter, import geosnap, and call the
store_ncdb
function with the path of the CSV:
from geosnap.io import store_ncdb
ncdb_path = "~/Downloads/ncdb.csv"
# note this will raise several warnings since NCDB does not contain all the underlying data necessary to calculate all the variables in the codebook
# uncomment to run
# store_ncdb(ncdb_path)
As with above, you can access the geolytics data through the ncdb
attribute of the datasets
#datasets.ncdb.head()