Loading Geographic Data in a Format You Can Actually Use

Written 2023-02-18 — Updated 2023-02-25

In the previous article we covered the GeoJSON format, which is the most common format encountered when working with geographic data in Javascript.

But geographic data often comes in specialized formats which are not seen elsewhere. Like much else in this world, it can seem intimidating at first, but with a little bit of knowledge you can do quite a lot. Let's take a look.

Data Formats

Shapefiles

Shapefiles are the most common format for publishing geographic data. They were invented in the early 1990s by ESRI, a well-known maker of geographic information system (GIS) software. This format has since been adopted by most other publishers of geographic data.

A shapefile set consists of a few different files, which are differentiated by their extension.

The .shp file holds the actual geometries for each shape in the shapefile. The format is binary but the set of shape types supported roughly overlaps with the set of shapes supported by GeoJSON. You'll find that this overlap occurs for most formats.
The .dbf file holds attributes for each feature, such as the name or a numeric identifier. This is a dBase database file, which is rarely encountered nowadays, but was popular in the early 1990s when the shapefile format was invented.
The .shx file provides a quick lookup to a location of a particular shape inside the .shp file. This can be useful for seeking to a particular shape in a shapefile, but is generally not needed for conversion purposes.
The optional .prj file contains information about what type of projection the data is in. This assists map renderers in displaying the data, and is crucial when converting it into another format. I'll discuss projections in a later section.

Looking at a download from the US Census, we see these files.

$ ls -l
.rw-rw----    5 dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.cpg
.rw-rw---- 243k dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.dbf
.rw-rw----  165 dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.prj
.rw-rw---- 6.6M dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.shp
.rwxrwxrwx  16k dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.shp.ea.iso.xml
.rwxrwxrwx  33k dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.shp.iso.xml
.rw-rw---- 7.6k dimfeld  8 Apr  2022 cb_2021_us_cbsa_500k.shx

In addition to the file types mentioned above, you can see there are other auxiliary files which may be included with a shapefile set. These are mostly just other ways of indexing or describing the data, and are not useful when converting data away from the shapefile format, but the Wikipedia page describes them all if you're curious.

You probably will not have to deal with these individual files yourself. Shapefile importers do the work of combining the shapes from the .shp with the attributes in the .dbf and writing them out all together.

Shapefiles for different regions can be found all over the internet. If you have a specific region in mind, government-related websites and universities are often a good place to look. For more general needs, here are a few of the best sources:

The World Bank provides shapefiles for every country.
ArcGIS Hub is one of the best sources for shapefiles for international use.
The US Census publishes detailed shapefiles for every region in the USA.

Many shapefile datasets require attribution, and some place limits on commercial use, so be sure to comply with the terms when publishing your maps.

US Census

If you're making maps in the USA, this is one of the best data sources. The US Census Bureau not only publishes demographic information on the USA, but also detailed mapping data covering the entire country.

TIGER/Line Shapefiles

The TIGER/Line Shapefiles dataset has detail down to the road and street level, and can be used to create just about any map of the USA that you want.

PostgreSQL/PostGIS also has facilities to use this data for geocoding, which is the process of converting between addresses and latitude/longitude coordinates. If you want to load this data, your best bet is to use the scripts provided by PostGIS.

For most applications, you don't actually need this level of data, and if you want to show streets in your map it's much simpler to use a tile layer — the background layer of the map — that includes them. We'll discuss doing this in a later article.

Cartographic Boundaries

If you don't need the TIGER/Line dataset's extreme level of detail, the Cartographic Boundary Shapefiles are for you. These are much smaller datasets which show region boundaries in the USA, and come in three different scales, or levels of detail.

These shapefiles cover many different types of regions that the census works with, from states down to block groups. Counties and states are a good place to start since people are familiar with them.

Core Based Statistical Areas (CBSAs) are less well-known, but also a good choice for some applications. They are formed around regions in which people actually live, as opposed to county boundaries which don't always relate to actual population distributions. CBSAs do not cover the whole country though. Each one is placed around an urban center with at least 10,000 people, so many rural areas will not be inside a CBSA.

Finally, ZIP Code Tabulation Areas (ZCTAs) are mostly — but not exactly — like ZIP codes. ZIP codes sometimes have weird shapes for mapping since they're just designed for use by the Postal Service, but they can be useful if you need more granular shapes for regions that cover the entire country, and also want to use something that the average non-GIS expert is familiar with.

I'll cover the various types of US regions and how each one is actually defined in a later article in this series.

KML

Keyhole Markup Language (KML) is an XML-based format which was invented for Google Earth. KML tends to compress a bit better than shapefiles, but the uncompressed files are much larger.

Most sources of KML also publish the same data in shapefile format, so you won't often need to work with it.

GPS Data

GPS data comes in many formats. GPX is one of the most common and is based on XML. Garmin's FIT format is also popular, especially in fitness-centered applications.

Geographic data is usually not published in these formats, but you are likely to encounter it if you create an application that lets users upload GPS data from their own activity.

OpenStreetMap

The OpenStreetMap format is used by the organization of the same name. You generally won't encounter this data anywhere except from OpenStreetMap, but it's an excellent source if you need global mapping with features down to the street level.

Like the US Census TIGER/Line dataset, it is extremely large. But if you do need this level of detail in a smaller region than the whole Earth, there are ways to create smaller regional extracts.

I won't cover importing their data in this post, but the OpenStreetMap Wiki has a good guide.

Coordinate Systems and Projections

Before loading the data, it helps to have a basic understanding of projections.

Coordinate systems and projections are a complex topic worth an article (or more!) unto themselves, but put simply, a projection defines how to translate between latitude/longitude coordinates and actual points on the two-dimensional map.

Many projections have an "SRID" registered with the EPSG registry, and you'll need this ID when loading your data.

Fortunately, once you've identified the projection, you don't really have to deal with the details or math yourself — if you set the proper SRID, geography libraries such as PostGIS will do this hard work for you.

Coordinate Systems

Some SRIDs only reference coordinate systems. A coordinate system (or spatial reference system) indicates how latitude/longitude coordinates translate to exact spots on our not-quite-spherical Earth.

GeoJSON uses the WGS84 spatial reference system, which has SRID 4326. This is one of the most widely-used coordinate systems.

US Census shapefiles use SRID 4269, which indicates the NAD83 coordinate system. This is close to WGS84 but has a few small differences, and specifically accounts for North American tectonic plate movement over time. It differs from WGS84 by a few feet, which means that for most applications the difference is negligible.

Projections

Coordinate systems by themselves do not define a particular projection; that is, they don't specify a way to translate from coordinates to a 2D map.

This is where projections come in. Some attempt to depict the entire Earth while making reasonable tradeoffs about the distortions that necessarily come with that translation. Mercator, Albers, and Goode are common examples of these types of projections.

Other projections are designed for smaller areas. For example, the UTM18 projection (SRID 32618) is specifically designed to accurately portray areas in the northern hemisphere, between 72 and 78 degrees West.

When you have data that uses one of these projections, the points are not in longitude/latitude coordinates, but have been pre-converted to the locations where they would show up on a map or screen.

This means that it's especially important to get the right SRID for the projection, since the data will look totally wrong otherwise. Fortunately, pasting the contents of the .prj file into Google usually brings up the answer if you aren't sure.

Loading Shapefiles

GIS professionals tend to use mapping software with native support for shapefiles. Web application developers usually do not, so we need to transfer it into a more suitable format.

PostgresSQL

PostGIS includes the shp2pgsql utility, which will read in a shapefile and dump out SQL to load it into a PostgreSQL database. It has various options on how to add the data, but the default is to create a new table with one column for each of the attributes in the .dbf file, and a column named geom to hold the shape itself.

First, you'll need to install PostGIS and enable it in your database, if you haven't already. The PostGIS website has installation instructions for both Windows and Mac. On Linux/Unix the system package manager should have everything you need.

On Mac you can also use the Homebrew package manager, or just install Postgres.app and set up its CLI tools.

Once you have PostgreSQL and PostGIS installed, you can create your database and enable the PostGIS extension.

$ psql -d postgres
psql (14.7 (Homebrew))
Type "help" for help.

postgres=# create database geodataapp;
CREATE DATABASE
Time: 189.309 ms
postgres=# \c geodataapp
You are now connected to database "geodataapp" as user "dimfeld".
geodataapp=# create extension postgis;
CREATE EXTENSION
Time: 1638.587 ms (00:01.639)

With this done, you can load your data.

$ shp2pgsql -h
# Shows all the options
# Omitted for brevity

$ shp2pgsql \
  -s 4296 # Set the SRID \
  -D # Output in dump format, faster to insert \
  -I # Generate an index on the geometry \
  cb_2021_us_cbsa_500k.shp | # US Census CBSA shapes \
    psql -d geodataapp

If we look at the generated SQL instead of piping it directly into psql, we see something like this.

SET CLIENT_ENCODING TO UTF8;
SET STANDARD_CONFORMING_STRINGS TO ON;
BEGIN;
CREATE TABLE "cb_2021_us_cbsa_500k" (gid serial,
  "csafp" varchar(3),
  "cbsafp" varchar(5),
  "affgeoid" varchar(14),
  "geoid" varchar(5),
  "name" varchar(100),
  "namelsad" varchar(100),
  "lsad" varchar(2),
  "aland" float8,
  "awater" float8);
ALTER TABLE "cb_2021_us_cbsa_500k" ADD PRIMARY KEY (gid);
SELECT AddGeometryColumn('','cb_2021_us_cbsa_500k','geom','4296','MULTIPOLYGON',2);
COPY "cb_2021_us_cbsa_500k" ("csafp","cbsafp","affgeoid","geoid","name","namelsad","lsad","aland","awater",geom) FROM stdin;
   ... many lines of data here
CREATE INDEX ON "cb_2021_us_cbsa_500k" USING GIST ("geom");
COMMIT;
ANALYZE "cb_2021_us_cbsa_500k";

SQLite

Spatialite, a geography extension for SQLite, also includes tools for loading shapefiles. Their loader tool is called, appropriately enough, spatialite_tool.

$ spatialite_tool \
  -i # Import mode \
  -shp cb_2021_us_cbsa_500k # Name of the shapefile (without extension) \
  -d db.sqlite # Database file to write into \
  -t cbsas # Table in the database to create \
  -s 4296 # The SRID \
  --charset utf8 # Character encoding listed in the CPG file. Use utf8 if not sure.

Spatialite seems sparsely documented and I've barely used it, but the cookbook is a good starting point. Once you have it set up, the SQL functions for querying the data have a lot of overlap with PostGIS.

GeoJSON

Depending on your needs, you may just convert your data to GeoJSON and use it directly in your application. This makes it harder to filter and aggregate the shapes inside a database, but it is definitely the simplest way, especially if you haven't set up a database.

The shapefile NPM package includes a command named shp2json that will read a shapefile and output a FeatureCollection containing all the data.

Wrapping everything in a FeatureCollection is nice because you can just parse the entire thing as a single JSON object. You can also pass the -n option to skip the FeatureCollection and write each Feature on its own line, as newline-delimited JSON.

$ npm -g install shapefile

# Write each Feature on its own line
$ shp2json -n cb_2021_us_county_20m.shp > counties.ndjson

# Write to a FeatureCollection
$ shp2json cb_2021_us_county_20m.shp > counties.json

Loading Other Formats

Using ogr2ogr

The open-source GDAL package provides a tool named ogr2ogr which can convert from and to almost any format. It's a little more awkward to use and has a lot of options, but if you encounter some unusual format it can probably convert it.

Generally it's worth using dedicated tools for the job though, as ogr2ogr's generalist nature can cause some quirks. One big exception is when your data is in an unusual projection and you're converting to GeoJSON.

Most shapefile conversion tools will not reproject the data, but ogr2ogr has options to do just that.

ogr2ogr \
  -s_srs "EPSG:3857" # Projection to convert from (3857 is Mercator) \
  -t_srs "WGS84" # Projection to convert to \
  -f GeoJSON # Format to conver to \
  output.json # Output destination \
  shapes.shp # Input file

If you're using MySQL, ogr2ogr also appears to be the tool of choice for loading geographic data.

Loading GeoJSON into the Database

Sometimes you come across data that is already in GeoJSON format, but you want to load it into your database. While you can just use a jsonb column, this doesn't work well if you need to actually manipulate or filter the shapes with SQL.

This is one good use case for ogr2ogr and since GeoJSON always uses the WGS84 coordinate system, you won't have to worry about projection issues causing problems. ogr2ogr can write directly into the database.

$ ogr2ogr \
  -f "PostgreSQL" \
  PG:"dbname=your-database-name user=your-username password=password" \
  data.json \
  -nln table-name-to-create

If you don't mind doing your load in two parts, you can also load the GeoJSON data into a temporary table, then use the ST_GeomFromGeoJSON function to convert the data and write it into another table. You'll have to unpack the Feature attributes into columns manually though.

KML

The easiest thing to do with KML is just convert it into a shapefile or GeoJSON, and then load the resulting data into your application.

As I mentioned above, most sources of KML also provide shapefiles, but if you have to deal with KML, ogr2ogr or the togeojson utility are good ways to handle it.

GPS Data

GPSBabel is a versatile tool for converting GPS data. It supports reading GPX, FIT, and many other formats, and can write to both shapefile and GeoJSON.

Next Steps

Now that you know how to load geographic data, how do you actually use it? The next article in this series will start to answer that question.

Also, many thanks to Sean Lynch for feedback on this article's initial draft, and to Bramus for the tip about ogr2ogr's ability to convert between projections.