Forum OpenACS Q&A: gps to zipcode query: postgres or postgresql + postgis or how?

I would like to map from GPS coordinates to US Zipcode for US Continental based needs.

(Yes, I am excluding perhaps Hawaii and Alaska for now, and I am also assuming that a simple 2D geometry, assuming the GPS coordinates are Euclidean and not spherically based will be good enough to find the closest zip code.)

So assuming something simple like

create table zipcode (
  zcode  integer primary key,
  loc    point
);

I have the zipcode data from the postoffice and it contains lat, lon.  (As I'm sure you guys know....)

I notice that Postgresql already supports the point type.  And I figured how to enter points into the db.

What I haven't figured out how is to make a query that selects other rows that are closed to my target.

select zcode, loc from zipcode where loc ## 'userx, usery';

That seems to fail.

Can someone tell me:

A)  Can I do this in postgresql without using postgis, and if so, how? (and what indices should I create?)
B)  Where a good reference to making queries with the postgres native geometric types can be found?

I'm happy to use postgis, I just don't want to toss an a-bomb after a fly.  If I implement postgis and do this in postgis, would it be faster than the "native" postgres query, and if so, why?

And how would I do this postgis and what sort of index would I need to create?

And hell, all I need is a mapping the fastest way possible.  Given that I have the zipcode => (lat, lon) data, anyone know of any C code to do a spatial search?

Thanks,

point_distance is the function I am looking for. Hmm, near as I can tell, point_distance and most of the functions listed by \df are not documented anywhere.... Well, the price is right.

Given the table above, the following non ANSI query will find the zipcode closest to your location ($X, $Y).

  select code from (select code, point_distance(loc, '$X, $Y) as dist
                      from zip
                  order by dist
                     limit 1) as tbl;
It's still not clear what the best indices to build are or how much faster or slower the typical C function to do this would be. Future exercise....
It may also be worth trying changing Jerry's inner select to something like this

select code, point_distance(loc, :X, :Y) as dist
  from zip
 where @ (:X - loc[0]) < x_delta_max
   and @ (:Y - loc[1]) < y_delta_max
 order by dist
 limit 1

to avoid doing an exact distance calculation unless the zipcode you're checking is relatively close to the point you're testing. [xy]_delta_max should be tuned depending on the maximum distance between centers of adjacent zipcode areas in the US.

This may of course have no noticable effect on performance, but it's an obvious first optimisation.

Ah, thank you, good point!

I was walking around tonight and got to thinking that perhaps the data is being stored all wrong. I want to know the closest zip code, but I don't care at all how far away I am from the center of that zip code.

Perhaps instead of storing X, Y, I should store X**2, Y**2, and then write my own distance function. Something like:

metric(zipx2, zipy2, x, y) {
   return (zipx2 - x**2) + (zipy2 - y**2)
}
This eliminates two multiplications and one square root for each measurement.

The question (that I will not be empirically answering until this thing becomes way more successful than it ever will) is if this function (as a pl/sql function) would be faster than the built-in point_distance function.

To further optimize the math, the lat/lon information can probably be represented by a Postgres INT2 representation (by scaling the actual values appropriately.) That should allow for all integer arithmetic to be used.

Jerry,
Just a FYI, acs-reference includes the zipcode package.
Yes, thanks Jon.

I've gone through all the threads I could find figuring out what the current state of zipcodes and OACS is.  It's not a question of my getting the data, it's more a matter of my understanding how to work with the data.  It's been very instructional though -- If AK is Alaska, it turns out that MH is Marshall Islands, PW is Palau, and a tour of various island territories....

I guess what really annoys me is that if we want current data we have to pay for it.  Haven't we already paid to have it created once?  And it's not like it's some supersecret or hard to determine or constantly changing information.  We have to pay for accurate zipcode information?  That's BS.

I'm still trying to figure out some postgres magic.  Given a table with x and y columns, I know how to generate a point.  point (x, y).  But what's the inverse?  Given a point, how do I generate x and y?

a point can be treated like a 2-element array, so somepoint[0] and somepoint[1] refer to the X and Y values.
Jerry,

I don't know if you are aware that there is a software package in the PostgreSQL contrib tree that might help you (or that do what you want already, not sure).

Description: Additional facilities for PostgreSQL
 The PostgreSQL contrib package provides several additional features for
 the PostgreSQL database. contrib often serves as a testbed for features
 before they are adopted into PostgreSQL proper:
.
  earthdistance  - Operator for computing the distance (in miles) 
                   between two points on the earth's surface

-Roberto

Russell, thanks for the tip.  That worked fine.

Roberto, I took a look at earthdistance and did some experimenting....  Thanks for pointing this out.  It actually helped me catch an error in my tables where I swapped lat and long.

The USPS gives little information about how much area zipcodes can represent.  The question here is if I need to use earthdistance (which uses a lot of float multiplication and division as well as 5 sin and cos and one acos operation for each measurement) or whether some bad assumptions about our flat planet would suffice and let me use point_distance which should require no trig and probably takes two mults and one square root per measurement.

Googling for a table of State sizes in square miles, and dividing that by the count(*) of zips in that state revealed that except for Alaska, the average size of a zip code is 100 sq. miles with a max of 492 sq miles and a std. deviation of 113 sq miles.

Alaska has an avg zip code size of 2120 sq. miles.

Also, if you examine the max and min latitudes of the zipcodes and you haphazardly guess that Washington or Maine have the highest in the continental US, then it turns out that the US is bounded by about 24 degrees in Florida and about 50 degrees somewhere in Washington.

In Washington, geodistance shows that one degree of latitude is about 44 miles.  In Florida, it's 60 miles, and at the top of Alaska, it's only 26 miles.

So in the lower 48, the average zip code area is 10 miles on a side and one degree of latitude covers 44 miles at the smallest.  Zip codes within the first std. deviation (is it right that that would be 68% of all zip codes?) are root(100 + 113) on a side (assuming square zip code shapes) and zip codes in the first three std. deviations (99%?) are root (100 + 113 + 113 + 113) = 21 miles.

The world IS flat!  As far as this application is concerned, it's safe to ignore geo_distance and just use point_distance, and, in the optimization above, a number like .5 is probably conservative.

And Alaska (a state I really love and I encourage you to visit)  Well as far as this app goes, if you're not within 22 miles of the center of a zip code, well you probably still are interested in the services in that zip code.

So I'll have to be concerned about that, uh, someday.  First thought at that solution, is that if the first query returns no rows, then check to see if you're in the Alaskan zone and rerun it without the the delta_x optimization.

This might be off-topic. Then again, it might be exactly related. It looks like some of the data you're looking for is available:

http://www.google.com/programming-contest/winner.html

Thanks.

I've come across Egnor's work before.  In fact I am almost positive I found a link to it here, but I cannot find it now (I'm also getting some sort of error when I search for zip (but not when I search for zipcode)).

But uh, I wasn't sure how to make use of Egnor's work.  My take was that it was more a work for Google to implement than one of us.  It involves, if I understood it correctly, crawling the page for patterns that make up addresses, and then associating the 3D location of that address to that page.

So if your query is "show me the stores within 20 miles of here", there is no way that you or I can implement that, that has to be done at google.  I believe.

As a smaller matter, he also acknowledged there was some work to be done on his address patterns, noting IIRC, that while Google lists three addresses on their contact form, his address parser only picked up one of them.

You do have to pay for the latest zip code information. I am not sure why.

The US Census made available a 1999 version of the zip code information. I am using it for a couple of things as it is close enough for my needs still.

http://www.census.gov/geo/www/tiger/zip1999.html
ref-us-zipcodes already includes the 1999 info.
I just stumbled across a new source of zip code, long lat (and population, housing, ...) data. The data is as of the 2000 Census with a vague impression that it will be kept current.

http://www.census.gov/geo/www/gazetteer/places2k.html

These aren't zip codes per se, these are zip code tabulating areas:

ZIP Code Tabulation Areas (ZCTAsTM) are a new statistical entity developed by the U.S. Census Bureau for tabulating summary statistics from Census 2000. This new entity was developed to overcome the difficulties in precisely defining the land area covered by each ZIP Code®. Defining the extent of an area is necessary in order to accurately tabulate census data for that area. ZCTAs are generalized area representations of U.S. Postal Service (USPS) ZIP Code service areas. Simply put, each one is built by aggregating the Census 2000 blocks, whose addresses use a given ZIP Code, into a ZCTA which gets that ZIP Code assigned as its ZCTA code. They represent the majority USPS five-digit ZIP Code found in a given area. For those areas where it is difficult to determine the prevailing five-digit ZIP Code, the higher-level three-digit ZIP Code is used for the ZCTA code. For more information, please refer to the ZCTA (FAQ) Frequently Asked Questions Web page.
If you are looking to associate a name with a zip code, that data comes from the USPS. If you would like to associate a lat, long pair with a zip code tabulating area, that can come from the Census.

If you are looking to join the two to find the name most likely associated with a lat/long that sounds like a webservice.

There is some interesting data in the new files. It reveals a whole slew of exciting new ZIP codes, such as the 351XX ZIP code for one. Secret government dachas? Witness Protection Program? Adults only zone? Eminem free zone? Where all the blue food is?
Jerry, you mentioned PostGIS at the very beginning of this thread, but you thought maybe it was overkill for this GPS to zip code mapping stuff you want to do. Any further thoughts on that? Did you end up playing with PostGIS at all? Any comments on whether or when it would be advantageous to use PostGIS or not, vs. other methods?