|Zip, City, State, County Data|
US data providers
I am looking around for a data provider that can provide me with an up to date list of zip, city, state, county, & lat/long to integrate into a local search function on a site. Basically so someone can ask for the site to list all locations within so many miles of a specified zip code. I have found a few providers that sell the information in a format that can easily be ported into MySQL but they seem to be getting their info from the 2000 Census. Is that really up to date enough? I know most of this information can't change tht often, but still 5 years old seems too old.
A great deal of information is avaiable for free from the US government census site. The age of the data is, as you state 2000, for cities and towns this is OK but for zipcode related items you would want a regularly updated dataset (which will cost money if you intend to use it on a commercial website).
If you don't need completely up to date data then the geography pages like [census.gov...] will help.
Getting good, up to date data can be expensive from a purchasing and maintenance point of view, if you can live with some changes then I'd stick with the free data.
As for how often data changes, I know that in the UK there are changes to postcodes almost constantly. New housing developments are the obvious ones but there are re-classifications too. It's a nightmare.
Give [melissadata.com...] a try. They have multiple tools that might help/
[edited by: Woz at 10:45 am (utc) on July 18, 2005]
[edit reason] made link live [/edit]
We bought zipcode data from a paid provider and still received many complaints about cities/towns not included.
I'm not sure if they're any better than free -- YMV.
I use [zipcodedownload.com...] for that particular need.
Yea, the free stuff might work, but I don't have the time to keep it even close to up to date. Just gonna have to pay.
voltrader, thats what I was afraid of.
Thank Rusty, that one seems like they do some decent updates and are cheap. As long as they are inexpensive I can afford to test the data. I just didn't want to pay $100-400 for data I can't test.
Now the hard part. If the info is in constant flux, how is everyone updating their records from a DB. If it was constant, each row of the table would have an id and each biz would be assigned one of those id's. But since this information will change, and the changed information will come in another bulk format, thats not really possible. I was thinking to have each biz have their location info in their own seperate table, with the zip/city/etc data in a table of it's own. The zip code data will basically only be used for distance searches. When I go to update the zip/city/etc data, write a small program to do a compare between the new data and old data and flag any rows that have changed or are new. From there it will compare and flag any biz that contains the same location info so they can be manually reviewed and updated if needed.
Anyone using any thing more efficient they could share?
A good data provider should supply information on changes, these are usually flagged well in advance so you can cope with large re-classifications. You should not have to do the comparisons yourself, all you should need to do is update the codes you are told are changing.
As for how to cope with change, you are in luck. If you are using zipcode data for locations then a re-classification does not change the physical location of the business significantly, you just need to keep on top of the zips that you use to lookup user location. You should still take care with your DB though, always try to keep it up to date (and backed up).
I'm not experienced enough with zipcodes vs postcodes to give you the best way to handle changes in zipcodes but I'm sure there will be someone here who can.
If you have an address for each location, there are plenty of free geocoding services.
a "local search" service implies you already have the locations but need a lat/long associated with them in order to generate a geospatial search.
See this for more info:
It is a google maps thread about someone asking how to do a geospatial search.
Geocoder.us is one of the free sites available to do free geocoding.
Yes but those only work for calculating distances from exact locations and knowing the lat/long of exact locations. For now, I am calculating distances from a zip code and listing all other businesses in that radius.
The more ideal solution would be to have people be able to search from a certain radius of a town since that is how people think about location, but that gets more complicated. It's on the list though.
Couple of things to add to the caveat about paid services:
- we did buy a service which included periodic updates, but the niche we were dealing with turned out to include many businesses in rural areas which tended to be undercovered. If you're starting with major metros, you probably won't have to worry about missing data.
- if I had to do it again, I would cross-ref the latest census data with the paid to try to plug holes.
- the big circle calc can be S - L - O - W if you're using a script to search thorugh the matrix of possibilities. Find ways to optimize it. A compiled plug-in?
Just thinking out loud...
but you can use the center points of the zipcodes you intend to use and define how far out you want to go from the center point. If it is those centerpoints you need, they come prepackaged with the Geo::PostalCode perl modules: [search.cpan.org...]
If you don't use perl, you can get just the text file with the data here:
[tjmather.com...] (url can change when postal codes are updated)
An easier option would be to simply have a postal code associated with each location and make it a simple "keyword"-based search (if user select "12111," show them results in "12111" only). However, if you add geospatial search, zipcode boundaries won't matter and it can give you the "true" closest locations. Keeping the geospatial search allows for building in the town-based searches you want to add later...towns will just have lat/longs associated with them and the search would happen the exact same way.
If you want some help building this, ping me with details.
Yea, but those are still really old. I ended up just purchasing a db from the one rusty recommended. For only a few bucks, there is no reason not to buy one that is up to date. I'll let you all know if I find it wasn't worth the cash.
Also on the slow aspect, I tried a few different distance calc functions and most where slow,but I found one here [micahcarrick.com...] that can pump out all zips within 100 miles of Dallas in half a second and all zips within 1000 miles in 2 secs.
I read some places that the speed can be increased even more by having the database itself do the calcs, but I haven't gotten that deep into MySQL to be able to do that yet.
The information I have gotten will also allow me to run the calcs by city name, but that needs to wait until my buisness database is properly populated because there will need to be some decent coding to catch mispellings of city names and the ability to suggest the correct ones. I know I can't spell so I can't expect my visitors to :D
The speed of distance calculations can be increase dramatically if you get rid of the need for great circle calculations.
If you take the UK as an example, it's not big enough for the curvature of the earth to make hugely significant differences to the calculations so I use the grid coordinate system adopted by many others. This means you can pre-populate x and y coords for all points of interest and then run a very simple database query.
Here is the query I use on a live site. It is able to sort and show the first 25 results from a 600 branch network by distance from the visitor in 0.02 seconds! (it is also matching only those branches that cater for a particular trade, so it's possible to use the $trade variable as a filter to suit your needs)
$sqlquery = "SELECT * FROM yourtable WHERE $trade='yes' ORDER BY (sqrt((Xcoord - $xcust)*(Xcoord - $xcust) + (Ycoord - $ycust)*(Ycoord - $ycust))) LIMIT 25";
The problem with large countries like the US is that the curvature of the earth means that grid coordinates tend not to work well. You can cheat if you are only looking at local results (maybe 100 miles) by splitting up the data into 400 mile segments which overlap by 200 miles. This allows you to create accurate grid coordinates in each area and use the above query. There would have to be a simple calculation to see which 400 mile square is the best to use for any query but run-time calculations would be reduced considerably.
There would be additional work in creating and maintaining the data but you could automate that. If you find your local calculations are loading your server too much because of great circle issues this seems to be a good solution.
I forgot to mention that the 400 mile squares would have to have coords referenced from their centre (quite obvious but could be missed). This means the lat and long for the centre of the area serves as the reference multiplier for x and y distances in miles from that point (don't do a mini version of the great circle within the area)