Viva Lost Data: Making All Data Easy to Search and Share

December 4, 2012 2:19 pm PST | Open Data

The more people use data, the more useful it becomes. To this end, Socrata and others want to liberate data from the shackles of legacy data systems and ossified data silos and get it in the hands of people that can really use it.

Who these people are will vary based on the data. It may be researchers at universities for one dataset, Yelp for the next, and boaters looking for a launch site with restrooms and a beach for another. In each of these cases, we need to have a data “platform” that allows data providers to open up their data and have data consumers be able to easily find and use that data.

The Ideal Data Platform

When I think about this platform, I try to naturally break it apart into how it will serve both data consumers and producers.

The three main goals are to make it:

1. Easy to find: Provide services to help users find appropriate data sets.
2. Easy to share: Provide synchronization standards to make data portable so it can be replicated across many data stores, referenced by many catalogs, and kept in sync.
3. Easy to use: Provide APIs and ways for other services to run interactive queries on top of the data to simplify the creation of mobile applications.

For some reason, whenever, I wrap my head around these goals, the word “network” pops into my head. I imagine a world where products like CKAN, Socrata and Junar work together such that any data added to one can be transported to any of the others seamlessly.

When I think about how to break this problem apart, I like to think about it in the somewhat familiar terms of “catalogs” and “data sources”:

Catalogs: Catalogs serve to organize, categorize and search datasets. An examples of a pure catalog would be a site like: http://www.data.gov/communities/node/42391/data. It aggregates data from several other open data sites.

Data sources: Data sources serve to actually expose datasets. They can provide downloads, API access or visualizations. An example would be something like White House visitor records: https://explore.data.gov/dataset/White-House-Visitor-Records-Requests/644b-gaut.

Efforts Towards Standardization

There are currently a few efforts towards getting standards together for open data and they all focus on different areas of the problem. Here is a list of what I think we need to standardize on over the next year or two:

1. Catalog federation: If one catalog knows about a dataset, other catalogs should be able to learn about the dataset and advertise it as well. This will more easily enable “super-catalogs” that can aggregate content.

2. Federated search: Federated catalogs should be able to search inside datasets in a standard way, so a federated catalog can search on results within datasets, rather than just on the metadata.

3. Synchronizing datasets: We should enable data sources to create “mirrors” of other data sources. This allows some analytics or research applications that need to run expensive operations on open data to always keep current.

4. Retrieving data through an API: As more mobile devices and web applications use open data, the more important it will be to have a small number of APIs devices can use to access the data, rather than a large numbers of APIs with small data sets.

At Socrata, we feel that creating standards for open data is an important component to creating an Open Data Platform. We will work on implementing and helping to push the standards forward where it makes sense.

Some of these efforts can be seen below and found here: http://open-data-standards.github.com/efforts.html

Current Standardization Efforts

Name               Goals                                                                               Relevant Standards

Data Catalog Schema
  1. Create a standard format for exposing resources to a catalog (or from a catalog).
  2. Define an apis.xml file to allows catalogs to aggregate APIs from data sources or other catalogs.
  3. Define a catalog.xml file to allow catalogs to aggregate resources from data sources or other catalogs.
Catalog Update API
  1. Create an API to more efficiently determine changes to a catalog.
Dataset Query API
  1. Standardize on an API for querying data within a dataset.
Dataset Federation
  1. Create a protocol for distributing the data within a dataset, and allowing other servers to synch the data.

 

Written by Will Pugh, Chief Technology Officer, Socrata, Inc.


Previous Article
Data Rockstars, Open Data
Movember @ Socrata – Moustache Moments

December 7, 2012

Next Article
Effective Governing
Edmonton Opens Doors Wide for Citizen Feedback

November 29, 2012