Viva Lost Data: Making All Data Easy to Search and Share
The more people use data, the more useful it becomes. To this end, Socrata and others want to liberate data from the shackles of legacy data systems and ossified data silos and get it in the hands of people that can really use it.
Who these people are will vary based on the data. It may be researchers at universities for one dataset, Yelp for the next, and boaters looking for a launch site with restrooms and a beach for another. In each of these cases, we need to have a data “platform” that allows data providers to open up their data and have data consumers be able to easily find and use that data.
The Ideal Data Platform
When I think about this platform, I try to naturally break it apart into how it will serve both data consumers and producers.
The three main goals are to make it:
1. Easy to find: Provide services to help users find appropriate data sets.
2. Easy to share: Provide synchronization standards to make data portable so it can be replicated across many data stores, referenced by many catalogs, and kept in sync.
3. Easy to use: Provide APIs and ways for other services to run interactive queries on top of the data to simplify the creation of mobile applications.
For some reason, whenever, I wrap my head around these goals, the word “network” pops into my head. I imagine a world where products like CKAN, Socrata and Junar work together such that any data added to one can be transported to any of the others seamlessly.
When I think about how to break this problem apart, I like to think about it in the somewhat familiar terms of “catalogs” and “data sources”:
Catalogs: Catalogs serve to organize, categorize and search datasets. An examples of a pure catalog would be a site like: http://www.data.gov/communities/node/42391/data. It aggregates data from several other open data sites.
Data sources: Data sources serve to actually expose datasets. They can provide downloads, API access or visualizations. An example would be something like White House visitor records: https://explore.data.gov/dataset/White-House-Visitor-Records-Requests/644b-gaut.
Efforts Towards Standardization
There are currently a few efforts towards getting standards together for open data and they all focus on different areas of the problem. Here is a list of what I think we need to standardize on over the next year or two:
1. Catalog federation: If one catalog knows about a dataset, other catalogs should be able to learn about the dataset and advertise it as well. This will more easily enable “super-catalogs” that can aggregate content.
2. Federated search: Federated catalogs should be able to search inside datasets in a standard way, so a federated catalog can search on results within datasets, rather than just on the metadata.
3. Synchronizing datasets: We should enable data sources to create “mirrors” of other data sources. This allows some analytics or research applications that need to run expensive operations on open data to always keep current.
4. Retrieving data through an API: As more mobile devices and web applications use open data, the more important it will be to have a small number of APIs devices can use to access the data, rather than a large numbers of APIs with small data sets.
At Socrata, we feel that creating standards for open data is an important component to creating an Open Data Platform. We will work on implementing and helping to push the standards forward where it makes sense.
Some of these efforts can be seen below and found here: http://open-data-standards.github.com/efforts.html
Current Standardization Efforts
Name Goals Relevant Standards
|Data Catalog Schema||
|Catalog Update API||
|Dataset Query API||
Written by Will Pugh, Chief Technology Officer, Socrata, Inc.