This document reviews existing web-oriented data formats.

HTML microformats

  • microformats overview: http://microformats.org/about and http://microformats.org/wiki
  • additional similar approaches: RDFa and microdata
  • microformats-2: latest iteration based on lessons learned from above

  • relationships between data, documents, using standard rel attribute values
  • semantic class names re-using well established vocabularies (e.g. vCard, iCalendar)
  • additional vocabularies developed with an open scientific process and community, CC0 public domain

Dataset Publishing Language (Google)

Metadata Structure

Basic dataset metadata

  • Name
  • Description
  • Url
  • Provider

    • Name
    • URL
  • Topics (aka tags)

Data metadata is organized around Concepts which are Dimensions (attributes) and Metrics (values).

Concept:

  • Id
  • Info

    • Name
    • Description
  • Type

Slices = collections of concepts. Define what is a metric and what is a dimension. Serialize one-to-one with tables.

Tables = definition of a CSV file.

  • Can define defaults for columns
  • Can define formats (e.g. for data columns …)

Comments

  • DSPL seems an excellent CSV-based data packaging system
  • Would be nice to have JSON instead of XML for metadata

RDF and Linked Data

Google Visualization API Data Format

Google BigQuery

https://developers.google.com/bigquery/

CKAN Data API

See http://docs.ckan.org/en/latest/datastore.html and http://docs.ckan.org/en/latest/datastore-api.html.

  • JSON based
  • Tabular oriented

JSON-Stat

http://json-stat.org/ with the detailed specification at http://json-stat.org/doc/

  • “The ultimate goal of json-stat.org is to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.”
  • JSON based and cube oriented

Example (reasonably complex):

{
   "dataset" : {
      "value" : [4729, 4832, 9561],
      "dimension" : {
         "id" : ["metric", "time", "geo", "sex"],
         "size" : [1, 1, 1, 3],
         "metric" : {
            "category" : {
               "label" : {
                  "pop" : "Population"
               },
               "unit" : {
                  "type" : {
                     "pop" : "count"
                  }, 
                  "base" : {
                     "pop" : "Person"
                  },
                  "symbol" : {
                     "pop" : null
                  },
                  "mult" : {
                     "pop" : 0
                  }
               },
            }
         },
         
      }
   }
}

OData (Microsoft)

SQL

Standard ANSI SQL

SQLite

  • http://www.sqlite.org/
  • SQLite binary format - not just sql Not specified by anyone in particular but suggested by several people and now used by Scraperwiki

SODA - Socrata Open Data API

Metaweb Object Model

Formats - Tabular

General characteristics

Most systems have a model that looks something like:

Dataset

  • headers: list of Columns
  • data: RowSet
  • total (total_rows in couch, count in sql style systems): number of rows in RowSet

Column:

  • id
  • label

RowSet - list of rows:

  • getLength
  • getRow(i): returns row

Row:

  • list of cells

R (Data Frames)

TODO: Need more info …

Tablib

Model:

  • Dataset - core object

    • dict: list of Rows (can instantiate with list of arrays/tuples)
    • headers: header fields
  • Row: list of fields
  • Databook: list of Datasets (e.g. spreadsheet workbook)

SlickGrid

JS tabular data presentation.

Model:

  • Two arguments: data, columns
  • Data: an array of dicts or a Model object

    • Model: object implement three methods - see sample implementation SlickGrid.Data.DataView_

      • model.getItem(i) // Returns the ith row
      • model.getLength() // Returns the number of items
      • model.getItemMetadata(i) // not sure about this …
  • Columns: at least id, name (label) and field attributes. See https://github.com/mleibman/SlickGrid/wiki/Column-Options

JS Data

Model:

  • Data.Hash (A sortable Hash data-structure)
  • Data.Graph (A data abstraction for all kinds of linked data)
  • Data.Collection (A simplified interface for tabular data that uses a Data.Graph internally)
  • Persistence Layer for Data.Graphs