Governments and public administrations started recently to publish large amounts of structured data on the Web, mostly in the form of tabular data such as CSV files or Excel sheets. This Mapping Wiki enables the crowd-sourcing of the large-scale semantic mapping of such tabular data. Default mappings are automatically created and can be revised by the community using this semantic wiki. The mappings are executed using a sophisticated streaming RDB2RDF conversion.
- name: a string (required), which identifies the mapping and must be unique within the scope of one resource. It is possible to define several mappings for the same resource simply by adding several RelCSV2RDF template instances to the page, which will result in several RDF files being generated.
- header: an integer or an integer range (optional), which indicates the position of header row(s). The default is "1".
- omitRows and omitCols: integer ranges (optional), which determine rows and columns to be omitted from the conversion. The default is not to omit any rows or columns.
- delimiter: a symbol (optional), defining the field separator for the tabular data file. Default is ",".
- col1, col2, col3 etc.: strings, which specify RDF properties to be used for the conversion of the respective column of the table.
The default mapping, which our automatic conversion process generates, uses CSV column headers as identifiers for respective properties in the namespace
Alternatively complete URIs or CURIS using Prefix.cc prefixes can be used.
These properties are then instantiated for each of the respective column values.
A consequence of this approach is, that CSV files using the same column header will produce RDF containing the same properties.
In the majority of the cases this behavior is desirable, especially, if multiple datasets were exported to CSV from the same backend system and have the same structure and headers.
However, this automatic mapping can also result in incorrect property identification in cases, where columns in CSV files have the same header label, but different meaning.
The crowd-sourcing approach enables to quickly resolve such problems once identified.
For example, see the resource "Spend over £25,000 in North East Strategic Health Authority".
In order to support the automatic conversion of CSV resources to RDF PublicData.eu is complemented with an CSV2RDF server. It downloads the CSV resources and automatically transforms them to RDF. Since an automatic transformation will not be able to properly reuse existing vocabularies, the transformation process can be configured through dataset transformation configuration pages, which are generated for each CSV resource in this wiki. Thus, the configuration of the transformation process can be crowd-sourced and the quality of the transformed data improves gradually.
The RDF generation is performed by Sparqlify-CSV. The CSV2RDF server is written in Python and uses the PublicData.eu CKAN instance as well as this PublicData.eu Wiki as main resources. CSV2RDF server is an open-source project, developed by AKSW group. The source code can be downloaded from the github repository
More detailed technical information can be found in the following publication:
- Ivan Ermilov, Sören Auer, Claus Stadler: User-driven Semantic Mapping of Tabular Data