Table2qb - CubiQL End-to-end example
November 19, 2018 · View on GitHub
Steps:
- Create RDF ttl files using Table2qb
- (Optional) Load the RDF files into a triple store
- Run CubiQL against the generated RDF data.
Step 1: Table2qb
cubiql is a command-line tool for generating RDF data cubes from tidy CSV data. See the table2qb repository for
installation instructions and an example of using it to generate data cubes.
Requirements
- Java 8
- Unix, Linux, Mac, Windows environments
- On Windows run
table2qbwithjava -jar table2qb.jar exec ...
How to use
Create the codelists using the codelist-pipeline
Example:
table2qb exec codelist-pipeline --codelist-csv csv/gender.csv --codelist-name "Gender" --codelist-slug "gender" --base-uri http://statistics.gov.scot/ --output-file ttl/gender.ttl
- Codelists should exist for every dimension. If codelists currently exists there is no need to create new ones.
- A separate CSV is required for each codelist.
- The codelist CSV files should contain all the possibles values for each codelist
- An example CSV is: gender.csv
- At the CSV use the column
Parent Notationto define hierarchies - The URIs created by the codelist pipeline
- Concept scheme (codelist):
{base-uri}/def/concept-scheme/{codelist-slug} - Concepts:
{base-uri}/def/concept/{codelist-slug}/{CSV_column_ Notation}
- Concept scheme (codelist):
Create the cube components using the components-pipeline
Example:
table2qb exec components-pipeline --input-csv csv/components.csv --base-uri http://statistics.gov.scot/ --output-file ttl/components.ttl
- The cube components are the
qb:DimensionProperty,qb:MeasurePropertyandqb:AttributeProperty. - At the CSV use one row for each of the cube components. If some of the components have already been created (e.g. for another cube) do not include them in the CSV.
- At the CSV column
Component Typethe types are:Dimension,MeasureandAttribute - At the CSV column
Codelistuse the URIs created by thecodelist-pipeline - An example CSV is: components.csv
- The URIs created by the
components-pipeline:- Dimension:
{base-uri}/def/dimension/{CSV_column_ Label} - Measure:
{base-uri}/def/measure/{CSV_column_ Label} - Attribute:
{base-uri}/def/attribute/{CSV_column_ Label}
- Dimension:
Create the cube DSD and observations using the cube-pipeline
Example:
table2qb exec cube-pipeline --input-csv csv/input.csv --dataset-name "Employment" --dataset-slug "employment" --column-config csv/columns.csv --base-uri http://statistics.gov.scot/ --output-file ttl/cube.ttl
- Use a CSV file to define the mappings between a CSV column and the relevant cube component.
- The CSV should contain one row per component i.e.
qb:DimensionProperty,qb:MeasurePropertyandqb:AttributeProperty. Including a row for theqb:measureType. It should also contain a row for the observationvalue. - At the CSV, the column
component_attachmentcan be:qb:dimension,qb:measure,qb:attribute. - At the CSV, the column
property_templateshould match with the URIs created for the components by thecomponents-pipeline. Use the csv columnnameif required at the template (e.g.http://example.gr/def/measure/{measure_type}) - At the CSV, the column
value_templateshould match with the URIs created for the concepts by thecodelist-pipeline. Use the csv columnnameif required at the template (e.g.http://example.gr/def/concept/stationid/{station_id}) - If there are numbers at the dimension values (not at the value of the observation) use
datatypestring. Otherwise ifdatatypenumberis used then the URIs will have the form e.g.http://example.gr/def/condept/year/2000.0 - At the CSV row that has the mapping for the measure (i.e.
component_attachmentqb:measure), leave thevalue_templateempty. - At the CSV row for the value (i.e. with
namevalue). Leave thecomponent_attachmentandvalue_templateempty. - At each CSV row use a
value_transformationif required. Possible values are:slugize,unitizeor blank.slugizeconverts column values into URI components, e.g.(slugize "Some column value")issome-column-value.unitizetranslates literal£intoGBP, e.g. (unitize "£ 100") isgpb-100. Be careful: theslugizeconverts all to non-capital letters. The URIs of the dimension values should match with the the concept URIs created through thecodelist-pipeline
Advice:
- Use the same base URI at all pipelines. Although it is not mandatory it will easy the transformation process.
- Be careful to use URIs that match between the pipelines. E.g. the
property_templateURI atcube-pipelineshould match with the URIs created for the components by thecomponents-pipeline.
A complete example can be found at Github.
Step 2: (Optional) Load RDF to triple store
CubiQL can be run directly against a local directory containing the generated RDF data, however this loads the data into memory so is unsuitable for large amounts of data. First loading the data into a dedicated triple store is therefore recommended.
Supported triple stores:
- Virtuoso
- Stardog
- Fuseki ?
What to load at the triple store:
- All the RDF ttl created by Table2qb:
- RDF for each of the codelists
- RDF for the cube components
- RDF for the DSD and the observations
- RDF for the QB vocabulary
Configuration of the triple store:
- Set the max limit of the returned results e.g. 100000 (depends on the size and number of cubes).
- At Virtuoso the limit is defined by the parameter
ResultSetMaxRows
- At Virtuoso the limit is defined by the parameter
Step 3: CubiQL
Run CubiQL using the default configuration:
java -jar cubiql-standalone.jar --port 8085 --endpoint http://myendpoint.com
if running against a local directory containing the data the --endpoint parameter should specify the path to the directory:
java -jar cubiql-standalone.jar --port 8085 --endpoint ./ttl
The default configuration:
{:geo-dimension-uri nil
:time-dimension-uri nil
:codelist-source "component"
:codelist-predicate "http://publishmydata.com/def/qb/codesUsed"
:codelist-label-uri "http://www.w3.org/2000/01/rdf-schema#label"
:dataset-label-uri "http://www.w3.org/2000/01/rdf-schema#label"
:schema-label-language nil
:max-observations-page-size 2000}
If default configuration does not match your data, then use another configuration file:
java -jar cubiql-standalone.jar --port 8085 --endpoint http://myendpoint.com/ --configuration myconfig.edn
Configuration parameters:
:geo-dimension-uridefines the geo dimension. The values of the geo dimension should have a labe.:time-dimension-uridefines the time dimension. The values of the time dimensions should be defined by thereference.data.gov.uke.g.http://reference.data.gov.uk/id/year/2016:codelist-sourcedefines the source of the codelist that contains only the values used at the cube. The source can be: (i)"component"or (ii)"dimension". By default Table2qb uses the"component".:codelist-predicatedefines the predicate that connects the:codelist-sourcewith the codelist that contains only the values used at the cube. By default Table2qb uses:http://publishmydata.com/def/qb/codesUsed.:codelist-label-uridefines the label property that is used at the codelists. By default Table2qb uses:http://www.w3.org/2000/01/rdf-schema#label:dataset-label-uridefines the label property that is used at the dataset i.e. cube, DSD, components. By default Table2qb uses:http://www.w3.org/2000/01/rdf-schema#label- Datasets, dimensions, measures and codelist members should all have a label with a language tag matching the
:schema-label-language.nilcan be specified to use strings without an associated language tag. :max-observations-page-sizedefines the maximum page size e.g. if you need to get all the observations with one query.