Neurobagel data dictionaries

Overview

When you annotate a phenotypic TSV using the Neurobagel annotation tool (see also the section on the annotation tool), your annotations are automatically stored in a JSON data dictionary. A Neurobagel data dictionary essentially describes the meaning and properties of columns and column values using standardized vocabularies.

Example

A comprehensive example data dictionary containing all currently supported phenotypic attributes and annotations can be found here (corresponding phenotypic .tsv).

Importantly, Neurobagel uses a structure for these data dictionaries that is compatible with and expands on BIDS participant.json data dictionaries.

Info

The specification for how a Neurobagel data dictionary is structured is also called a schema. Because Neurobagel data dictionaries are stored as .json files, we use the jsonschema schema language to write the specification.

Neurobagel data dictionaries uniquely include an Annotations attribute for each column entry to store user-provided semantic annotations.

Here is an example BIDS data dictionary (participants.json):

{
  "age": {
    "Description": "age of the participant",
    "Units": "years"
  },
  "sex": {
    "Description": "sex of the participant as reported by the participant",
    "Levels": {
      "M": "male",
      "F": "female"
    }
  }
}

And here is the same data dictionary augmented with Neurobagel annotations:

{
  "age": {
    "Description": "age of the participant",
    "Units": "years",
    "Annotations": {
      "IsAbout": {
        "TermURL": "http://neurobagel.org/vocab/Age",
        "Label": "Age"
      },
      "Transformation": {
        "TermURL": "http://neurobagel.org/vocab/int",
        "Label": "Integer"
      }
    }
  },
  "sex": {
    "Description": "sex of the participant as reported by the participant",
    "Levels": {
      "M": "male",
      "F": "female"
    },
    "Annotations": {
      "IsAbout": {
        "TermURL": "http://neurobagel.org/vocab/Sex",
        "Label": "Sex"
      },
      "Levels": {
        "M": {
          "TermURL": "http://purl.bioontology.org/ontology/SNOMEDCT/248153007",
          "Label": "Male"
        },
        "F": {
          "TermURL": "http://purl.bioontology.org/ontology/SNOMEDCT/248152002",
          "Label": "Female"
        }
      },
      "MissingValues": [
        "",
        " "
      ]
    }
  }
}

A custom Neurobagel namespace (URI: http://neurobagel.org/vocab/) is currently used for controlled terms that represent attribute classes modelled by Neurobagel, such as "Age" and "Sex", even though these terms may have equivalents in other vocabularies used for annotation. For example, the following terms from the Neurobagel annotations above are conceptually equivalent to terms from the SNOMED CT namespace:

Neurobagel namespace term	Equivalent external controlled vocabulary term
http://neurobagel.org/vocab/Age	http://purl.bioontology.org/ontology/SNOMEDCT/397669002
http://neurobagel.org/vocab/Sex	http://purl.bioontology.org/ontology/SNOMEDCT/184100006

Phenotypic attributes

The Neurobagel annotation tool generates a data dictionary entry for a given column by augmenting the information recommended by BIDS with unambiguous semantic tags.

Below we'll outline several example annotations using the following example participants.tsv file:

participant_id	session_id	group	age	sex	updrs_1	updrs_2
sub-01	ses-01	PAT	25	M	2
sub-01	ses-02	PAT	26	M	3	5
sub-02	ses-01	CTL	28	F	1	1
sub-02	ses-02	CTL	29	F	1	1

Controlled terms in the below examples are shortened using the RDF prefix/context syntax for json-ld:

{
  "@context": {
    "nb": "http://neurobagel.org/vocab/",
    "ncit": "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#",
    "nidm": "http://purl.org/nidash/nidm#",
    "snomed": "http://purl.bioontology.org/ontology/SNOMEDCT/",
    "cogatlas": "https://www.cognitiveatlas.org/task/id/"
  }
}

Participant identifier

Term from the Neurobagel vocabulary.

{
  "participant_id": {
    "Description": "A participant ID",
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:ParticipantID",
        "Label": "Subject Unique Identifier"
      },
      "Identifies": "participant"
    }
  }
}

Note

participant_id is a reserved name in BIDS and BIDS data dictionaries therefore typically don't annotate this column. Neurobagel supports multiple subject ID columns for situations where a study is using more than one ID scheme.

Note

The Identifies annotation key is currently required to validate annotations for columns about unique observation identifiers (e.g., participant or session IDs). The "Identifies" key should only be used for these columns and its value should be an informative string value describing the type/level of observation identified. This required key is currently only used for validation and its value will not be processed by Neurobagel. (e.g., participant or session IDs), and should have an informative string value describing the type/level of observation identified.

Session identifier

Term from the Neurobagel vocabulary.

{
  "session_id": {
    "Description": "A session ID",
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:SessionID",
        "Label": "Run Identifier"
      },
      "Identifies": "session"
    }
  }
}

Note

Unlike the BIDS specification, Neurobagel supports a participants.tsv file with a session_id field.

Diagnosis

Terms from the SNOMED-CT ontology for clinical diagnosis. Terms from the National Cancer Institute Thesaurus for healthy control status.

{
  "group": {
    "Description": "Group variable",
    "Levels": {
      "PD": "Parkinson's patient",
      "CTRL": "Control subject",
    },
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:Diagnosis",
        "Label": "Diagnosis"
      },
      "Levels": {
        "PD": {
          "TermURL": "snomed:49049000",
          "Label": "Parkinson's disease"
        },
        "CTRL": {
          "TermURL": "ncit:C94342",
          "Label": "Healthy Control"
        }
      }
    }
  }
}

The IsAbout relation uses a term from the Neurobagel namespace because "Diagnosis" is a standardized term.

Note

Columns with categorical values (e.g., study groups, diagnoses, sex) require a Levels key in their Neurobagel annotation. The Neurobagel "Levels" key is modeled after the BIDS "Levels" key for human readable descriptions.

Sex

Terms are from the SNOMED-CT ontology, which has controlled terms aligning with BIDS participants.tsv descriptions for sex. Below are the SNOMED terms for the sex values allowed by BIDS:

Sex	Controlled term
Male	http://purl.bioontology.org/ontology/SNOMEDCT/248153007
Female	http://purl.bioontology.org/ontology/SNOMEDCT/248152002
Other	http://purl.bioontology.org/ontology/SNOMEDCT/32570681000036106

Here is what a sex annotation looks like in practice:

{
  "sex": {
    "Description": "Sex variable",
    "Levels": {
      "M": "Male",
      "F": "Female"
    },
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:Sex",
        "Label": "Sex"
      },
      "Levels": {
        "M": {
          "TermURL": "snomed:248153007",
          "Label": "Male"
        },
        "F": {
          "TermURL": "snomed:248152002",
          "Label": "Female"
        }
      }
    }
  }
}

The IsAbout relation uses a Neurobagel scoped term for "Sex" because this is a Neurobagel common data element.

Age

Neurobagel has a common data element for "Age" describing a continuous column. To ensure age values are represented as floats in Neurobagel graphs, Neurobagel encodes the relevant "heuristic" describing the value format for a given age column. This heuristic, stored in the Transformation annotation (required for continuous columns describing age), maps internally to a specific transformation that is used to convert the values to floats.

Possible heuristics:

TermURL	Label	Example
`nb:FromFloat`	float value	`31.5`
`nb:FromInt`	integer value	`31`
`nb:FromEuro`	european decimal value	`31,5`
`nb:FromBounded`	bounded value	`30+`
`nb:FromISO8061`	period of time defined according to the ISO8601 standard	`31Y6M`

{
  "age": {
    "Description": "Participant age",
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:Age",
        "Label": "Chronological age"
      },
      "Transformation": {
        "TermURL": "nb:FromEuro",
        "Label": "European value decimals"
      }
    }
  }
}

Assessment tool

For assessment tools like cognitive tests or rating scales, Neurobagel encodes whether a subject has a value/score for at least one item or subscale of the assessment. Because assessment tools often have several subscales or items that can be stored as separate columns in the tabular participant.tsv file, each assessment tool column receives a minimum of two annotations:

one to classify that the column IsAbout the generic category of assessment tools
one to classify that the column IsPartOf the specific assessment tool

An optional additional annotation MissingValues can be used to specify value(s) in an assessment tool column which represent that the participant is missing a value/response for that subscale, when instances of missing values are present (see also section Missing values).

{
  "updrs_1": {
    "Description": "item 1 scores for UPDRS",
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:Assessment",
        "Label": "Assessment tool"
      },
      "IsPartOf": {
        "TermURL": "cogatlas:tsk_4a57abb949ece",
        "Label": "Unified Parkinson's Disease Rating Scale"
      }
    }
  },
  "updrs_2": {
    "Description": "item 2 scores for UPDRS",
    "Annotations": {
      "IsAbout": {
        "TermURL": "nb:Assessment",
        "Label": "Assessment tool"
      },
      "IsPartOf": {
        "TermURL": "cogatlas:tsk_4a57abb949ece",
        "Label": "Unified Parkinson's Disease Rating Scale"
      },
      "MissingValues": [""]
    }
  }
}

To determine whether a specific assessment tool is available for a given participant, we then consider all of the columns that were classified as IsPartOf that specific tool and then apply a simple any() heuristic to check that at least one column does not contain any MissingValues.

For the above example, this would be:

particpant_id	updrs_1	updrs_2
sub-01	2
sub-02	1	1
sub-03

Therefore:

particpant_id	updrs_available
sub-01	True
sub-02	True
sub-03	False

Missing values

Missing values are allowed for any phenotypic variable (column) that does not describe a participant or session identifier (e.g., columns like participant_id or session_id). In a Neurobagel data dictionary, missing values for a given column are listed under the "MissingValues" annotation for the column (see the Assessment tool section or the comprehensive example data dictionary for examples).