No remote found
Workflow
The following workflow defines all tasks necessary to upload, preprocess, share, and map a new data dictionary.
- Upload cohort data
- Open Google Sheet
- Run automated mapping for new data dictionary
- Share Google Sheet with submitter
- Prepare data dictionary for build
- Run automated validation
- Build data dictionary
- View results
- Add data dictionary to Version Control
- Prepare git commit (click on Commit in Version menu)
- Push changes to GitHub (click on Push in Version menu), and make pull request.
- Delete Google sheet (Caution, cannot be undone)
IHCC Data Admin Tasks
Console
Press a button above to execute an action.
Exit status of last command unknown. The server may have restarted before it could complete.
$ make -f Makefile automated_mapping
make cogs_pull
make[1]: Entering directory '/workspace'
cogs fetch
cogs pull
make[1]: Leaving directory '/workspace'
cp build/terminology.tsv templates/cogs.tsv
make build/suggestions_cogs.tsv
make[1]: Entering directory '/workspace'
python3 src/mapping-suggest/id-generator-templates.py -t templates/cogs.tsv -m build/metadata.tsv
Generating IDs for data dictionary: EB
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -o build/intermediate/cogs_mapping_suggestions_zooma.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person GECKO:0000066 0.76
1 Person GECKO:0000055 0.76
2 Person GECKO:0000120 0.76
3 Sample GECKO:0000052 0.98
4 Nationality GECKO:0000064 1.00
5 Education GECKO:0000065 1.00
6 Smoking GECKO:0000068 1.00
7 Alcohol GECKO:0000069 1.00
8 Sleep GECKO:0000071 0.98
9 Health GECKO:0000126 0.98
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -o build/intermediate/cogs_mapping_suggestions_nlp.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term match confidence
0 ObjectiveInformation.weight CMO:0000012 0.837803
1 ObjectiveInformation.bmi CMO:0000021 0.672504
2 ObjectiveInformation.waist CMO:0000021 0.267665
3 ObjectiveInformation.hip CMO:0000021 0.272348
4 ObjectiveInformation.height CMO:0000106 0.689633
5 Sample.vkood GECKO:0000052 0.199489
6 Sample.visitNumber GECKO:0000052 0.199489
7 PhysicalExercise.code GECKO:0000052 0.404730
8 ProfessionalSportPast.code GECKO:0000052 0.404730
9 ProfessionalSport.code GECKO:0000052 0.404730
10 HormonalContraceptiveUsed.code GECKO:0000052 0.404730
11 HormonalContraceptiveUsedV1.code GECKO:0000052 0.404730
12 HormonalMedicationMenopause.code GECKO:0000052 0.404730
13 HormonalMedicationMenopauseV1.code GECKO:0000052 0.404730
14 Health.movement GECKO:0000052 0.105289
15 Health.selfcare GECKO:0000052 0.105289
16 Health.commonActivities GECKO:0000052 0.105289
17 Health.painDiscomfort GECKO:0000052 0.105289
18 Health.anxietyDepression GECKO:0000052 0.105289
19 MedicationsForTroubledBreathing.code GECKO:0000052 0.404730
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p WORD_BOUNDARY -o build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person.gender GECKO:0000066 0.98
1 Nationality.nationality GECKO:0000064 0.98
2 Education.highestEducationLevel GECKO:0000065 0.76
3 Education.highestEducationLevel UBERON:0000105 0.76
4 Education.highestEducationLevel GECKO:0000065 0.76
5 EatingHabits.eatingHabit GECKO:0000072 0.98
6 Smoking.smokingStatus GECKO:0000068 0.98
7 OtherDrugs.usingOtherDrugs GECKO:0000094 0.98
8 OtherDrugs.otherDrugs GECKO:0000094 0.98
9 ObjectiveInformation.weight GECKO:0000114 0.98
10 Person GECKO:0000066 0.76
11 Person GECKO:0000055 0.76
12 Person GECKO:0000120 0.76
13 Person GECKO:0000066 0.76
14 Person GECKO:0000055 0.76
15 Person GECKO:0000120 0.76
16 Sample GECKO:0000052 0.98
17 Sample GECKO:0000052 0.98
18 Nationality GECKO:0000064 1.00
19 Nationality GECKO:0000064 1.00
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p WORD_BOUNDARY -o build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term match confidence
0 ObjectiveInformation.weight CMO:0000012 0.837917
1 ObjectiveInformation.bmi CMO:0000021 0.672600
2 ObjectiveInformation.waist CMO:0000021 0.263734
3 ObjectiveInformation.hip CMO:0000021 0.269210
4 ObjectiveInformation.height CMO:0000106 0.689339
5 Sample.vkood GECKO:0000052 0.199805
6 Sample.visitNumber GECKO:0000052 0.199805
7 PhysicalExercise.code GECKO:0000052 0.400454
8 ProfessionalSportPast.code GECKO:0000052 0.400454
9 ProfessionalSport.code GECKO:0000052 0.400454
10 HormonalContraceptiveUsed.code GECKO:0000052 0.400454
11 HormonalContraceptiveUsedV1.code GECKO:0000052 0.400454
12 HormonalMedicationMenopause.code GECKO:0000052 0.400454
13 HormonalMedicationMenopauseV1.code GECKO:0000052 0.400454
14 Health.movement GECKO:0000052 0.105161
15 Health.selfcare GECKO:0000052 0.105161
16 Health.commonActivities GECKO:0000052 0.105161
17 Health.painDiscomfort GECKO:0000052 0.105161
18 Health.anxietyDepression GECKO:0000052 0.105161
19 MedicationsForTroubledBreathing.code GECKO:0000052 0.400454
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p DEFINITION -o build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person.gender GECKO:0000060 1.00
1 Person.birthDate PATO:0000011 0.98
2 Person.birthYear GECKO:0000066 0.98
3 Person.deathDate STATO:0000093 1.00
4 PersonPortrait.nationality GECKO:0000064 1.00
5 PersonPortrait.lastEducation GECKO:0000065 0.98
6 PersonPortrait.residencyRegion GECKO:0000064 1.00
7 Nationality.nationality GECKO:0000064 1.00
8 SpareTimeActivities.shoppingPerWeek GECKO:0000131 0.76
9 SpareTimeActivities.shoppingPerWeek GECKO:0000104 0.76
10 SpareTimeActivities.cleaningPerWeek GECKO:0000052 0.76
11 SpareTimeActivities.cleaningPerWeek MONDO:0004992 0.76
12 SpareTimeActivities.cleaningPerWeek GECKO:0000060 0.76
13 SpareTimeActivities.physicalExercisePerWeek OGMS:0000020 0.98
14 SpareTimeActivities.readingPerWeek CMO:0000294 0.76
15 SpareTimeActivities.readingPerWeek CMO:0000003 0.76
16 TobaccoLast12Months.smokeProd GECKO:0000068 0.98
17 TobaccoUsually.smokeProd GECKO:0000068 0.98
18 TobaccoLastMonth.smokeProd GECKO:0000068 0.98
19 TobaccoYearBeforeQuitting.smokeProd GECKO:0000068 0.98
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p DEFINITION -o build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term match confidence
0 PersonPortrait.lastWeight CMO:0000012 0.236058
1 ObjectiveInformation.weight CMO:0000012 0.776128
2 PersonPortrait.lastBmi CMO:0000021 0.159147
3 PersonPortrait.bmiSource CMO:0000021 0.111170
4 ObjectiveInformation.bmi CMO:0000021 0.624578
5 ObjectiveInformation.waist CMO:0000021 0.246661
6 ObjectiveInformation.hip CMO:0000021 0.251470
7 PersonPortrait.lastHeight CMO:0000106 0.139162
8 ObjectiveInformation.height CMO:0000106 0.635535
9 PersonPortrait.bmiSource GECKO:0000052 0.102550
10 PersonPortrait.settlementRegionType GECKO:0000052 0.210465
11 Sample.vkood GECKO:0000052 0.184984
12 Sample.visitNumber GECKO:0000052 0.184984
13 Answerset.isFirst GECKO:0000052 0.118963
14 Answerset.visitNumber GECKO:0000052 0.121422
15 PhysicalExercise.code GECKO:0000052 0.375404
16 ProfessionalSportPast.code GECKO:0000052 0.375404
17 ProfessionalSport.code GECKO:0000052 0.375404
18 SpareTimeActivities.childcarePerWeek GECKO:0000052 0.155255
19 SpareTimeActivities.elderlyCarePerWeek GECKO:0000052 0.133835
python3 src/mapping-suggest/merge-mapping-suggestions.py -t templates/cogs.tsv -s build/intermediate/cogs_mapping_suggestions_zooma.tsv -s build/intermediate/cogs_mapping_suggestions_nlp.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv -o build/suggestions_cogs.tsv
['build/intermediate/cogs_mapping_suggestions_zooma.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv']
Mapping suggestions files concat:
term ... confidence
0 Person ... 0.760000
1 Person ... 0.760000
2 Person ... 0.760000
3 Sample ... 0.980000
4 Nationality ... 1.000000
5 Education ... 1.000000
6 Smoking ... 1.000000
7 Alcohol ... 1.000000
8 Sleep ... 0.980000
9 Health ... 0.980000
0 ObjectiveInformation.weight ... 0.837803
1 ObjectiveInformation.bmi ... 0.672504
2 ObjectiveInformation.waist ... 0.267665
3 ObjectiveInformation.hip ... 0.272348
4 ObjectiveInformation.height ... 0.689633
5 Sample.vkood ... 0.199489
6 Sample.visitNumber ... 0.199489
7 PhysicalExercise.code ... 0.404730
8 ProfessionalSportPast.code ... 0.404730
9 ProfessionalSport.code ... 0.404730
[20 rows x 4 columns]
Merging suggestions successful. First twenty results:
Term ID ... Comment
0 EB:0000001 ... NaN
1 EB:0000002 ... NaN
2 EB:0000003 ... NaN
3 EB:0000004 ... NaN
4 EB:0000005 ... NaN
5 EB:0000006 ... NaN
6 EB:0000007 ... NaN
7 EB:0000008 ... NaN
8 EB:0000009 ... NaN
9 EB:0000010 ... NaN
10 EB:0000011 ... Person's last measurements, smoking status and...
11 EB:0000012 ... Person's last measurements, smoking status and...
12 EB:0000013 ... Person's last measurements, smoking status and...
13 EB:0000014 ... Person's last measurements, smoking status and...
14 EB:0000015 ... Person's last measurements, smoking status and...
15 EB:0000016 ... Person's last measurements, smoking status and...
16 EB:0000017 ... Person's last measurements, smoking status and...
17 EB:0000018 ... Person's last measurements, smoking status and...
18 EB:0000019 ... Person's last measurements, smoking status and...
19 EB:0000020 ... Person's last measurements, smoking status and...
[20 rows x 8 columns]
make[1]: Leaving directory '/workspace'
cp build/suggestions_cogs.tsv build/terminology.tsv
rm -f templates/cogs.tsv
make cogs-apply-data-validation
make[1]: Entering directory '/workspace'
python3 src/mapping-suggest/create-data-validation.py build/terminology.tsv build/gecko_labels.tsv build/cogs-data-validation.tsv build/cogs-info-table.tsv
cogs apply build/cogs-data-validation.tsv
make[1]: Leaving directory '/workspace'
make cogs-apply-info-table
make[1]: Entering directory '/workspace'
cogs apply build/cogs-info-table.tsv
make[1]: Leaving directory '/workspace'
cogs push