DROID / IHCC-cohorts-data-harmonization-test / EBPERIOD
No remote found
Workflow
The following workflow defines all tasks necessary to upload, preprocess, share, and map a new data dictionary.
- Upload cohort data
- Open Google Sheet
- Run automated mapping for new data dictionary
- Share Google Sheet with submitter
- Prepare data dictionary for build
- Run automated validation
- Build data dictionary
- View results
- Add data dictionary to Version Control
- Prepare git commit (click on Commit in Version menu)
- Push changes to GitHub (click on Push in Version menu), and make pull request.
- Delete Google sheet (Caution, cannot be undone)
IHCC Data Admin Tasks
Console
Action automated_mapping started at 2022-11-23T12:56:50.116Z (2022-11-23T12:56:50.116Z)
Success
$ make -f Makefile automated_mapping
make cogs_pull
make[1]: Entering directory '/workspace'
cogs fetch
cogs pull
make[1]: Leaving directory '/workspace'
cp build/terminology.tsv templates/cogs.tsv
make build/suggestions_cogs.tsv
make[1]: Entering directory '/workspace'
python3 src/mapping-suggest/id-generator-templates.py -t templates/cogs.tsv -m build/metadata.tsv
Generating IDs for data dictionary: EB
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -o build/intermediate/cogs_mapping_suggestions_zooma.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person GECKO:0000066 0.76
1 Person GECKO:0000055 0.76
2 Person GECKO:0000120 0.76
3 Sample GECKO:0000052 0.98
4 Nationality GECKO:0000064 1.00
5 Education GECKO:0000065 1.00
6 Smoking GECKO:0000068 1.00
7 Alcohol GECKO:0000069 1.00
8 Sleep GECKO:0000071 0.98
9 Health GECKO:0000126 0.98
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -o build/intermediate/cogs_mapping_suggestions_nlp.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term match confidence
0 ObjectiveInformation.weight CMO:0000012 0.839915
1 ObjectiveInformation.bmi CMO:0000021 0.672958
2 ObjectiveInformation.waist CMO:0000021 0.267822
3 ObjectiveInformation.hip CMO:0000021 0.271177
4 ObjectiveInformation.height CMO:0000106 0.687485
5 Sample.vkood GECKO:0000052 0.196984
6 Sample.visitNumber GECKO:0000052 0.196984
7 PhysicalExercise.code GECKO:0000052 0.404677
8 ProfessionalSportPast.code GECKO:0000052 0.404677
9 ProfessionalSport.code GECKO:0000052 0.404677
10 HormonalContraceptiveUsed.code GECKO:0000052 0.404677
11 HormonalContraceptiveUsedV1.code GECKO:0000052 0.404677
12 HormonalMedicationMenopause.code GECKO:0000052 0.404677
13 HormonalMedicationMenopauseV1.code GECKO:0000052 0.404677
14 Health.movement GECKO:0000052 0.104620
15 Health.selfcare GECKO:0000052 0.104620
16 Health.commonActivities GECKO:0000052 0.104620
17 Health.painDiscomfort GECKO:0000052 0.104620
18 Health.anxietyDepression GECKO:0000052 0.104620
19 MedicationsForTroubledBreathing.code GECKO:0000052 0.404677
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p WORD_BOUNDARY -o build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person.gender GECKO:0000066 0.98
1 Nationality.nationality GECKO:0000064 0.98
2 Education.highestEducationLevel GECKO:0000065 0.76
3 Education.highestEducationLevel UBERON:0000105 0.76
4 Education.highestEducationLevel GECKO:0000065 0.76
5 EatingHabits.eatingHabit GECKO:0000072 0.98
6 Smoking.smokingStatus GECKO:0000068 0.98
7 OtherDrugs.usingOtherDrugs GECKO:0000094 0.98
8 OtherDrugs.otherDrugs GECKO:0000094 0.98
9 ObjectiveInformation.weight GECKO:0000114 0.98
10 Person GECKO:0000066 0.76
11 Person GECKO:0000055 0.76
12 Person GECKO:0000120 0.76
13 Person GECKO:0000066 0.76
14 Person GECKO:0000055 0.76
15 Person GECKO:0000120 0.76
16 Sample GECKO:0000052 0.98
17 Sample GECKO:0000052 0.98
18 Nationality GECKO:0000064 1.00
19 Nationality GECKO:0000064 1.00
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p WORD_BOUNDARY -o build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term ... confidence
0 VisualDecadence.leftEyeDioptry ... 0.113719
1 CardiovascularDiseasesAdditional.leftVentricle... ... 0.375941
2 PersonPortrait.lastWeight ... 0.286969
3 ObjectiveInformation.weight ... 0.166849
4 ObjectiveInformation.weightMeasurement ... 0.128142
5 PersonPortrait.lastBmi ... 0.180271
6 PersonPortrait.bmiDate ... 0.179665
7 PersonPortrait.bmiSource ... 0.163099
8 ObjectiveInformation.bmi ... 0.132098
9 ObjectiveInformation.armCircumference ... 0.129445
10 ObjectiveInformation.waistHipMeasurement ... 0.188727
11 PersonPortrait.lastHeight ... 0.172090
12 ObjectiveInformation.height ... 0.101991
13 PersonPortrait.bmiSource ... 0.122198
14 PersonPortrait.smokingSource ... 0.100377
15 PersonPortrait.educationSource ... 0.132969
16 PersonPortrait.settlementRegionType ... 0.171467
17 Sample.vkood ... 0.175970
18 Sample.visitNumber ... 0.177679
19 InformedConsent.icDate ... 0.132688
[20 rows x 3 columns]
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p HIERARCHY -o build/intermediate/cogs_mapping_suggestions_zooma_hierarchy.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person.gender GECKO:0000060 1.00
1 Person.birthDate PATO:0000011 0.98
2 PersonPortrait.nationality GECKO:0000064 1.00
3 DiagnosisConsolidated.icd10 MONDO:0004992 0.76
4 DiagnosisConsolidated.icd10 MONDO:0005084 0.76
5 DiagnosisConsolidated.icd10 MONDO:0000001 0.76
6 DiagnosisConsolidated.icd10 MONDO:0004995 0.76
7 DiagnosisConsolidated.icd10 GECKO:0000052 0.76
8 Nationality.nationality GECKO:0000064 1.00
9 PhysicalExercise.code GECKO:0000073 0.76
10 PhysicalExercise.code GECKO:0000064 0.76
11 PhysicalExercise.code GECKO:0000052 0.76
12 PhysicalExercise.code MONDO:0000001 0.76
13 PhysicalExercise.code GECKO:0000060 0.76
14 ProfessionalSportPast.code GECKO:0000073 0.76
15 ProfessionalSportPast.code GECKO:0000064 0.76
16 ProfessionalSportPast.code GECKO:0000052 0.76
17 ProfessionalSportPast.code MONDO:0000001 0.76
18 ProfessionalSportPast.code GECKO:0000060 0.76
19 ProfessionalSport.code GECKO:0000073 0.76
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p HIERARCHY -o build/intermediate/cogs_mapping_suggestions_nlp_hierarchy.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term match confidence
0 ObjectiveInformation.weight CMO:0000012 0.838501
1 ObjectiveInformation.bmi CMO:0000021 0.673675
2 ObjectiveInformation.waist CMO:0000021 0.266746
3 ObjectiveInformation.hip CMO:0000021 0.276154
4 ObjectiveInformation.height CMO:0000106 0.689157
5 PhysicalExercise.code GECKO:0000052 0.403723
6 ProfessionalSportPast.code GECKO:0000052 0.403723
7 ProfessionalSport.code GECKO:0000052 0.403723
8 HormonalContraceptiveUsed.code GECKO:0000052 0.403723
9 HormonalContraceptiveUsedV1.code GECKO:0000052 0.403723
10 HormonalMedicationMenopause.code GECKO:0000052 0.403723
11 HormonalMedicationMenopauseV1.code GECKO:0000052 0.403723
12 MedicationsForTroubledBreathing.code GECKO:0000052 0.403723
13 DiseasesDiagnosed.code GECKO:0000052 0.403723
14 MedicationsUsedForDisease.code GECKO:0000052 0.403723
15 MedicationPackagesUsedForDisease.code GECKO:0000052 0.403723
16 ConcurrentDiagnoses.code GECKO:0000052 0.403723
17 RespiratoryDiseasesMedications.code GECKO:0000052 0.403723
18 DiabetesComplications.code GECKO:0000052 0.403723
19 DiabetesMedications.code GECKO:0000052 0.403723
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p DEFINITION -o build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person.gender GECKO:0000060 1.00
1 Person.birthDate PATO:0000011 0.98
2 Person.birthYear GECKO:0000066 0.98
3 Person.deathDate STATO:0000093 1.00
4 PersonPortrait.nationality GECKO:0000064 1.00
5 PersonPortrait.lastEducation GECKO:0000065 0.98
6 PersonPortrait.residencyRegion GECKO:0000064 1.00
7 Nationality.nationality GECKO:0000064 1.00
8 SpareTimeActivities.shoppingPerWeek GECKO:0000131 0.76
9 SpareTimeActivities.shoppingPerWeek GECKO:0000104 0.76
10 SpareTimeActivities.cleaningPerWeek GECKO:0000052 0.76
11 SpareTimeActivities.cleaningPerWeek MONDO:0004992 0.76
12 SpareTimeActivities.cleaningPerWeek GECKO:0000060 0.76
13 SpareTimeActivities.physicalExercisePerWeek OGMS:0000020 0.98
14 SpareTimeActivities.readingPerWeek CMO:0000294 0.76
15 SpareTimeActivities.readingPerWeek CMO:0000003 0.76
16 TobaccoLast12Months.smokeProd GECKO:0000068 0.98
17 TobaccoUsually.smokeProd GECKO:0000068 0.98
18 TobaccoLastMonth.smokeProd GECKO:0000068 0.98
19 TobaccoYearBeforeQuitting.smokeProd GECKO:0000068 0.98
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p DEFINITION -o build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EB:0000001 Person.skood
1 EB:0000002 Person.gender
2 EB:0000003 Person.birthDate
3 EB:0000004 Person.birthYear
4 EB:0000005 Person.agreementDate
NLP matching successful. First twenty results:
term match confidence
0 PersonPortrait.lastWeight CMO:0000012 0.237716
1 ObjectiveInformation.weight CMO:0000012 0.776403
2 PersonPortrait.lastBmi CMO:0000021 0.156668
3 PersonPortrait.bmiSource CMO:0000021 0.109631
4 ObjectiveInformation.bmi CMO:0000021 0.619491
5 ObjectiveInformation.waist CMO:0000021 0.245984
6 ObjectiveInformation.hip CMO:0000021 0.249478
7 PersonPortrait.lastHeight CMO:0000106 0.137774
8 ObjectiveInformation.height CMO:0000106 0.632686
9 PersonPortrait.bmiSource GECKO:0000052 0.101925
10 PersonPortrait.settlementRegionType GECKO:0000052 0.208364
11 Sample.vkood GECKO:0000052 0.182295
12 Sample.visitNumber GECKO:0000052 0.182295
13 Answerset.isFirst GECKO:0000052 0.118335
14 Answerset.visitNumber GECKO:0000052 0.120449
15 PhysicalExercise.code GECKO:0000052 0.370965
16 ProfessionalSportPast.code GECKO:0000052 0.370965
17 ProfessionalSport.code GECKO:0000052 0.370965
18 SpareTimeActivities.childcarePerWeek GECKO:0000052 0.153579
19 SpareTimeActivities.elderlyCarePerWeek GECKO:0000052 0.132282
python3 src/mapping-suggest/merge-mapping-suggestions.py -t templates/cogs.tsv -s build/intermediate/cogs_mapping_suggestions_zooma.tsv -s build/intermediate/cogs_mapping_suggestions_nlp.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_hierarchy.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_hierarchy.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv -o build/suggestions_cogs.tsv
['build/intermediate/cogs_mapping_suggestions_zooma.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_hierarchy.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_hierarchy.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv']
Mapping suggestions files concat:
term ... confidence
0 Person ... 0.760000
1 Person ... 0.760000
2 Person ... 0.760000
3 Sample ... 0.980000
4 Nationality ... 1.000000
5 Education ... 1.000000
6 Smoking ... 1.000000
7 Alcohol ... 1.000000
8 Sleep ... 0.980000
9 Health ... 0.980000
0 ObjectiveInformation.weight ... 0.839915
1 ObjectiveInformation.bmi ... 0.672958
2 ObjectiveInformation.waist ... 0.267822
3 ObjectiveInformation.hip ... 0.271177
4 ObjectiveInformation.height ... 0.687485
5 Sample.vkood ... 0.196984
6 Sample.visitNumber ... 0.196984
7 PhysicalExercise.code ... 0.404677
8 ProfessionalSportPast.code ... 0.404677
9 ProfessionalSport.code ... 0.404677
[20 rows x 4 columns]
Merging suggestions successful. First twenty results:
Term ID ... Comment
0 EB:0000001 ... NaN
1 EB:0000002 ... NaN
2 EB:0000003 ... NaN
3 EB:0000004 ... NaN
4 EB:0000005 ... NaN
5 EB:0000006 ... NaN
6 EB:0000007 ... NaN
7 EB:0000008 ... NaN
8 EB:0000009 ... NaN
9 EB:0000010 ... NaN
10 EB:0000011 ... NaN
11 EB:0000012 ... Person's last measurements, smoking status and...
12 EB:0000013 ... Person's last measurements, smoking status and...
13 EB:0000014 ... Person's last measurements, smoking status and...
14 EB:0000015 ... Person's last measurements, smoking status and...
15 EB:0000016 ... Person's last measurements, smoking status and...
16 EB:0000017 ... Person's last measurements, smoking status and...
17 EB:0000018 ... Person's last measurements, smoking status and...
18 EB:0000019 ... Person's last measurements, smoking status and...
19 EB:0000020 ... Person's last measurements, smoking status and...
[20 rows x 8 columns]
make[1]: Leaving directory '/workspace'
cp build/suggestions_cogs.tsv build/terminology.tsv
rm -f templates/cogs.tsv
make cogs-apply-data-validation
make[1]: Entering directory '/workspace'
python3 src/mapping-suggest/create-data-validation.py build/terminology.tsv build/gecko_labels.tsv build/cogs-data-validation.tsv build/cogs-info-table.tsv
ERROR:root:'disease or disorder' suggested on row 32 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 85 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 92 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 94 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 265 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 266 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 267 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 268 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 332 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 335 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 339 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 341 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 342 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 359 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 371 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 373 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 374 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 397 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 405 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 465 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 466 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 467 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 468 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 469 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 470 is not a GECKO term
ERROR:root:'disease or disorder' suggested on row 475 is not a GECKO term
cogs apply build/cogs-data-validation.tsv
make[1]: Leaving directory '/workspace'
make cogs-apply-info-table
make[1]: Entering directory '/workspace'
cogs apply build/cogs-info-table.tsv
make[1]: Leaving directory '/workspace'
cogs push
Success