No remote found
Workflow
The following workflow defines all tasks necessary to upload, preprocess, share, and map a new data dictionary.
- Upload cohort data
- Open Google Sheet
- Run automated mapping for new data dictionary
- Share Google Sheet with submitter
- Prepare data dictionary for build
- Run automated validation
- Build data dictionary
- View results
- Add data dictionary to Version Control
- Prepare git commit (click on Commit in Version menu)
- Push changes to GitHub (click on Push in Version menu), and make pull request.
- Delete Google sheet (Caution, cannot be undone)
IHCC Data Admin Tasks
Console
Action automated_mapping started at 2022-11-21T12:46:03.960Z (2022-11-21T12:46:03.960Z)
Success
$ make -f Makefile automated_mapping
make cogs_pull
make[1]: Entering directory '/workspace'
cogs fetch
cogs pull
make[1]: Leaving directory '/workspace'
cp build/terminology.tsv templates/cogs.tsv
make build/suggestions_cogs.tsv
make[1]: Entering directory '/workspace'
python3 src/mapping-suggest/id-generator-templates.py -t templates/cogs.tsv -m build/metadata.tsv
Generating IDs for data dictionary: EBD
mkdir -p build/intermediate
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -o build/intermediate/cogs_mapping_suggestions_zooma.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person GECKO:0000066 0.76
1 Person GECKO:0000055 0.76
2 Person GECKO:0000120 0.76
3 Sample GECKO:0000052 0.98
4 Nationality GECKO:0000064 1.00
5 Education GECKO:0000065 1.00
6 Physical Activities GECKO:0000104 1.00
7 Eating Habits GECKO:0000072 0.98
8 Smoking GECKO:0000068 1.00
9 Alcohol GECKO:0000069 1.00
10 Sleep GECKO:0000071 0.98
11 Health GECKO:0000126 0.98
12 Objective Information GECKO:0000114 0.98
curl -L -o build/gecko.owl http://purl.obolibrary.org/obo/gecko/views/ihcc-gecko.owl
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 349 100 349 0 0 1928 0 --:--:-- --:--:-- --:--:-- 1928
100 105k 100 105k 0 0 223k 0 --:--:-- --:--:-- --:--:-- 223k
curl -Lk -o build/robot.jar https://build.obolibrary.io/job/ontodev/job/robot/job/master/lastSuccessfulBuild/artifact/bin/robot.jar
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
1 77.1M 1 1456k 0 0 984k 0 0:01:20 0:00:01 0:01:19 983k
24 77.1M 24 18.8M 0 0 7720k 0 0:00:10 0:00:02 0:00:08 7717k
48 77.1M 48 37.6M 0 0 11.0M 0 0:00:07 0:00:03 0:00:04 11.0M
75 77.1M 75 58.0M 0 0 13.1M 0 0:00:05 0:00:04 0:00:01 13.1M
100 77.1M 100 77.1M 0 0 14.3M 0 0:00:05 0:00:05 --:--:-- 15.6M
java -jar build/robot.jar --prefixes src/prefixes.json query --input build/gecko.owl --query src/queries/ihcc-mapping-gecko.sparql build/intermediate/gecko-xrefs-sparql.csv
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -o build/intermediate/cogs_mapping_suggestions_nlp.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EBD:0000001 Person
1 EBD:0000002 Person Portrait
2 EBD:0000003 Diagnosis Consolidated
3 EBD:0000004 Sample
4 EBD:0000005 Informed Consent
NLP matching successful. First twenty results:
term match confidence
0 Sample GECKO:0000052 0.199025
1 Health GECKO:0000052 0.105093
2 Education GECKO:0000065 0.169910
3 Person GECKO:0000066 0.246904
4 Person Portrait GECKO:0000066 0.246904
5 Smoking GECKO:0000068 0.870167
6 Alcohol GECKO:0000069 0.900000
7 Sleep GECKO:0000071 0.252353
8 Eating Habits GECKO:0000072 0.278889
9 Other Drugs GECKO:0000072 0.136990
10 Regular Medications GECKO:0000072 0.156956
11 Other Drugs GECKO:0000093 0.105215
12 Medications Used For Disease GECKO:0000093 0.252179
13 Regular Medications GECKO:0000093 0.117252
14 Physical Activities GECKO:0000104 0.197033
15 Donor Birth Methods GECKO:0000114 0.147404
16 Place Of Birth GECKO:0000114 0.272022
17 Work GECKO:0000131 0.115732
18 Mother Health GECKO:0000132 0.476819
19 Diagnosis Consolidated MONDO:0000001 0.143600
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p WORD_BOUNDARY -o build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person GECKO:0000066 0.76
1 Person GECKO:0000055 0.76
2 Person GECKO:0000120 0.76
3 Person GECKO:0000066 0.76
4 Person GECKO:0000055 0.76
5 Person GECKO:0000120 0.76
6 Sample GECKO:0000052 0.98
7 Sample GECKO:0000052 0.98
8 Nationality GECKO:0000064 1.00
9 Nationality GECKO:0000064 1.00
10 Education GECKO:0000065 1.00
11 Education GECKO:0000065 1.00
12 Physical Activities GECKO:0000104 1.00
13 Physical Activities GECKO:0000104 1.00
14 Eating Habits GECKO:0000072 0.98
15 Eating Habits GECKO:0000072 0.98
16 Smoking GECKO:0000068 1.00
17 Smoking GECKO:0000068 1.00
18 Alcohol GECKO:0000069 1.00
19 Alcohol GECKO:0000069 1.00
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p WORD_BOUNDARY -o build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EBD:0000001 Person
1 EBD:0000002 Person Portrait
2 EBD:0000003 Diagnosis Consolidated
3 EBD:0000004 Sample
4 EBD:0000005 Informed Consent
NLP matching successful. First twenty results:
term match confidence
0 Sample GECKO:0000052 0.197785
1 Health GECKO:0000052 0.104552
2 Education GECKO:0000065 0.170096
3 Person GECKO:0000066 0.249028
4 Person Portrait GECKO:0000066 0.249028
5 Smoking GECKO:0000068 0.867390
6 Alcohol GECKO:0000069 0.900000
7 Sleep GECKO:0000071 0.253743
8 Informed Consent GECKO:0000072 0.100264
9 Eating Habits GECKO:0000072 0.278638
10 Other Drugs GECKO:0000072 0.136942
11 Regular Medications GECKO:0000072 0.157802
12 Other Drugs GECKO:0000093 0.106156
13 Medications Used For Disease GECKO:0000093 0.250168
14 Regular Medications GECKO:0000093 0.117184
15 Physical Activities GECKO:0000104 0.193678
16 Donor Birth Methods GECKO:0000114 0.147944
17 Place Of Birth GECKO:0000114 0.274689
18 Work GECKO:0000131 0.113861
19 Mother Health GECKO:0000132 0.473953
python3 src/mapping-suggest/mapping-suggest-zooma.py -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -p DEFINITION -o build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv
Zooma config: {'zooma_annotate': 'https://test.mapping.ihccglobal.app/zooma/v2/api/services/annotate?propertyValue=', 'oxo_mapping': 'https://test.mapping.ihccglobal.app/api/mappings?fromId=', 'ols_term': 'https://test.registry.ihccglobal.app/api/terms?iri=', 'ols_oboid': 'https://test.registry.ihccglobal.app/api/terms?obo_id=', 'min_match_probability': 0.1, 'rescale_nlp_matches': {'low': 0, 'high': 0.9}, 'zooma_confidence_mappings': {'LOW': 0.51, 'MEDIUM': 0.76, 'GOOD': 0.98, 'HIGH': 1}}
Zooma matching successful. First twenty results:
term match confidence
0 Person GECKO:0000066 0.76
1 Person GECKO:0000055 0.76
2 Person GECKO:0000120 0.76
3 Sample GECKO:0000052 0.98
4 Nationality GECKO:0000064 1.00
5 Education GECKO:0000065 1.00
6 Physical Activities GECKO:0000104 1.00
7 Eating Habits GECKO:0000072 0.98
8 Smoking GECKO:0000068 1.00
9 Alcohol GECKO:0000069 1.00
10 Sleep GECKO:0000071 0.98
11 Health GECKO:0000126 0.98
12 Objective Information GECKO:0000114 0.98
python3 src/mapping-suggest/mapping-suggest-nlp.py -z data/ihcc-mapping-suggestions-zooma.tsv -c src/mapping-suggest/mapping-suggest-config.yml -t templates/cogs.tsv -g build/intermediate/gecko-xrefs-sparql.csv -p DEFINITION -o build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Term ID Label
0 EBD:0000001 Person
1 EBD:0000002 Person Portrait
2 EBD:0000003 Diagnosis Consolidated
3 EBD:0000004 Sample
4 EBD:0000005 Informed Consent
NLP matching successful. First twenty results:
term match confidence
0 Sample GECKO:0000052 0.198378
1 Health GECKO:0000052 0.106021
2 Education GECKO:0000065 0.168180
3 Person GECKO:0000066 0.244266
4 Person Portrait GECKO:0000066 0.244266
5 Smoking GECKO:0000068 0.868231
6 Alcohol GECKO:0000069 0.900000
7 Sleep GECKO:0000071 0.253284
8 Eating Habits GECKO:0000072 0.274467
9 Other Drugs GECKO:0000072 0.136972
10 Regular Medications GECKO:0000072 0.156687
11 Other Drugs GECKO:0000093 0.105217
12 Medications Used For Disease GECKO:0000093 0.250133
13 Regular Medications GECKO:0000093 0.116194
14 Physical Activities GECKO:0000104 0.195562
15 Donor Birth Methods GECKO:0000114 0.147419
16 Place Of Birth GECKO:0000114 0.273065
17 Work GECKO:0000131 0.113717
18 Mother Health GECKO:0000132 0.477547
19 Diagnosis Consolidated MONDO:0000001 0.143923
python3 src/mapping-suggest/merge-mapping-suggestions.py -t templates/cogs.tsv -s build/intermediate/cogs_mapping_suggestions_zooma.tsv -s build/intermediate/cogs_mapping_suggestions_nlp.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv -s build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv -s build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv -o build/suggestions_cogs.tsv
['build/intermediate/cogs_mapping_suggestions_zooma.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_clean.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_clean.tsv', 'build/intermediate/cogs_mapping_suggestions_zooma_definition.tsv', 'build/intermediate/cogs_mapping_suggestions_nlp_definition.tsv']
Mapping suggestions files concat:
term ... confidence
0 Person ... 0.760000
1 Person ... 0.760000
2 Person ... 0.760000
3 Sample ... 0.980000
4 Nationality ... 1.000000
5 Education ... 1.000000
6 Physical Activities ... 1.000000
7 Eating Habits ... 0.980000
8 Smoking ... 1.000000
9 Alcohol ... 1.000000
10 Sleep ... 0.980000
11 Health ... 0.980000
12 Objective Information ... 0.980000
0 Sample ... 0.199025
1 Health ... 0.105093
2 Education ... 0.169910
3 Person ... 0.246904
4 Person Portrait ... 0.246904
5 Smoking ... 0.870167
6 Alcohol ... 0.900000
[20 rows x 4 columns]
Merging suggestions successful. First twenty results:
Term ID ... Comment
0 EBD:0000001 ... NaN
1 EBD:0000002 ... NaN
2 EBD:0000003 ... NaN
3 EBD:0000004 ... NaN
4 EBD:0000005 ... NaN
5 EBD:0000006 ... NaN
6 EBD:0000007 ... NaN
7 EBD:0000008 ... NaN
8 EBD:0000009 ... NaN
9 EBD:0000010 ... NaN
10 EBD:0000011 ... NaN
11 EBD:0000012 ... NaN
12 EBD:0000013 ... NaN
13 EBD:0000014 ... NaN
14 EBD:0000015 ... NaN
15 EBD:0000016 ... NaN
16 EBD:0000017 ... NaN
17 EBD:0000018 ... NaN
18 EBD:0000019 ... NaN
19 EBD:0000020 ... NaN
[20 rows x 8 columns]
make[1]: Leaving directory '/workspace'
cp build/suggestions_cogs.tsv build/terminology.tsv
rm -f templates/cogs.tsv
make cogs-apply-data-validation
make[1]: Entering directory '/workspace'
java -jar build/robot.jar --prefixes src/prefixes.json export \
--input build/gecko.owl \
--header "LABEL" \
--export build/gecko_labels.tsv
python3 src/mapping-suggest/create-data-validation.py build/terminology.tsv build/gecko_labels.tsv build/cogs-data-validation.tsv build/cogs-info-table.tsv
cogs apply build/cogs-data-validation.tsv
make[1]: Leaving directory '/workspace'
make cogs-apply-info-table
make[1]: Entering directory '/workspace'
cogs apply build/cogs-info-table.tsv
make[1]: Leaving directory '/workspace'
cogs push
Success