Skip to main content

Load Athena Vocabularies Datasets

Download Datasets

  • Select DOWNLOAD button at top right of page
  • Default datasets are automatically selected
  • Confirm CDM version 5.x at top right
  • Enter a name
  • Click DOWNLOAD
  • Await email with download link (typically within ~1 hour)
  • Download ~174Mb zipfile to Downloads folder
  • Run the following commands in the d2e directory:
GIT_BASE_DIR=$(pwd)
VOCAB_DIR=$GIT_BASE_DIR/cache/vocab
ZIP_FILE="$(ls -tr ~/Downloads/vocabulary_download_v5*.zip | head -1)"
cd $VOCAB_DIR
unzip -o -d . "${ZIP_FILE}"
  • remove additional file
rm CONCEPT_CPT4.csv

Confirm Line Count Before Transformation

  • Run command to output line count for each CSV file
wc -l *.csv | sort -nr
  • Output should be:
  130981848 total
72754470 CONCEPT_ANCESTOR.csv
47212425 CONCEPT_RELATIONSHIP.csv
5975393 CONCEPT.csv
2980116 DRUG_STRENGTH.csv
2058224 CONCEPT_SYNONYM.csv
691 RELATIONSHIP.csv
418 CONCEPT_CLASS.csv
60 VOCABULARY.csv
51 DOMAIN.csv
  • Note that unzipped CSV file from Athena contains an extra new line that is empty.

Transform csv files to expected format

  • Escape existing double quotes
  • Add double quotes around each value
  • This transformation is done in case there is a literal string character e.g. \t or \n in the value.
  • Enclosing the values in double quotes would prevent interpretation as a tab or newline.
  • Move files to a folder named transformed
mkdir transformed
for CSV_FILE in *.csv; do sed "s/\"/\"\"/g;s/\t/\"\t\"/g;s/\(.*\)/\"\1\"/" $CSV_FILE >> ./transformed/$CSV_FILE; done

Confirm Line Count of Transformed CSVs before ingestion

  • Run command to output line count for each CSV file
wc -l *.csv | sort
  • Linecounts are similar to:
        51 DOMAIN.csv
60 VOCABULARY.csv
424 CONCEPT_CLASS.csv
697 RELATIONSHIP.csv
1669979 CONCEPT_SYNONYM.csv
2981808 DRUG_STRENGTH.csv
6071643 CONCEPT.csv
38570317 CONCEPT_RELATIONSHIP.csv
72374968 CONCEPT_ANCESTOR.csv
121669947 total
  • Note that unzipped CSV file from Athena contains an extra new line that is empty.

Load data to cdmvocab

  • Run the following command in terminal to stop an alp docker container and start another container to load data
docker exec -it alp-dataflow-gen-worker prefect deployment run data_load_plugin/data_load_plugin --param options='{"files":[{"name": "CONCEPT_ANCESTOR","path": "/app/vocab/CONCEPT_ANCESTOR.csv", "truncate": "True", "table_name": "concept_ancestor"},{"name": "CONCEPT_CLASS","path": "/app/vocab/CONCEPT_CLASS.csv", "truncate": "True", "table_name": "concept_class"},{"name": "CONCEPT_RELATIONSHIP","path": "/app/vocab/CONCEPT_RELATIONSHIP.csv", "truncate": "True", "table_name": "concept_relationship"},{"name": "CONCEPT_SYNONYM","path": "/app/vocab/CONCEPT_SYNONYM.csv", "truncate": "True", "table_name": "concept_synonym"},{"name": "CONCEPT","path": "/app/vocab/CONCEPT.csv", "truncate": "True", "table_name": "concept"},{"name": "DOMAIN","path": "/app/vocab/DOMAIN.csv", "truncate": "True", "table_name": "domain"},{"name": "DRUG_STRENGTH","path": "/app/vocab/DRUG_STRENGTH.csv", "truncate": "True", "table_name": "drug_strength"},{"name": "RELATIONSHIP","path": "/app/vocab/RELATIONSHIP.csv", "truncate": "True", "table_name": "relationship"},{"name": "VOCABULARY","path": "/app/vocab/VOCABULARY.csv", "truncate": "True", "table_name": "vocabulary"}],"schema_name":"cdmvocab","header":"true","delimiter":"\t","database_code": "alpdev_pg", "chunksize": "50000", "encoding": "utf_8"}'
  • Docker container logs can be checked with the bash command docker logs --tail 100 alp-dataflow-gen-worker
  • Once the flow is completed, the container logs the message "Finished in state Completed()"
  • expected output is

    COPY ${LINE_COUNT}

Validation

  • Confirm data loaded with by select count(*) from tables
PROJECT_NAME=$(grep -E '^PROJECT_NAME=' .env 2>/dev/null | awk -F'=' '{print $2}' | tr -d '"') 
PROJECT_NAME=${PROJECT_NAME:-"d2e"}
CONTAINER_NAME=$PROJECT_NAME-minerva-postgres-1
docker exec -it $CONTAINER_NAME psql -h localhost -U postgres -p 5432 -d alpdev_pg --command "SELECT table_name, row_count FROM (SELECT 'concept_relationship' AS table_name, COUNT(*) AS row_count FROM cdmvocab.concept_relationship UNION SELECT 'concept_ancestor' AS table_name, count(*) AS row_count FROM cdmvocab.concept_ancestor UNION SELECT 'concept_relationship' AS table_name, COUNT(*) AS row_count FROM cdmvocab.concept UNION SELECT 'relationship' AS table_name, COUNT(*) AS row_count FROM cdmvocab.relationship UNION SELECT 'concept_synonym' AS table_name, COUNT(*) AS row_count FROM cdmvocab.concept_synonym UNION SELECT 'vocabulary' AS table_name, COUNT(*) AS row_count FROM cdmvocab.vocabulary UNION SELECT 'domain' AS table_name, COUNT(*) AS row_count FROM cdmvocab.domain UNION SELECT 'drug_strength' AS table_name, COUNT(*) AS row_count FROM cdmvocab.drug_strength UNION SELECT 'concept_class' AS table_name, COUNT(*) AS row_count FROM cdmvocab.concept_class) temp ORDER BY row_count DESC;"
  • Expect output row_count similar to:
table_namerow_count
concept_ancestor72754469
concept_relationship47212424
concept5975392
drug_strength2980115
concept_synonym2058223
relationship690
concept_class417
vocabulary59
domain50

Troubleshooting

Repeat load

  • To repeat, run "Load data to cdmvocab" commands in the sequence given