Skip to main content

Download & transform SynPUF-1k

  • Initially, the D2E system does not contain any data
  • Synthetic Public Use Files (SynPUFs) provide sample patient data

Download Public SynPUF-1k

  • Open a terminal in the root of d2e repo
  • Run the following command to download synpuf1k54
wget https://caruscloud.uniklinikum-dresden.de/index.php/s/Qog8B5WCTHFHmjW/download -O ~/Downloads/synpuf1k54.tar.gz

Transform csv files to expected format

  • Open a terminal in the d2e directory.
  • Run the following commands to define directories:
GIT_BASE_DIR=$(pwd)
SYNPUF1K_DIR=$GIT_BASE_DIR/cache/synpuf1k
  • Create synpuf1k load dir
mkdir -p $SYNPUF1K_DIR
  • Change directory to synpuf1k load dir
cd $SYNPUF1K_DIR
  • Decompress to this dir
echo PWD=$PWD
tar xzf ~/Downloads/synpuf1k54.tar.gz --strip-components 1
  • Remove additional files
rm "drug_exposure (Kopie).csv"
rm README.md
  • Copy headers files
cp $GIT_BASE_DIR/docs/3-load/headers-synpuf1k/*.csv .
  • Confirm with linecounts
wc -l *
  • Run the following commands to replace tab with comma and append data to header files
sed 's/\t/,/g' care_site.csv >> 003_CARE_SITE.csv
sed 's/\t/,/g' condition_era.csv >> 009_CONDITION_ERA.csv
sed 's/\t/,/g' condition_occurrence.csv >> 008_CONDITION_OCCURRENCE.csv
sed 's/\t/,/g' cost.csv >> 005_COST.csv
sed 's/\t/,/g' death.csv >> 007_DEATH.csv
sed 's/\t/,/g' device_exposure.csv >> 010_DEVICE_EXPOSURE.csv
sed 's/\t/,/g' drug_era.csv >> 012_DRUG_ERA.csv
sed 's/\t/,/g' drug_exposure.csv >> 011_DRUG_EXPOSURE.csv
sed 's/\t/,/g' location.csv >> 002_LOCATION.csv
sed 's/\t/,/g' measurement.csv >> 013_MEASUREMENT.csv
sed 's/\t/,/g' observation.csv >> 014_OBSERVATION.csv
sed 's/\t/,/g' observation_period.csv >> 015_OBSERVATION_PERIOD.csv
sed 's/\t/,/g' payer_plan_period.csv >> 016_PAYER_PLAN_PERIOD.csv
sed 's/\t/,/g' person.csv >> 006_PERSON.csv
sed 's/\t/,/g' procedure_occurrence.csv >> 017_PROCEDURE_OCCURRENCE.csv
sed 's/\t/,/g' provider.csv >> 004_PROVIDER.csv
sed 's/\t/,/g' visit_occurrence.csv >> 018_VISIT_OCCURRENCE.csv
  • Remove blank line at end of each file
for i in 0*.csv; do perl -pi -e 'chomp if eof' $i; done
  • Remove downloaded files
ls | grep -v ^0 | xargs -n 1 rm
  • Confirm with linecounts
wc -l *

Mount cache directory to Prefect Docker container

  • Add to the PREFECT_DOCKER_VOLUMES environment variable list for trex service in docker-compose.yml
  • ${GIT_BASE_DIR}/cache/synpuf1k:/app/synpuf1k
  • ${GIT_BASE_DIR}/cache/transformed/vocab:/app/vocab

Create the database alpdev_pg and schema cdmdefault

Reminder: Before running the next steps

  • Navigate back to root folder d2e and Run the following command to create postgres schemas
cd $GIT_BASE_DIR
yarn create-postgres-cdm-schemas alpdev_pg cdmdefault cdmvocab
  • where cdmdefault is the default cdm schema name
  • Wait ~2 minutes

Load data to cdmdefault

  • Run the following commands to seed postgres cdm schemas with synpuf-1k
PROJECT_NAME=$(grep -E '^PROJECT_NAME=' .env 2>/dev/null | awk -F'=' '{print $2}' | tr -d '"') 
PROJECT_NAME=${PROJECT_NAME:-"d2e"}
CONTAINER_NAME=$PROJECT_NAME-dataflow-gen-worker
docker exec -it CONTAINER_NAME prefect deployment run data_load_plugin/data_load_plugin --param options='{"files":[{"name": "Location","path": "/app/synpuf1k/002_LOCATION.csv", "truncate": "True", "table_name": "location"},{"name": "CARE_SITE","path": "/app/synpuf1k/003_CARE_SITE.csv", "truncate": "True", "table_name": "care_site"},{"name": "Provider","path": "/app/synpuf1k/004_PROVIDER.csv", "truncate": "True", "table_name": "provider"},{"name": "Cost","path": "/app/synpuf1k/005_COST.csv", "truncate": "True", "table_name": "cost"},{"name": "Person","path": "/app/synpuf1k/006_PERSON.csv", "truncate": "True", "table_name": "person"},{"name": "Death","path": "/app/synpuf1k/007_DEATH.csv", "truncate": "True", "table_name": "death"},{"name": "Condition_Occirence","path": "/app/synpuf1k/008_CONDITION_OCCURRENCE.csv", "truncate": "True", "table_name": "condition_occurrence"},{"name": "Condition_Era","path": "/app/synpuf1k/009_CONDITION_ERA.csv", "truncate": "True", "table_name": "condition_era"},{"name": "Device_Exposure","path": "/app/synpuf1k/010_DEVICE_EXPOSURE.csv", "truncate": "True", "table_name": "device_exposure"},{"name": "Drug_Exposure","path": "/app/synpuf1k/011_DRUG_EXPOSURE.csv", "truncate": "True", "table_name": "drug_exposure"},{"name": "Drug_Era","path": "/app/synpuf1k/012_DRUG_ERA.csv", "truncate": "True", "table_name": "drug_era"},{"name": "Measurement","path": "/app/synpuf1k/013_MEASUREMENT.csv", "truncate": "True", "table_name": "measurement"},{"name": "Observation","path": "/app/synpuf1k/014_OBSERVATION.csv", "truncate": "True", "table_name": "observation"},{"name": "Observation_Period","path": "/app/synpuf1k/015_OBSERVATION_PERIOD.csv", "truncate": "True", "table_name": "observation_period"},{"name": "Payer_Plan_Period","path": "/app/synpuf1k/016_PAYER_PLAN_PERIOD.csv", "truncate": "True", "table_name": "payer_plan_period"},{"name": "Procedure_Occurrence","path": "/app/synpuf1k/017_PROCEDURE_OCCURRENCE.csv", "truncate": "True", "table_name": "procedure_occurrence"},{"name": "Visit_Occurrence","path": "/app/synpuf1k/018_VISIT_OCCURRENCE.csv", "truncate": "True", "table_name": "visit_occurrence"}],"schema_name":"cdmdefault","header":"true","delimiter":",","database_code": "alpdev_pg", "chunksize": "50000", "encoding": "utf_8"}'
  • Docker container logs can be checked with the bash command docker logs --tail 100 CONTAINER_NAME

  • Once the flow is completed, the container logs the message "Finished in state Completed()"

  • Confirm data loaded with

CONTAINER_NAME=$PROJECT_NAME-minerva-postgres-1
docker exec -it $CONTAINER_NAME psql -h localhost -U postgres -p 5432 -d alpdev_pg --command "SELECT schemaname as table_schema,relname as table_name,n_live_tup as table_rows FROM pg_stat_user_tables where schemaname='cdmdefault' ORDER BY n_live_tup DESC limit 17;"
  • Expect output to be similar to:
table_schematable_nametable_rows
cdmdefaultcost367378
cdmdefaultcondition_occurrence147186
cdmdefaultprocedure_occurrence137522
cdmdefaultcondition_era99855
cdmdefaultdrug_exposure57095
cdmdefaultdrug_era56257
cdmdefaultvisit_occurrence55261
cdmdefaultprovider38810
cdmdefaultmeasurement34556
cdmdefaultcare_site23259
cdmdefaultobservation19339
cdmdefaultpayer_plan_period3470
cdmdefaultdevice_exposure2262
cdmdefaultobservation_period1000
cdmdefaultperson1000
cdmdefaultlocation570
cdmdefaultdeath52

Troubleshooting

create-postgres-cdm-schemas fails to start

  • create-postgres-cdm-schemas requires the following containers:
    • d2e-minerva-postgres-1, d2e-data-flow-gen-1, d2e-data-flow-gen-agent-1
  • If not running then check container logs

data flows

Repeat seed

  • To repeat, run "Load data to cdmdefault" commands in the sequence given

see: