Data Sets

Machine learning methodologies are increasingly popular in health care research. This shift to integrated data science approaches necessitates professional development of the existing health care data analyst workforce. To enhance a smooth transition, educational resources need to be developed. Barriers to accessing real healthcare datasets, vital for health care data analyses methodologies training purposes, include financial, ethical and patient confidentiality concerns. Synthetic datasets mimicking real-world complexities offer a simpler solution.

On this page, we present a synthetic dataset which mirrors routinely collected primary care data on heart attack and stroke among the adult population. The data incorporates much of the practical challenges encountered in routinely collected primary care systems such as missing data, informative censoring, interactions, variable irrelevance, and noise and can be used for training in methods which handle these difficulties. The intent is for the user to build models of heart/stroke risk using survival-based methodologies.

By sharing this synthetic dataset openly, our goal is to contribute a transformative asset for professional training in health and social care data analysis. The dataset covers demographics, lifestyle variables, comorbidities, systolic blood pressure, hypertension treatment, family history of cardiovascular diseases, respiratory functioning, and experience of heart-attack and/or stroke. This initiative aims to bridge the gap in sophisticated healthcare datasets for training, fostering professional development of the health and social care research workforce.

Development of this dataset was funded by ARC Wessex and the National Centre for Research Methods (NCRM).

Synthetic Data set CSV

cvd_synthetic_dataset_v0.2_metadata.xlsx