Database Restricted Access
Flatten: COVID-19 Survey Data on Symptoms, Demographics and Mental Health in Canada
Shrey Jain , Marie Charpignon , Mathew Samuel , Jaydeep Mistry , Nicholas Frosst , Leo Anthony Celi , Marzyeh Ghassemi
Published: March 8, 2021. Version: 1.0
When using this resource, please cite:
(show more options)
Jain, S., Charpignon, M., Samuel, M., Mistry, J., Frosst, N., Celi, L. A., & Ghassemi, M. (2021). Flatten: COVID-19 Survey Data on Symptoms, Demographics and Mental Health in Canada (version 1.0). PhysioNet. https://doi.org/10.13026/v8eq-8v80.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
The expedited collection and analysis of pre-clinical syndromic data is delivering novel insights into public health conditions during the ongoing COVID-19 pandemic. We provide Canada’s first publicly available pre-clinical COVID-19 dataset, based on survey responses collected from 294,106 Canadians from March 23rd until July 30th 2020, using a platform developed by Flatten, a Canadian non-profit organization. We provide this data for academic and industry research.
Background
The health impact of and governmental response to the COVID 19 pandemic vary widely across countries. However, data has been a common denominator in effectively informing the ability to implement non-pharmaceutical interventions (NPIs) [2].
In the absence of high-volume testing and rapid data sharing, pre-clinical syndromic reporting platforms have been shown to successfully provide a much-needed tool to fill the data gap. Research from North-America, Africa, and Asia, validates the role of syndromic surveillance data in supplementing public health reports, and in delivering valuable understanding of infectious disease burdens, including assessments of the extent of current COVID-19 testing rates [3, 4, 5].
Digital surveillance health tools include contact tracing, syndromic surveillance (symptom tracking), quarantine compliance monitoring, and flow modelling (movement tracking) [6]. While these digital health tools are not meant to serve as a substitute for enhanced testing, syndromic reporting via surveys may bolster existing public health systems and provide an opportunity to improve digitally-based infectious disease surveillance.
The primary aim of syndromic surveillance is to identify illness clusters early, well before patient diagnoses are confirmed and reported to public health agencies. This may enable an early response that reduces morbidity and mortality at the population-level.
Recognizing the value of digital syndromic surveillance for the COVID-19 pandemic, we launched a non-profit called Flatten [1] on March 23, 2020, focusing on building a robust self-reporting syndromic surveillance tool. The tool consists of socio-demographic, and clinical data, combined with COVID-19 exposure, symptom, and historic test results aggregated with real-time mapping methods (aggregating survey responses by postal code and visualizing on an interactive map for users).
We launched the tool in the form of a web-based survey. The Flatten team received a significant amount of media attention in March of 2020, which helped to dramatically increase survey submissions (294,106 surveys responses collected).
To recognize the increasingly broad usage of COVID-19 data as a mechanism for response efforts, Flatten is releasing Canada’s first freely accessible pre-clinical database to enable clinical research and facilitate greater understanding of how to effectively utilize data for public health interventions.
Methods
The COVID-19 pandemic has influenced fast-moving and pragmatic approaches to solving problems in public health. Recognizing that time was a major concern, we focused on releasing surveys in the most expedient ways possible.
Our survey has evolved through three iterations (Schema 1, Schema 2, Schema 3) based on feedback received from public health experts. Survey participants were solicited through social media platforms and traditional news services.
Within three weeks of launch, Flatten had been featured on major news networks in Canada. The impact of these well-established communication channels positively impacted our data collection effort by increasing survey participation and targeting demographics actively using media networks. Notable features of this platform in national news can be found on flatten.ca, including the public announcement from the City of Montreal public health [7-13]
In addition to features on major news networks, the Flatten team had pushed social media ads via Twitter and Facebook to capture user attention. Given that the survey data was mainly shared through news/social media, the ability to characterize less urbanized geographies and underrepresented population groups of Canada is limited in this dataset. As such, the dataset is primarily useful to characterize the impact of the COVID-19 pandemic in urban cities over rural regions. Additionally, the ability to characterize specific minority subgroups (e.g., members of Indigenous communities) is limited due to their low participation rates.
The Flatten survey was created with advisory from a team of public health experts of the Dalla Lana School of Public Health from the University of Toronto. As research on COVID-19 evolved, the initial survey was amended and the following iterations incorporated feedback from machine learning and clinical experts on the Flatten advisory team. The symptoms that we were aiming to capture (ie., fever, cough, shortness of breath, etc) aligned with the symptoms listed by Canadian federal public health on their website [14].
The Flatten data was de-identified by Replica Analytics [18] using a common statistical disclosure control (SDC) protocol [15]. The SDC method manages the risk of identification from quasi-identifiers, which are variables in the data that can be known by an adversary and used in the attack. Quasi-identifiers are the demographic and socio-economic variables in the data (i.e., date, FSA, age, comorbidities).
Given that the data disclosure is a semi-public release in that it requires users to agree to specific terms of use including a prohibition on identification attempts, the risk metric used is a strict average risk, given a survey response. This means that the estimated sample uniques (survey response such that the quasi identifiers have a frequency of one empirically) that are population uniques must be very small and that the average risk is below the commonly used threshold of 0.09 recommended by the European Medicines Agency [16] and Health Canada [17].
If the estimated risk is above the threshold, quasi-identifiers are generalized by decreasing granularity or removed from the dataset for all users. Concurrent with that, the Flatten data was linked with census data from 2016 Environics Canada [19], using the Forward Sortation Area (FSA). For each FSA in, the number of inhabitants per age, race, and gender was extracted. If the size of the subgroup was less than 11 based on census data, then that record was deemed to be high risk and resulted in the removal of surveys [15]. To protect user privacy, all free form text fields were removed from the final dataset being released.
Flatten has a certification of institutional ethics approval through the Conjoint Health Research Ethics Board, University of Calgary (REB20-0608).
Data Description
The Flatten dataset consists of three versions of the survey referred to as Schema 1, Schema 2, and Schema 3. As compared to Schema 1, subsequent versions either include additional questions or refine existing question and answer options. Each survey response is stored as an individual record (row) in the dataset.
Across all survey versions, each record in the Flatten dataset consists of temporal, spatial, and survey response attributes. Temporal data include the week number or month during which the survey response was recorded. Survey response data primarily consists of a binary variable indicating whether the respondent is aged 60 or over, and their symptoms related to COVID-19 at the time of response (i.e., fever, cough, and shortness of breath). The survey also asks whether the survey participant has travelled outside of Canada in the past 14 days, or if they suspect they have been in contact with someone infected with COVID-19 in the past 14 days.
Survey responses associated with Schema 1 are the most numerous, consisting of 263,640 individual level records submitted in the early weeks of the pandemic (March 23rd to April 8th of 2020). These responses also correspond with the peak of Flatten’s presence in the media. Survey responses associated with Schema 2 consist of 14,932 records (April 8th to April 28th of 2020), and Schema 3 consists of 15,534 records (April 28th to July 30th of 2020). Although Schema 2 and Schema 3 contain far fewer records than Schema 1, they contain valuable information about the demographic profile of the survey participant that Schema 1 does not have, such as their race, ethnicity, sex, age, and pandemic-induced most pressing needs (i.e., food, medical, financial, emotional, other).
The raw Flatten dataset consists of 498,211 survey submissions (90.81% responses from unique participants) with granular temporal, spatial, and survey participant socio-demographic and health factors data. Only a de-identified subset of survey responses is made available in this version’s release, due to ethical as well as privacy and identity protection reasons. Furthermore, only the subset of survey responses where the participant indicates living in Ontario are made available, given the prevalence of the province in the collected data records (60.3% of the responses).
The dataset contains four variables related to the health status of a survey respondent, labelled as follow: “probable”, “vulnerable”, “is_most_recent”, and “any_medical_conditions”. The definition of these variables were elaborated based on guidance from public health professors of the University of Toronto Dalla Lana School of Public Health at the time of which data was collected.
A survey respondent was considered to be a “probable” COVID-19 case if they fit into one of the three following combinations: have come in contact with illness; have travelled outside of canada and have the symptoms (fever, cough, shortness of breath); have travelled outside of canada and have the symptoms (cough, shortness of breath). The decision tree of this definition can be found in the following script: https://github.com/flatten-official/flatten-scripts/blob/staging/dags/sanitisation/sanitisation.py).
A survey respondent was considered to be “vulnerable” to COVID-19 disease if they were aged 65 or more or if they had at least one of the following comorbidities: diabetes, cancer, diabetes, high blood pressure, heart disease, asthma or other breathing-related illness, immunocompromising condition, kidney disease, history of stroke. The code can also be found in the script linked above.
A survey response is considered most recent if it is the most recent submission from the respondent based on their unique user identifier. This variable is used in the event a respondent makes multiple survey submissions across the survey data collection time period.
Usage Notes
The Flatten dataset being released is intended to be used for population health research by teams of public health researchers, epidemiologists, clinicians, and data scientists with domain-specific knowledge to generate relevant insights. Flatten will also be used as coursework material at the Massachusetts Institute of Technology (HST.936: Leveraging data science for public health). It may also be leveraged as part of future data science events.
The code used to build the Flatten syndromic surveillance tool for the Canadian website is openly available at https://github.com/flatten-official, under the MIT license.
Release Notes
Flatten would like to acknowledge the entire development, legal, and design teams, as well as the supporting organizations (Mosaic, Osler Law, Cosette, and ESRI) for creating this dataset. The Flatten team thanks Dr. Khaled Emam and Replica Analytics for de-identifying the dataset being released.
Acknowledgements
Flatten would like to acknowledge the entire development, legal, and design teams, as well as the supporting organizations (Mosaic, Osler, Cosette, ESRI, University of Toronto) for supporting Flatten to create this dataset. The Flatten team thanks Dr. Khaled Emam and Replica Analytics for de-identifying the dataset.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Flatten, from https://www.flatten.ca/.
- Askitas, N., Tatsiramos, K. & Verheyden, B. Estimating worldwide effects of non-pharmaceutical interventions on COVID-19 incidence and population mobility patterns using a multiple-event study. Sci. Rep. 11, 1972 (2021).
- May, L., Chretien, J.-P. & Pavlin, J. A. Beyond traditional surveillance: applying syndromic surveillance to developing settings--opportunities and challenges. BMC Public Health 9, 242 (2009).
- Lapointe-Shaw L, Rader B, Astley CM, Hawkins JB, Bhatia D, Schatten WJ, Lee TC, Liu JJ, Ivers NM, Stall NM, Gournis E. Syndromic Surveillance for COVID-19 in Canada. medRxiv. 2020.
- Abat C, Colson P, Chaudet H, Rolain JM, Bassene H, Diallo A, Mediannikov O, Fenollar F, Raoult D, Sokhna C. Implementation of syndromic surveillance systems in two rural villages in Senegal. PLOS Neglected Tropical Diseases. 2016 Dec 7;10(12):e0005212.
- Gasser, U., Ienca, M., Scheibner, J., Sleigh, J. & Vayena, E. Digital tools against COVID-19: taxonomy, ethical challenges, and navigation aid. Lancet Digit Health 2, e425–e434 (2020).
- U of T startup leverages big data to fight COVID-19 in Mogadishu. https://www.utoronto.ca/news/u-t-startup-leverages-big-data-fight-covid-19-mogadishu.
- This 18-year-old tracks COVID-19 hot spots using crowdsourcing. CBC News.
- Kirkwood, I. 14 U of T students, alumni named recipients of social impact fund financing. https://betakit.com/14-u-of-t-students-alumni-named-recipients-of-social-impact-fund-financing/ (2020).
- Winsa, P. Mississauga student helping to develop crowdsourcing COVID-19 website for Somalia. The Toronto Star (2020).
- Semeniuk, I. How big data, population health and other scientists are trying to map COVID-19 in the community. The Globe and mail (2020).
- Jones, A. M. Canadian university students create map of self-reported potential COVID-19 cases. CTV News https://www.ctvnews.ca/sci-tech/canadian-university-students-create-map-of-self-reported-potential-covid-19-cases-1.4872245 (2020).
- Amy Luft. Montreal asks people to use new web app to track COVID-19 cases throughout the city. CTV News https://montreal.ctvnews.ca/montreal-asks-people-to-use-new-web-app-to-track-covid-19-cases-throughout-the-city-1.4881506 (2020).
- Canada, P. (2020, November 20). Government of Canada. Retrieved January 15, 2021, from https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection/symptoms.html#s
- K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.
- European Medicines Agency. European Medicines Agency policy on publication of clinical data for medicinal products for human use. Technical report.
- Guidance document on Public Release of Clinical Information: profile page - Canada.ca
- https://www.replica-analytics.com/home
- https://environicsanalytics.com/en-ca
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
Discovery
DOI (version 1.0):
https://doi.org/10.13026/v8eq-8v80
DOI (latest version):
https://doi.org/10.13026/3pqw-8411
Topics:
public health
population statistics
covid-19
Project Website:
https://www.flatten.ca/
Corresponding Author
Files
- sign the data use agreement for the project