Vaccine safety surveillance using social media data


Sedigheh Khademi Habibabadi


This dissertation is being published during an extraordinary time, under the shadow of the worldwide spread of COVID-19, with the world hoping for vaccines that would allow us to breath freely again, to give us back our freedom to move, to emerge from lockdowns, fear and frustration. Although vaccine manufacturers are being careful to put vaccines through well-established testing processes, there are new vaccine technologies on trial and there is an intense public health and political pressure to bring them to market. It is inevitable that as vaccines are disseminated to the general population there will be reports of adverse health-related events in relation to vaccine distributions, some of which will be genuine vaccine reactions, either expected or untoward. It is vital that reports of adverse events are continually monitored to help ensure a rapid response to any emerging issues with vaccine safety and effectiveness.

Traditional monitoring for Adverse Events Following Immunisation (AEFI) relies on various established reporting systems, where there is inevitably a lag between an adverse event following a vaccine and the reporting of it, and subsequent processing of reports. Therefore, it is desirable to try and detect AEFI earlier, ideally close to real-time, and monitoring social media data holds promise as a resource for this. However, social media users relaying experiences of adverse events following vaccination are difficult to detect – there is an overwhelming amount of other vaccine and virus-related conversations swamping social media platforms. This research is dedicated to proving that useful Vaccine Adverse Event Mentions (VAEM) can be detected in social media, using Twitter as a data source, and applying natural language processing techniques to successively filter out unwanted messages to bring VAEM to light.

This research has developed a VAEM-Mine method that combines two stages of topic modelling with classification to extract around 90% of all VAEM posts from a Twitter stream, with a high degree of confidence. This is a significant achievement, as VAEM posts constitute less than 2% of all vaccine-related Twitter posts. The research also presents a taxonomy of vaccine-related Twitter posts, datasets of VAEM Twitter posts and detailed reporting on the most effective approaches to topic modelling and to classification of extracted posts, in relation to varying data volumes. This work provides a methodological foundation for potential near-real time monitoring of social media VAEM to augment existing signal detection systems, maximising the ability to detect unsafe vaccines rapidly.

Table of Contents

Table of Contents
List of Tables
List of Figures
List of Abbreviations

  1.  Introduction
    • Vaccines and vaccine safety
    • Social media monitoring
    • Vaccine adverse event mentions
    • Problem Statement
    • Research aims and objectives
      • Research questions
      • Research design
    • Research contribution
    • Structure of the thesis
  2. Literature Review
    • Chapter overview
    • Vaccine safety
      • Surveillance definition
      • Vaccine safety assessment
      • Pre-licensure (clinical trials) surveillance
      • Post-licensure surveillance
    • Social media data sources for public health studies
    • Surveillance using social media
      • Disease surveillance
      • Adverse Drug Reaction detection
      • Vaccine Adverse Event detection
      • Personal health mention detection
      • Monitoring of vaccines and vaccinations
    • Social media data processing
      • Social media data collection
      • Text pre-processing
    • Machine learning methods in text processing
      • Topic modelling
      • Deep learning
    • Chapter 2 summary
  3. Research Design
    • Chapter overview
    • Research approach
    • Research process
    • Framework
      • Domain exploration
      • Data collection
      • Two stage topic modelling
      • Datasets and embeddings
      • Features development
      • VAEM classification
    • The VAEM-Mine method
    • Chapter 3 Summary
  4. Data collection and preparation
    • Chapter overview
      • Social media (Twitter) data collection
    • Data pre-processing
      • Removing unwanted tokens
      • Pre-processing and adding features
    • Topic modelling data preparation
    • Topic modelling datasets
    • Classification datasets
    • Phase One classification data
      • Experimenting with imbalanced datasets
      • Creating a balanced dataset
      • Imbalanced (Victorian) test dataset
      • Final Phase One datasets
    • Phase Two classification data
      • Additional data collection
      • Balancing the Phase Two data
      • Final Phase Two datasets
    • Chapter 4 summary
  5. Topic modelling
    • Chapter overview
      • Topic modelling algorithms
      • Topic modelling data
      • Labelling for topic model scoring
    • Topic modelling scoring method
      • Calculating F-Scores
      • Coherence
    • First stage of topic modelling
      • DMM model
      • MALLET model
      • Gensim model
      • Summary of the best scoring topic models
      • First Stage Topics keywords
    • Taxonomy
    • Second stage of topic modelling
      • Second Stage topic keywords
    • Summary of the two stages of topic modelling
      • Verification of the best topic model
      • “The Vaccines”
      • Final labels
    • Evaluation
    • Additional visualisation techniques
    • Chapter 5 summary
  6. Classification
    • Chapter overview
    • Classifiers
      • Calibrated Classifier Cross Validation
      • Ensemble
      • Neural network models
      • Transfer learning
      • Evaluation measures
    • Data preparation
    • Classification evaluation
    • Initial experimentation with imbalanced datasets
    • Experimentation with balanced training datasets
    • Phase One classifiers results
    • Phase Two classifiers results
    • Classifier performance over the two training phases
      • Imbalanced Test data with Phase-One models
      • Imbalanced Test data with Phase-Two models
      • Balanced Test data with Phase-One models
      • Balanced Test data with Phase-Two models
      • Phase-Two models vs Phase-One models
    • Baseline rule-based classification technique
    • Chapter 6 summary
  7. Evaluation
    • Chapter overview
    • Evaluating topic model effectiveness
      • Verifying topic models with samples
      • Verifying effectiveness with label distributions
      • Utilising topic model outputs
    • Evaluating classifiers effectiveness
      • Comparative charts
      • Detailed analysis of classifier scores
      • Classifier effectiveness
    • Evaluating effectiveness of the method
    • Word importance analysis
    • Chapter 7 summary
  8. Discussion and Conclusion
    • The Research Questions
      • Aim of the research
      • Research Question 1
      • Research Question 2
      • Research Question 3
    • The research contribution
    • Limitations
    • Future Research


Link to full publication