Report on Medical Data of Thrombosis Diagnosis
1 Introduction
“A Data Scientist’s responsibility is as much about pertinent skills as it is about how the learner can understand ‘Data’ through proper analysis and formulate insightful questions for the benefit of the world.”
This report will walk you through the journey of a data scientist who was given a task to explore a dataset thoroughly and provide relevant information such that it should help non-technical shareholders make sense of a highly technical data set.
2 Chapter 1: Story
Raghav is currently working as a data scientist at a consulting firm. An external entity (client) paid a significant amount of money for his firm to make sense of a medical data set. They planned to incorporate the findings to help guide the allocation of a multi-million dollar research grant and wanted the results in two weeks, after which point the contract terminates and his consulting firm moved onto a new unrelated project.
His job was to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.
3 Chapter 2: Domain Knowledge
Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lungs, he/she will lose the respiratory function in a chronic course and finally lose their lives. The disease mechanisms are only partially known and their classification is still fuzzy. Some patients may generate many kinds of antibodies and their manifestations may include all the characteristics of collagen diseases.
In these diseases, thrombosis is one of the most important and severe complications, one of the major cause of death in collagen diseases. Thrombosis is emergency and it is important to detect and predict the possibilities of thrombosis.
However, such database analysis has not been made by any experts on immunology. Domain experts are very much interested in discovering some regularities behind patients’ observations, which may be a really new discovery in the world.
4 Chapter 3: The Data
Data was collected from Georgetown University’s Hospital and contained information about patients who had been admitted to the hospital’s Collagen disease clinic.
There were 3 CSV files that made up for the medical data. These files are:
TSUMOTO_A.CSV: Contains basic information about patients
TSUMOTO_B.CSV: Contains information about the ‘Special Laboratory Examinations’ of patients to detect Thrombosis.
TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.
5 Chapter 4: Goals of the project
Raghav aimed to thoroughly understand and analyze Thrombosis by achieving the following:
- Search for good patterns which detect and predict thrombosis.
- Search for temporal patterns specific/sensitive to thrombosis.
- Search for good features which classifies collagen diseases correctly.
- Search for temporal patterns specific/sensitive to each collagen diseases.
To achieve these goals, he carefully analyzed the data to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, he also tried to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.
6 Chapter 5: Approach
Before doing any kind of analysis, it was important to properly understand the datasets in terms of shape, strengths, weaknesses, etc.
Here are some of his finding’s on the nature of data provided:
Too many missing values account for the uncertainity in the dataset. Since this dataset contains information about lab examinations, it can be a possibility that the patient didn’t appear for a particular lab test. Hence the particular record went missing
It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values can distort and bring bias into the data, thereby producing unrepresentative results.
Some of the columns didn’t make sense when it came to decide the relevant predictos of thrombosis.
Dataset contained temporal data which provided insights about how thrombosis was diagnosed over time.
The data lacked thrombosis cases due to which understanding thrombosis at a deeper level and uncovering patterns in the data that could serve has predictors to this disease became difficult.
This dataset was imbalanced with almost all patients being mostly females which could lead to inaccurate conclusions that Thrombosis occurs much more frequently with females. A proper analysis on gender could not be conducted because of this.
To summarize, the data provided for this analysis was much lower in quality than expected. However, it is important to mention that all the data was collected between the years 1989 and 1996 when the data collection methods were not the most advanced and discrepancies in data was common in the industry. A better idea would be conducting an analysis on thrombosis using more recent data if available.
7 Chapter 6: Exploratory analysis
Raghav has provided 4 static visualizations detailing his most important insights.
7.1 Plot 1 : Histogram of the temporal data from the first dataset (TSUMOTO_A.CSV)
From Figure 1, we infer two things:
1. The number of female patients are more as compared to male.
2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.
We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don’t tend to visit hospital that much for their medical checkups.
As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.
7.2 Plot 2: Bar plot explaining thrombosis diagnosis over time
Figure 2 highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.
Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.
7.3 Plot 3: Histogram of temporal data representing age distribution of patients
As per Figure 3, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low.
7.4 Plot 4: Radar chart displaying extent of possible predictors
Figure 4 shows that the only lab results that show significant differences between different Thrombosis groups are:
In terms of the special lab tests conducted at the Laboratory of Collagen Diseases, Figure 4 strongly suggests that ana is a good indicator of whether a person has Thrombosis or not. The plot shows that the higher a patient’s degree of Thrombosis is, the higher that patient’s ANA values are (on average).
Thrombosis level 1 seems have to similar acl_iga and acl_igm values.
If more tests were provided in a proper dataset, maybe additional relevant features could have been found.
7.5 Plot 5: Word cloud of symptoms
Figure 5 shows that different levels of Thrombosis are accompanied by different sets of symptoms that are prevalent for each group. Specifically:
- Brain infarction is the most dominant symptom for Thrombosis Level 1 patients.
- CNS lupus is the most dominant symptom for Thrombosis Level 2 patients.
- Throbocytopenia is the most dominant symptom for Thrombosis Level 3 patients.
The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.
8 Chapter 7: Conclusion
In general, this research project has drawn attention to two key points. First off, standard laboratory testing performed daily at hospitals may be able to anticipate thrombosis. If this assertion is investigated further and found to be accurate, it may prevent thousands of deaths from this deadly disease. Second, this study has advanced our understanding of thrombosis despite the little and poor-quality data it supplied, and there is undoubtedly room for more research.
While Raghav’s research does not offer a conclusive method for predicting or preventing thrombosis, it does serve as a first step in securing adequate funding, gathering more data of higher quality, and assembling a larger team of scientists to further investigate the intriguing results of this study.