Report on Medical Data of Thrombosis Diagnosis

Author

Affiliation

Georgetown University

1 Introduction

“A Data Scientist’s responsibility is as much about pertinent skills as it is about how the learner can understand ‘Data’ through proper analysis and formulate insightful questions for the benefit of the world.”

This report will walk you through the journey of a data scientist who was given a task to explore a dataset thoroughly and provide relevant information such that it should help non-technical shareholders make sense of a highly technical data set.

2 Chapter 1: Story

Raghav is currently working as a data scientist at a consulting firm. An external entity (client) paid a significant amount of money for his firm to make sense of a medical data set. They planned to incorporate the findings to help guide the allocation of a multi-million dollar research grant and wanted the results in two weeks, after which point the contract terminates and his consulting firm moved onto a new unrelated project.

His job was to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.

3 Chapter 2: Domain Knowledge

Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lungs, he/she will lose the respiratory function in a chronic course and finally lose their lives. The disease mechanisms are only partially known and their classification is still fuzzy. Some patients may generate many kinds of antibodies and their manifestations may include all the characteristics of collagen diseases.

In these diseases, thrombosis is one of the most important and severe complications, one of the major cause of death in collagen diseases. Thrombosis is emergency and it is important to detect and predict the possibilities of thrombosis.

However, such database analysis has not been made by any experts on immunology. Domain experts are very much interested in discovering some regularities behind patients’ observations, which may be a really new discovery in the world.

4 Chapter 3: The Data

Data was collected from Georgetown University’s Hospital and contained information about patients who had been admitted to the hospital’s Collagen disease clinic.

There were 3 CSV files that made up for the medical data. These files are:

TSUMOTO_A.CSV: Contains basic information about patients
TSUMOTO_B.CSV: Contains information about the ‘Special Laboratory Examinations’ of patients to detect Thrombosis.
TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.

5 Chapter 4: Goals of the project

Raghav aimed to thoroughly understand and analyze Thrombosis by achieving the following:

Search for good patterns which detect and predict thrombosis.
Search for temporal patterns specific/sensitive to thrombosis.
Search for good features which classifies collagen diseases correctly.
Search for temporal patterns specific/sensitive to each collagen diseases.

To achieve these goals, he carefully analyzed the data to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, he also tried to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.

6 Chapter 5: Approach

Before doing any kind of analysis, it was important to properly understand the datasets in terms of shape, strengths, weaknesses, etc.

Here are some of his finding’s on the nature of data provided:

Too many missing values account for the uncertainity in the dataset. Since this dataset contains information about lab examinations, it can be a possibility that the patient didn’t appear for a particular lab test. Hence the particular record went missing
It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values can distort and bring bias into the data, thereby producing unrepresentative results.
Some of the columns didn’t make sense when it came to decide the relevant predictos of thrombosis.
Dataset contained temporal data which provided insights about how thrombosis was diagnosed over time.
The data lacked thrombosis cases due to which understanding thrombosis at a deeper level and uncovering patterns in the data that could serve has predictors to this disease became difficult.
This dataset was imbalanced with almost all patients being mostly females which could lead to inaccurate conclusions that Thrombosis occurs much more frequently with females. A proper analysis on gender could not be conducted because of this.

To summarize, the data provided for this analysis was much lower in quality than expected. However, it is important to mention that all the data was collected between the years 1989 and 1996 when the data collection methods were not the most advanced and discrepancies in data was common in the industry. A better idea would be conducting an analysis on thrombosis using more recent data if available.

7 Chapter 6: Exploratory analysis

Raghav has provided 4 static visualizations detailing his most important insights.

7.1 Plot 1 : Histogram of the temporal data from the first dataset (TSUMOTO_A.CSV)

Age of patients when they visited hospital for the first time. — Figure 1: Age of Patients when they visited hospital for the first time.

From Figure 1, we infer two things:

1. The number of female patients are more as compared to male.

2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.

We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don’t tend to visit hospital that much for their medical checkups.

As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.

7.2 Plot 2: Bar plot explaining thrombosis diagnosis over time

Thrombosis over the years. — Figure 2: Thrombosis diagnosis over the years .

Figure 2 highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.

Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.

7.3 Plot 3: Histogram of temporal data representing age distribution of patients

As per Figure 3, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low.

7.4 Plot 4: Radar chart displaying extent of possible predictors

Figure 4: Effect of possible predictors on thrombosis in patients.

Figure 4 shows that the only lab results that show significant differences between different Thrombosis groups are:

In terms of the special lab tests conducted at the Laboratory of Collagen Diseases, Figure 4 strongly suggests that ana is a good indicator of whether a person has Thrombosis or not. The plot shows that the higher a patient’s degree of Thrombosis is, the higher that patient’s ANA values are (on average).
Thrombosis level 1 seems have to similar acl_iga and acl_igm values.
If more tests were provided in a proper dataset, maybe additional relevant features could have been found.

7.5 Plot 5: Word cloud of symptoms

Figure 5: Wordcloud of symptoms of thrombosis

Figure 5 shows that different levels of Thrombosis are accompanied by different sets of symptoms that are prevalent for each group. Specifically:

Brain infarction is the most dominant symptom for Thrombosis Level 1 patients.
CNS lupus is the most dominant symptom for Thrombosis Level 2 patients.
Throbocytopenia is the most dominant symptom for Thrombosis Level 3 patients.

The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.

8 Chapter 7: Conclusion

In general, this research project has drawn attention to two key points. First off, standard laboratory testing performed daily at hospitals may be able to anticipate thrombosis. If this assertion is investigated further and found to be accurate, it may prevent thousands of deaths from this deadly disease. Second, this study has advanced our understanding of thrombosis despite the little and poor-quality data it supplied, and there is undoubtedly room for more research.

While Raghav’s research does not offer a conclusive method for predicting or preventing thrombosis, it does serve as a first step in securing adequate funding, gathering more data of higher quality, and assembling a larger team of scientists to further investigate the intriguing results of this study.

--- title: "Report on Medical Data of Thrombosis Diagnosis" author: - name: Raghav Sharma affiliations: - name: Georgetown University url: https://github.com/anly503/hw2-spring-2023-raghavSharmaCode email: rs2190@georgetown.edu --- # Introduction "A Data Scientist's responsibility is as much about pertinent skills as it is about how the learner can understand 'Data' through proper analysis and formulate insightful questions for the benefit of the world." This report will walk you through the journey of a data scientist who was given a task to explore a dataset thoroughly and provide relevant information such that it should help non-technical shareholders make sense of a highly technical data set. # Chapter 1: Story Raghav is currently working as a data scientist at a consulting firm. An external entity (client) paid a significant amount of money for his firm to make sense of a medical data set. They planned to incorporate the findings to help guide the allocation of a multi-million dollar research grant and wanted the results in two weeks, after which point the contract terminates and his consulting firm moved onto a new unrelated project. His job was to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client. # Chapter 2: Domain Knowledge Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lungs, he/she will lose the respiratory function in a chronic course and finally lose their lives. The disease mechanisms are only partially known and their classification is still fuzzy. Some patients may generate many kinds of antibodies and their manifestations may include all the characteristics of collagen diseases. In these diseases, thrombosis is one of the most important and severe complications, one of the major cause of death in collagen diseases. Thrombosis is emergency and it is important to detect and predict the possibilities of thrombosis. However, such database analysis has not been made by any experts on immunology. Domain experts are very much interested in discovering some regularities behind patients' observations, which may be a really new discovery in the world. # Chapter 3: The Data Data was collected from Georgetown University's Hospital and contained information about patients who had been admitted to the hospital's Collagen disease clinic. There were 3 CSV files that made up for the medical data. These files are: 1. TSUMOTO_A.CSV: Contains basic information about patients 2. TSUMOTO_B.CSV: Contains information about the 'Special Laboratory Examinations' of patients to detect Thrombosis. 3. TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps. # Chapter 4: Goals of the project Raghav aimed to thoroughly understand and analyze Thrombosis by achieving the following: 1. Search for good patterns which detect and predict thrombosis. 2. Search for temporal patterns specific/sensitive to thrombosis. 3. Search for good features which classifies collagen diseases correctly. 4. Search for temporal patterns specific/sensitive to each collagen diseases. To achieve these goals, he carefully analyzed the data to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, he also tried to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions. # Chapter 5: Approach Before doing any kind of analysis, it was important to properly understand the datasets in terms of shape, strengths, weaknesses, etc. Here are some of his finding's on the nature of data provided: 1. Too many missing values account for the uncertainity in the dataset. Since this dataset contains information about lab examinations, it can be a possibility that the patient didn't appear for a particular lab test. Hence the particular record went missing 2. It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values can distort and bring bias into the data, thereby producing unrepresentative results. 3. Some of the columns didn't make sense when it came to decide the relevant predictos of thrombosis. 4. Dataset contained temporal data which provided insights about how thrombosis was diagnosed over time. 5. The data lacked thrombosis cases due to which understanding thrombosis at a deeper level and uncovering patterns in the data that could serve has predictors to this disease became difficult. 6. This dataset was imbalanced with almost all patients being mostly females which could lead to inaccurate conclusions that Thrombosis occurs much more frequently with females. A proper analysis on gender could not be conducted because of this. To summarize, the data provided for this analysis was much lower in quality than expected. However, it is important to mention that all the data was collected between the years 1989 and 1996 when the data collection methods were not the most advanced and discrepancies in data was common in the industry. A better idea would be conducting an analysis on thrombosis using more recent data if available. # Chapter 6: Exploratory analysis Raghav has provided 4 static visualizations detailing his most important insights. ## Plot 1 : Histogram of the temporal data from the first dataset (TSUMOTO_A.CSV) <figure> <img src="./plots/plot-01.png" width="900" alt="Age of patients when they visited hospital for the first time."> <figcaption style="color:black; text-align:center; margin-left: 10px;">Figure 1: Age of Patients when they visited hospital for the first time. </figcaption> </figure> From Figure 1, we infer two things: 1. The number of female patients are more as compared to male. 2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25. We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don't tend to visit hospital that much for their medical checkups. As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data. ## Plot 2: Bar plot explaining thrombosis diagnosis over time <figure> <img src="./plots/plot-02.png" width="900" alt="Thrombosis over the years."> <figcaption style="color:black; text-align:center; margin-left: 10px;">Figure 2: Thrombosis diagnosis over the years . </figcaption> </figure> Figure 2 highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less. Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise. ## Plot 3: Histogram of temporal data representing age distribution of patients <figure> <img src="./plots/plot-03.png" width="900" alt="Age distribution of patients."> <figcaption style="color:black; text-align:center; margin-left: 10px;">Figure 3: Age of patients at the time they got tested. </figcaption> </figure> As per Figure 3, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low. ## Plot 4: Radar chart displaying extent of possible predictors <figure> <img src="./plots/plot-04.png" width="900" alt="Effect of possible predictors on thrombosis"> <figcaption style="color:black; text-align:center; margin-left: 10px;">Figure 4: Effect of possible predictors on thrombosis in patients. </figcaption> </figure> Figure 4 shows that the only lab results that show significant differences between different Thrombosis groups are: 1. In terms of the special lab tests conducted at the Laboratory of Collagen Diseases, Figure 4 strongly suggests that ana is a good indicator of whether a person has Thrombosis or not. The plot shows that the higher a patient’s degree of Thrombosis is, the higher that patient’s ANA values are (on average). 2. Thrombosis level 1 seems have to similar acl_iga and acl_igm values. 3. If more tests were provided in a proper dataset, maybe additional relevant features could have been found. ## Plot 5: Word cloud of symptoms <figure> <img src="./plots/plot-05.png" width="900" alt="Wordcloud of symptoms of thrombosis"> <figcaption style="color:black; text-align:center; margin-left: 10px;">Figure 5: Wordcloud of symptoms of thrombosis </figcaption> </figure> Figure 5 shows that different levels of Thrombosis are accompanied by different sets of symptoms that are prevalent for each group. Specifically: 1. Brain infarction is the most dominant symptom for Thrombosis Level 1 patients. 2. CNS lupus is the most dominant symptom for Thrombosis Level 2 patients. 3. Throbocytopenia is the most dominant symptom for Thrombosis Level 3 patients. The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe. # Chapter 7: Conclusion In general, this research project has drawn attention to two key points. First off, standard laboratory testing performed daily at hospitals may be able to anticipate thrombosis. If this assertion is investigated further and found to be accurate, it may prevent thousands of deaths from this deadly disease. Second, this study has advanced our understanding of thrombosis despite the little and poor-quality data it supplied, and there is undoubtedly room for more research. While Raghav's research does not offer a conclusive method for predicting or preventing thrombosis, it does serve as a first step in securing adequate funding, gathering more data of higher quality, and assembling a larger team of scientists to further investigate the intriguing results of this study.