Imagine yourself as a data scientist working at a consulting firm. An external entity (client) has paid a significant amount of money for your firm to make sense of a medical data set. They plan to incorporate your findings to help guide the allocation of a multi-million dollar research grant. They want the results in two weeks, after which point the contract terminates and your consulting firm will move onto a new unrelated project.
Our job is to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.
1.2 Data
Data has been collected from Georgetown University’s Hospital and contain information about patients who have been admitted to the hospital’s Collagen disease clinic.
Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lung, he/she will lose the respiratory function in a chronic course and finally lose their lives. Thrombosis is the most severe form of Collagen diseases.
There are 3 CSV files that make up for the medical data. These files are:
TSUMOTO_A.CSV: Contains basic information about patients
TSUMOTO_B.CSV: Contains information about the ‘Special Laboratory Examinations’ of patients to detect Thrombosis.
TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.
1.3 Goals of the project
This project aims to thoroughly understand and analyze Thrombosis which is a Collagen disease by achieving the following:
Search for good patterns which detect and predict thrombosis.
Search for temporal patterns specific/sensitive to thrombosis.
Search for good features which classifies collagen diseases correctly.
Search for temporal patterns specific/sensitive to each collagen diseases.
To achieve these goals, the data will be carefully analyzed to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, this project aims to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.
2 Initial Questions
We would like to investigate the following questions prior to our analyses:
How does thrombosis evolve over time?
What age group is Thrombosis, the most prevalent in?
Does it depends upon the sex of the patient?
Is this dataset good enough to understand and visualize the factors causing and effects of Thrombosis?
3 Data Ingestion, Cleaning and Munging
Before doing any kind of analysis, it is important to properly understand the datasets in terms of shape, strengths, weaknesses, etc.
3.1 Ingestion
3.1.0.1 Importing the libraries
Code
# Importing all necessary packagesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as snsimport klibimport missingno as msnofrom upsetplot import UpSetfrom wordcloud import WordCloud, STOPWORDSimport randomimport sysimport warningswarnings.filterwarnings("ignore")
3.1.0.2 Importing the datasets
Code
# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')
TSUMOTO_A.CSV
Code
A = pd.read_csv("./data/TSUMOTO_A.CSV", encoding='unicode_escape')A.head()
ID
SEX
Birthday
Description
First Date
Admission
Diagnosis
0
2110
F
1934/2/13
94.02.14
93.02.10
+
RA susp.
1
11408
F
1937/5/2
96.12.01
73.01.01
+
PSS
2
12052
F
1956/4/14
91.08.13
NaN
+
SLE
3
14872
F
1953/9/21
97.08.13
NaN
+
MCTD
4
27654
F
1936/3/25
NaN
92.02.03
+
RA, SLE susp
Code
"Shape of A: "+str(A.shape)
'Shape of A: (1240, 7)'
TSUMOTO_B.CSV
Code
B = pd.read_csv("./data/TSUMOTO_B.CSV", encoding='unicode_escape')B.head()
ID
Examination Date
aCL IgG
aCL IgM
ANA
ANA Pattern
aCL IgA
Diagnosis
KCT
RVVT
LAC
Symptoms
Thrombosis
0
14872.0
1997/5/27
1.3
1.6
256
P
0.0
MCTD, AMI
NaN
NaN
-
AMI
1
1
48473.0
1992/12/21
4.3
4.6
256
P,S
3.3
SLE
-
-
-
NaN
0
2
102490.0
1995/4/20
2.3
2.5
0
NaN
3.5
PSS
NaN
NaN
NaN
NaN
0
3
108788.0
1997/5/6
0.0
0.0
16
S
0.0
NaN
NaN
NaN
-
NaN
0
4
122405.0
1998/4/2
0.0
4.0
4
P
0.0
SLE, SjS, vertigo
NaN
NaN
NaN
NaN
0
Code
"Shape of B: "+str(B.shape)
'Shape of B: (806, 13)'
TSUMOTO_C.CSV
Note: Couple of lines in this file were causing issues when reading the dataset. Using “error_bad_lines=False” argument resolved that. A total of 30 lines (out of 50k + where dropped), so any impact of this drop would be very minimal.
Code
# A couple of bad (unreadable) lines in the 3rd dataset were dropped. The dropped lines account for 30 (out of 50k+), # which makes this drop insignificantC = pd.read_csv("data/TSUMOTO_C.CSV" , encoding='unicode_escape', error_bad_lines=False) C.head()
ID
Date
GOT
GPT
LDH
ALP
TP
ALB
UA
UN
...
C3
C4
RNP
SM
SC170
SSA
SSB
CENTROMEA
DNA
DNA-II
0
2110
860419
24
12
152
63
7.5
4.5
3.4
16.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
2110
860430
25
12
162
76
7.9
4.6
4.7
18.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2110
860502
22
8
144
68
7
4.2
5
18.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
2110
860506
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
2110
860507
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5 rows × 44 columns
Code
"Shape of C: "+str(C.shape)
'Shape of C: (57510, 44)'
C is by far the largest dataframe that has information about general laboratory examinations.
3.2 Cleaning and Munging
3.2.0.1 Analyzing Missing values
Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the datasets.
It is very important to note that this is medical data. So if any of the dataset columns has a missing value, then it means that it wasn’t recorded. We need to carefully analyze and then take appropriate steps to handle missing data.
We will use ‘klib’, ‘missingno’, and ‘UpSet’ libraries to visualize the missing data for all the datasets.
klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.
missingno library will be helpful in plotting the correlation amongst the columns having missing values. This way we will be able to make a decision if a certain column needs to be dropped or not.
UpSet library looks at the multivariate relationships among categorical variables and is applied to missing data.
Lets create a function to analyze missing data statistics of our dataset.
Code
# define a function that returns a data-frame of missing data statisticsdef missing_val_stats(df):# define columns of the data-frame df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 'num_miss', 'pct_miss']) tmp = pd.DataFrame()for c in df.columns:# column tmp['column'] = [c]# unique values in the column tmp['unique_val'] = [df[c].unique()]# number of unique values in the column tmp['num_unique_val'] =len(list(df[c].unique()))# number of unique values in the column without nan tmp['num_unique_val_nona'] =int(df[c].nunique())# number of missing values in the column tmp['num_miss'] = df[c].isnull().sum()# percentage of missing values in the column tmp['pct_miss'] = (df[c].isnull().sum()/len(df)).round(3)*100 df_stats = df_stats.append(tmp)return df_stats
Lets take a look at the missing values statistics of A
Code
df_A_stats = missing_val_stats(A)df_A_stats
column
unique_val
num_unique_val
num_unique_val_nona
num_miss
pct_miss
0
ID
[2110, 11408, 12052, 14872, 27654, 30609, 4300...
1238
1238
0
0.0
0
SEX
[F, M, nan]
3
2
13
1.0
0
Birthday
[1934/2/13, 1937/5/2, 1956/4/14, 1953/9/21, 19...
1195
1194
1
0.1
0
Description
[94.02.14, 96.12.01, 91.08.13, 97.08.13, nan, ...
100
99
217
17.5
0
First Date
[93.02.10, 73.01.01, nan, 92.02.03, 94.03.08, ...
821
820
248
20.0
0
Admission
[+, -, nan, üH, îýô€, +üiö‡òåë®üj, +(ë¦ÄuÆÈòaë...
9
8
31
2.5
0
Diagnosis
[RA susp., PSS, SLE, MCTD, RA, SLE susp, SLE, ...
222
222
0
0.0
Visualizing missing values in A
Using klib
Code
"Missing Value Plot (TSUMOTO_A.CSV)"a_klib = klib.missingval_plot(A, figsize=(10,15))
There are a few columns with few missing values with ‘First Date’ reaching a maximum of 20%. Dealing with missing values in these columns could be dealt with at later stages, when/if they are used.
"UpSet plot for TSUMOTO_A.CSV"# check for nan valuesa_upset = pd.isna( A[['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission','Diagnosis']])# using groupby for the missing valuesa_upset = a_upset.groupby(['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission','Diagnosis'], as_index=True).size()# plot the groupby objectUpSet(a_upset).plot(fig= plt.figure(figsize=(8, 5)));plt.show()
As stated earlier, there isn’t much of a correlation amongst the missing values in A. This is proved by the upset plot as we see in the figure that there are less cases of two or more values missing together.
Lets take a look at the missing values statistics of B
Code
df_B_stats = missing_val_stats(B)df_B_stats
column
unique_val
num_unique_val
num_unique_val_nona
num_miss
pct_miss
0
ID
[14872.0, 48473.0, 102490.0, 108788.0, 122405....
768
767
36
4.5
0
Examination Date
[1997/5/27, 1992/12/21, 1995/4/20, 1997/5/6, 1...
476
475
10
1.2
0
aCL IgG
[1.3, 4.3, 2.3, 0.0, 1.0, 6.1, 3.7, 1.7, 5.1, ...
115
115
0
0.0
0
aCL IgM
[1.6, 4.6, 2.5, 0.0, 4.0, 2.4, 9.5, 3.0, 0.8, ...
140
140
0
0.0
0
ANA
[256, 0, 16, 4, 1024, 4096, 64, nan, 4094, >4096]
10
9
38
4.7
0
ANA Pattern
[P, P,S, nan, S, D,P,S, P.S, S,P, S,D, D,P, P,...
16
15
261
32.4
0
aCL IgA
[0.0, 3.3, 3.5, 8.7, 12.3, 22.5, 3.1, 9.1, 5.1...
163
163
0
0.0
0
Diagnosis
[MCTD, AMI, SLE, PSS, nan, SLE, SjS, vertigo, ...
182
181
331
41.1
0
KCT
[nan, -, +]
3
2
660
81.9
0
RVVT
[nan, -, +]
3
2
660
81.9
0
LAC
[-, nan, +]
3
2
584
72.5
0
Symptoms
[AMI, nan, CNS lupus, Apo, AP, thrombocytepeni...
40
39
726
90.1
0
Thrombosis
[1, 0, 2, 3]
4
4
0
0.0
Visualizing missing values in B
Using klib
Code
"Missing Value Plot (TSUMOTO_B.CSV)"b_klib = klib.missingval_plot(B, figsize=(10,15))
This dataset is a bit concerning reason being it has some columns that are missing more than 70% of their data. ‘KCT’, ‘RVVT’, ‘LAC’, and ‘Symptoms’, pose a significant challenge in terms of replacing missing values due to various reasons.
If we take a look at the ‘Symptoms’ column, we can see that it has the most missing values (726). Now it can be a scenario where a patient doesn’t have any symptom so we see the corresponding value as Nan. We should change this to ‘None’ for our analysis later.
It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values would distort and bring bias into the data, thereby producing unrepresentative results.
Looking at the plot above, we get the idea that missing value columns ‘KCT’, ‘RVVT’, and ‘LAC’ are correlated. Two of these three columns can be dropped considering they represent the same thing (measure of degree of coagulation).
Using UpSet
Code
"UpSet plot for TSUMOTO_B.CSV"# check for missing valuesb_upset = pd.isna( B[['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern','aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis']])# using groupby for the missing valuesb_upset = b_upset.groupby(['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern','aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis'], as_index=True).size()# plot the groupby objectUpSet(b_upset).plot(fig= plt.figure(figsize=(8, 5)));plt.show()
We see the correlation quite clearly in the plot. The height of the bar is the highest when ‘KCT’, ‘RVVT’, ‘LAC’, and ‘Symptoms’ are missing together. Similar pattern can be witnessed for other columns.
Lets take a look at the missing values statistics of C
Code
df_C_stats = missing_val_stats(C)df_C_stats
column
unique_val
num_unique_val
num_unique_val_nona
num_miss
pct_miss
0
ID
[2110, 11408, 12052, 14872, 27654, 30609, 4300...
1236
1236
0
0.0
0
Date
[860419, 860430, 860502, 860506, 860507, 86050...
4960
4960
0
0.0
0
GOT
[24, 25, 22, nan, 28, 21, 23, 38, 27, 32, 20, ...
424
423
11341
19.7
0
GPT
[12, 8, nan, 13, 9, 10, 22, 18, 17, 15, 14, 46...
770
769
11356
19.7
0
LDH
[152, 162, 144, nan, 156, 167, 185, 134, 165, ...
1500
1499
11202
19.5
0
ALP
[63, 76, 68, nan, 66, 64, 60, 56, 55, 75, 89, ...
1335
1334
11989
20.8
0
TP
[7.5, 7.9, 7, nan, 7.6, 8, 7.2, 7.1, 8.2, 7.7,...
95
94
11874
20.6
0
ALB
[4.5, 4.6, 4.2, nan, 4.8, 4.4, 4.7, 4, 4.3, 4....
95
94
12189
21.2
0
UA
[3.4, 4.7, 5, nan, 4.4, 4.5, 4, 3.7, 3.6, 3.2,...
275
274
12202
21.2
0
UN
[16.0, 18.0, nan, 15.0, 14.0, 13.0, 12.0, 19.0...
346
345
11387
19.8
0
CRE
[0.6, nan, 0.7, 0.8, 0.66, 0.9, 1.2, 1, 1.1, 1...
601
600
11291
19.6
0
T-BIL
[0.4, nan, 0.5, 0.8, 0.6, 0.7, 0.2, 0.3, 0.9, ...
340
339
18733
32.6
0
T-CHO
[192, 187, 191, nan, 214, 231, 212, 204, 208, ...
848
847
15090
26.2
0
TG
[nan, 76, 65, 67, 113, 74, 63, 169, 155, 119, ...
1008
1007
28252
49.1
0
CPK
[nan, 31, 23, 25, 26, 28, 16, 37, 22, 18, 21, ...
920
919
36946
64.2
0
GLU
[nan, 114.0, 155.0, 88.0, 80.0, 84.0, 103.0, 2...
514
513
49078
85.3
0
WBC
[5.9, 6.9, nan, 7.2, 5, 4.9, 7.4, 6.5, 6.2, 7....
547
546
7905
13.7
0
RBC
[4.69, 4.73, nan, 4.93, 4.62, 4.49, 4.71, 4.38...
856
855
7905
13.7
0
HGB
[9, 9.3, nan, 9.5, 8.9, 8.5, 9.2, 9.4, 9.1, 10...
285
284
7904
13.7
0
HCT
[31.6, 31.8, nan, 33, 30.9, 30, 31.7, 29.3, 31...
766
765
7904
13.7
0
PLT
[380, 323, nan, 367, 372, 313, 299, 312, 346, ...
913
912
9605
16.7
0
PT
[nan, 11.9, 12.4, 12.0, 11.4, 12.1, 11.5, 11.3...
199
198
53507
93.0
0
APTT
[nan, 68.0, 67.0, 71.0, 65.0, 57.0, 56.0, 80.0...
129
128
56618
98.4
0
FG
[nan, 27.8, 27.0, 31.8, 31.9, 43.2, 47.5, 31.0...
876
875
54127
94.1
0
PIC
[nan, 290.0, 315.0, 105.0, 175.0, 285.0, 530.0...
350
349
56711
98.6
0
TAT
[nan, 86, 72, 63, 142, 138, 92, 107, 111, 110,...
231
230
56147
97.6
0
TAT2
[nan, 73, 96, 92, 120, 124, 112, 119, 128, 113...
164
163
56518
98.3
0
U-PRO
[0, nan, -, TR, 100, 30, 1, 2, 300, >=300, 3, ...
21
20
18402
32.0
0
IGG
[nan, 1265.0, 1485.0, 1530.0, 1149.0, 1240.0, ...
4815
4814
45181
78.6
0
IGA
[nan, 284, 328, 312, 327, 303, 270, 494, 390, ...
969
968
45184
78.6
0
IGM
[nan, 124, 140, 132, 39, 109, 115, 190, 160, 1...
763
762
45189
78.6
0
CRP
[nan, <0.002, -, <0.2, +, 2+, 3+, 6+, 4+, 0.3,...
441
440
10409
18.1
0
RA
[nan, -, +, 2+, +-, --, 22, 7-]
8
7
48950
85.1
0
RF
[nan, <40, 246, 286, 249, 162, 120, 104, 73.5,...
2394
2393
44943
78.1
0
C3
[nan, 30, 26, 24, 45, 49, 48, 65, 56, 58, 64, ...
373
372
30647
53.3
0
C4
[nan, 11, 8, 7, 15, 14, 13, 20, 19, 18, 16, 21...
107
106
30646
53.3
0
RNP
[nan, 0, 4, 64, 1, negative, 256, 16]
8
7
56768
98.7
0
SM
[nan, 0, negative, 2, 1, 8, 32]
7
6
56855
98.9
0
SC170
[nan, 16, 1, 0, negative, 4]
6
5
57372
99.8
0
SSA
[nan, 4, 0, negative, 64, 16, 1, 256]
8
7
56935
99.0
0
SSB
[nan, 0, negative, 1, 32, 2, 8, 128]
8
7
56966
99.1
0
CENTROMEA
[nan, 0, negative, 320, 160, 0.0, 1280.0, 1280...
11
10
57443
99.9
0
DNA
[nan, 63.1, 59.9, 57.1, 66.0, 41.9, 18.0, 15.8...
330
329
57078
99.2
0
DNA-II
[nan]
1
0
57510
100.0
Visualizing missing values in C
Using klib
Code
"Missing Value Plot (TSUMOTO_C.CSV)"c_klib = klib.missingval_plot(C, figsize=(10,15))
As for our third dataset, it turns out to be the most problematic, with over half (19 out of 44) of the columns having more than 60% of their values missing, including 15 columns with more than 80%. The fundamental problem is that the majority of these columns are numerical and could have provided context for any study that may have been conducted, but their current state renders them useless and extremely problematic.
There is a high correlation between many columns that have missing data. Since this dataset contains information about general lab examination, it can be a possibility that the patient didn’t appear for a particular lab test. Since the majority of these lab test results are obtained using the same procedure, omitting one would affect numerous outcomes.
Using UpSet Note: Since C has too many columns, the image generated by the UpSet plot will be too large to support. So we can skip generating the UpSet plot for C.
The above hypotheses would be further understood once missing values are dropped and new heatmaps are plotted, which is part of the next steps.
3.2.0.2 Taking care of missing data
As previously discussed, missing values are a big obstacle to the data sets, so dealing with them is a priority before carrying on any analysis.
Missing data for dataset A can be handled during cleaning and munging. As for Dataset B, we need to implement the following:
Any columns with more than 70% missing values should also be dropped since the probable bias and skewness caused by data imputation is not worth it. Columns with missing values in the 50–60% range will be preserved and individually assessed.
Code
# drop the columns that have more than 70% of their data missingA = A.loc[:, A.isna().mean() <0.7]B = B.loc[:, B.isna().mean() <0.7]C = C.loc[:, C.isna().mean() <0.7]
3.2.0.3 Cleaning data to make it tidier
It is a good idea to rename the columns in the dataset to make it more organized before continuing with any research. All columns should be lowercase, and underscore (’_‘) should be used in place of space (’ ’).
Code
def rename_columns(df):for i inrange(len(df.columns)):# check for number of words per column nameiflen(df.columns[i].split()) !=1:# rename column as mentioned above df.rename(columns={str(df.columns[i]): (str(df.columns[i]).lower().split()[0]) +'_'+ (str(df.columns[i]).lower().split()[1])}, inplace=True)# in case of only word, follow the same ideaelse: df.rename(columns={str(df.columns[i]):str(df.columns[i]).lower()}, inplace=True)rename_columns(A)rename_columns(B)rename_columns(C)
There are some columns in Dataset A that have date as an object. Lets convert them to the appropriate format.
Code
def form_date(val):# check for nan valuesifnot pd.isna(val):# Case 1, separator = '.'if'.'in val: date_list = val.split(".") year = date_list[0] month = date_list[1] day = date_list[2]# Case 2, separator = '/'elif'/'in val: date_list = val.split("/") year = date_list[0] month = date_list[1] day = date_list[2]# form year year ='19'+ year[-2:]# form monthifint(month) <10: month ='0'+str(int(month))# form dayifint(day) <10: day ='0'+str(int(day))return year +'/'+ month +'/'+ dayelse:return val
Code
# Convert relevant and required date columnsA['birthday'] = A['birthday'].apply(form_date)A['first_date'] = A['first_date'].apply(form_date)
Now that we have defined the function for date to be converted, we can calculate the age of a patient.
Code
from datetime import datetime, date# Function to calculate agedef calculate_age(val1, val2):# check for nan valuesifnot pd.isna(val1) andnot pd.isna(val2):# Identify given date as date, month and year born = datetime.strptime(val1, "%Y/%m/%d").date() curr_date = datetime.strptime(val2, "%Y/%m/%d").date()# return agereturn curr_date.year - born.year - ((curr_date.month, curr_date.day) < (born.month, born.day))else: np.nan
Code
# Calculating age of every patient at the time when they first came to the hospitalage_first_date = []for val1, val2 inzip(A['birthday'], A['first_date']): age_first_date.append(calculate_age(val1, val2))age_first_date# appending the age column to the dataframeA['age_first_date'] = age_first_date
Now that we have calculated the respective ages, we don’t need those columns. Also, we should drop negative values of age because that is absurd.
Note: We won’t be removing the birth date column since that can be used later for the examination date. Also, dropping columns may be a risk, but here we have already made the most use out of those. So now they seem irrelevant towards our goal.
Code
# Drop irrelevant columns from AA = A.drop(columns=['description', 'first_date'], axis=1)# Filter dataframe to drop absurd values of ageA = A[A.age_first_date >0]
If we take a look at the ‘admission’ column in A, we see that apart from ‘+“,’-’ and Nan values, it also has some absurd ones. So lets update that column so that it does not cause any issue later on.
Code
# define a function for the admission column to be formatted appropriatelydef update_admit(val):# check for nan valuesifnot pd.isna(val):if'+'in val:# if '+' in value, then update the entire string value with '+' val = val[val.index('+')]elif'-'in val:# if '-' in value, then update the entire string value with '-' val = val[val.index('-')]else:# if either of the above literals are not present, then that means it's irrelevant. This value won't be dropped since this is medical data, but it will be replaced with nan since it wasn't recorded. val = np.nanreturn val
Code
# Update the admission columnA['admission'] = A['admission'].apply(update_admit)A.head()
id
sex
birthday
admission
diagnosis
age_first_date
0
2110
F
1934/02/13
+
RA susp.
58.0
1
11408
F
1937/05/02
+
PSS
35.0
4
27654
F
1936/03/25
+
RA, SLE susp
55.0
6
43003
M
1937/11/24
-
Raynaud's phenomenon
56.0
9
57266
M
1923/07/25
+
RA
69.0
Moving on to dataset B.
B has 3 different anti-Cardiolipin antibody columns with their values on different scales. We should normalize this data since it may be useful for us later on for temporal analysis.
Code
# Normalizing values of the 3 different anti-Cardiolipin antibody columnB['acl_igg'] = (B['acl_igg']-B['acl_igg'].mean())/B['acl_igg'].std()B['acl_igm'] = (B['acl_igm']-B['acl_igm'].mean())/B['acl_igm'].std()B['acl_iga'] = (B['acl_iga']-B['acl_iga'].mean())/B['acl_iga'].std()
Also, notice that the ‘id’ column in B has missing values. Since this dataset has our dependent variable and the fact that it has information about those patients that got themselves tested for the same, if we don’t have their respective id then it may generate inappropriate results.
Also, these are just 36 rows out of 806. So dropping these won’t create any significant issue.
Code
# Drop NAs in the ID column of B since they are the main unit of analysis (total 36 rows dropped out of 806)B = B.dropna(subset=['id'])# change dtype of B.id to int for mergingB['id'] = B['id'].astype('int')
Also, ‘ana’ column has an inappropriate value of ‘>4096’ for the measure of anti-nucleus antibodies in a patient. Lets fix that.
Code
# filter dataset B for correct ana valuesB = B[B.ana !='>4096']
That concludes for B. Lets move on to our third dataset.
For C, we again see that there is a date column but with dtype as integer. Lets fix that.
Code
# Converting date column from int to datedef int_to_date(val):# check for nan valuesifnot pd.isna(val): val =str(val)# form year year ='19'+ val[:2]# form month month = val[2:4]# form day day = val[4:]# add a 0 to month to denote months before Octoberifint(month) <10: month ='0'+str(int(month))# add a 0 to day to denote days before the 10th of every monthifint(day) <10: day ='0'+str(int(day))return year +'/'+ month +'/'+ dayelse:return val
Code
# Convert dtype of date column to dateC['lab_date'] = C['date'].apply(int_to_date)# Drop 'date' column since now it has been updated C = C.drop(columns=['date'], axis=1)C.head()
id
got
gpt
ldh
alp
tp
alb
ua
un
cre
...
wbc
rbc
hgb
hct
plt
u-pro
crp
c3
c4
lab_date
0
2110
24
12
152
63
7.5
4.5
3.4
16.0
0.6
...
5.9
4.69
9
31.6
380
0
NaN
NaN
NaN
1986/04/19
1
2110
25
12
162
76
7.9
4.6
4.7
18.0
0.6
...
6.9
4.73
9.3
31.8
323
NaN
<0.002
NaN
NaN
1986/04/30
2
2110
22
8
144
68
7
4.2
5
18.0
0.6
...
NaN
NaN
NaN
NaN
NaN
NaN
-
NaN
NaN
1986/05/02
3
2110
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
0
NaN
NaN
NaN
1986/05/06
4
2110
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
NaN
0
NaN
NaN
NaN
1986/05/07
5 rows × 24 columns
3.2.0.4 Check individual datasets before merging
Lets have one final look at each of our dataset before merging.
Too many missing values account for the uncertainity in the dataset.
3.2.0.5 Merging the datasets
In terms of joining data sets, there are multiple ways of doing so depending on the question being answered.
Now that we have an idea, we will first merge A with C. Reason being that A contains basic information about the patients and C has the data from general laboratory examinations stored in hospital information systems. Merging these will filter out the id(s) of the patients that dint get tested for these examinations.
Then the last step would be to merge the combined one with B.
Since B contains our target variable (Thrombosis), it has the information of those patients that got tested via the ‘Special Laboratory Examination’. Merging B at the end will give us the final dataset that we can use for EDA and strengthen our final conclusion.
Since ‘ID’ is the only common column amongst the 3 datasets, it will be used in the ‘left’ join for merging.
3.2.0.5.1 Step 1: Merge A with C
Code
# Merging A with C using left join on id columnC_A = C.merge(A, on='id', how='left')C_A.head()
id
got
gpt
ldh
alp
tp
alb
ua
un
cre
...
u-pro
crp
c3
c4
lab_date
sex
birthday
admission
diagnosis
age_first_date
0
2110
24
12
152
63
7.5
4.5
3.4
16.0
0.6
...
0
NaN
NaN
NaN
1986/04/19
F
1934/02/13
+
RA susp.
58.0
1
2110
25
12
162
76
7.9
4.6
4.7
18.0
0.6
...
NaN
<0.002
NaN
NaN
1986/04/30
F
1934/02/13
+
RA susp.
58.0
2
2110
22
8
144
68
7
4.2
5
18.0
0.6
...
NaN
-
NaN
NaN
1986/05/02
F
1934/02/13
+
RA susp.
58.0
3
2110
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
0
NaN
NaN
NaN
1986/05/06
F
1934/02/13
+
RA susp.
58.0
4
2110
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
0
NaN
NaN
NaN
1986/05/07
F
1934/02/13
+
RA susp.
58.0
5 rows × 29 columns
Check the shape.
Code
C_A.shape
(57512, 29)
3.2.0.5.2 Step 2: Merge combined (C_A) with B
Code
# Merging C_A with B using left join on id columndf_final = B.merge(C_A, on='id', how='left')df_final.head()
id
examination_date
acl_igg
acl_igm
ana
ana_pattern
acl_iga
diagnosis_x
symptoms
thrombosis
...
u-pro
crp
c3
c4
lab_date
sex
birthday
admission
diagnosis_y
age_first_date
0
14872
1997/5/27
-0.114128
-0.035902
256
P
-0.038639
MCTD, AMI
AMI
1
...
0
-
NaN
NaN
1981/02/17
NaN
NaN
NaN
NaN
NaN
1
14872
1997/5/27
-0.114128
-0.035902
256
P
-0.038639
MCTD, AMI
AMI
1
...
NaN
-
NaN
NaN
1981/03/16
NaN
NaN
NaN
NaN
NaN
2
14872
1997/5/27
-0.114128
-0.035902
256
P
-0.038639
MCTD, AMI
AMI
1
...
NaN
-
NaN
NaN
1981/04/06
NaN
NaN
NaN
NaN
NaN
3
14872
1997/5/27
-0.114128
-0.035902
256
P
-0.038639
MCTD, AMI
AMI
1
...
NaN
-
NaN
NaN
1981/04/27
NaN
NaN
NaN
NaN
NaN
4
14872
1997/5/27
-0.114128
-0.035902
256
P
-0.038639
MCTD, AMI
AMI
1
...
NaN
-
NaN
NaN
1981/05/01
NaN
NaN
NaN
NaN
NaN
5 rows × 38 columns
Check the shape.
Code
df_final.shape
(20279, 38)
Since ‘diagnosis’ was present both in B and C_A, we see two different columns for both in the final dataframe. But if you look carefully, the values are similar. Lets merge these columns into one and form the updated ‘diagnosis’ column for df_final.
Code
for i inrange(len(df_final)):# check for nan in diagnosis_x, if false then split the valuesif df_final["diagnosis_x"].isna()[i] ==False: df_final["diagnosis_x"][i] = df_final["diagnosis_x"][i].split(",")# check for nan in diagnosis_y, if false then split the valuesif df_final["diagnosis_y"].isna()[i] ==False: df_final["diagnosis_y"][i] = df_final["diagnosis_y"][i].split(",")# defining an empty list to store diagnosis as column in the dataframe later ondf_final["diagnosis"] =""for i inrange(len(df_final)):# fill the value with diagnosis_x if diagnosis_y is nanif (df_final["diagnosis_x"].isna()[i] ==False) & (df_final["diagnosis_y"].isna()[i] ==True): df_final["diagnosis"][i] = df_final["diagnosis_x"][i]# fill the value with diagnosis_y if diagnosis_x is nanelif (df_final["diagnosis_x"].isna()[i] ==True) & (df_final["diagnosis_y"].isna()[i] ==False): df_final["diagnosis"][i] = df_final["diagnosis_y"][i]# fill the value with the list of the set of values in diagnosis_x and diagnosis_y if both are not nanelif (df_final["diagnosis_x"].isna()[i] ==False) & (df_final["diagnosis_y"].isna()[i] ==False): df_final["diagnosis"][i] =list(set(df_final["diagnosis_x"][i] + df_final["diagnosis_y"][i]))
If we remember fondly, we didn’t drop the ‘birthday’ column earlier. The main reason for not dropping that was to calculate the age of the patients at the time of their ‘Thrombosis’ lab test.
Now that the datasets have been merged, we can use the ‘birthday’ and ‘examination_date’ column to calculate the same.
Code
# convert examination_date to date formatdf_final['examination_date'] = df_final['examination_date'].apply(form_date)# Calculate age at the time of examinationage_at_examination = []for val1, val2 inzip(df_final['birthday'], df_final['examination_date']): age_at_examination.append(calculate_age(val1, val2))# add the column to the dataframedf_final['age_at_examination'] = age_at_examination
Now that we have used these columns, we can drop them
Some columns still are very much correlated to one another in terms of missing data.
For example, columns such as ‘GOT’, ‘GPT’, ‘LDH’, ‘ALP’, ‘TP’, ‘ALB’, ‘UA’, ‘UN’, and ‘CRE’ have values approximately equal to 1.
The set of columns (‘WBC’, ‘RBC’, ‘HGB’, ‘HCT’, ‘PLT’) and (‘C3’, and ‘C4’) are highly correlated as per their missing values.
We need to analyze and then handle the missing data. Since this is a medical dataset, we can’t just drop or modify such columns. We need to take precautions and then follow appropriate steps.
Using UpSet Note: Since df_final has too many columns, the image generated by the UpSet plot will be too large to support.
4 Exploratory analysis
Lets plot a histogram for our temporal data from the first dataset (TSUMOTO_A.CSV). This plot will display the count of ‘age_first_date’ to analyze the age of patients when they first visited the hospital.
Code
# define figure sizeplt.figure(figsize=(8, 5))# plot the histogram using dataset A, age_first_date as x and sex as huesns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)# define axis labelsplt.xlabel("Age(years)")plt.ylabel("Count")# put title on the plotplt.title('Age Distribution of patients when they first visited the hospital')# set margins for x and y axisplt.margins(x=0.01, y=0.1)plt.show()
From the above plot, we infer two things:
1. The number of female patients are more as compared to male.
2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.
We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don’t tend to visit hospital that much for their medical checkups.
As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.
Lets do another temporal analysis, this time using the second dataset (TSUMOTO_B.CSV) and see how has thrombosis evolved over time.
Code
# drop nan values from subset columns and then sort by examination_datetemporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])# convert the examination_date column dtype to pandas datetimetemporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])# Extract the year from the datetime and create a new columntemporal_B['year'] = temporal_B['examination_date'].dt.year# count of thrombosis patients per yearthrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()# plot barchart of the thrombosis counts per yearax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#979797', '#f44336', "#C2CD23", "#918BC3"])# set title of the plotax.set_title("Thrombosis diagnosis over the years")# define labels of the axisax.set_xlabel('Year')ax.set_ylabel('Count')
Text(0, 0.5, 'Count')
The histogram above highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.
Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.
Lets have a look at the age distribution of patients when got tested for thrombosis. This plot also accounts for temporal analysis.
Code
# drop nan values from the subset columnage_exam_nona = df_final.dropna(subset=['age_at_examination'])# convert age_at_examination dtype to intage_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')# define figure sizeplt.figure(figsize=(8, 5))# plot histogram using df_final as data, age_at_examination as x and thrombosis as hueax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)# define labels of axisplt.xlabel("Age(years)")plt.ylabel("Count")# set title of the plotplt.title('Age distribution of patients when they got tested for thrombosis')# define margins of the x and y axisplt.margins(x=0.005, y=0.1)plt.show()
As per the plot above, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low.
Now we will conduct a comprehensive analysis of thrombosis since it is important to gain a deeper understanding of its associated side effects and prevalence among different age groups and genders within the population.
Below is a density plot that shows the age distribution of people that have thrombosis vs people who do not.
Code
# filter df with patients that have thrombosisthrombosis_patients = df_final[df_final['thrombosis'] !=0]# filter df with patients that don't have thrombosisno_thrombosis_patients = df_final[df_final['thrombosis'] ==0]# age at examination of patients with thrombosisage_t = thrombosis_patients['age_at_examination']# age at examination of patients without thrombosisage_nt = no_thrombosis_patients['age_at_examination']# Filter outliersage_t = age_t[age_t >2]age_nt = age_nt[age_nt >2]# Age distribution plotsns.kdeplot(age_t, color="#f44336",fill=True, label="Patients with Thrombosis")sns.kdeplot(age_nt, color="#979797",fill=True, label="Patients without Thrombosis")# Add legendplt.legend()# set titleplt.title("Age Distribution of Patients with respect to Thrombosis")# set x-axis labelplt.xlabel("Age(years)")# Display plotplt.show()
The density plot above shows that Thrombosis does not seem to occur with people of age more than 70. In fact, most of Thrombosis patients’ age are in the range 20-60, which is pretty much the same as that of petients that did not have Thrombosis. The only slight difference is that patients with no thromboses are slightly skewed to the younger side of that range, but ultimately i dont think this holds any relvance.
Next step is to see if there are any differences within the Thrombosis group, meaning between the different levels of the disease’s severity.
Code
# filter the level of thrombosis disease into 3 separate variablest1 = df_final[df_final['thrombosis'] ==1]t2 = df_final[df_final['thrombosis'] ==2]t3 = df_final[df_final['thrombosis'] ==3]# store the respective ages of patients as per the variables chosen aboveage_t1 = t1['age_at_examination']age_t2 = t2['age_at_examination']age_t3 = t3['age_at_examination']# only check for patients having age > 2age_t1 = age_t1[age_t1 >2]age_t2 = age_t2[age_t2 >2]age_t3 = age_t3[age_t3 >2]# plot the kernel density plot using the age calculated abovesns.kdeplot(age_t1, color="#979797",fill=True, label="Level 1")sns.kdeplot(age_t2, color="black",fill=True, label="Level 2")sns.kdeplot(age_t3, color="#f44336",fill=True, label="Level 3")# Add legendplt.legend()# set titleplt.title("Age Distribution of differnet levels of Thrombosis")# set x-labelplt.xlabel("Age(Years)")# Display plotplt.show()
Different levels of Thrombosis turn out to have different age distributions, with the age density plot of level 3 patients being the most condensed in the 30-40 age range, while that of level 1 and level 2 patients in 15-35 and 0-30 respectively.
Look at box plots to check for outliers.
Code
# Subset columns of interestlab_results = df_final[['acl_igg', 'acl_igm', 'acl_iga', 'ana', 'thrombosis']]# drop nan values from subset columnslab_results = lab_results.dropna(subset=['ana']) # Only 4 missing values # change ana column dtype to intlab_results['ana'] = lab_results['ana'].astype('int')# create subplotsfig, axs = plt.subplots(2, 2, figsize=(10, 10))# create box plotsboxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)# plot boxplotaxs[0, 0].boxplot(boxp['acl_iga'])# set title of boxplotaxs[0, 0].set_title('acl_iga')# plot boxplotaxs[0, 1].boxplot(boxp['acl_igg'])# set title of boxplotaxs[0, 1].set_title('acl_igg')# plot boxplotaxs[1, 0].boxplot(boxp['acl_igm'])# set title of boxplotaxs[1, 0].set_title('acl_igm')# plot boxplotaxs[1, 1].boxplot(boxp['ana'])# set title of boxplotaxs[1, 1].set_title('ana')# add x-axis labelfor ax in axs.flat: ax.set_xlabel('Feature')# add y-axis labelfor ax in axs.flat: ax.set_ylabel('Value')# add overall titleplt.suptitle('Box plots of features', fontsize=16)# adjust layoutplt.tight_layout()# show plotplt.show()
The data has outlier issues.
Setting the threshold at 100 for IgA, IgG, and IgM makes sense, while putting it at 200 for ANA also makes sense.
Code
lab_results['ana'] = lab_results['ana'].astype('int')# setting the threshold as 100 for the anti-Cardiolipin antibodies and 200 for analab_results = lab_results[(lab_results['acl_iga'] <100) & (lab_results['acl_igg'] <100) &(lab_results['acl_igm'] <100) & (lab_results['ana'] <50)]# create subplotsfig, axs = plt.subplots(2, 2, figsize=(10, 10))# create box plotsboxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)# plot boxplotaxs[0, 0].boxplot(boxp['acl_iga'])# set title of boxplotaxs[0, 0].set_title('acl_iga')# plot boxplotaxs[0, 1].boxplot(boxp['acl_igg'])# set title of boxplotaxs[0, 1].set_title('acl_igg')# plot boxplotaxs[1, 0].boxplot(boxp['acl_igm'])# set title of boxplotaxs[1, 0].set_title('acl_igm')# plot boxplotaxs[1, 1].boxplot(boxp['ana'])# set title of boxplotaxs[1, 1].set_title('ana')# add x-axis labelfor ax in axs.flat: ax.set_xlabel('Feature')# add y-axis labelfor ax in axs.flat: ax.set_ylabel('Value')# add overall titleplt.suptitle('Box plots of features', fontsize=16)# adjust layoutplt.tight_layout()# show plotplt.show()
The data now is in a better state to analyze (a handful of outliers remain).
Below is a radar chart that shows how patients that do not have Thrombosis compare to those who have (and the different levels) in terms of lab results (averaged results).
Code
# Normalize columns to have them all in the same scale from sklearn.preprocessing import StandardScaler# using standard scaler object for feature scalingscaler = StandardScaler()# choose all columns except thrombosiscols_to_normalize = [col for col in lab_results.columns if col !='thrombosis']# normalize the column values and put those in a new dataframelab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)lab_results_norm['thrombosis'] = lab_results['thrombosis']# Get the means of each Thrombosis groupmeans = lab_results_norm.groupby('thrombosis').mean().reset_index()categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']# store each respective mean in a different variablevariable1 = means.iloc[0, :].valuesvariable2 = means.iloc[1, :].valuesvariable3 = means.iloc[2, :].valuesvariable4 = means.iloc[3, :].valueslabel_loc = np.linspace(start=0, stop=2* np.pi, num=len(variable1))# Set colors and fillcolors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']# define figure sizeplt.figure(figsize=(8.5, 8.5))# add subplotsplt.subplot(polar=True)# plot the graph for the first meanplt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])plt.plot(label_loc, variable1, color=colors[0], label='No Thrombosis', linewidth=2)# plot the graph for the second meanplt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])plt.plot(label_loc, variable2, color=colors[1], label='Thrombosis 1', linewidth=2)# plot the graph for the third meanplt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])plt.plot(label_loc, variable3, color=colors[2], label='Thrombosis 2', linewidth=2)# plot the graph for the fourth meanplt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])plt.plot(label_loc, variable4, color=colors[3], label='Thrombosis 3', linewidth=2)# set titleplt.title('Lab results averages of patients', size=24)lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[''] + categories, color='gray')# set legendplt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)plt.tight_layout()plt.show()
The radar chart shows that the only lab results that show significant differences between different Thrombosis groups are:
ANA: The higher the ANA value the more severe the Thrombosis diagnosis. Thrombosis level 1 seems have to similar acl_iga and acl_igm values.
A possible predictor of Thrombosis could be the symptoms a patient is experiencing.
Code
from wordcloud import WordCloud, STOPWORDSimport randomimport matplotlib.pyplot as plt# Define the thrombosis categories to iterate overthrombosis_categories = [1, 2, 3]# Create a figure with three columnsfig, axes = plt.subplots(1, 3, figsize=(9, 6))# Iterate over the thrombosis categories and plot a word cloud in each columnfor i, thrombosis_category inenumerate(thrombosis_categories): thrombosis_p = B[B['thrombosis'] == thrombosis_category]# Concatenate all the strings in the "Symptoms" column into a single string text =' '.join(thrombosis_p['symptoms'])# Create a set of stopwords stopwords =set(STOPWORDS) stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ','ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY']) # Excluding thrombsosis (obvious word) + some words that dont make sense# Define the color palette custom_colors = ['#121111', '#3646f4']def color_func(word, font_size, position, orientation, random_state=None, **kwargs):return random.choice(custom_colors)# Create a WordCloud object with the custom color palette wordcloud = WordCloud(width=400, height=400, background_color='white', color_func=color_func, stopwords=stopwords).generate(text)# Plot the word cloud in the corresponding subplot ax = axes[i] ax.imshow(wordcloud) ax.axis("off") ax.set_title(f"Thrombosis {thrombosis_category}")# Adjust the layout of the figure and display itplt.tight_layout()plt.show()
The main symptoms of each thrombosis level are the following:
Level 1: Brain infarction
Level 2: CNS lupus
Level 3: Thrombocytopenia
The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.
5 Final Plots
Code
# define figure sizeplt.figure(figsize=(8, 5))# plot the histogram using dataset A, age_first_date as x and sex as huesns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)# define axis labelsplt.xlabel("Age(years)")plt.ylabel("Count")# put title on the plotplt.title('Age Distribution of patients when they first visited the hospital')# set margins for x and y axisplt.margins(x=0.01, y=0.1)plt.show()
Code
# drop nan values from subset columns and then sort by examination_datetemporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])# convert the examination_date column dtype to pandas datetimetemporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])# Extract the year from the datetime and create a new columntemporal_B['year'] = temporal_B['examination_date'].dt.year# count of thrombosis patients per yearthrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()# plot barchart of the thrombosis counts per yearax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#979797', '#f44336', "#C2CD23", "#918BC3"])# set title of the plotax.set_title("Thrombosis diagnosis over the years")# define labels of the axisax.set_xlabel('Year')ax.set_ylabel('Count')
Text(0, 0.5, 'Count')
Code
# drop nan values from the subset columnage_exam_nona = df_final.dropna(subset=['age_at_examination'])# convert age_at_examination dtype to intage_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')# define figure sizeplt.figure(figsize=(8, 5))# plot histogram using df_final as data, age_at_examination as x and thrombosis as hueax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)# define labels of axisplt.xlabel("Age(years)")plt.ylabel("Count")# set title of the plotplt.title('Age distribution of patients when they got tested for thrombosis')# define margins of the x and y axisplt.margins(x=0.005, y=0.1)plt.show()
Code
# Normalize columns to have them all in the same scale from sklearn.preprocessing import StandardScaler# using standard scaler object for feature scalingscaler = StandardScaler()# choose all columns except thrombosiscols_to_normalize = [col for col in lab_results.columns if col !='thrombosis']# normalize the column values and put those in a new dataframelab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)lab_results_norm['thrombosis'] = lab_results['thrombosis']# Get the means of each Thrombosis groupmeans = lab_results_norm.groupby('thrombosis').mean().reset_index()categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']# store each respective mean in a different variablevariable1 = means.iloc[0, :].valuesvariable2 = means.iloc[1, :].valuesvariable3 = means.iloc[2, :].valuesvariable4 = means.iloc[3, :].valueslabel_loc = np.linspace(start=0, stop=2* np.pi, num=len(variable1))# Set colors and fillcolors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']# define figure sizeplt.figure(figsize=(8.5, 8.5))# add subplotsplt.subplot(polar=True)# plot the graph for the first meanplt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])plt.plot(label_loc, variable1, color=colors[0], label='No Thrombosis', linewidth=2)# plot the graph for the second meanplt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])plt.plot(label_loc, variable2, color=colors[1], label='Thrombosis 1', linewidth=2)# plot the graph for the third meanplt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])plt.plot(label_loc, variable3, color=colors[2], label='Thrombosis 2', linewidth=2)# plot the graph for the fourth meanplt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])plt.plot(label_loc, variable4, color=colors[3], label='Thrombosis 3', linewidth=2)# set titleplt.title('Lab results averages of patients', size=24)lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[''] + categories, color='gray')# set legendplt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)plt.tight_layout()plt.show()
Code
from wordcloud import WordCloud, STOPWORDSimport randomimport matplotlib.pyplot as plt# Define the thrombosis categories to iterate overthrombosis_categories = [1, 2, 3]# Create a figure with three columnsfig, axes = plt.subplots(1, 3, figsize=(9, 6))# Iterate over the thrombosis categories and plot a word cloud in each columnfor i, thrombosis_category inenumerate(thrombosis_categories): thrombosis_p = B[B['thrombosis'] == thrombosis_category]# Concatenate all the strings in the "Symptoms" column into a single string text =' '.join(thrombosis_p['symptoms'])# Create a set of stopwords stopwords =set(STOPWORDS) stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ','ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY']) # Excluding thrombsosis (obvious word) + some words that dont make sense# Define the color palette custom_colors = ['#121111', '#3646f4']def color_func(word, font_size, position, orientation, random_state=None, **kwargs):return random.choice(custom_colors)# Create a WordCloud object with the custom color palette wordcloud = WordCloud(width=400, height=400, background_color='white', color_func=color_func, stopwords=stopwords).generate(text)# Plot the word cloud in the corresponding subplot ax = axes[i] ax.imshow(wordcloud) ax.axis("off") ax.set_title(f"Thrombosis {thrombosis_category}")# Adjust the layout of the figure and display itplt.tight_layout()plt.show()
6 Technical summary
Unexpectedly, the quality of the data provided for this investigation was far lower than anticipated, especially considering that it came from such a prestigious institution. The main issues with the data are provided below:
Missing Data: Missing values are a major problem in this set of data, particularly in TSUMOTO_C.CSV, where more than 80% of the columns and features have missing values. The biggest barrier to better understanding thrombosis and its connection to routine lab tests was this problem. Because of this problem, perhaps some significant connections and patterns were missed.
Lack of Thrombosis cases: The major goals of this study were to gain a deeper understanding of thrombosis and identify data patterns that might act as predictors of this condition. The data presented, however, contains extremely few thrombosis instances, particularly more severe ones (levels 2 and 3), making any study very susceptible to the addition of more cases.
Data not balanced: When sampling for thrombosis + cases exclusively, the dataset TSUMOTO_B.CSV was significantly unbalanced, with nearly all patients (about 85%) being female, which could lead to incorrect findings that thrombosis occurs considerably more commonly with females. This prevented a gender analysis that was accurate from being done.
Dates: TIn some cases, there is a discrepancy of more than 5 years between the dates patients’ lab tests and thrombosis testing are performed. Given that the patients’ actual results would have been drastically different if they had undergone more recent (closer to Thrombosis assessment) lab testing, this renders any analysis of lab data (TSUMOTO_C.CSV) very poor..
Yet it’s vital to note that all of the data was gathered between 1989 and 1996, a time period in which industrywide data discrepancies were frequent and data collection methods weren’t the most sophisticated. An examination of thrombosis using more recent data, if available, would be a better idea.
Below are some of main key takeaways (elaborated more in results.html):
Level 3 Thrombosis appears to happen with people on the higher end of the age distribution.
ANA seems to be the best predictor amongst all.
There are various stages of thrombosis, and each level has its own set of symptoms.
Source Code
---title: "EDA on Medical Data of Thrombosis Diagnosis"author: - name: Raghav Sharma affiliations: - name: Georgetown University url: https://github.com/anly503/hw2-spring-2023-raghavSharmaCode email: rs2190@georgetown.edu ---# Introduction## ObjectiveImagine yourself as a data scientist working at a consulting firm. An external entity (client) has paid a significant amount of money for your firm to make sense of a medical data set. They plan to incorporate your findings to help guide the allocation of a multi-million dollar research grant. They want the results in two weeks, after which point the contract terminates and your consulting firm will move onto a new unrelated project. <br><br>Our job is to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.## DataData has been collected from Georgetown University's Hospital and contain information about patients who have been admitted to the hospital's <b>Collagen</b> disease clinic. <br><br>Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lung, he/she will lose the respiratory function in a chronic course and finally lose their lives. <b>Thrombosis</b> is the most severe form of Collagen diseases. <br><br> There are 3 CSV files that make up for the medical data. These files are:<br><br>1. TSUMOTO_A.CSV: Contains basic information about patients<br><br>2. TSUMOTO_B.CSV: Contains information about the 'Special Laboratory Examinations' of patients to detect Thrombosis.<br><br>3. TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.<br><br>## Goals of the projectThis project aims to thoroughly understand and analyze Thrombosis which is a Collagen disease by achieving the following:<br><br>1. Search for good patterns which detect and predict thrombosis.<br><br>2. Search for temporal patterns specific/sensitive to thrombosis.<br><br>3. Search for good features which classifies collagen diseases correctly.<br><br>4. Search for temporal patterns specific/sensitive to each collagen diseases.<br><br>To achieve these goals, the data will be carefully analyzed to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, this project aims to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.# Initial QuestionsWe would like to investigate the following questions prior to our analyses:<br><br>1. How does thrombosis evolve over time?<br><br>2. What age group is Thrombosis, the most prevalent in?<br><br>3. Does it depends upon the sex of the patient?<br><br>4. Is this dataset good enough to understand and visualize the factors causing and effects of Thrombosis?<br><br># Data Ingestion, Cleaning and MungingBefore doing any kind of analysis, it is important to properly understand the datasets in terms of shape, strengths, weaknesses, etc. <br><br>## Ingestion#### Importing the libraries```{python}# Importing all necessary packagesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as snsimport klibimport missingno as msnofrom upsetplot import UpSetfrom wordcloud import WordCloud, STOPWORDSimport randomimport sysimport warningswarnings.filterwarnings("ignore")```#### Importing the datasets```{python}# Redirect stdout and stderr to null devicesys.stdout =open('/dev/null', 'w')sys.stderr =open('/dev/null', 'w')```1. <b>TSUMOTO_A.CSV</b>```{python}A = pd.read_csv("./data/TSUMOTO_A.CSV", encoding='unicode_escape')A.head()``````{python}"Shape of A: "+str(A.shape)```2. <b>TSUMOTO_B.CSV</b>```{python}B = pd.read_csv("./data/TSUMOTO_B.CSV", encoding='unicode_escape')B.head()``````{python}"Shape of B: "+str(B.shape)```3. <b>TSUMOTO_C.CSV</b> <br><br>Note: Couple of lines in this file were causing issues when reading the dataset. Using "error_bad_lines=False" argument resolved that. A total of 30 lines (out of 50k + where dropped), so any impact of this drop would be very minimal.```{python}# A couple of bad (unreadable) lines in the 3rd dataset were dropped. The dropped lines account for 30 (out of 50k+), # which makes this drop insignificantC = pd.read_csv("data/TSUMOTO_C.CSV" , encoding='unicode_escape', error_bad_lines=False) C.head()``````{python}"Shape of C: "+str(C.shape)```C is by far the largest dataframe that has information about general laboratory examinations.## Cleaning and Munging#### Analyzing Missing valuesSearching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the datasets.<br><br>It is very important to note that this is medical data. So if any of the dataset columns has a missing value, then it means that it wasn't recorded. We need to carefully analyze and then take appropriate steps to handle missing data.We will use 'klib', 'missingno', and 'UpSet' libraries to visualize the missing data for all the datasets.<br><br>1. klib library helps us to visualize missing data trends in the dataset. Using the 'missing_val' plot, we will be able to extract necessary information of the missing data in every column. <br><br>2. missingno library will be helpful in plotting the correlation amongst the columns having missing values. This way we will be able to make a decision if a certain column needs to be dropped or not.<br><br>3. UpSet library looks at the multivariate relationships among categorical variables and is applied to missing data.<br><br>Lets create a function to analyze missing data statistics of our dataset.```{python}# define a function that returns a data-frame of missing data statisticsdef missing_val_stats(df):# define columns of the data-frame df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 'num_miss', 'pct_miss']) tmp = pd.DataFrame()for c in df.columns:# column tmp['column'] = [c]# unique values in the column tmp['unique_val'] = [df[c].unique()]# number of unique values in the column tmp['num_unique_val'] =len(list(df[c].unique()))# number of unique values in the column without nan tmp['num_unique_val_nona'] =int(df[c].nunique())# number of missing values in the column tmp['num_miss'] = df[c].isnull().sum()# percentage of missing values in the column tmp['pct_miss'] = (df[c].isnull().sum()/len(df)).round(3)*100 df_stats = df_stats.append(tmp)return df_stats```Lets take a look at the missing values statistics of A```{python}df_A_stats = missing_val_stats(A)df_A_stats```Visualizing missing values in A <br><br>1. Using klib```{python}"Missing Value Plot (TSUMOTO_A.CSV)"a_klib = klib.missingval_plot(A, figsize=(10,15))```There are a few columns with few missing values with 'First Date' reaching a maximum of 20%. Dealing with missing values in these columns could be dealt with at later stages, when/if they are used.2. Using missingno```{python}"Correlation amongst missing values (TSUMOTO_A.CSV)"a_msno = msno.heatmap(A, figsize=(8, 5))```Nothing too significant for A.3. Using UpSet```{python}"UpSet plot for TSUMOTO_A.CSV"# check for nan valuesa_upset = pd.isna( A[['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission','Diagnosis']])# using groupby for the missing valuesa_upset = a_upset.groupby(['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission','Diagnosis'], as_index=True).size()# plot the groupby objectUpSet(a_upset).plot(fig= plt.figure(figsize=(8, 5)));plt.show()```As stated earlier, there isn't much of a correlation amongst the missing values in A. This is proved by the upset plot as we see in the figure that there are less cases of two or more values missing together.<br><br>Lets take a look at the missing values statistics of B```{python}df_B_stats = missing_val_stats(B)df_B_stats```Visualizing missing values in B <br><br>1. Using klib```{python}"Missing Value Plot (TSUMOTO_B.CSV)"b_klib = klib.missingval_plot(B, figsize=(10,15))```This dataset is a bit concerning reason being it has some columns that are missing more than 70% of their data.'KCT', 'RVVT', 'LAC', and 'Symptoms', pose a significant challenge in terms of replacing missing values due to various reasons.<br><br>If we take a look at the 'Symptoms' column, we can see that it has the most missing values (726). Now it can be a scenario where a patient doesn't have any symptom so we see the corresponding value as Nan. We should change this to 'None' for our analysis later. <br><br>It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values would distort and bring bias into the data, thereby producing unrepresentative results.2. Using missingno```{python}"Correlation amongst missing values (TSUMOTO_B.CSV)"b_msno = msno.heatmap(B, figsize=(8,5))```Looking at the plot above, we get the idea that missing value columns 'KCT', 'RVVT', and 'LAC' are correlated. Two of these three columns can be dropped considering they represent the same thing (measure of degree of coagulation).3. Using UpSet```{python}"UpSet plot for TSUMOTO_B.CSV"# check for missing valuesb_upset = pd.isna( B[['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern','aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis']])# using groupby for the missing valuesb_upset = b_upset.groupby(['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern','aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis'], as_index=True).size()# plot the groupby objectUpSet(b_upset).plot(fig= plt.figure(figsize=(8, 5)));plt.show()```We see the correlation quite clearly in the plot. The height of the bar is the highest when 'KCT', 'RVVT', 'LAC', and 'Symptoms' are missing together. Similar pattern can be witnessed for other columns. <br><br>Lets take a look at the missing values statistics of C```{python}df_C_stats = missing_val_stats(C)df_C_stats```Visualizing missing values in C <br><br>1. Using klib```{python}"Missing Value Plot (TSUMOTO_C.CSV)"c_klib = klib.missingval_plot(C, figsize=(10,15))```As for our third dataset, it turns out to be the most problematic, with over half (19 out of 44) of the columns having more than 60% of their values missing, including 15 columns with more than 80%. The fundamental problem is that the majority of these columns are numerical and could have provided context for any study that may have been conducted, but their current state renders them useless and extremely problematic.2. Using missingno```{python}"Correlation amongst missing values (TSUMOTO_C.CSV)"c_msno = msno.heatmap(C, figsize=(20,15))```There is a high correlation between many columns that have missing data. Since this dataset contains information about general lab examination, it can be a possibility that the patient didn't appear for a particular lab test. Since the majority of these lab test results are obtained using the same procedure, omitting one would affect numerous outcomes.<br><br>3. Using UpSet <br>Note: Since C has too many columns, the image generated by the UpSet plot will be too large to support. So we can skip generating the UpSet plot for C.<br><br>The above hypotheses would be further understood once missing values are dropped and new heatmaps are plotted, which is part of the next steps.<br><br>#### Taking care of missing data As previously discussed, missing values are a big obstacle to the data sets, so dealing with them is a priority before carrying on any analysis. <br><br>Missing data for dataset A can be handled during cleaning and munging. As for Dataset B, we need to implement the following:<br><br>1. Dropping correlated columns <br><br>2. Modifying 'Symptoms' column<br><br>```{python}# Drop correlated columnsB = B.drop(columns=['RVVT', 'LAC'], axis=1)# Modify symptoms columnB[['Symptoms']] = B[['Symptoms']].fillna("None")```Any columns with more than 70% missing values should also be dropped since the probable bias and skewness caused by data imputation is not worth it. Columns with missing values in the 50–60% range will be preserved and individually assessed.```{python}# drop the columns that have more than 70% of their data missingA = A.loc[:, A.isna().mean() <0.7]B = B.loc[:, B.isna().mean() <0.7]C = C.loc[:, C.isna().mean() <0.7]```#### Cleaning data to make it tidierIt is a good idea to rename the columns in the dataset to make it more organized before continuing with any research. All columns should be lowercase, and underscore ('_') should be used in place of space (' ').```{python}def rename_columns(df):for i inrange(len(df.columns)):# check for number of words per column nameiflen(df.columns[i].split()) !=1:# rename column as mentioned above df.rename(columns={str(df.columns[i]): (str(df.columns[i]).lower().split()[0]) +'_'+ (str(df.columns[i]).lower().split()[1])}, inplace=True)# in case of only word, follow the same ideaelse: df.rename(columns={str(df.columns[i]):str(df.columns[i]).lower()}, inplace=True)rename_columns(A)rename_columns(B)rename_columns(C)```There are some columns in Dataset A that have date as an object. Lets convert them to the appropriate format.```{python}def form_date(val):# check for nan valuesifnot pd.isna(val):# Case 1, separator = '.'if'.'in val: date_list = val.split(".") year = date_list[0] month = date_list[1] day = date_list[2]# Case 2, separator = '/'elif'/'in val: date_list = val.split("/") year = date_list[0] month = date_list[1] day = date_list[2]# form year year ='19'+ year[-2:]# form monthifint(month) <10: month ='0'+str(int(month))# form dayifint(day) <10: day ='0'+str(int(day))return year +'/'+ month +'/'+ dayelse:return val``````{python}# Convert relevant and required date columnsA['birthday'] = A['birthday'].apply(form_date)A['first_date'] = A['first_date'].apply(form_date)```Now that we have defined the function for date to be converted, we can calculate the age of a patient.```{python}from datetime import datetime, date# Function to calculate agedef calculate_age(val1, val2):# check for nan valuesifnot pd.isna(val1) andnot pd.isna(val2):# Identify given date as date, month and year born = datetime.strptime(val1, "%Y/%m/%d").date() curr_date = datetime.strptime(val2, "%Y/%m/%d").date()# return agereturn curr_date.year - born.year - ((curr_date.month, curr_date.day) < (born.month, born.day))else: np.nan``````{python}# Calculating age of every patient at the time when they first came to the hospitalage_first_date = []for val1, val2 inzip(A['birthday'], A['first_date']): age_first_date.append(calculate_age(val1, val2))age_first_date# appending the age column to the dataframeA['age_first_date'] = age_first_date```Now that we have calculated the respective ages, we don't need those columns. Also, we should drop negative values of age because that is absurd.Note: We won't be removing the birth date column since that can be used later for the examination date. Also, dropping columns may be a risk, but here we have already made the most use out of those. So now they seem irrelevant towards our goal.```{python}# Drop irrelevant columns from AA = A.drop(columns=['description', 'first_date'], axis=1)# Filter dataframe to drop absurd values of ageA = A[A.age_first_date >0]```If we take a look at the 'admission' column in A, we see that apart from '+", '-' and Nan values, it also has some absurd ones. So lets update that column so that it does not cause any issue later on.```{python}# define a function for the admission column to be formatted appropriatelydef update_admit(val):# check for nan valuesifnot pd.isna(val):if'+'in val:# if '+' in value, then update the entire string value with '+' val = val[val.index('+')]elif'-'in val:# if '-' in value, then update the entire string value with '-' val = val[val.index('-')]else:# if either of the above literals are not present, then that means it's irrelevant. This value won't be dropped since this is medical data, but it will be replaced with nan since it wasn't recorded. val = np.nanreturn val``````{python}# Update the admission columnA['admission'] = A['admission'].apply(update_admit)A.head()```Moving on to dataset B. <br><br>B has 3 different anti-Cardiolipin antibody columns with their values on different scales. We should normalize this data since it may be useful for us later on for temporal analysis.```{python}# Normalizing values of the 3 different anti-Cardiolipin antibody columnB['acl_igg'] = (B['acl_igg']-B['acl_igg'].mean())/B['acl_igg'].std()B['acl_igm'] = (B['acl_igm']-B['acl_igm'].mean())/B['acl_igm'].std()B['acl_iga'] = (B['acl_iga']-B['acl_iga'].mean())/B['acl_iga'].std()```Also, notice that the 'id' column in B has missing values. Since this dataset has our dependent variable and the fact that it has information about those patients that got themselves tested for the same, if we don't have their respective id then it may generate inappropriate results.<br><br>Also, these are just 36 rows out of 806. So dropping these won't create any significant issue.```{python}# Drop NAs in the ID column of B since they are the main unit of analysis (total 36 rows dropped out of 806)B = B.dropna(subset=['id'])# change dtype of B.id to int for mergingB['id'] = B['id'].astype('int')```Also, 'ana' column has an inappropriate value of '>4096' for the measure of anti-nucleus antibodies in a patient. Lets fix that.```{python}# filter dataset B for correct ana valuesB = B[B.ana !='>4096']```That concludes for B. Lets move on to our third dataset. <br><br>For C, we again see that there is a date column but with dtype as integer. Lets fix that.```{python}# Converting date column from int to datedef int_to_date(val):# check for nan valuesifnot pd.isna(val): val =str(val)# form year year ='19'+ val[:2]# form month month = val[2:4]# form day day = val[4:]# add a 0 to month to denote months before Octoberifint(month) <10: month ='0'+str(int(month))# add a 0 to day to denote days before the 10th of every monthifint(day) <10: day ='0'+str(int(day))return year +'/'+ month +'/'+ dayelse:return val``````{python}# Convert dtype of date column to dateC['lab_date'] = C['date'].apply(int_to_date)# Drop 'date' column since now it has been updated C = C.drop(columns=['date'], axis=1)C.head()```#### Check individual datasets before mergingLets have one final look at each of our dataset before merging.<br><br>1. TSUMOTO_A.CSV```{python}A.head()```Check for missing data statistics```{python}df_A_stats_new = missing_val_stats(A)df_A_stats_new```We see that there are very few missing values in total (30 including 'sex' and 'admission'). Given such a small number we can drop these values.```{python}# drop nan values from the subset columnsA = A.dropna(subset=['sex', 'admission'])```After removing NAs from the above two columns, we have no missing values for our first dataset. <br><br>2. TSUMOTO_B.CSV```{python}B.head()```Check for missing data statistics```{python}df_B_stats_new = missing_val_stats(B)df_B_stats_new```We see that there is still some missing data that can't be just dropped. This can be handled after merging.3. TSUMOTO_C.CSV```{python}C.head()```Check for missing data statistics```{python}df_C_stats_new = missing_val_stats(C)df_C_stats_new```Too many missing values account for the uncertainity in the dataset.#### Merging the datasetsIn terms of joining data sets, there are multiple ways of doing so depending on the question being answered.<br><br>Now that we have an idea, we will first merge A with C. Reason being that A contains basic information about the patients and C has the data from general laboratory examinations stored in hospital information systems. Merging these will filter out the id(s) of the patients that dint get tested for these examinations. <br><br>Then the last step would be to merge the combined one with B.<br><br>Since B contains our target variable (Thrombosis), it has the information of those patients that got tested via the 'Special Laboratory Examination'. Merging B at the end will give us the final dataset that we can use for EDA and strengthen our final conclusion.<br><br>Since 'ID' is the only common column amongst the 3 datasets, it will be used in the 'left' join for merging.<br><br>##### Step 1: Merge A with C```{python}# Merging A with C using left join on id columnC_A = C.merge(A, on='id', how='left')C_A.head()```Check the shape.```{python}C_A.shape```##### Step 2: Merge combined (C_A) with B```{python}# Merging C_A with B using left join on id columndf_final = B.merge(C_A, on='id', how='left')df_final.head()```Check the shape.```{python}df_final.shape```Since 'diagnosis' was present both in B and C_A, we see two different columns for both in the final dataframe. But if you look carefully, the values are similar. Lets merge these columns into one and form the updated 'diagnosis' column for df_final.```{python}for i inrange(len(df_final)):# check for nan in diagnosis_x, if false then split the valuesif df_final["diagnosis_x"].isna()[i] ==False: df_final["diagnosis_x"][i] = df_final["diagnosis_x"][i].split(",")# check for nan in diagnosis_y, if false then split the valuesif df_final["diagnosis_y"].isna()[i] ==False: df_final["diagnosis_y"][i] = df_final["diagnosis_y"][i].split(",")# defining an empty list to store diagnosis as column in the dataframe later ondf_final["diagnosis"] =""for i inrange(len(df_final)):# fill the value with diagnosis_x if diagnosis_y is nanif (df_final["diagnosis_x"].isna()[i] ==False) & (df_final["diagnosis_y"].isna()[i] ==True): df_final["diagnosis"][i] = df_final["diagnosis_x"][i]# fill the value with diagnosis_y if diagnosis_x is nanelif (df_final["diagnosis_x"].isna()[i] ==True) & (df_final["diagnosis_y"].isna()[i] ==False): df_final["diagnosis"][i] = df_final["diagnosis_y"][i]# fill the value with the list of the set of values in diagnosis_x and diagnosis_y if both are not nanelif (df_final["diagnosis_x"].isna()[i] ==False) & (df_final["diagnosis_y"].isna()[i] ==False): df_final["diagnosis"][i] =list(set(df_final["diagnosis_x"][i] + df_final["diagnosis_y"][i]))```If we remember fondly, we didn't drop the 'birthday' column earlier. The main reason for not dropping that was to calculate the age of the patients at the time of their 'Thrombosis' lab test. Now that the datasets have been merged, we can use the 'birthday' and 'examination_date' column to calculate the same.```{python}# convert examination_date to date formatdf_final['examination_date'] = df_final['examination_date'].apply(form_date)# Calculate age at the time of examinationage_at_examination = []for val1, val2 inzip(df_final['birthday'], df_final['examination_date']): age_at_examination.append(calculate_age(val1, val2))# add the column to the dataframedf_final['age_at_examination'] = age_at_examination```Now that we have used these columns, we can drop them```{python}# drop irrelevant columnsdf_final = df_final.drop(columns=['examination_date', 'diagnosis_x', 'lab_date', 'birthday', 'diagnosis_y'], axis=1)```A property of the dataframe is to have it's columns organized with dependent column at the end. Lets implement the same for our merged dataset.```{python}# rearrange the columnsdf_final = df_final[['id', 'sex', 'admission', 'age_first_date', 'age_at_examination','diagnosis', 'acl_igg', 'acl_igm', 'acl_iga','ana', 'ana_pattern', 'got', 'gpt', 'ldh', 'alp', 'tp', 'alb', 'ua', 'un', 'cre', 't-bil', 't-cho', 'tg', 'cpk', 'wbc', 'rbc', 'hgb', 'hct', 'plt', 'u-pro', 'crp', 'c3', 'c4', 'symptoms','thrombosis']]df_final.head()```#### Missing data in the merged datasetLets visualize the missing data of the merged dataset.<br><br>1. Using klib```{python}print('Missing Value Plot of df_final')df_final_klib = klib.missingval_plot(df_final, figsize=(10,15))```If you see the plot above, there are a few columns that show a similar trend in their missing values. Lets verify this using the 'missingno' library.. Using missingno```{python}print('Correlation amongst missing values (df_final))')df_final_msno = msno.heatmap(df_final, figsize=(20,15))```Some columns still are very much correlated to one another in terms of missing data.For example, columns such as 'GOT', 'GPT', 'LDH', 'ALP', 'TP', 'ALB', 'UA', 'UN', and 'CRE' have values approximately equal to 1.<br><br>The set of columns ('WBC', 'RBC', 'HGB', 'HCT', 'PLT') and ('C3', and 'C4') are highly correlated as per their missing values.<br><br>We need to analyze and then handle the missing data. Since this is a medical dataset, we can't just drop or modify such columns. We need to take precautions and then follow appropriate steps.<br><br>Using UpSet<br>Note: Since df_final has too many columns, the image generated by the UpSet plot will be too large to support.<br><br># Exploratory analysisLets plot a histogram for our temporal data from the first dataset (TSUMOTO_A.CSV). This plot will display the count of 'age_first_date' to analyze the age of patients when they first visited the hospital. ```{python}# define figure sizeplt.figure(figsize=(8, 5))# plot the histogram using dataset A, age_first_date as x and sex as huesns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)# define axis labelsplt.xlabel("Age(years)")plt.ylabel("Count")# put title on the plotplt.title('Age Distribution of patients when they first visited the hospital')# set margins for x and y axisplt.margins(x=0.01, y=0.1)plt.show()```From the above plot, we infer two things:<br><br>1. The number of female patients are more as compared to male.<br><br>2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.<br><br>We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don't tend to visit hospital that much for their medical checkups. <br><br>As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.Lets do another temporal analysis, this time using the second dataset (TSUMOTO_B.CSV) and see how has thrombosis evolved over time.```{python}# drop nan values from subset columns and then sort by examination_datetemporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])# convert the examination_date column dtype to pandas datetimetemporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])# Extract the year from the datetime and create a new columntemporal_B['year'] = temporal_B['examination_date'].dt.year# count of thrombosis patients per yearthrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()# plot barchart of the thrombosis counts per yearax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#979797', '#f44336', "#C2CD23", "#918BC3"])# set title of the plotax.set_title("Thrombosis diagnosis over the years")# define labels of the axisax.set_xlabel('Year')ax.set_ylabel('Count')```The histogram above highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.<br><br>Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.<br><br>Lets have a look at the age distribution of patients when got tested for thrombosis. This plot also accounts for temporal analysis.```{python}# drop nan values from the subset columnage_exam_nona = df_final.dropna(subset=['age_at_examination'])# convert age_at_examination dtype to intage_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')# define figure sizeplt.figure(figsize=(8, 5))# plot histogram using df_final as data, age_at_examination as x and thrombosis as hueax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)# define labels of axisplt.xlabel("Age(years)")plt.ylabel("Count")# set title of the plotplt.title('Age distribution of patients when they got tested for thrombosis')# define margins of the x and y axisplt.margins(x=0.005, y=0.1)plt.show()```As per the plot above, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low. <br><br>Now we will conduct a comprehensive analysis of thrombosis since it is important to gain a deeper understanding of its associated side effects and prevalence among different age groups and genders within the population.Below is a density plot that shows the age distribution of people that have thrombosis vs people who do not.```{python}# filter df with patients that have thrombosisthrombosis_patients = df_final[df_final['thrombosis'] !=0]# filter df with patients that don't have thrombosisno_thrombosis_patients = df_final[df_final['thrombosis'] ==0]# age at examination of patients with thrombosisage_t = thrombosis_patients['age_at_examination']# age at examination of patients without thrombosisage_nt = no_thrombosis_patients['age_at_examination']# Filter outliersage_t = age_t[age_t >2]age_nt = age_nt[age_nt >2]# Age distribution plotsns.kdeplot(age_t, color="#f44336",fill=True, label="Patients with Thrombosis")sns.kdeplot(age_nt, color="#979797",fill=True, label="Patients without Thrombosis")# Add legendplt.legend()# set titleplt.title("Age Distribution of Patients with respect to Thrombosis")# set x-axis labelplt.xlabel("Age(years)")# Display plotplt.show()```The density plot above shows that Thrombosis does not seem to occur with people of age more than 70. In fact, most of Thrombosis patients’ age are in the range 20-60, which is pretty much the same as that of petients that did not have Thrombosis. The only slight difference is that patients with no thromboses are slightly skewed to the younger side of that range, but ultimately i dont think this holds any relvance.Next step is to see if there are any differences within the Thrombosis group, meaning between the different levels of the disease’s severity.```{python}# filter the level of thrombosis disease into 3 separate variablest1 = df_final[df_final['thrombosis'] ==1]t2 = df_final[df_final['thrombosis'] ==2]t3 = df_final[df_final['thrombosis'] ==3]# store the respective ages of patients as per the variables chosen aboveage_t1 = t1['age_at_examination']age_t2 = t2['age_at_examination']age_t3 = t3['age_at_examination']# only check for patients having age > 2age_t1 = age_t1[age_t1 >2]age_t2 = age_t2[age_t2 >2]age_t3 = age_t3[age_t3 >2]# plot the kernel density plot using the age calculated abovesns.kdeplot(age_t1, color="#979797",fill=True, label="Level 1")sns.kdeplot(age_t2, color="black",fill=True, label="Level 2")sns.kdeplot(age_t3, color="#f44336",fill=True, label="Level 3")# Add legendplt.legend()# set titleplt.title("Age Distribution of differnet levels of Thrombosis")# set x-labelplt.xlabel("Age(Years)")# Display plotplt.show()```Different levels of Thrombosis turn out to have different age distributions, with the age density plot of level 3 patients being the most condensed in the 30-40 age range, while that of level 1 and level 2 patients in 15-35 and 0-30 respectively.<br><br>Look at box plots to check for outliers.```{python}# Subset columns of interestlab_results = df_final[['acl_igg', 'acl_igm', 'acl_iga', 'ana', 'thrombosis']]# drop nan values from subset columnslab_results = lab_results.dropna(subset=['ana']) # Only 4 missing values # change ana column dtype to intlab_results['ana'] = lab_results['ana'].astype('int')# create subplotsfig, axs = plt.subplots(2, 2, figsize=(10, 10))# create box plotsboxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)# plot boxplotaxs[0, 0].boxplot(boxp['acl_iga'])# set title of boxplotaxs[0, 0].set_title('acl_iga')# plot boxplotaxs[0, 1].boxplot(boxp['acl_igg'])# set title of boxplotaxs[0, 1].set_title('acl_igg')# plot boxplotaxs[1, 0].boxplot(boxp['acl_igm'])# set title of boxplotaxs[1, 0].set_title('acl_igm')# plot boxplotaxs[1, 1].boxplot(boxp['ana'])# set title of boxplotaxs[1, 1].set_title('ana')# add x-axis labelfor ax in axs.flat: ax.set_xlabel('Feature')# add y-axis labelfor ax in axs.flat: ax.set_ylabel('Value')# add overall titleplt.suptitle('Box plots of features', fontsize=16)# adjust layoutplt.tight_layout()# show plotplt.show()```The data has outlier issues.Setting the threshold at 100 for IgA, IgG, and IgM makes sense, while putting it at 200 for ANA also makes sense.```{python}lab_results['ana'] = lab_results['ana'].astype('int')# setting the threshold as 100 for the anti-Cardiolipin antibodies and 200 for analab_results = lab_results[(lab_results['acl_iga'] <100) & (lab_results['acl_igg'] <100) &(lab_results['acl_igm'] <100) & (lab_results['ana'] <50)]# create subplotsfig, axs = plt.subplots(2, 2, figsize=(10, 10))# create box plotsboxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)# plot boxplotaxs[0, 0].boxplot(boxp['acl_iga'])# set title of boxplotaxs[0, 0].set_title('acl_iga')# plot boxplotaxs[0, 1].boxplot(boxp['acl_igg'])# set title of boxplotaxs[0, 1].set_title('acl_igg')# plot boxplotaxs[1, 0].boxplot(boxp['acl_igm'])# set title of boxplotaxs[1, 0].set_title('acl_igm')# plot boxplotaxs[1, 1].boxplot(boxp['ana'])# set title of boxplotaxs[1, 1].set_title('ana')# add x-axis labelfor ax in axs.flat: ax.set_xlabel('Feature')# add y-axis labelfor ax in axs.flat: ax.set_ylabel('Value')# add overall titleplt.suptitle('Box plots of features', fontsize=16)# adjust layoutplt.tight_layout()# show plotplt.show()```The data now is in a better state to analyze (a handful of outliers remain). Below is a radar chart that shows how patients that do not have Thrombosis compare to those who have (and the different levels) in terms of lab results (averaged results).```{python}# Normalize columns to have them all in the same scale from sklearn.preprocessing import StandardScaler# using standard scaler object for feature scalingscaler = StandardScaler()# choose all columns except thrombosiscols_to_normalize = [col for col in lab_results.columns if col !='thrombosis']# normalize the column values and put those in a new dataframelab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)lab_results_norm['thrombosis'] = lab_results['thrombosis']# Get the means of each Thrombosis groupmeans = lab_results_norm.groupby('thrombosis').mean().reset_index()categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']# store each respective mean in a different variablevariable1 = means.iloc[0, :].valuesvariable2 = means.iloc[1, :].valuesvariable3 = means.iloc[2, :].valuesvariable4 = means.iloc[3, :].valueslabel_loc = np.linspace(start=0, stop=2* np.pi, num=len(variable1))# Set colors and fillcolors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']# define figure sizeplt.figure(figsize=(8.5, 8.5))# add subplotsplt.subplot(polar=True)# plot the graph for the first meanplt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])plt.plot(label_loc, variable1, color=colors[0], label='No Thrombosis', linewidth=2)# plot the graph for the second meanplt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])plt.plot(label_loc, variable2, color=colors[1], label='Thrombosis 1', linewidth=2)# plot the graph for the third meanplt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])plt.plot(label_loc, variable3, color=colors[2], label='Thrombosis 2', linewidth=2)# plot the graph for the fourth meanplt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])plt.plot(label_loc, variable4, color=colors[3], label='Thrombosis 3', linewidth=2)# set titleplt.title('Lab results averages of patients', size=24)lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[''] + categories, color='gray')# set legendplt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)plt.tight_layout()plt.show()```The radar chart shows that the only lab results that show significant differences between different Thrombosis groups are:ANA: The higher the ANA value the more severe the Thrombosis diagnosis.Thrombosis level 1 seems have to similar acl_iga and acl_igm values.A possible predictor of Thrombosis could be the symptoms a patient is experiencing.```{python}from wordcloud import WordCloud, STOPWORDSimport randomimport matplotlib.pyplot as plt# Define the thrombosis categories to iterate overthrombosis_categories = [1, 2, 3]# Create a figure with three columnsfig, axes = plt.subplots(1, 3, figsize=(9, 6))# Iterate over the thrombosis categories and plot a word cloud in each columnfor i, thrombosis_category inenumerate(thrombosis_categories): thrombosis_p = B[B['thrombosis'] == thrombosis_category]# Concatenate all the strings in the "Symptoms" column into a single string text =' '.join(thrombosis_p['symptoms'])# Create a set of stopwords stopwords =set(STOPWORDS) stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ','ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY']) # Excluding thrombsosis (obvious word) + some words that dont make sense# Define the color palette custom_colors = ['#121111', '#3646f4']def color_func(word, font_size, position, orientation, random_state=None, **kwargs):return random.choice(custom_colors)# Create a WordCloud object with the custom color palette wordcloud = WordCloud(width=400, height=400, background_color='white', color_func=color_func, stopwords=stopwords).generate(text)# Plot the word cloud in the corresponding subplot ax = axes[i] ax.imshow(wordcloud) ax.axis("off") ax.set_title(f"Thrombosis {thrombosis_category}")# Adjust the layout of the figure and display itplt.tight_layout()plt.show()```The main symptoms of each thrombosis level are the following:Level 1: Brain infarctionLevel 2: CNS lupusLevel 3: ThrombocytopeniaThe world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.# Final Plots```{python}# define figure sizeplt.figure(figsize=(8, 5))# plot the histogram using dataset A, age_first_date as x and sex as huesns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)# define axis labelsplt.xlabel("Age(years)")plt.ylabel("Count")# put title on the plotplt.title('Age Distribution of patients when they first visited the hospital')# set margins for x and y axisplt.margins(x=0.01, y=0.1)plt.show()```<br><br>```{python}# drop nan values from subset columns and then sort by examination_datetemporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])# convert the examination_date column dtype to pandas datetimetemporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])# Extract the year from the datetime and create a new columntemporal_B['year'] = temporal_B['examination_date'].dt.year# count of thrombosis patients per yearthrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()# plot barchart of the thrombosis counts per yearax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#979797', '#f44336', "#C2CD23", "#918BC3"])# set title of the plotax.set_title("Thrombosis diagnosis over the years")# define labels of the axisax.set_xlabel('Year')ax.set_ylabel('Count')```<br><br>```{python}# drop nan values from the subset columnage_exam_nona = df_final.dropna(subset=['age_at_examination'])# convert age_at_examination dtype to intage_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')# define figure sizeplt.figure(figsize=(8, 5))# plot histogram using df_final as data, age_at_examination as x and thrombosis as hueax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)# define labels of axisplt.xlabel("Age(years)")plt.ylabel("Count")# set title of the plotplt.title('Age distribution of patients when they got tested for thrombosis')# define margins of the x and y axisplt.margins(x=0.005, y=0.1)plt.show()```<br><br>```{python}# Normalize columns to have them all in the same scale from sklearn.preprocessing import StandardScaler# using standard scaler object for feature scalingscaler = StandardScaler()# choose all columns except thrombosiscols_to_normalize = [col for col in lab_results.columns if col !='thrombosis']# normalize the column values and put those in a new dataframelab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)lab_results_norm['thrombosis'] = lab_results['thrombosis']# Get the means of each Thrombosis groupmeans = lab_results_norm.groupby('thrombosis').mean().reset_index()categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']# store each respective mean in a different variablevariable1 = means.iloc[0, :].valuesvariable2 = means.iloc[1, :].valuesvariable3 = means.iloc[2, :].valuesvariable4 = means.iloc[3, :].valueslabel_loc = np.linspace(start=0, stop=2* np.pi, num=len(variable1))# Set colors and fillcolors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']# define figure sizeplt.figure(figsize=(8.5, 8.5))# add subplotsplt.subplot(polar=True)# plot the graph for the first meanplt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])plt.plot(label_loc, variable1, color=colors[0], label='No Thrombosis', linewidth=2)# plot the graph for the second meanplt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])plt.plot(label_loc, variable2, color=colors[1], label='Thrombosis 1', linewidth=2)# plot the graph for the third meanplt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])plt.plot(label_loc, variable3, color=colors[2], label='Thrombosis 2', linewidth=2)# plot the graph for the fourth meanplt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])plt.plot(label_loc, variable4, color=colors[3], label='Thrombosis 3', linewidth=2)# set titleplt.title('Lab results averages of patients', size=24)lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[''] + categories, color='gray')# set legendplt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)plt.tight_layout()plt.show()```<br><br>```{python}from wordcloud import WordCloud, STOPWORDSimport randomimport matplotlib.pyplot as plt# Define the thrombosis categories to iterate overthrombosis_categories = [1, 2, 3]# Create a figure with three columnsfig, axes = plt.subplots(1, 3, figsize=(9, 6))# Iterate over the thrombosis categories and plot a word cloud in each columnfor i, thrombosis_category inenumerate(thrombosis_categories): thrombosis_p = B[B['thrombosis'] == thrombosis_category]# Concatenate all the strings in the "Symptoms" column into a single string text =' '.join(thrombosis_p['symptoms'])# Create a set of stopwords stopwords =set(STOPWORDS) stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ','ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY']) # Excluding thrombsosis (obvious word) + some words that dont make sense# Define the color palette custom_colors = ['#121111', '#3646f4']def color_func(word, font_size, position, orientation, random_state=None, **kwargs):return random.choice(custom_colors)# Create a WordCloud object with the custom color palette wordcloud = WordCloud(width=400, height=400, background_color='white', color_func=color_func, stopwords=stopwords).generate(text)# Plot the word cloud in the corresponding subplot ax = axes[i] ax.imshow(wordcloud) ax.axis("off") ax.set_title(f"Thrombosis {thrombosis_category}")# Adjust the layout of the figure and display itplt.tight_layout()plt.show()```# Technical summaryUnexpectedly, the quality of the data provided for this investigation was far lower than anticipated, especially considering that it came from such a prestigious institution. The main issues with the data are provided below:<br><br>1. Missing Data: Missing values are a major problem in this set of data, particularly in TSUMOTO_C.CSV, where more than 80% of the columns and features have missing values. The biggest barrier to better understanding thrombosis and its connection to routine lab tests was this problem. Because of this problem, perhaps some significant connections and patterns were missed.<br><br>2. Lack of Thrombosis cases: The major goals of this study were to gain a deeper understanding of thrombosis and identify data patterns that might act as predictors of this condition. The data presented, however, contains extremely few thrombosis instances, particularly more severe ones (levels 2 and 3), making any study very susceptible to the addition of more cases.<br><br>3. Data not balanced: When sampling for thrombosis + cases exclusively, the dataset TSUMOTO_B.CSV was significantly unbalanced, with nearly all patients (about 85%) being female, which could lead to incorrect findings that thrombosis occurs considerably more commonly with females. This prevented a gender analysis that was accurate from being done.<br><br>4. Dates: TIn some cases, there is a discrepancy of more than 5 years between the dates patients' lab tests and thrombosis testing are performed. Given that the patients' actual results would have been drastically different if they had undergone more recent (closer to Thrombosis assessment) lab testing, this renders any analysis of lab data (TSUMOTO_C.CSV) very poor..<br><br>Yet it's vital to note that all of the data was gathered between 1989 and 1996, a time period in which industrywide data discrepancies were frequent and data collection methods weren't the most sophisticated. An examination of thrombosis using more recent data, if available, would be a better idea.<br><br>Below are some of main key takeaways (elaborated more in results.html):<br><br>1. Level 3 Thrombosis appears to happen with people on the higher end of the age distribution.<br><br>2. ANA seems to be the best predictor amongst all.<br><br>3. There are various stages of thrombosis, and each level has its own set of symptoms.