Thrombosis Analysis
  • Home
  • Code
  • Report

On this page

  • 1 Introduction
    • 1.1 Objective
    • 1.2 Data
    • 1.3 Goals of the project
  • 2 Initial Questions
  • 3 Data Ingestion, Cleaning and Munging
    • 3.1 Ingestion
    • 3.2 Cleaning and Munging
  • 4 Exploratory analysis
  • 5 Final Plots
  • 6 Technical summary

EDA on Medical Data of Thrombosis Diagnosis

  • Show All Code
  • Hide All Code

  • View Source
Author
Affiliation

Raghav Sharma

Georgetown University

1 Introduction

1.1 Objective

Imagine yourself as a data scientist working at a consulting firm. An external entity (client) has paid a significant amount of money for your firm to make sense of a medical data set. They plan to incorporate your findings to help guide the allocation of a multi-million dollar research grant. They want the results in two weeks, after which point the contract terminates and your consulting firm will move onto a new unrelated project.

Our job is to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.

1.2 Data

Data has been collected from Georgetown University’s Hospital and contain information about patients who have been admitted to the hospital’s Collagen disease clinic.

Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lung, he/she will lose the respiratory function in a chronic course and finally lose their lives. Thrombosis is the most severe form of Collagen diseases.

There are 3 CSV files that make up for the medical data. These files are:

  1. TSUMOTO_A.CSV: Contains basic information about patients

  2. TSUMOTO_B.CSV: Contains information about the ‘Special Laboratory Examinations’ of patients to detect Thrombosis.

  3. TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.

1.3 Goals of the project

This project aims to thoroughly understand and analyze Thrombosis which is a Collagen disease by achieving the following:

  1. Search for good patterns which detect and predict thrombosis.

  2. Search for temporal patterns specific/sensitive to thrombosis.

  3. Search for good features which classifies collagen diseases correctly.

  4. Search for temporal patterns specific/sensitive to each collagen diseases.

To achieve these goals, the data will be carefully analyzed to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, this project aims to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.

2 Initial Questions

We would like to investigate the following questions prior to our analyses:

  1. How does thrombosis evolve over time?

  2. What age group is Thrombosis, the most prevalent in?

  3. Does it depends upon the sex of the patient?

  4. Is this dataset good enough to understand and visualize the factors causing and effects of Thrombosis?

3 Data Ingestion, Cleaning and Munging

Before doing any kind of analysis, it is important to properly understand the datasets in terms of shape, strengths, weaknesses, etc.

3.1 Ingestion

3.1.0.1 Importing the libraries

Code
# Importing all necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import klib
import missingno as msno
from upsetplot import UpSet

from wordcloud import WordCloud, STOPWORDS
import random

import sys

import warnings
warnings.filterwarnings("ignore")

3.1.0.2 Importing the datasets

Code
# Redirect stdout and stderr to null device
sys.stdout = open('/dev/null', 'w')
sys.stderr = open('/dev/null', 'w')
  1. TSUMOTO_A.CSV
Code
A = pd.read_csv("./data/TSUMOTO_A.CSV", encoding= 'unicode_escape')
A.head()
ID SEX Birthday Description First Date Admission Diagnosis
0 2110 F 1934/2/13 94.02.14 93.02.10 + RA susp.
1 11408 F 1937/5/2 96.12.01 73.01.01 + PSS
2 12052 F 1956/4/14 91.08.13 NaN + SLE
3 14872 F 1953/9/21 97.08.13 NaN + MCTD
4 27654 F 1936/3/25 NaN 92.02.03 + RA, SLE susp
Code
"Shape of A: " + str(A.shape)
'Shape of A: (1240, 7)'
  1. TSUMOTO_B.CSV
Code
B = pd.read_csv("./data/TSUMOTO_B.CSV", encoding= 'unicode_escape')
B.head()
ID Examination Date aCL IgG aCL IgM ANA ANA Pattern aCL IgA Diagnosis KCT RVVT LAC Symptoms Thrombosis
0 14872.0 1997/5/27 1.3 1.6 256 P 0.0 MCTD, AMI NaN NaN - AMI 1
1 48473.0 1992/12/21 4.3 4.6 256 P,S 3.3 SLE - - - NaN 0
2 102490.0 1995/4/20 2.3 2.5 0 NaN 3.5 PSS NaN NaN NaN NaN 0
3 108788.0 1997/5/6 0.0 0.0 16 S 0.0 NaN NaN NaN - NaN 0
4 122405.0 1998/4/2 0.0 4.0 4 P 0.0 SLE, SjS, vertigo NaN NaN NaN NaN 0
Code
"Shape of B: " + str(B.shape)
'Shape of B: (806, 13)'
  1. TSUMOTO_C.CSV

Note: Couple of lines in this file were causing issues when reading the dataset. Using “error_bad_lines=False” argument resolved that. A total of 30 lines (out of 50k + where dropped), so any impact of this drop would be very minimal.

Code
# A couple of bad (unreadable) lines in the 3rd dataset were dropped. The dropped lines account for 30 (out of 50k+), 
# which makes this drop insignificant
C = pd.read_csv("data/TSUMOTO_C.CSV" , encoding= 'unicode_escape', error_bad_lines=False) 
C.head()
ID Date GOT GPT LDH ALP TP ALB UA UN ... C3 C4 RNP SM SC170 SSA SSB CENTROMEA DNA DNA-II
0 2110 860419 24 12 152 63 7.5 4.5 3.4 16.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2110 860430 25 12 162 76 7.9 4.6 4.7 18.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2110 860502 22 8 144 68 7 4.2 5 18.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2110 860506 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2110 860507 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 44 columns

Code
"Shape of C: " + str(C.shape)
'Shape of C: (57510, 44)'

C is by far the largest dataframe that has information about general laboratory examinations.

3.2 Cleaning and Munging

3.2.0.1 Analyzing Missing values

Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the datasets.

It is very important to note that this is medical data. So if any of the dataset columns has a missing value, then it means that it wasn’t recorded. We need to carefully analyze and then take appropriate steps to handle missing data.

We will use ‘klib’, ‘missingno’, and ‘UpSet’ libraries to visualize the missing data for all the datasets.

  1. klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.

  2. missingno library will be helpful in plotting the correlation amongst the columns having missing values. This way we will be able to make a decision if a certain column needs to be dropped or not.

  3. UpSet library looks at the multivariate relationships among categorical variables and is applied to missing data.

Lets create a function to analyze missing data statistics of our dataset.

Code
# define a function that returns a data-frame of missing data statistics
def missing_val_stats(df):
    # define columns of the data-frame
    df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 
                                       'num_miss', 'pct_miss'])
    tmp = pd.DataFrame()
    
    for c in df.columns:
        # column
        tmp['column'] = [c]
        # unique values in the column
        tmp['unique_val'] = [df[c].unique()]
        # number of unique values in the column
        tmp['num_unique_val'] = len(list(df[c].unique()))
        # number of unique values in the column without nan
        tmp['num_unique_val_nona'] = int(df[c].nunique())
        # number of missing values in the column
        tmp['num_miss'] = df[c].isnull().sum()
        # percentage of missing values in the column
        tmp['pct_miss'] = (df[c].isnull().sum()/ len(df)).round(3)*100
        df_stats = df_stats.append(tmp)
    
    return df_stats

Lets take a look at the missing values statistics of A

Code
df_A_stats = missing_val_stats(A)
df_A_stats
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 ID [2110, 11408, 12052, 14872, 27654, 30609, 4300... 1238 1238 0 0.0
0 SEX [F, M, nan] 3 2 13 1.0
0 Birthday [1934/2/13, 1937/5/2, 1956/4/14, 1953/9/21, 19... 1195 1194 1 0.1
0 Description [94.02.14, 96.12.01, 91.08.13, 97.08.13, nan, ... 100 99 217 17.5
0 First Date [93.02.10, 73.01.01, nan, 92.02.03, 94.03.08, ... 821 820 248 20.0
0 Admission [+, -, nan, üH, îýô€, +üiö‡òåë®üj, +(ë¦ÄuÆÈòaë... 9 8 31 2.5
0 Diagnosis [RA susp., PSS, SLE, MCTD, RA, SLE susp, SLE, ... 222 222 0 0.0

Visualizing missing values in A

  1. Using klib
Code
"Missing Value Plot (TSUMOTO_A.CSV)"
a_klib = klib.missingval_plot(A, figsize=(10,15))

There are a few columns with few missing values with ‘First Date’ reaching a maximum of 20%. Dealing with missing values in these columns could be dealt with at later stages, when/if they are used.

  1. Using missingno
Code
"Correlation amongst missing values (TSUMOTO_A.CSV)"
a_msno = msno.heatmap(A, figsize=(8, 5))

Nothing too significant for A.

  1. Using UpSet
Code
"UpSet plot for TSUMOTO_A.CSV"
# check for nan values
a_upset = pd.isna(
  A[['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission',
       'Diagnosis']]
)

# using groupby for the missing values
a_upset = a_upset.groupby(['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission',
       'Diagnosis'], as_index=True).size()

# plot the groupby object
UpSet(a_upset).plot(fig= plt.figure(figsize=(8, 5)));
plt.show()

As stated earlier, there isn’t much of a correlation amongst the missing values in A. This is proved by the upset plot as we see in the figure that there are less cases of two or more values missing together.

Lets take a look at the missing values statistics of B

Code
df_B_stats = missing_val_stats(B)
df_B_stats
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 ID [14872.0, 48473.0, 102490.0, 108788.0, 122405.... 768 767 36 4.5
0 Examination Date [1997/5/27, 1992/12/21, 1995/4/20, 1997/5/6, 1... 476 475 10 1.2
0 aCL IgG [1.3, 4.3, 2.3, 0.0, 1.0, 6.1, 3.7, 1.7, 5.1, ... 115 115 0 0.0
0 aCL IgM [1.6, 4.6, 2.5, 0.0, 4.0, 2.4, 9.5, 3.0, 0.8, ... 140 140 0 0.0
0 ANA [256, 0, 16, 4, 1024, 4096, 64, nan, 4094, >4096] 10 9 38 4.7
0 ANA Pattern [P, P,S, nan, S, D,P,S, P.S, S,P, S,D, D,P, P,... 16 15 261 32.4
0 aCL IgA [0.0, 3.3, 3.5, 8.7, 12.3, 22.5, 3.1, 9.1, 5.1... 163 163 0 0.0
0 Diagnosis [MCTD, AMI, SLE, PSS, nan, SLE, SjS, vertigo, ... 182 181 331 41.1
0 KCT [nan, -, +] 3 2 660 81.9
0 RVVT [nan, -, +] 3 2 660 81.9
0 LAC [-, nan, +] 3 2 584 72.5
0 Symptoms [AMI, nan, CNS lupus, Apo, AP, thrombocytepeni... 40 39 726 90.1
0 Thrombosis [1, 0, 2, 3] 4 4 0 0.0

Visualizing missing values in B

  1. Using klib
Code
"Missing Value Plot (TSUMOTO_B.CSV)"
b_klib = klib.missingval_plot(B, figsize=(10,15))

This dataset is a bit concerning reason being it has some columns that are missing more than 70% of their data. ‘KCT’, ‘RVVT’, ‘LAC’, and ‘Symptoms’, pose a significant challenge in terms of replacing missing values due to various reasons.

If we take a look at the ‘Symptoms’ column, we can see that it has the most missing values (726). Now it can be a scenario where a patient doesn’t have any symptom so we see the corresponding value as Nan. We should change this to ‘None’ for our analysis later.

It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values would distort and bring bias into the data, thereby producing unrepresentative results.

  1. Using missingno
Code
"Correlation amongst missing values (TSUMOTO_B.CSV)"
b_msno = msno.heatmap(B, figsize=(8,5))

Looking at the plot above, we get the idea that missing value columns ‘KCT’, ‘RVVT’, and ‘LAC’ are correlated. Two of these three columns can be dropped considering they represent the same thing (measure of degree of coagulation).

  1. Using UpSet
Code
"UpSet plot for TSUMOTO_B.CSV"
# check for missing values
b_upset = pd.isna(
  B[['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern',
       'aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis']]
)

# using groupby for the missing values
b_upset = b_upset.groupby(['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern',
       'aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis'], as_index=True).size()

# plot the groupby object
UpSet(b_upset).plot(fig= plt.figure(figsize=(8, 5)));
plt.show()

We see the correlation quite clearly in the plot. The height of the bar is the highest when ‘KCT’, ‘RVVT’, ‘LAC’, and ‘Symptoms’ are missing together. Similar pattern can be witnessed for other columns.

Lets take a look at the missing values statistics of C

Code
df_C_stats = missing_val_stats(C)
df_C_stats
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 ID [2110, 11408, 12052, 14872, 27654, 30609, 4300... 1236 1236 0 0.0
0 Date [860419, 860430, 860502, 860506, 860507, 86050... 4960 4960 0 0.0
0 GOT [24, 25, 22, nan, 28, 21, 23, 38, 27, 32, 20, ... 424 423 11341 19.7
0 GPT [12, 8, nan, 13, 9, 10, 22, 18, 17, 15, 14, 46... 770 769 11356 19.7
0 LDH [152, 162, 144, nan, 156, 167, 185, 134, 165, ... 1500 1499 11202 19.5
0 ALP [63, 76, 68, nan, 66, 64, 60, 56, 55, 75, 89, ... 1335 1334 11989 20.8
0 TP [7.5, 7.9, 7, nan, 7.6, 8, 7.2, 7.1, 8.2, 7.7,... 95 94 11874 20.6
0 ALB [4.5, 4.6, 4.2, nan, 4.8, 4.4, 4.7, 4, 4.3, 4.... 95 94 12189 21.2
0 UA [3.4, 4.7, 5, nan, 4.4, 4.5, 4, 3.7, 3.6, 3.2,... 275 274 12202 21.2
0 UN [16.0, 18.0, nan, 15.0, 14.0, 13.0, 12.0, 19.0... 346 345 11387 19.8
0 CRE [0.6, nan, 0.7, 0.8, 0.66, 0.9, 1.2, 1, 1.1, 1... 601 600 11291 19.6
0 T-BIL [0.4, nan, 0.5, 0.8, 0.6, 0.7, 0.2, 0.3, 0.9, ... 340 339 18733 32.6
0 T-CHO [192, 187, 191, nan, 214, 231, 212, 204, 208, ... 848 847 15090 26.2
0 TG [nan, 76, 65, 67, 113, 74, 63, 169, 155, 119, ... 1008 1007 28252 49.1
0 CPK [nan, 31, 23, 25, 26, 28, 16, 37, 22, 18, 21, ... 920 919 36946 64.2
0 GLU [nan, 114.0, 155.0, 88.0, 80.0, 84.0, 103.0, 2... 514 513 49078 85.3
0 WBC [5.9, 6.9, nan, 7.2, 5, 4.9, 7.4, 6.5, 6.2, 7.... 547 546 7905 13.7
0 RBC [4.69, 4.73, nan, 4.93, 4.62, 4.49, 4.71, 4.38... 856 855 7905 13.7
0 HGB [9, 9.3, nan, 9.5, 8.9, 8.5, 9.2, 9.4, 9.1, 10... 285 284 7904 13.7
0 HCT [31.6, 31.8, nan, 33, 30.9, 30, 31.7, 29.3, 31... 766 765 7904 13.7
0 PLT [380, 323, nan, 367, 372, 313, 299, 312, 346, ... 913 912 9605 16.7
0 PT [nan, 11.9, 12.4, 12.0, 11.4, 12.1, 11.5, 11.3... 199 198 53507 93.0
0 APTT [nan, 68.0, 67.0, 71.0, 65.0, 57.0, 56.0, 80.0... 129 128 56618 98.4
0 FG [nan, 27.8, 27.0, 31.8, 31.9, 43.2, 47.5, 31.0... 876 875 54127 94.1
0 PIC [nan, 290.0, 315.0, 105.0, 175.0, 285.0, 530.0... 350 349 56711 98.6
0 TAT [nan, 86, 72, 63, 142, 138, 92, 107, 111, 110,... 231 230 56147 97.6
0 TAT2 [nan, 73, 96, 92, 120, 124, 112, 119, 128, 113... 164 163 56518 98.3
0 U-PRO [0, nan, -, TR, 100, 30, 1, 2, 300, >=300, 3, ... 21 20 18402 32.0
0 IGG [nan, 1265.0, 1485.0, 1530.0, 1149.0, 1240.0, ... 4815 4814 45181 78.6
0 IGA [nan, 284, 328, 312, 327, 303, 270, 494, 390, ... 969 968 45184 78.6
0 IGM [nan, 124, 140, 132, 39, 109, 115, 190, 160, 1... 763 762 45189 78.6
0 CRP [nan, <0.002, -, <0.2, +, 2+, 3+, 6+, 4+, 0.3,... 441 440 10409 18.1
0 RA [nan, -, +, 2+, +-, --, 22, 7-] 8 7 48950 85.1
0 RF [nan, <40, 246, 286, 249, 162, 120, 104, 73.5,... 2394 2393 44943 78.1
0 C3 [nan, 30, 26, 24, 45, 49, 48, 65, 56, 58, 64, ... 373 372 30647 53.3
0 C4 [nan, 11, 8, 7, 15, 14, 13, 20, 19, 18, 16, 21... 107 106 30646 53.3
0 RNP [nan, 0, 4, 64, 1, negative, 256, 16] 8 7 56768 98.7
0 SM [nan, 0, negative, 2, 1, 8, 32] 7 6 56855 98.9
0 SC170 [nan, 16, 1, 0, negative, 4] 6 5 57372 99.8
0 SSA [nan, 4, 0, negative, 64, 16, 1, 256] 8 7 56935 99.0
0 SSB [nan, 0, negative, 1, 32, 2, 8, 128] 8 7 56966 99.1
0 CENTROMEA [nan, 0, negative, 320, 160, 0.0, 1280.0, 1280... 11 10 57443 99.9
0 DNA [nan, 63.1, 59.9, 57.1, 66.0, 41.9, 18.0, 15.8... 330 329 57078 99.2
0 DNA-II [nan] 1 0 57510 100.0

Visualizing missing values in C

  1. Using klib
Code
"Missing Value Plot (TSUMOTO_C.CSV)"
c_klib = klib.missingval_plot(C, figsize=(10,15))

As for our third dataset, it turns out to be the most problematic, with over half (19 out of 44) of the columns having more than 60% of their values missing, including 15 columns with more than 80%. The fundamental problem is that the majority of these columns are numerical and could have provided context for any study that may have been conducted, but their current state renders them useless and extremely problematic.

  1. Using missingno
Code
"Correlation amongst missing values (TSUMOTO_C.CSV)"
c_msno = msno.heatmap(C, figsize=(20,15))

There is a high correlation between many columns that have missing data. Since this dataset contains information about general lab examination, it can be a possibility that the patient didn’t appear for a particular lab test. Since the majority of these lab test results are obtained using the same procedure, omitting one would affect numerous outcomes.

  1. Using UpSet
    Note: Since C has too many columns, the image generated by the UpSet plot will be too large to support. So we can skip generating the UpSet plot for C.

The above hypotheses would be further understood once missing values are dropped and new heatmaps are plotted, which is part of the next steps.

3.2.0.2 Taking care of missing data

As previously discussed, missing values are a big obstacle to the data sets, so dealing with them is a priority before carrying on any analysis.

Missing data for dataset A can be handled during cleaning and munging. As for Dataset B, we need to implement the following:

  1. Dropping correlated columns

  2. Modifying ‘Symptoms’ column

Code
# Drop correlated columns
B = B.drop(columns=['RVVT', 'LAC'], axis=1)
# Modify symptoms column
B[['Symptoms']] = B[['Symptoms']].fillna("None")

Any columns with more than 70% missing values should also be dropped since the probable bias and skewness caused by data imputation is not worth it. Columns with missing values in the 50–60% range will be preserved and individually assessed.

Code
# drop the columns that have more than 70% of their data missing
A = A.loc[:, A.isna().mean() < 0.7]
B = B.loc[:, B.isna().mean() < 0.7]
C = C.loc[:, C.isna().mean() < 0.7]

3.2.0.3 Cleaning data to make it tidier

It is a good idea to rename the columns in the dataset to make it more organized before continuing with any research. All columns should be lowercase, and underscore (’_‘) should be used in place of space (’ ’).

Code
def rename_columns(df):
    for i in range(len(df.columns)):
        # check for number of words per column name
        if len(df.columns[i].split()) != 1:
            # rename column as mentioned above
            df.rename(columns={str(df.columns[i]): (str(df.columns[i]).lower().split()[0]) + '_' + (str(df.columns[i]).lower().split()[1])}, inplace=True)
        # in case of only word, follow the same idea
        else:
            df.rename(columns={str(df.columns[i]):str(df.columns[i]).lower()}, inplace=True)

rename_columns(A)
rename_columns(B)
rename_columns(C)

There are some columns in Dataset A that have date as an object. Lets convert them to the appropriate format.

Code
def form_date(val):
    # check for nan values
    if not pd.isna(val):
        # Case 1, separator = '.'
        if '.' in val:
            date_list = val.split(".")
            year = date_list[0]
            month = date_list[1]
            day = date_list[2]

        # Case 2, separator = '/'
        elif '/' in val:
            date_list = val.split("/")
            year = date_list[0]
            month = date_list[1]
            day = date_list[2]

        # form year
        year = '19' + year[-2:]
        # form month
        if int(month) < 10:
            month = '0'+ str(int(month))
        # form day
        if int(day) < 10:
            day = '0' + str(int(day))
    
        return year + '/' + month + '/' + day
    else:
        return val
Code
# Convert relevant and required date columns
A['birthday'] = A['birthday'].apply(form_date)
A['first_date'] = A['first_date'].apply(form_date)

Now that we have defined the function for date to be converted, we can calculate the age of a patient.

Code
from datetime import datetime, date

# Function to calculate age
def calculate_age(val1, val2):
    # check for nan values
    if not pd.isna(val1) and not pd.isna(val2):
        # Identify given date as date, month and year
        born = datetime.strptime(val1, "%Y/%m/%d").date()
        curr_date = datetime.strptime(val2, "%Y/%m/%d").date()
        # return age
        return curr_date.year - born.year - ((curr_date.month, curr_date.day) < (born.month, born.day))
    else:
        np.nan
Code
# Calculating age of every patient at the time when they first came to the hospital
age_first_date = []
for val1, val2 in zip(A['birthday'], A['first_date']):
    age_first_date.append(calculate_age(val1, val2))
age_first_date

# appending the age column to the dataframe
A['age_first_date'] = age_first_date

Now that we have calculated the respective ages, we don’t need those columns. Also, we should drop negative values of age because that is absurd.

Note: We won’t be removing the birth date column since that can be used later for the examination date. Also, dropping columns may be a risk, but here we have already made the most use out of those. So now they seem irrelevant towards our goal.

Code
# Drop irrelevant columns from A
A = A.drop(columns=['description', 'first_date'],  axis=1)

# Filter dataframe to drop absurd values of age
A = A[A.age_first_date > 0]

If we take a look at the ‘admission’ column in A, we see that apart from ‘+“,’-’ and Nan values, it also has some absurd ones. So lets update that column so that it does not cause any issue later on.

Code
# define a function for the admission column to be formatted appropriately
def update_admit(val):
    # check for nan values
    if not pd.isna(val):
        if '+' in val:
            # if '+' in value, then update the entire string value with '+'
            val = val[val.index('+')]
        elif '-' in val:
            # if '-' in value, then update the entire string value with '-'
            val = val[val.index('-')]
        else:
            # if either of the above literals are not present, then that means it's irrelevant. This value won't be dropped since this is medical data, but it will be replaced with nan since it wasn't recorded.
            val = np.nan
    return val
Code
# Update the admission column
A['admission'] = A['admission'].apply(update_admit)
A.head()
id sex birthday admission diagnosis age_first_date
0 2110 F 1934/02/13 + RA susp. 58.0
1 11408 F 1937/05/02 + PSS 35.0
4 27654 F 1936/03/25 + RA, SLE susp 55.0
6 43003 M 1937/11/24 - Raynaud's phenomenon 56.0
9 57266 M 1923/07/25 + RA 69.0

Moving on to dataset B.

B has 3 different anti-Cardiolipin antibody columns with their values on different scales. We should normalize this data since it may be useful for us later on for temporal analysis.

Code
# Normalizing values of the 3 different anti-Cardiolipin antibody column
B['acl_igg'] = (B['acl_igg']-B['acl_igg'].mean())/B['acl_igg'].std()
B['acl_igm'] = (B['acl_igm']-B['acl_igm'].mean())/B['acl_igm'].std()
B['acl_iga'] = (B['acl_iga']-B['acl_iga'].mean())/B['acl_iga'].std()

Also, notice that the ‘id’ column in B has missing values. Since this dataset has our dependent variable and the fact that it has information about those patients that got themselves tested for the same, if we don’t have their respective id then it may generate inappropriate results.

Also, these are just 36 rows out of 806. So dropping these won’t create any significant issue.

Code
# Drop NAs in the ID column of B since they are the main unit of analysis (total 36 rows dropped out of 806)
B = B.dropna(subset=['id'])
# change dtype of B.id to int for merging
B['id'] = B['id'].astype('int')

Also, ‘ana’ column has an inappropriate value of ‘>4096’ for the measure of anti-nucleus antibodies in a patient. Lets fix that.

Code
# filter dataset B for correct ana values
B = B[B.ana != '>4096']

That concludes for B. Lets move on to our third dataset.

For C, we again see that there is a date column but with dtype as integer. Lets fix that.

Code
# Converting date column from int to date
def int_to_date(val):
    # check for nan values
    if not pd.isna(val):
        val = str(val)
        # form year
        year = '19' + val[:2]
        # form month
        month = val[2:4]
        # form day
        day = val[4:]
        # add a 0 to month to denote months before October
        if int(month) < 10:
            month = '0'+ str(int(month))
        # add a 0 to day to denote days before the 10th of every month
        if int(day) < 10:
            day = '0' + str(int(day))
    
        return year + '/' + month + '/' + day
    else:
        return val
Code
# Convert dtype of date column to date
C['lab_date'] = C['date'].apply(int_to_date)

# Drop 'date' column since now it has been updated 
C = C.drop(columns=['date'], axis=1)
C.head()
id got gpt ldh alp tp alb ua un cre ... wbc rbc hgb hct plt u-pro crp c3 c4 lab_date
0 2110 24 12 152 63 7.5 4.5 3.4 16.0 0.6 ... 5.9 4.69 9 31.6 380 0 NaN NaN NaN 1986/04/19
1 2110 25 12 162 76 7.9 4.6 4.7 18.0 0.6 ... 6.9 4.73 9.3 31.8 323 NaN <0.002 NaN NaN 1986/04/30
2 2110 22 8 144 68 7 4.2 5 18.0 0.6 ... NaN NaN NaN NaN NaN NaN - NaN NaN 1986/05/02
3 2110 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0 NaN NaN NaN 1986/05/06
4 2110 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0 NaN NaN NaN 1986/05/07

5 rows × 24 columns

3.2.0.4 Check individual datasets before merging

Lets have one final look at each of our dataset before merging.

  1. TSUMOTO_A.CSV
Code
A.head()
id sex birthday admission diagnosis age_first_date
0 2110 F 1934/02/13 + RA susp. 58.0
1 11408 F 1937/05/02 + PSS 35.0
4 27654 F 1936/03/25 + RA, SLE susp 55.0
6 43003 M 1937/11/24 - Raynaud's phenomenon 56.0
9 57266 M 1923/07/25 + RA 69.0

Check for missing data statistics

Code
df_A_stats_new = missing_val_stats(A)
df_A_stats_new
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 id [2110, 11408, 27654, 43003, 57266, 71417, 1152... 980 980 0 0.0
0 sex [F, M, nan] 3 2 7 0.7
0 birthday [1934/02/13, 1937/05/02, 1936/03/25, 1937/11/2... 950 950 0 0.0
0 admission [+, -, nan] 3 2 23 2.3
0 diagnosis [RA susp., PSS, RA, SLE susp, Raynaud's phenom... 189 189 0 0.0
0 age_first_date [58.0, 35.0, 55.0, 56.0, 69.0, 65.0, 20.0, 26.... 75 75 0 0.0

We see that there are very few missing values in total (30 including ‘sex’ and ‘admission’). Given such a small number we can drop these values.

Code
# drop nan values from the subset columns
A = A.dropna(subset=['sex', 'admission'])

After removing NAs from the above two columns, we have no missing values for our first dataset.

  1. TSUMOTO_B.CSV
Code
B.head()
id examination_date acl_igg acl_igm ana ana_pattern acl_iga diagnosis symptoms thrombosis
0 14872 1997/5/27 -0.114128 -0.035902 256 P -0.038639 MCTD, AMI AMI 1
1 48473 1992/12/21 -0.088135 -0.035447 256 P,S -0.036709 SLE None 0
2 102490 1995/4/20 -0.105464 -0.035765 0 NaN -0.036592 PSS None 0
3 108788 1997/5/6 -0.125392 -0.036144 16 S -0.038639 NaN None 0
4 122405 1998/4/2 -0.125392 -0.035538 4 P -0.038639 SLE, SjS, vertigo None 0

Check for missing data statistics

Code
df_B_stats_new = missing_val_stats(B)
df_B_stats_new
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 id [14872, 48473, 102490, 108788, 122405, 133382,... 766 766 0 0.0
0 examination_date [1997/5/27, 1992/12/21, 1995/4/20, 1997/5/6, 1... 453 452 7 0.9
0 acl_igg [-0.11412841735362349, -0.08813455752655686, -... 111 111 0 0.0
0 acl_igm [-0.03590169890063771, -0.03544652764211938, -... 137 137 0 0.0
0 ana [256, 0, 16, 4, 1024, 4096, 64, nan, 4094] 9 8 20 2.6
0 ana_pattern [P, P,S, nan, S, D,P,S, P.S, S,P, S,D, D,P, P,... 16 15 236 30.7
0 acl_iga [-0.038639334734420175, -0.036709439525941974,... 155 155 0 0.0
0 diagnosis [MCTD, AMI, SLE, PSS, nan, SLE, SjS, vertigo, ... 177 176 312 40.6
0 symptoms [AMI, None, CNS lupus, Apo, AP, thrombocytepen... 40 40 0 0.0
0 thrombosis [1, 0, 2, 3] 4 4 0 0.0

We see that there is still some missing data that can’t be just dropped. This can be handled after merging.

  1. TSUMOTO_C.CSV
Code
C.head()
id got gpt ldh alp tp alb ua un cre ... wbc rbc hgb hct plt u-pro crp c3 c4 lab_date
0 2110 24 12 152 63 7.5 4.5 3.4 16.0 0.6 ... 5.9 4.69 9 31.6 380 0 NaN NaN NaN 1986/04/19
1 2110 25 12 162 76 7.9 4.6 4.7 18.0 0.6 ... 6.9 4.73 9.3 31.8 323 NaN <0.002 NaN NaN 1986/04/30
2 2110 22 8 144 68 7 4.2 5 18.0 0.6 ... NaN NaN NaN NaN NaN NaN - NaN NaN 1986/05/02
3 2110 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0 NaN NaN NaN 1986/05/06
4 2110 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0 NaN NaN NaN 1986/05/07

5 rows × 24 columns

Check for missing data statistics

Code
df_C_stats_new = missing_val_stats(C)
df_C_stats_new
column unique_val num_unique_val num_unique_val_nona num_miss pct_miss
0 id [2110, 11408, 12052, 14872, 27654, 30609, 4300... 1236 1236 0 0.0
0 got [24, 25, 22, nan, 28, 21, 23, 38, 27, 32, 20, ... 424 423 11341 19.7
0 gpt [12, 8, nan, 13, 9, 10, 22, 18, 17, 15, 14, 46... 770 769 11356 19.7
0 ldh [152, 162, 144, nan, 156, 167, 185, 134, 165, ... 1500 1499 11202 19.5
0 alp [63, 76, 68, nan, 66, 64, 60, 56, 55, 75, 89, ... 1335 1334 11989 20.8
0 tp [7.5, 7.9, 7, nan, 7.6, 8, 7.2, 7.1, 8.2, 7.7,... 95 94 11874 20.6
0 alb [4.5, 4.6, 4.2, nan, 4.8, 4.4, 4.7, 4, 4.3, 4.... 95 94 12189 21.2
0 ua [3.4, 4.7, 5, nan, 4.4, 4.5, 4, 3.7, 3.6, 3.2,... 275 274 12202 21.2
0 un [16.0, 18.0, nan, 15.0, 14.0, 13.0, 12.0, 19.0... 346 345 11387 19.8
0 cre [0.6, nan, 0.7, 0.8, 0.66, 0.9, 1.2, 1, 1.1, 1... 601 600 11291 19.6
0 t-bil [0.4, nan, 0.5, 0.8, 0.6, 0.7, 0.2, 0.3, 0.9, ... 340 339 18733 32.6
0 t-cho [192, 187, 191, nan, 214, 231, 212, 204, 208, ... 848 847 15090 26.2
0 tg [nan, 76, 65, 67, 113, 74, 63, 169, 155, 119, ... 1008 1007 28252 49.1
0 cpk [nan, 31, 23, 25, 26, 28, 16, 37, 22, 18, 21, ... 920 919 36946 64.2
0 wbc [5.9, 6.9, nan, 7.2, 5, 4.9, 7.4, 6.5, 6.2, 7.... 547 546 7905 13.7
0 rbc [4.69, 4.73, nan, 4.93, 4.62, 4.49, 4.71, 4.38... 856 855 7905 13.7
0 hgb [9, 9.3, nan, 9.5, 8.9, 8.5, 9.2, 9.4, 9.1, 10... 285 284 7904 13.7
0 hct [31.6, 31.8, nan, 33, 30.9, 30, 31.7, 29.3, 31... 766 765 7904 13.7
0 plt [380, 323, nan, 367, 372, 313, 299, 312, 346, ... 913 912 9605 16.7
0 u-pro [0, nan, -, TR, 100, 30, 1, 2, 300, >=300, 3, ... 21 20 18402 32.0
0 crp [nan, <0.002, -, <0.2, +, 2+, 3+, 6+, 4+, 0.3,... 441 440 10409 18.1
0 c3 [nan, 30, 26, 24, 45, 49, 48, 65, 56, 58, 64, ... 373 372 30647 53.3
0 c4 [nan, 11, 8, 7, 15, 14, 13, 20, 19, 18, 16, 21... 107 106 30646 53.3
0 lab_date [1986/04/19, 1986/04/30, 1986/05/02, 1986/05/0... 4960 4960 0 0.0

Too many missing values account for the uncertainity in the dataset.

3.2.0.5 Merging the datasets

In terms of joining data sets, there are multiple ways of doing so depending on the question being answered.

Now that we have an idea, we will first merge A with C. Reason being that A contains basic information about the patients and C has the data from general laboratory examinations stored in hospital information systems. Merging these will filter out the id(s) of the patients that dint get tested for these examinations.

Then the last step would be to merge the combined one with B.

Since B contains our target variable (Thrombosis), it has the information of those patients that got tested via the ‘Special Laboratory Examination’. Merging B at the end will give us the final dataset that we can use for EDA and strengthen our final conclusion.

Since ‘ID’ is the only common column amongst the 3 datasets, it will be used in the ‘left’ join for merging.

3.2.0.5.1 Step 1: Merge A with C
Code
# Merging A with C using left join on id column
C_A = C.merge(A, on='id', how='left')
C_A.head()
id got gpt ldh alp tp alb ua un cre ... u-pro crp c3 c4 lab_date sex birthday admission diagnosis age_first_date
0 2110 24 12 152 63 7.5 4.5 3.4 16.0 0.6 ... 0 NaN NaN NaN 1986/04/19 F 1934/02/13 + RA susp. 58.0
1 2110 25 12 162 76 7.9 4.6 4.7 18.0 0.6 ... NaN <0.002 NaN NaN 1986/04/30 F 1934/02/13 + RA susp. 58.0
2 2110 22 8 144 68 7 4.2 5 18.0 0.6 ... NaN - NaN NaN 1986/05/02 F 1934/02/13 + RA susp. 58.0
3 2110 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0 NaN NaN NaN 1986/05/06 F 1934/02/13 + RA susp. 58.0
4 2110 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0 NaN NaN NaN 1986/05/07 F 1934/02/13 + RA susp. 58.0

5 rows × 29 columns

Check the shape.

Code
C_A.shape
(57512, 29)
3.2.0.5.2 Step 2: Merge combined (C_A) with B
Code
# Merging C_A with B using left join on id column
df_final = B.merge(C_A, on='id', how='left')
df_final.head()
id examination_date acl_igg acl_igm ana ana_pattern acl_iga diagnosis_x symptoms thrombosis ... u-pro crp c3 c4 lab_date sex birthday admission diagnosis_y age_first_date
0 14872 1997/5/27 -0.114128 -0.035902 256 P -0.038639 MCTD, AMI AMI 1 ... 0 - NaN NaN 1981/02/17 NaN NaN NaN NaN NaN
1 14872 1997/5/27 -0.114128 -0.035902 256 P -0.038639 MCTD, AMI AMI 1 ... NaN - NaN NaN 1981/03/16 NaN NaN NaN NaN NaN
2 14872 1997/5/27 -0.114128 -0.035902 256 P -0.038639 MCTD, AMI AMI 1 ... NaN - NaN NaN 1981/04/06 NaN NaN NaN NaN NaN
3 14872 1997/5/27 -0.114128 -0.035902 256 P -0.038639 MCTD, AMI AMI 1 ... NaN - NaN NaN 1981/04/27 NaN NaN NaN NaN NaN
4 14872 1997/5/27 -0.114128 -0.035902 256 P -0.038639 MCTD, AMI AMI 1 ... NaN - NaN NaN 1981/05/01 NaN NaN NaN NaN NaN

5 rows × 38 columns

Check the shape.

Code
df_final.shape
(20279, 38)

Since ‘diagnosis’ was present both in B and C_A, we see two different columns for both in the final dataframe. But if you look carefully, the values are similar. Lets merge these columns into one and form the updated ‘diagnosis’ column for df_final.

Code
for i in range(len(df_final)):
    # check for nan in diagnosis_x, if false then split the values
    if df_final["diagnosis_x"].isna()[i] == False:
        df_final["diagnosis_x"][i] = df_final["diagnosis_x"][i].split(",")
    # check for nan in diagnosis_y, if false then split the values
    if df_final["diagnosis_y"].isna()[i] == False:
        df_final["diagnosis_y"][i] = df_final["diagnosis_y"][i].split(",")
        
# defining an empty list to store diagnosis as column in the dataframe later on
df_final["diagnosis"] = ""

for i in range(len(df_final)):
    # fill the value with diagnosis_x if diagnosis_y is nan
    if (df_final["diagnosis_x"].isna()[i] == False) & (df_final["diagnosis_y"].isna()[i] == True):
        df_final["diagnosis"][i] = df_final["diagnosis_x"][i]
    # fill the value with diagnosis_y if diagnosis_x is nan
    elif (df_final["diagnosis_x"].isna()[i] == True) & (df_final["diagnosis_y"].isna()[i] == False):
        df_final["diagnosis"][i] = df_final["diagnosis_y"][i]
    # fill the value with the list of the set of values in diagnosis_x and diagnosis_y if both are not nan
    elif (df_final["diagnosis_x"].isna()[i] == False) & (df_final["diagnosis_y"].isna()[i] == False):
        df_final["diagnosis"][i] = list(set(df_final["diagnosis_x"][i] + df_final["diagnosis_y"][i]))

If we remember fondly, we didn’t drop the ‘birthday’ column earlier. The main reason for not dropping that was to calculate the age of the patients at the time of their ‘Thrombosis’ lab test.

Now that the datasets have been merged, we can use the ‘birthday’ and ‘examination_date’ column to calculate the same.

Code
# convert examination_date to date format
df_final['examination_date'] = df_final['examination_date'].apply(form_date)

# Calculate age at the time of examination
age_at_examination = []
for val1, val2 in zip(df_final['birthday'], df_final['examination_date']):
    age_at_examination.append(calculate_age(val1, val2))

# add the column to the dataframe
df_final['age_at_examination'] = age_at_examination

Now that we have used these columns, we can drop them

Code
# drop irrelevant columns
df_final = df_final.drop(columns=['examination_date', 'diagnosis_x', 'lab_date', 'birthday', 'diagnosis_y'], axis=1)

A property of the dataframe is to have it’s columns organized with dependent column at the end. Lets implement the same for our merged dataset.

Code
# rearrange the columns
df_final = df_final[['id', 'sex', 'admission', 'age_first_date', 'age_at_examination',
                    'diagnosis', 'acl_igg', 'acl_igm', 'acl_iga','ana', 'ana_pattern', 
                    'got', 'gpt', 'ldh', 'alp', 'tp', 'alb', 'ua', 'un', 'cre', 't-bil', 
                    't-cho', 'tg', 'cpk', 'wbc', 'rbc', 'hgb', 'hct', 'plt', 'u-pro', 'crp', 
                    'c3', 'c4', 'symptoms','thrombosis']]
df_final.head()
id sex admission age_first_date age_at_examination diagnosis acl_igg acl_igm acl_iga ana ... rbc hgb hct plt u-pro crp c3 c4 symptoms thrombosis
0 14872 NaN NaN NaN NaN [MCTD, AMI] -0.114128 -0.035902 -0.038639 256 ... 4.44 12.5 38.2 256 0 - NaN NaN AMI 1
1 14872 NaN NaN NaN NaN [MCTD, AMI] -0.114128 -0.035902 -0.038639 256 ... 4.5 11.8 37.7 199 NaN - NaN NaN AMI 1
2 14872 NaN NaN NaN NaN [MCTD, AMI] -0.114128 -0.035902 -0.038639 256 ... 4.76 12.8 39.4 199 NaN - NaN NaN AMI 1
3 14872 NaN NaN NaN NaN [MCTD, AMI] -0.114128 -0.035902 -0.038639 256 ... 4.76 12.3 39.4 183 NaN - NaN NaN AMI 1
4 14872 NaN NaN NaN NaN [MCTD, AMI] -0.114128 -0.035902 -0.038639 256 ... 4.43 11.7 36 215 NaN - NaN NaN AMI 1

5 rows × 35 columns

3.2.0.6 Missing data in the merged dataset

Lets visualize the missing data of the merged dataset.

  1. Using klib
Code
print('Missing Value Plot of df_final')
df_final_klib = klib.missingval_plot(df_final, figsize=(10,15))

If you see the plot above, there are a few columns that show a similar trend in their missing values. Lets verify this using the ‘missingno’ library.

. Using missingno

Code
print('Correlation amongst missing values (df_final))')
df_final_msno = msno.heatmap(df_final, figsize=(20,15))

Some columns still are very much correlated to one another in terms of missing data.

For example, columns such as ‘GOT’, ‘GPT’, ‘LDH’, ‘ALP’, ‘TP’, ‘ALB’, ‘UA’, ‘UN’, and ‘CRE’ have values approximately equal to 1.

The set of columns (‘WBC’, ‘RBC’, ‘HGB’, ‘HCT’, ‘PLT’) and (‘C3’, and ‘C4’) are highly correlated as per their missing values.

We need to analyze and then handle the missing data. Since this is a medical dataset, we can’t just drop or modify such columns. We need to take precautions and then follow appropriate steps.

Using UpSet
Note: Since df_final has too many columns, the image generated by the UpSet plot will be too large to support.

4 Exploratory analysis

Lets plot a histogram for our temporal data from the first dataset (TSUMOTO_A.CSV). This plot will display the count of ‘age_first_date’ to analyze the age of patients when they first visited the hospital.

Code
# define figure size
plt.figure(figsize=(8, 5))
# plot the histogram using dataset A, age_first_date as x and sex as hue
sns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)
# define axis labels
plt.xlabel("Age(years)")
plt.ylabel("Count")
# put title on the plot
plt.title('Age Distribution of patients when they first visited the hospital')
# set margins for x and y axis
plt.margins(x=0.01, y=0.1)
plt.show()

From the above plot, we infer two things:

1. The number of female patients are more as compared to male.

2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.

We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don’t tend to visit hospital that much for their medical checkups.

As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.

Lets do another temporal analysis, this time using the second dataset (TSUMOTO_B.CSV) and see how has thrombosis evolved over time.

Code
# drop nan values from subset columns and then sort by examination_date
temporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])

# convert the examination_date column dtype to pandas datetime
temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])

# Extract the year from the datetime and create a new column
temporal_B['year'] = temporal_B['examination_date'].dt.year

# count of thrombosis patients per year
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()

# plot barchart of the thrombosis counts per year
ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=
['#979797', '#f44336', "#C2CD23", "#918BC3"])
# set title of the plot
ax.set_title("Thrombosis diagnosis over the years")
# define labels of the axis
ax.set_xlabel('Year')
ax.set_ylabel('Count')
Text(0, 0.5, 'Count')

The histogram above highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.

Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.

Lets have a look at the age distribution of patients when got tested for thrombosis. This plot also accounts for temporal analysis.

Code
# drop nan values from the subset column
age_exam_nona = df_final.dropna(subset=['age_at_examination'])
# convert age_at_examination dtype to int
age_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')

# define figure size
plt.figure(figsize=(8, 5))
# plot histogram using df_final as data, age_at_examination as x and thrombosis as hue
ax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)

# define labels of axis
plt.xlabel("Age(years)")
plt.ylabel("Count")
# set title of the plot
plt.title('Age distribution of patients when they got tested for thrombosis')
# define margins of the x and y axis
plt.margins(x=0.005, y=0.1)
plt.show()

As per the plot above, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low.

Now we will conduct a comprehensive analysis of thrombosis since it is important to gain a deeper understanding of its associated side effects and prevalence among different age groups and genders within the population.

Below is a density plot that shows the age distribution of people that have thrombosis vs people who do not.

Code
# filter df with patients that have thrombosis
thrombosis_patients = df_final[df_final['thrombosis'] != 0]
# filter df with patients that don't have thrombosis
no_thrombosis_patients = df_final[df_final['thrombosis'] == 0]

# age at examination of patients with thrombosis
age_t = thrombosis_patients['age_at_examination']
# age at examination of patients without thrombosis
age_nt = no_thrombosis_patients['age_at_examination']

# Filter outliers
age_t = age_t[age_t >2]
age_nt = age_nt[age_nt >2]

# Age distribution plot
sns.kdeplot(age_t, color="#f44336",fill=True, label="Patients with Thrombosis")
sns.kdeplot(age_nt, color="#979797",fill=True, label="Patients without Thrombosis")

# Add legend
plt.legend()
# set title
plt.title("Age Distribution of Patients with respect to Thrombosis")
# set x-axis label
plt.xlabel("Age(years)")

# Display plot
plt.show()

The density plot above shows that Thrombosis does not seem to occur with people of age more than 70. In fact, most of Thrombosis patients’ age are in the range 20-60, which is pretty much the same as that of petients that did not have Thrombosis. The only slight difference is that patients with no thromboses are slightly skewed to the younger side of that range, but ultimately i dont think this holds any relvance.

Next step is to see if there are any differences within the Thrombosis group, meaning between the different levels of the disease’s severity.

Code
# filter the level of thrombosis disease into 3 separate variables
t1 = df_final[df_final['thrombosis'] == 1]
t2 = df_final[df_final['thrombosis'] == 2]
t3 = df_final[df_final['thrombosis'] == 3]

# store the respective ages of patients as per the variables chosen above
age_t1 = t1['age_at_examination']
age_t2 = t2['age_at_examination']
age_t3 = t3['age_at_examination']

# only check for patients having age > 2
age_t1 = age_t1[age_t1 > 2]
age_t2 = age_t2[age_t2 > 2]
age_t3 = age_t3[age_t3 > 2]

# plot the kernel density plot using the age calculated above
sns.kdeplot(age_t1, color="#979797",fill=True, label="Level 1")
sns.kdeplot(age_t2, color="black",fill=True, label="Level 2")
sns.kdeplot(age_t3, color="#f44336",fill=True, label="Level 3")

# Add legend
plt.legend()
# set title
plt.title("Age Distribution of differnet levels of Thrombosis")
# set x-label
plt.xlabel("Age(Years)")

# Display plot
plt.show()

Different levels of Thrombosis turn out to have different age distributions, with the age density plot of level 3 patients being the most condensed in the 30-40 age range, while that of level 1 and level 2 patients in 15-35 and 0-30 respectively.

Look at box plots to check for outliers.

Code
# Subset columns of interest
lab_results = df_final[['acl_igg', 'acl_igm', 'acl_iga', 'ana', 'thrombosis']]

# drop nan values from subset columns
lab_results = lab_results.dropna(subset=['ana']) # Only 4 missing values 
# change ana column dtype to int
lab_results['ana'] = lab_results['ana'].astype('int')

# create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# create box plots
boxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)

# plot boxplot
axs[0, 0].boxplot(boxp['acl_iga'])
# set title of boxplot
axs[0, 0].set_title('acl_iga')

# plot boxplot
axs[0, 1].boxplot(boxp['acl_igg'])
# set title of boxplot
axs[0, 1].set_title('acl_igg')

# plot boxplot
axs[1, 0].boxplot(boxp['acl_igm'])
# set title of boxplot
axs[1, 0].set_title('acl_igm')

# plot boxplot
axs[1, 1].boxplot(boxp['ana'])
# set title of boxplot
axs[1, 1].set_title('ana')

# add x-axis label
for ax in axs.flat:
    ax.set_xlabel('Feature')

# add y-axis label
for ax in axs.flat:
    ax.set_ylabel('Value')

# add overall title
plt.suptitle('Box plots of features', fontsize=16)

# adjust layout
plt.tight_layout()

# show plot
plt.show()

The data has outlier issues.

Setting the threshold at 100 for IgA, IgG, and IgM makes sense, while putting it at 200 for ANA also makes sense.

Code
lab_results['ana'] = lab_results['ana'].astype('int')
# setting the threshold as 100 for the anti-Cardiolipin antibodies and 200 for ana
lab_results = lab_results[(lab_results['acl_iga'] < 100) &  (lab_results['acl_igg'] < 100) &  
(lab_results['acl_igm'] < 100) & (lab_results['ana'] < 50)]


# create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# create box plots
boxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)

# plot boxplot
axs[0, 0].boxplot(boxp['acl_iga'])
# set title of boxplot
axs[0, 0].set_title('acl_iga')

# plot boxplot
axs[0, 1].boxplot(boxp['acl_igg'])
# set title of boxplot
axs[0, 1].set_title('acl_igg')

# plot boxplot
axs[1, 0].boxplot(boxp['acl_igm'])
# set title of boxplot
axs[1, 0].set_title('acl_igm')

# plot boxplot
axs[1, 1].boxplot(boxp['ana'])
# set title of boxplot
axs[1, 1].set_title('ana')

# add x-axis label
for ax in axs.flat:
    ax.set_xlabel('Feature')

# add y-axis label
for ax in axs.flat:
    ax.set_ylabel('Value')

# add overall title
plt.suptitle('Box plots of features', fontsize=16)

# adjust layout
plt.tight_layout()

# show plot
plt.show()

The data now is in a better state to analyze (a handful of outliers remain).

Below is a radar chart that shows how patients that do not have Thrombosis compare to those who have (and the different levels) in terms of lab results (averaged results).

Code
# Normalize columns to have them all in the same scale 
from sklearn.preprocessing import StandardScaler

# using standard scaler object for feature scaling
scaler = StandardScaler()
# choose all columns except thrombosis
cols_to_normalize = [col for col in lab_results.columns if col != 'thrombosis']
# normalize the column values and put those in a new dataframe
lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)
lab_results_norm['thrombosis'] = lab_results['thrombosis']

# Get the means of each Thrombosis group
means = lab_results_norm.groupby('thrombosis').mean().reset_index()

categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']

# store each respective mean in a different variable
variable1 = means.iloc[0, :].values
variable2 = means.iloc[1, :].values
variable3 = means.iloc[2, :].values
variable4 = means.iloc[3, :].values

label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(variable1))

# Set colors and fill
colors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']
fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']

# define figure size
plt.figure(figsize=(8.5, 8.5))
# add subplots
plt.subplot(polar=True)

# plot the graph for the first mean
plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])
plt.plot(label_loc, variable1,
         color=colors[0], label='No Thrombosis', linewidth=2)

# plot the graph for the second mean
plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])
plt.plot(label_loc, variable2,
         color=colors[1], label='Thrombosis 1', linewidth=2)

# plot the graph for the third mean
plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])
plt.plot(label_loc, variable3,
         color=colors[2], label='Thrombosis 2', linewidth=2)

# plot the graph for the fourth mean
plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])
plt.plot(label_loc, variable4,
         color=colors[3], label='Thrombosis 3', linewidth=2)

# set title
plt.title('Lab results averages of patients', size=24)

lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[
                               ''] + categories, color='gray')

# set legend
plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)
plt.tight_layout()
plt.show()

The radar chart shows that the only lab results that show significant differences between different Thrombosis groups are:

ANA: The higher the ANA value the more severe the Thrombosis diagnosis. Thrombosis level 1 seems have to similar acl_iga and acl_igm values.

A possible predictor of Thrombosis could be the symptoms a patient is experiencing.

Code
from wordcloud import WordCloud, STOPWORDS
import random
import matplotlib.pyplot as plt

# Define the thrombosis categories to iterate over
thrombosis_categories = [1, 2, 3]

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(9, 6))

# Iterate over the thrombosis categories and plot a word cloud in each column
for i, thrombosis_category in enumerate(thrombosis_categories):

    thrombosis_p = B[B['thrombosis'] == thrombosis_category]

    # Concatenate all the strings in the "Symptoms" column into a single string
    text = ' '.join(thrombosis_p['symptoms'])

    # Create a set of stopwords
    stopwords = set(STOPWORDS)
    stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ',
                      'ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY'])  # Excluding thrombsosis (obvious word) + some words that dont make sense

    # Define the color palette
    custom_colors = ['#121111', '#3646f4']
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return random.choice(custom_colors)

    # Create a WordCloud object with the custom color palette
    wordcloud = WordCloud(width=400, height=400, background_color='white',
                          color_func=color_func, stopwords=stopwords).generate(text)

    # Plot the word cloud in the corresponding subplot
    ax = axes[i]
    ax.imshow(wordcloud)
    ax.axis("off")
    ax.set_title(f"Thrombosis {thrombosis_category}")

# Adjust the layout of the figure and display it
plt.tight_layout()
plt.show()

The main symptoms of each thrombosis level are the following:

Level 1: Brain infarction

Level 2: CNS lupus

Level 3: Thrombocytopenia

The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.

5 Final Plots

Code
# define figure size
plt.figure(figsize=(8, 5))
# plot the histogram using dataset A, age_first_date as x and sex as hue
sns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)
# define axis labels
plt.xlabel("Age(years)")
plt.ylabel("Count")
# put title on the plot
plt.title('Age Distribution of patients when they first visited the hospital')
# set margins for x and y axis
plt.margins(x=0.01, y=0.1)
plt.show()



Code
# drop nan values from subset columns and then sort by examination_date
temporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])

# convert the examination_date column dtype to pandas datetime
temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])

# Extract the year from the datetime and create a new column
temporal_B['year'] = temporal_B['examination_date'].dt.year

# count of thrombosis patients per year
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()

# plot barchart of the thrombosis counts per year
ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=
['#979797', '#f44336', "#C2CD23", "#918BC3"])
# set title of the plot
ax.set_title("Thrombosis diagnosis over the years")
# define labels of the axis
ax.set_xlabel('Year')
ax.set_ylabel('Count')
Text(0, 0.5, 'Count')



Code
# drop nan values from the subset column
age_exam_nona = df_final.dropna(subset=['age_at_examination'])
# convert age_at_examination dtype to int
age_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')

# define figure size
plt.figure(figsize=(8, 5))
# plot histogram using df_final as data, age_at_examination as x and thrombosis as hue
ax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)

# define labels of axis
plt.xlabel("Age(years)")
plt.ylabel("Count")
# set title of the plot
plt.title('Age distribution of patients when they got tested for thrombosis')
# define margins of the x and y axis
plt.margins(x=0.005, y=0.1)
plt.show()



Code
# Normalize columns to have them all in the same scale 
from sklearn.preprocessing import StandardScaler

# using standard scaler object for feature scaling
scaler = StandardScaler()
# choose all columns except thrombosis
cols_to_normalize = [col for col in lab_results.columns if col != 'thrombosis']
# normalize the column values and put those in a new dataframe
lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)
lab_results_norm['thrombosis'] = lab_results['thrombosis']

# Get the means of each Thrombosis group
means = lab_results_norm.groupby('thrombosis').mean().reset_index()

categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']

# store each respective mean in a different variable
variable1 = means.iloc[0, :].values
variable2 = means.iloc[1, :].values
variable3 = means.iloc[2, :].values
variable4 = means.iloc[3, :].values

label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(variable1))

# Set colors and fill
colors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']
fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']

# define figure size
plt.figure(figsize=(8.5, 8.5))
# add subplots
plt.subplot(polar=True)

# plot the graph for the first mean
plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])
plt.plot(label_loc, variable1,
         color=colors[0], label='No Thrombosis', linewidth=2)

# plot the graph for the second mean
plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])
plt.plot(label_loc, variable2,
         color=colors[1], label='Thrombosis 1', linewidth=2)

# plot the graph for the third mean
plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])
plt.plot(label_loc, variable3,
         color=colors[2], label='Thrombosis 2', linewidth=2)

# plot the graph for the fourth mean
plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])
plt.plot(label_loc, variable4,
         color=colors[3], label='Thrombosis 3', linewidth=2)

# set title
plt.title('Lab results averages of patients', size=24)

lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[
                               ''] + categories, color='gray')

# set legend
plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)
plt.tight_layout()
plt.show()



Code
from wordcloud import WordCloud, STOPWORDS
import random
import matplotlib.pyplot as plt

# Define the thrombosis categories to iterate over
thrombosis_categories = [1, 2, 3]

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(9, 6))

# Iterate over the thrombosis categories and plot a word cloud in each column
for i, thrombosis_category in enumerate(thrombosis_categories):

    thrombosis_p = B[B['thrombosis'] == thrombosis_category]

    # Concatenate all the strings in the "Symptoms" column into a single string
    text = ' '.join(thrombosis_p['symptoms'])

    # Create a set of stopwords
    stopwords = set(STOPWORDS)
    stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ',
                      'ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY'])  # Excluding thrombsosis (obvious word) + some words that dont make sense

    # Define the color palette
    custom_colors = ['#121111', '#3646f4']
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return random.choice(custom_colors)

    # Create a WordCloud object with the custom color palette
    wordcloud = WordCloud(width=400, height=400, background_color='white',
                          color_func=color_func, stopwords=stopwords).generate(text)

    # Plot the word cloud in the corresponding subplot
    ax = axes[i]
    ax.imshow(wordcloud)
    ax.axis("off")
    ax.set_title(f"Thrombosis {thrombosis_category}")

# Adjust the layout of the figure and display it
plt.tight_layout()
plt.show()

6 Technical summary

Unexpectedly, the quality of the data provided for this investigation was far lower than anticipated, especially considering that it came from such a prestigious institution. The main issues with the data are provided below:

  1. Missing Data: Missing values are a major problem in this set of data, particularly in TSUMOTO_C.CSV, where more than 80% of the columns and features have missing values. The biggest barrier to better understanding thrombosis and its connection to routine lab tests was this problem. Because of this problem, perhaps some significant connections and patterns were missed.

  2. Lack of Thrombosis cases: The major goals of this study were to gain a deeper understanding of thrombosis and identify data patterns that might act as predictors of this condition. The data presented, however, contains extremely few thrombosis instances, particularly more severe ones (levels 2 and 3), making any study very susceptible to the addition of more cases.

  3. Data not balanced: When sampling for thrombosis + cases exclusively, the dataset TSUMOTO_B.CSV was significantly unbalanced, with nearly all patients (about 85%) being female, which could lead to incorrect findings that thrombosis occurs considerably more commonly with females. This prevented a gender analysis that was accurate from being done.

  4. Dates: TIn some cases, there is a discrepancy of more than 5 years between the dates patients’ lab tests and thrombosis testing are performed. Given that the patients’ actual results would have been drastically different if they had undergone more recent (closer to Thrombosis assessment) lab testing, this renders any analysis of lab data (TSUMOTO_C.CSV) very poor..

Yet it’s vital to note that all of the data was gathered between 1989 and 1996, a time period in which industrywide data discrepancies were frequent and data collection methods weren’t the most sophisticated. An examination of thrombosis using more recent data, if available, would be a better idea.

Below are some of main key takeaways (elaborated more in results.html):

  1. Level 3 Thrombosis appears to happen with people on the higher end of the age distribution.

  2. ANA seems to be the best predictor amongst all.

  3. There are various stages of thrombosis, and each level has its own set of symptoms.
Source Code
---
title: "EDA on Medical Data of Thrombosis Diagnosis"
author:
  - name: Raghav Sharma
    affiliations:
      - name: Georgetown University      
    url: https://github.com/anly503/hw2-spring-2023-raghavSharmaCode
    email: rs2190@georgetown.edu  
---

# Introduction

## Objective

Imagine yourself as a data scientist working at a consulting firm. An external entity (client) has paid a significant amount of money for your firm to make sense of a medical data set. They plan to incorporate your findings to help guide the allocation of a multi-million dollar research grant. They want the results in two weeks, after which point the contract terminates and your consulting firm will move onto a new unrelated project. <br><br>


Our job is to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.

## Data

Data has been collected from Georgetown University's Hospital and contain information about patients who have been admitted to the hospital's <b>Collagen</b> disease clinic. <br><br>

Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lung, he/she will lose the respiratory function in a chronic course and finally lose their lives. <b>Thrombosis</b> is the most severe form of Collagen diseases. <br><br> 


There are 3 CSV files that make up for the medical data. These files are:<br><br>

1. TSUMOTO_A.CSV: Contains basic information about patients<br><br>

2. TSUMOTO_B.CSV: Contains information about the 'Special Laboratory Examinations' of patients to detect Thrombosis.<br><br>

3. TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.<br><br>


## Goals of the project

This project aims to thoroughly understand and analyze Thrombosis which is a Collagen disease by achieving the following:<br><br>

1. Search for good patterns which detect and predict thrombosis.<br><br>
2. Search for temporal patterns specific/sensitive to thrombosis.<br><br>
3. Search for good features which classifies collagen diseases correctly.<br><br>
4. Search for temporal patterns specific/sensitive to each collagen diseases.<br><br>

To achieve these goals, the data will be carefully analyzed to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, this project aims to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.

# Initial Questions

We would like to investigate the following questions prior to our analyses:<br><br>

1. How does thrombosis evolve over time?<br><br>
2. What age group is Thrombosis, the most prevalent in?<br><br>
3. Does it depends upon the sex of the patient?<br><br>
4. Is this dataset good enough to understand and visualize the factors causing and effects of Thrombosis?<br><br>


# Data Ingestion, Cleaning and Munging

Before doing any kind of analysis, it is important to properly understand the datasets in terms of shape, strengths, weaknesses, etc. <br><br>

## Ingestion

#### Importing the libraries
```{python}
# Importing all necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import klib
import missingno as msno
from upsetplot import UpSet

from wordcloud import WordCloud, STOPWORDS
import random

import sys

import warnings
warnings.filterwarnings("ignore")
```


#### Importing the datasets

```{python}
# Redirect stdout and stderr to null device
sys.stdout = open('/dev/null', 'w')
sys.stderr = open('/dev/null', 'w')
```

1. <b>TSUMOTO_A.CSV</b>

```{python}
A = pd.read_csv("./data/TSUMOTO_A.CSV", encoding= 'unicode_escape')
A.head()
```

```{python}
"Shape of A: " + str(A.shape)
```

2. <b>TSUMOTO_B.CSV</b>

```{python}
B = pd.read_csv("./data/TSUMOTO_B.CSV", encoding= 'unicode_escape')
B.head()
```

```{python}
"Shape of B: " + str(B.shape)
```

3. <b>TSUMOTO_C.CSV</b> <br><br>

Note: Couple of lines in this file were causing issues when reading the dataset. Using "error_bad_lines=False" argument resolved that. A total of 30 lines (out of 50k + where dropped), so any impact of this drop would be very minimal.

```{python}
# A couple of bad (unreadable) lines in the 3rd dataset were dropped. The dropped lines account for 30 (out of 50k+), 
# which makes this drop insignificant
C = pd.read_csv("data/TSUMOTO_C.CSV" , encoding= 'unicode_escape', error_bad_lines=False) 
C.head()
```

```{python}
"Shape of C: " + str(C.shape)
```


C is by far the largest dataframe that has information about general laboratory examinations.

## Cleaning and Munging

#### Analyzing Missing values

Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the datasets.<br><br>

It is very important to note that this is medical data. So if any of the dataset columns has a missing value, then it means that it wasn't recorded. We need to carefully analyze and then take appropriate steps to handle missing data.

We will use 'klib', 'missingno', and 'UpSet' libraries to visualize the missing data for all the datasets.<br><br>

1. klib library helps us to visualize missing data trends in the dataset. Using the 'missing_val' plot, we will be able to extract necessary information of the missing data in every column. <br><br>

2. missingno library will be helpful in plotting the correlation amongst the columns having missing values. This way we will be able to make a decision if a certain column needs to be dropped or not.<br><br>

3. UpSet library looks at the multivariate relationships among categorical variables and is applied to missing data.<br><br>


Lets create a function to analyze missing data statistics of our dataset.

```{python}
# define a function that returns a data-frame of missing data statistics
def missing_val_stats(df):
    # define columns of the data-frame
    df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 
                                       'num_miss', 'pct_miss'])
    tmp = pd.DataFrame()
    
    for c in df.columns:
        # column
        tmp['column'] = [c]
        # unique values in the column
        tmp['unique_val'] = [df[c].unique()]
        # number of unique values in the column
        tmp['num_unique_val'] = len(list(df[c].unique()))
        # number of unique values in the column without nan
        tmp['num_unique_val_nona'] = int(df[c].nunique())
        # number of missing values in the column
        tmp['num_miss'] = df[c].isnull().sum()
        # percentage of missing values in the column
        tmp['pct_miss'] = (df[c].isnull().sum()/ len(df)).round(3)*100
        df_stats = df_stats.append(tmp)
    
    return df_stats
```

Lets take a look at the missing values statistics of A


```{python}
df_A_stats = missing_val_stats(A)
df_A_stats
```

Visualizing missing values in A <br><br>

1. Using klib
```{python}
"Missing Value Plot (TSUMOTO_A.CSV)"
a_klib = klib.missingval_plot(A, figsize=(10,15))
```

There are a few columns with few missing values with 'First Date' reaching a maximum of 20%. Dealing with missing values in these columns could be dealt with at later stages, when/if they are used.

2. Using missingno
```{python}
"Correlation amongst missing values (TSUMOTO_A.CSV)"
a_msno = msno.heatmap(A, figsize=(8, 5))
```

Nothing too significant for A.

3. Using UpSet
```{python}
"UpSet plot for TSUMOTO_A.CSV"
# check for nan values
a_upset = pd.isna(
  A[['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission',
       'Diagnosis']]
)

# using groupby for the missing values
a_upset = a_upset.groupby(['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission',
       'Diagnosis'], as_index=True).size()

# plot the groupby object
UpSet(a_upset).plot(fig= plt.figure(figsize=(8, 5)));
plt.show()
```

As stated earlier, there isn't much of a correlation amongst the missing values in A. This is proved by the upset plot as we see in the figure that there are less cases of two or more values missing together.<br><br>

Lets take a look at the missing values statistics of B

```{python}
df_B_stats = missing_val_stats(B)
df_B_stats
```

Visualizing missing values in B <br><br>

1. Using klib
```{python}
"Missing Value Plot (TSUMOTO_B.CSV)"
b_klib = klib.missingval_plot(B, figsize=(10,15))
```

This dataset is a bit concerning reason being it has some columns that are missing more than 70% of their data.
'KCT', 'RVVT', 'LAC', and 'Symptoms', pose a significant challenge in terms of replacing missing values due to various reasons.<br><br>

If we take a look at the 'Symptoms' column, we can see that it has the most missing values (726). Now it can be a scenario where a patient doesn't have any symptom so we see the corresponding value as Nan. We should change this to 'None' for our analysis later. <br><br>

It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values would distort and bring bias into the data, thereby producing unrepresentative results.

2. Using missingno
```{python}
"Correlation amongst missing values (TSUMOTO_B.CSV)"
b_msno = msno.heatmap(B, figsize=(8,5))
```

Looking at the plot above, we get the idea that missing value columns 'KCT', 'RVVT', and 'LAC' are correlated. Two of these three columns can be dropped considering they represent the same thing (measure of degree of coagulation).

3. Using UpSet
```{python}
"UpSet plot for TSUMOTO_B.CSV"
# check for missing values
b_upset = pd.isna(
  B[['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern',
       'aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis']]
)

# using groupby for the missing values
b_upset = b_upset.groupby(['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern',
       'aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis'], as_index=True).size()

# plot the groupby object
UpSet(b_upset).plot(fig= plt.figure(figsize=(8, 5)));
plt.show()
```

We see the correlation quite clearly in the plot. The height of the bar is the highest when 'KCT', 'RVVT', 'LAC', and 'Symptoms' are missing together. Similar pattern can be witnessed for other columns. <br><br>

Lets take a look at the missing values statistics of C

```{python}
df_C_stats = missing_val_stats(C)
df_C_stats
```

Visualizing missing values in C <br><br>

1. Using klib
```{python}
"Missing Value Plot (TSUMOTO_C.CSV)"
c_klib = klib.missingval_plot(C, figsize=(10,15))
```

As for our third dataset, it turns out to be the most problematic, with over half (19 out of 44) of the columns having more than 60% of their values missing, including 15 columns with more than 80%. The fundamental problem is that the majority of these columns are numerical and could have provided context for any study that may have been conducted, but their current state renders them useless and extremely problematic.

2. Using missingno
```{python}
"Correlation amongst missing values (TSUMOTO_C.CSV)"
c_msno = msno.heatmap(C, figsize=(20,15))
```

There is a high correlation between many columns that have missing data. Since this dataset contains information about general lab examination, it can be a possibility that the patient didn't appear for a particular lab test. Since the majority of these lab test results are obtained using the same procedure, omitting one would affect numerous outcomes.<br><br>

3. Using UpSet <br>
Note: Since C has too many columns, the image generated by the UpSet plot will be too large to support. So we can skip generating the UpSet plot for C.<br><br>

The above hypotheses would be further understood once missing values are dropped and new heatmaps are plotted, which is part of the next steps.<br><br>


#### Taking care of missing data 

As previously discussed, missing values are a big obstacle to the data sets, so dealing with them is a priority before carrying on any analysis. <br><br>


Missing data for dataset A can be handled during cleaning and munging. As for Dataset B, we need to implement the following:<br><br>

1. Dropping correlated columns <br><br>

2. Modifying 'Symptoms' column<br><br>

```{python}
# Drop correlated columns
B = B.drop(columns=['RVVT', 'LAC'], axis=1)
# Modify symptoms column
B[['Symptoms']] = B[['Symptoms']].fillna("None")
```

Any columns with more than 70% missing values should also be dropped since the probable bias and skewness caused by data imputation is not worth it. Columns with missing values in the 50–60% range will be preserved and individually assessed.

```{python}
# drop the columns that have more than 70% of their data missing
A = A.loc[:, A.isna().mean() < 0.7]
B = B.loc[:, B.isna().mean() < 0.7]
C = C.loc[:, C.isna().mean() < 0.7]
```

#### Cleaning data to make it tidier

It is a good idea to rename the columns in the dataset to make it more organized before continuing with any research. All columns should be lowercase, and underscore ('_') should be used in place of space (' ').

```{python}
def rename_columns(df):
    for i in range(len(df.columns)):
        # check for number of words per column name
        if len(df.columns[i].split()) != 1:
            # rename column as mentioned above
            df.rename(columns={str(df.columns[i]): (str(df.columns[i]).lower().split()[0]) + '_' + (str(df.columns[i]).lower().split()[1])}, inplace=True)
        # in case of only word, follow the same idea
        else:
            df.rename(columns={str(df.columns[i]):str(df.columns[i]).lower()}, inplace=True)

rename_columns(A)
rename_columns(B)
rename_columns(C)
```


There are some columns in Dataset A that have date as an object. Lets convert them to the appropriate format.

```{python}
def form_date(val):
    # check for nan values
    if not pd.isna(val):
        # Case 1, separator = '.'
        if '.' in val:
            date_list = val.split(".")
            year = date_list[0]
            month = date_list[1]
            day = date_list[2]

        # Case 2, separator = '/'
        elif '/' in val:
            date_list = val.split("/")
            year = date_list[0]
            month = date_list[1]
            day = date_list[2]

        # form year
        year = '19' + year[-2:]
        # form month
        if int(month) < 10:
            month = '0'+ str(int(month))
        # form day
        if int(day) < 10:
            day = '0' + str(int(day))
    
        return year + '/' + month + '/' + day
    else:
        return val
```

```{python}
# Convert relevant and required date columns
A['birthday'] = A['birthday'].apply(form_date)
A['first_date'] = A['first_date'].apply(form_date)
```

Now that we have defined the function for date to be converted, we can calculate the age of a patient.

```{python}
from datetime import datetime, date

# Function to calculate age
def calculate_age(val1, val2):
    # check for nan values
    if not pd.isna(val1) and not pd.isna(val2):
        # Identify given date as date, month and year
        born = datetime.strptime(val1, "%Y/%m/%d").date()
        curr_date = datetime.strptime(val2, "%Y/%m/%d").date()
        # return age
        return curr_date.year - born.year - ((curr_date.month, curr_date.day) < (born.month, born.day))
    else:
        np.nan
```

```{python}
# Calculating age of every patient at the time when they first came to the hospital
age_first_date = []
for val1, val2 in zip(A['birthday'], A['first_date']):
    age_first_date.append(calculate_age(val1, val2))
age_first_date

# appending the age column to the dataframe
A['age_first_date'] = age_first_date
```

Now that we have calculated the respective ages, we don't need those columns. Also, we should drop negative values of age because that is absurd.

Note: We won't be removing the birth date column since that can be used later for the examination date. 
Also, dropping columns may be a risk, but here we have already made the most use out of those. So now they seem irrelevant towards our goal.

```{python}
# Drop irrelevant columns from A
A = A.drop(columns=['description', 'first_date'],  axis=1)

# Filter dataframe to drop absurd values of age
A = A[A.age_first_date > 0]
```

If we take a look at the 'admission' column in A, we see that apart from '+", '-' and Nan values, it also has some absurd ones. So lets update that column so that it does not cause any issue later on.

```{python}
# define a function for the admission column to be formatted appropriately
def update_admit(val):
    # check for nan values
    if not pd.isna(val):
        if '+' in val:
            # if '+' in value, then update the entire string value with '+'
            val = val[val.index('+')]
        elif '-' in val:
            # if '-' in value, then update the entire string value with '-'
            val = val[val.index('-')]
        else:
            # if either of the above literals are not present, then that means it's irrelevant. This value won't be dropped since this is medical data, but it will be replaced with nan since it wasn't recorded.
            val = np.nan
    return val
```

```{python}
# Update the admission column
A['admission'] = A['admission'].apply(update_admit)
A.head()
```

Moving on to dataset B. <br><br>
B has 3 different anti-Cardiolipin antibody columns with their values on different scales. We should normalize this data since it may be useful for us later on for temporal analysis.

```{python}
# Normalizing values of the 3 different anti-Cardiolipin antibody column
B['acl_igg'] = (B['acl_igg']-B['acl_igg'].mean())/B['acl_igg'].std()
B['acl_igm'] = (B['acl_igm']-B['acl_igm'].mean())/B['acl_igm'].std()
B['acl_iga'] = (B['acl_iga']-B['acl_iga'].mean())/B['acl_iga'].std()
```

Also, notice that the 'id' column in B has missing values. Since this dataset has our dependent variable and the fact that it has information about those patients that got themselves tested for the same, if we don't have their respective id then it may generate inappropriate results.<br><br>

Also, these are just 36 rows out of 806. So dropping these won't create any significant issue.

```{python}
# Drop NAs in the ID column of B since they are the main unit of analysis (total 36 rows dropped out of 806)
B = B.dropna(subset=['id'])
# change dtype of B.id to int for merging
B['id'] = B['id'].astype('int')
```

Also, 'ana' column has an inappropriate value of '>4096' for the measure of anti-nucleus antibodies in a patient. Lets fix that.
```{python}
# filter dataset B for correct ana values
B = B[B.ana != '>4096']
```

That concludes for B. Lets move on to our third dataset. <br><br>

For C, we again see that there is a date column but with dtype as integer. Lets fix that.


```{python}
# Converting date column from int to date
def int_to_date(val):
    # check for nan values
    if not pd.isna(val):
        val = str(val)
        # form year
        year = '19' + val[:2]
        # form month
        month = val[2:4]
        # form day
        day = val[4:]
        # add a 0 to month to denote months before October
        if int(month) < 10:
            month = '0'+ str(int(month))
        # add a 0 to day to denote days before the 10th of every month
        if int(day) < 10:
            day = '0' + str(int(day))
    
        return year + '/' + month + '/' + day
    else:
        return val
```

```{python}
# Convert dtype of date column to date
C['lab_date'] = C['date'].apply(int_to_date)

# Drop 'date' column since now it has been updated 
C = C.drop(columns=['date'], axis=1)
C.head()
```

#### Check individual datasets before merging

Lets have one final look at each of our dataset before merging.<br><br>

1. TSUMOTO_A.CSV
```{python}
A.head()
```

Check for missing data statistics
```{python}
df_A_stats_new = missing_val_stats(A)
df_A_stats_new
```

We see that there are very few missing values in total (30 including 'sex' and 'admission'). Given such a small number we can drop these values.

```{python}
# drop nan values from the subset columns
A = A.dropna(subset=['sex', 'admission'])
```

After removing NAs from the above two columns, we have no missing values for our first dataset. <br><br>


2. TSUMOTO_B.CSV
```{python}
B.head()
```

Check for missing data statistics

```{python}
df_B_stats_new = missing_val_stats(B)
df_B_stats_new
```

We see that there is still some missing data that can't be just dropped. This can be handled after merging.

3. TSUMOTO_C.CSV
```{python}
C.head()
```

Check for missing data statistics

```{python}
df_C_stats_new = missing_val_stats(C)
df_C_stats_new
```

Too many missing values account for the uncertainity in the dataset.

#### Merging the datasets

In terms of joining data sets, there are multiple ways of doing so depending on the question being answered.<br><br>

Now that we have an idea, we will first merge A with C. Reason being that A contains basic information about the patients and C has the data from general laboratory examinations stored in hospital information systems. Merging these will filter out the id(s) of the patients that dint get tested for these examinations. <br><br>

Then the last step would be to merge the combined one with B.<br><br>

Since B contains our target variable (Thrombosis), it has the information of those patients that got tested via the 'Special Laboratory Examination'. Merging B at the end will give us the final dataset that we can use for EDA and strengthen our final conclusion.<br><br>

Since 'ID' is the only common column amongst the 3 datasets, it will be used in the 'left' join for merging.<br><br>


##### Step 1: Merge A with C

```{python}
# Merging A with C using left join on id column
C_A = C.merge(A, on='id', how='left')
C_A.head()
```

Check the shape.
```{python}
C_A.shape
```

##### Step 2: Merge combined (C_A) with B

```{python}
# Merging C_A with B using left join on id column
df_final = B.merge(C_A, on='id', how='left')
df_final.head()
```

Check the shape.
```{python}
df_final.shape
```

Since 'diagnosis' was present both in B and C_A, we see two different columns for both in the final dataframe. But if you look carefully, the values are similar. Lets merge these columns into one and form the updated 'diagnosis' column for df_final.

```{python}
for i in range(len(df_final)):
    # check for nan in diagnosis_x, if false then split the values
    if df_final["diagnosis_x"].isna()[i] == False:
        df_final["diagnosis_x"][i] = df_final["diagnosis_x"][i].split(",")
    # check for nan in diagnosis_y, if false then split the values
    if df_final["diagnosis_y"].isna()[i] == False:
        df_final["diagnosis_y"][i] = df_final["diagnosis_y"][i].split(",")
        
# defining an empty list to store diagnosis as column in the dataframe later on
df_final["diagnosis"] = ""

for i in range(len(df_final)):
    # fill the value with diagnosis_x if diagnosis_y is nan
    if (df_final["diagnosis_x"].isna()[i] == False) & (df_final["diagnosis_y"].isna()[i] == True):
        df_final["diagnosis"][i] = df_final["diagnosis_x"][i]
    # fill the value with diagnosis_y if diagnosis_x is nan
    elif (df_final["diagnosis_x"].isna()[i] == True) & (df_final["diagnosis_y"].isna()[i] == False):
        df_final["diagnosis"][i] = df_final["diagnosis_y"][i]
    # fill the value with the list of the set of values in diagnosis_x and diagnosis_y if both are not nan
    elif (df_final["diagnosis_x"].isna()[i] == False) & (df_final["diagnosis_y"].isna()[i] == False):
        df_final["diagnosis"][i] = list(set(df_final["diagnosis_x"][i] + df_final["diagnosis_y"][i]))
```

If we remember fondly, we didn't drop the 'birthday' column earlier. The main reason for not dropping that was to calculate the age of the patients at the time of their 'Thrombosis' lab test. 

Now that the datasets have been merged, we can use the 'birthday' and 'examination_date' column to calculate the same.

```{python}
# convert examination_date to date format
df_final['examination_date'] = df_final['examination_date'].apply(form_date)

# Calculate age at the time of examination
age_at_examination = []
for val1, val2 in zip(df_final['birthday'], df_final['examination_date']):
    age_at_examination.append(calculate_age(val1, val2))

# add the column to the dataframe
df_final['age_at_examination'] = age_at_examination
```

Now that we have used these columns, we can drop them

```{python}
# drop irrelevant columns
df_final = df_final.drop(columns=['examination_date', 'diagnosis_x', 'lab_date', 'birthday', 'diagnosis_y'], axis=1)
```

A property of the dataframe is to have it's columns organized with dependent column at the end. Lets implement the same for our merged dataset.

```{python}
# rearrange the columns
df_final = df_final[['id', 'sex', 'admission', 'age_first_date', 'age_at_examination',
                    'diagnosis', 'acl_igg', 'acl_igm', 'acl_iga','ana', 'ana_pattern', 
                    'got', 'gpt', 'ldh', 'alp', 'tp', 'alb', 'ua', 'un', 'cre', 't-bil', 
                    't-cho', 'tg', 'cpk', 'wbc', 'rbc', 'hgb', 'hct', 'plt', 'u-pro', 'crp', 
                    'c3', 'c4', 'symptoms','thrombosis']]
df_final.head()
```


#### Missing data in the merged dataset

Lets visualize the missing data of the merged dataset.<br><br>

1. Using klib
```{python}
print('Missing Value Plot of df_final')
df_final_klib = klib.missingval_plot(df_final, figsize=(10,15))
```

If you see the plot above, there are a few columns that show a similar trend in their missing values. Lets verify this using the 'missingno' library.

. Using missingno
```{python}
print('Correlation amongst missing values (df_final))')
df_final_msno = msno.heatmap(df_final, figsize=(20,15))
```

Some columns still are very much correlated to one another in terms of missing data.

For example, columns such as 'GOT', 'GPT', 'LDH', 'ALP', 'TP', 'ALB', 'UA', 'UN', and 'CRE' have values approximately equal to 1.<br><br>

The set of columns ('WBC', 'RBC', 'HGB', 'HCT', 'PLT') and ('C3', and 'C4') are highly correlated as per their missing values.<br><br>

We need to analyze and then handle the missing data. Since this is a medical dataset, we can't just drop or modify such columns. We need to take precautions and then follow appropriate steps.<br><br>

Using UpSet<br>
Note: Since df_final has too many columns, the image generated by the UpSet plot will be too large to support.<br><br>


# Exploratory analysis

Lets plot a histogram for our temporal data from the first dataset (TSUMOTO_A.CSV). This plot will display the count of 'age_first_date' to analyze the age of patients when they first visited the hospital. 

```{python}
# define figure size
plt.figure(figsize=(8, 5))
# plot the histogram using dataset A, age_first_date as x and sex as hue
sns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)
# define axis labels
plt.xlabel("Age(years)")
plt.ylabel("Count")
# put title on the plot
plt.title('Age Distribution of patients when they first visited the hospital')
# set margins for x and y axis
plt.margins(x=0.01, y=0.1)
plt.show()
```

From the above plot, we infer two things:<br><br>
1. The number of female patients are more as compared to male.<br><br>
2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.<br><br>


We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don't tend to visit hospital that much for their medical checkups. <br><br>

As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.


Lets do another temporal analysis, this time using the second dataset (TSUMOTO_B.CSV) and see how has thrombosis evolved over time.

```{python}
# drop nan values from subset columns and then sort by examination_date
temporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])

# convert the examination_date column dtype to pandas datetime
temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])

# Extract the year from the datetime and create a new column
temporal_B['year'] = temporal_B['examination_date'].dt.year

# count of thrombosis patients per year
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()

# plot barchart of the thrombosis counts per year
ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=
['#979797', '#f44336', "#C2CD23", "#918BC3"])
# set title of the plot
ax.set_title("Thrombosis diagnosis over the years")
# define labels of the axis
ax.set_xlabel('Year')
ax.set_ylabel('Count')
```

The histogram above highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.<br><br>

Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.<br><br>

Lets have a look at the age distribution of patients when got tested for thrombosis. This plot also accounts for temporal analysis.

```{python}
# drop nan values from the subset column
age_exam_nona = df_final.dropna(subset=['age_at_examination'])
# convert age_at_examination dtype to int
age_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')

# define figure size
plt.figure(figsize=(8, 5))
# plot histogram using df_final as data, age_at_examination as x and thrombosis as hue
ax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)

# define labels of axis
plt.xlabel("Age(years)")
plt.ylabel("Count")
# set title of the plot
plt.title('Age distribution of patients when they got tested for thrombosis')
# define margins of the x and y axis
plt.margins(x=0.005, y=0.1)
plt.show()
```

As per the plot above, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low. <br><br>

Now we will conduct a comprehensive analysis of thrombosis since it is important to gain a deeper understanding of its associated side effects and prevalence among different age groups and genders within the population.

Below is a density plot that shows the age distribution of people that have thrombosis vs people who do not.

```{python}
# filter df with patients that have thrombosis
thrombosis_patients = df_final[df_final['thrombosis'] != 0]
# filter df with patients that don't have thrombosis
no_thrombosis_patients = df_final[df_final['thrombosis'] == 0]

# age at examination of patients with thrombosis
age_t = thrombosis_patients['age_at_examination']
# age at examination of patients without thrombosis
age_nt = no_thrombosis_patients['age_at_examination']

# Filter outliers
age_t = age_t[age_t >2]
age_nt = age_nt[age_nt >2]

# Age distribution plot
sns.kdeplot(age_t, color="#f44336",fill=True, label="Patients with Thrombosis")
sns.kdeplot(age_nt, color="#979797",fill=True, label="Patients without Thrombosis")

# Add legend
plt.legend()
# set title
plt.title("Age Distribution of Patients with respect to Thrombosis")
# set x-axis label
plt.xlabel("Age(years)")

# Display plot
plt.show()
```

The density plot above shows that Thrombosis does not seem to occur with people of age more than 70. In fact, most of Thrombosis patients’ age are in the range 20-60, which is pretty much the same as that of petients that did not have Thrombosis. The only slight difference is that patients with no thromboses are slightly skewed to the younger side of that range, but ultimately i dont think this holds any relvance.

Next step is to see if there are any differences within the Thrombosis group, meaning between the different levels of the disease’s severity.

```{python}
# filter the level of thrombosis disease into 3 separate variables
t1 = df_final[df_final['thrombosis'] == 1]
t2 = df_final[df_final['thrombosis'] == 2]
t3 = df_final[df_final['thrombosis'] == 3]

# store the respective ages of patients as per the variables chosen above
age_t1 = t1['age_at_examination']
age_t2 = t2['age_at_examination']
age_t3 = t3['age_at_examination']

# only check for patients having age > 2
age_t1 = age_t1[age_t1 > 2]
age_t2 = age_t2[age_t2 > 2]
age_t3 = age_t3[age_t3 > 2]

# plot the kernel density plot using the age calculated above
sns.kdeplot(age_t1, color="#979797",fill=True, label="Level 1")
sns.kdeplot(age_t2, color="black",fill=True, label="Level 2")
sns.kdeplot(age_t3, color="#f44336",fill=True, label="Level 3")

# Add legend
plt.legend()
# set title
plt.title("Age Distribution of differnet levels of Thrombosis")
# set x-label
plt.xlabel("Age(Years)")

# Display plot
plt.show()
```

Different levels of Thrombosis turn out to have different age distributions, with the age density plot of level 3 patients being the most condensed in the 30-40 age range, while that of level 1 and level 2 patients in 15-35 and 0-30 respectively.<br><br>


Look at box plots to check for outliers.
```{python}
# Subset columns of interest
lab_results = df_final[['acl_igg', 'acl_igm', 'acl_iga', 'ana', 'thrombosis']]

# drop nan values from subset columns
lab_results = lab_results.dropna(subset=['ana']) # Only 4 missing values 
# change ana column dtype to int
lab_results['ana'] = lab_results['ana'].astype('int')

# create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# create box plots
boxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)

# plot boxplot
axs[0, 0].boxplot(boxp['acl_iga'])
# set title of boxplot
axs[0, 0].set_title('acl_iga')

# plot boxplot
axs[0, 1].boxplot(boxp['acl_igg'])
# set title of boxplot
axs[0, 1].set_title('acl_igg')

# plot boxplot
axs[1, 0].boxplot(boxp['acl_igm'])
# set title of boxplot
axs[1, 0].set_title('acl_igm')

# plot boxplot
axs[1, 1].boxplot(boxp['ana'])
# set title of boxplot
axs[1, 1].set_title('ana')

# add x-axis label
for ax in axs.flat:
    ax.set_xlabel('Feature')

# add y-axis label
for ax in axs.flat:
    ax.set_ylabel('Value')

# add overall title
plt.suptitle('Box plots of features', fontsize=16)

# adjust layout
plt.tight_layout()

# show plot
plt.show()
```

The data has outlier issues.

Setting the threshold at 100 for IgA, IgG, and IgM makes sense, while putting it at 200 for ANA also makes sense.

```{python}
lab_results['ana'] = lab_results['ana'].astype('int')
# setting the threshold as 100 for the anti-Cardiolipin antibodies and 200 for ana
lab_results = lab_results[(lab_results['acl_iga'] < 100) &  (lab_results['acl_igg'] < 100) &  
(lab_results['acl_igm'] < 100) & (lab_results['ana'] < 50)]


# create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# create box plots
boxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)

# plot boxplot
axs[0, 0].boxplot(boxp['acl_iga'])
# set title of boxplot
axs[0, 0].set_title('acl_iga')

# plot boxplot
axs[0, 1].boxplot(boxp['acl_igg'])
# set title of boxplot
axs[0, 1].set_title('acl_igg')

# plot boxplot
axs[1, 0].boxplot(boxp['acl_igm'])
# set title of boxplot
axs[1, 0].set_title('acl_igm')

# plot boxplot
axs[1, 1].boxplot(boxp['ana'])
# set title of boxplot
axs[1, 1].set_title('ana')

# add x-axis label
for ax in axs.flat:
    ax.set_xlabel('Feature')

# add y-axis label
for ax in axs.flat:
    ax.set_ylabel('Value')

# add overall title
plt.suptitle('Box plots of features', fontsize=16)

# adjust layout
plt.tight_layout()

# show plot
plt.show()
```

The data now is in a better state to analyze (a handful of outliers remain). 

Below is a radar chart that shows how patients that do not have Thrombosis compare to those who have (and the different levels) in terms of lab results (averaged results).

```{python}
# Normalize columns to have them all in the same scale 
from sklearn.preprocessing import StandardScaler

# using standard scaler object for feature scaling
scaler = StandardScaler()
# choose all columns except thrombosis
cols_to_normalize = [col for col in lab_results.columns if col != 'thrombosis']
# normalize the column values and put those in a new dataframe
lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)
lab_results_norm['thrombosis'] = lab_results['thrombosis']

# Get the means of each Thrombosis group
means = lab_results_norm.groupby('thrombosis').mean().reset_index()

categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']

# store each respective mean in a different variable
variable1 = means.iloc[0, :].values
variable2 = means.iloc[1, :].values
variable3 = means.iloc[2, :].values
variable4 = means.iloc[3, :].values

label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(variable1))

# Set colors and fill
colors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']
fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']

# define figure size
plt.figure(figsize=(8.5, 8.5))
# add subplots
plt.subplot(polar=True)

# plot the graph for the first mean
plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])
plt.plot(label_loc, variable1,
         color=colors[0], label='No Thrombosis', linewidth=2)

# plot the graph for the second mean
plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])
plt.plot(label_loc, variable2,
         color=colors[1], label='Thrombosis 1', linewidth=2)

# plot the graph for the third mean
plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])
plt.plot(label_loc, variable3,
         color=colors[2], label='Thrombosis 2', linewidth=2)

# plot the graph for the fourth mean
plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])
plt.plot(label_loc, variable4,
         color=colors[3], label='Thrombosis 3', linewidth=2)

# set title
plt.title('Lab results averages of patients', size=24)

lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[
                               ''] + categories, color='gray')

# set legend
plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)
plt.tight_layout()
plt.show()
```

The radar chart shows that the only lab results that show significant differences between different Thrombosis groups are:

ANA: The higher the ANA value the more severe the Thrombosis diagnosis.
Thrombosis level 1 seems have to similar acl_iga and acl_igm values.

A possible predictor of Thrombosis could be the symptoms a patient is experiencing.

```{python}
from wordcloud import WordCloud, STOPWORDS
import random
import matplotlib.pyplot as plt

# Define the thrombosis categories to iterate over
thrombosis_categories = [1, 2, 3]

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(9, 6))

# Iterate over the thrombosis categories and plot a word cloud in each column
for i, thrombosis_category in enumerate(thrombosis_categories):

    thrombosis_p = B[B['thrombosis'] == thrombosis_category]

    # Concatenate all the strings in the "Symptoms" column into a single string
    text = ' '.join(thrombosis_p['symptoms'])

    # Create a set of stopwords
    stopwords = set(STOPWORDS)
    stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ',
                      'ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY'])  # Excluding thrombsosis (obvious word) + some words that dont make sense

    # Define the color palette
    custom_colors = ['#121111', '#3646f4']
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return random.choice(custom_colors)

    # Create a WordCloud object with the custom color palette
    wordcloud = WordCloud(width=400, height=400, background_color='white',
                          color_func=color_func, stopwords=stopwords).generate(text)

    # Plot the word cloud in the corresponding subplot
    ax = axes[i]
    ax.imshow(wordcloud)
    ax.axis("off")
    ax.set_title(f"Thrombosis {thrombosis_category}")

# Adjust the layout of the figure and display it
plt.tight_layout()
plt.show()
```

The main symptoms of each thrombosis level are the following:

Level 1: Brain infarction

Level 2: CNS lupus

Level 3: Thrombocytopenia

The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.

# Final Plots

```{python}
# define figure size
plt.figure(figsize=(8, 5))
# plot the histogram using dataset A, age_first_date as x and sex as hue
sns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)
# define axis labels
plt.xlabel("Age(years)")
plt.ylabel("Count")
# put title on the plot
plt.title('Age Distribution of patients when they first visited the hospital')
# set margins for x and y axis
plt.margins(x=0.01, y=0.1)
plt.show()
```

<br><br>

```{python}
# drop nan values from subset columns and then sort by examination_date
temporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])

# convert the examination_date column dtype to pandas datetime
temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])

# Extract the year from the datetime and create a new column
temporal_B['year'] = temporal_B['examination_date'].dt.year

# count of thrombosis patients per year
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()

# plot barchart of the thrombosis counts per year
ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=
['#979797', '#f44336', "#C2CD23", "#918BC3"])
# set title of the plot
ax.set_title("Thrombosis diagnosis over the years")
# define labels of the axis
ax.set_xlabel('Year')
ax.set_ylabel('Count')
```

<br><br>

```{python}
# drop nan values from the subset column
age_exam_nona = df_final.dropna(subset=['age_at_examination'])
# convert age_at_examination dtype to int
age_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')

# define figure size
plt.figure(figsize=(8, 5))
# plot histogram using df_final as data, age_at_examination as x and thrombosis as hue
ax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)

# define labels of axis
plt.xlabel("Age(years)")
plt.ylabel("Count")
# set title of the plot
plt.title('Age distribution of patients when they got tested for thrombosis')
# define margins of the x and y axis
plt.margins(x=0.005, y=0.1)
plt.show()
```

<br><br>

```{python}
# Normalize columns to have them all in the same scale 
from sklearn.preprocessing import StandardScaler

# using standard scaler object for feature scaling
scaler = StandardScaler()
# choose all columns except thrombosis
cols_to_normalize = [col for col in lab_results.columns if col != 'thrombosis']
# normalize the column values and put those in a new dataframe
lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)
lab_results_norm['thrombosis'] = lab_results['thrombosis']

# Get the means of each Thrombosis group
means = lab_results_norm.groupby('thrombosis').mean().reset_index()

categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']

# store each respective mean in a different variable
variable1 = means.iloc[0, :].values
variable2 = means.iloc[1, :].values
variable3 = means.iloc[2, :].values
variable4 = means.iloc[3, :].values

label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(variable1))

# Set colors and fill
colors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']
fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']

# define figure size
plt.figure(figsize=(8.5, 8.5))
# add subplots
plt.subplot(polar=True)

# plot the graph for the first mean
plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])
plt.plot(label_loc, variable1,
         color=colors[0], label='No Thrombosis', linewidth=2)

# plot the graph for the second mean
plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])
plt.plot(label_loc, variable2,
         color=colors[1], label='Thrombosis 1', linewidth=2)

# plot the graph for the third mean
plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])
plt.plot(label_loc, variable3,
         color=colors[2], label='Thrombosis 2', linewidth=2)

# plot the graph for the fourth mean
plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])
plt.plot(label_loc, variable4,
         color=colors[3], label='Thrombosis 3', linewidth=2)

# set title
plt.title('Lab results averages of patients', size=24)

lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[
                               ''] + categories, color='gray')

# set legend
plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)
plt.tight_layout()
plt.show()
```

<br><br>

```{python}
from wordcloud import WordCloud, STOPWORDS
import random
import matplotlib.pyplot as plt

# Define the thrombosis categories to iterate over
thrombosis_categories = [1, 2, 3]

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(9, 6))

# Iterate over the thrombosis categories and plot a word cloud in each column
for i, thrombosis_category in enumerate(thrombosis_categories):

    thrombosis_p = B[B['thrombosis'] == thrombosis_category]

    # Concatenate all the strings in the "Symptoms" column into a single string
    text = ' '.join(thrombosis_p['symptoms'])

    # Create a set of stopwords
    stopwords = set(STOPWORDS)
    stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ',
                      'ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY'])  # Excluding thrombsosis (obvious word) + some words that dont make sense

    # Define the color palette
    custom_colors = ['#121111', '#3646f4']
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return random.choice(custom_colors)

    # Create a WordCloud object with the custom color palette
    wordcloud = WordCloud(width=400, height=400, background_color='white',
                          color_func=color_func, stopwords=stopwords).generate(text)

    # Plot the word cloud in the corresponding subplot
    ax = axes[i]
    ax.imshow(wordcloud)
    ax.axis("off")
    ax.set_title(f"Thrombosis {thrombosis_category}")

# Adjust the layout of the figure and display it
plt.tight_layout()
plt.show()
```



# Technical summary

Unexpectedly, the quality of the data provided for this investigation was far lower than anticipated, especially considering that it came from such a prestigious institution. The main issues with the data are provided below:<br><br>

1. Missing Data: Missing values are a major problem in this set of data, particularly in TSUMOTO_C.CSV, where more than 80% of the columns and features have missing values. The biggest barrier to better understanding thrombosis and its connection to routine lab tests was this problem. Because of this problem, perhaps some significant connections and patterns were missed.<br><br>

2. Lack of Thrombosis cases: The major goals of this study were to gain a deeper understanding of thrombosis and identify data patterns that might act as predictors of this condition. The data presented, however, contains extremely few thrombosis instances, particularly more severe ones (levels 2 and 3), making any study very susceptible to the addition of more cases.<br><br>

3. Data not balanced: When sampling for thrombosis + cases exclusively, the dataset TSUMOTO_B.CSV was significantly unbalanced, with nearly all patients (about 85%) being female, which could lead to incorrect findings that thrombosis occurs considerably more commonly with females. This prevented a gender analysis that was accurate from being done.<br><br>

4. Dates: TIn some cases, there is a discrepancy of more than 5 years between the dates patients' lab tests and thrombosis testing are performed. Given that the patients' actual results would have been drastically different if they had undergone more recent (closer to Thrombosis assessment) lab testing, this renders any analysis of lab data (TSUMOTO_C.CSV) very poor..<br><br>

Yet it's vital to note that all of the data was gathered between 1989 and 1996, a time period in which industrywide data discrepancies were frequent and data collection methods weren't the most sophisticated. An examination of thrombosis using more recent data, if available, would be a better idea.<br><br>

Below are some of main key takeaways (elaborated more in results.html):<br><br>

1. Level 3 Thrombosis appears to happen with people on the higher end of the age distribution.<br><br>
2. ANA seems to be the best predictor amongst all.<br><br>
3. There are various stages of thrombosis, and each level has its own set of symptoms.