EDA on Medical Data of Thrombosis Diagnosis

Author

Affiliation

Georgetown University

1 Introduction

1.1 Objective

Imagine yourself as a data scientist working at a consulting firm. An external entity (client) has paid a significant amount of money for your firm to make sense of a medical data set. They plan to incorporate your findings to help guide the allocation of a multi-million dollar research grant. They want the results in two weeks, after which point the contract terminates and your consulting firm will move onto a new unrelated project.

Our job is to perform visual and non-visual exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses for the client.

1.2 Data

Data has been collected from Georgetown University’s Hospital and contain information about patients who have been admitted to the hospital’s Collagen disease clinic.

Collagen diseases are auto-immune diseases whose patients generate antibodies attacking to their bodies. For example, if a patient generates antibodies to lung, he/she will lose the respiratory function in a chronic course and finally lose their lives. Thrombosis is the most severe form of Collagen diseases.

There are 3 CSV files that make up for the medical data. These files are:

TSUMOTO_A.CSV: Contains basic information about patients
TSUMOTO_B.CSV: Contains information about the ‘Special Laboratory Examinations’ of patients to detect Thrombosis.
TSUMOTO_C.CSV: Contains Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) This dataset has temporal stamps.

1.3 Goals of the project

This project aims to thoroughly understand and analyze Thrombosis which is a Collagen disease by achieving the following:

Search for good patterns which detect and predict thrombosis.
Search for temporal patterns specific/sensitive to thrombosis.
Search for good features which classifies collagen diseases correctly.
Search for temporal patterns specific/sensitive to each collagen diseases.

To achieve these goals, the data will be carefully analyzed to identify any significant patterns and relationships that can provide insights into the causes and characteristics of Thrombosis. By doing so, this project aims to contribute to a better understanding of Collagen diseases, particularly Thrombosis, and ultimately improve the diagnosis and treatment of these life-threatening conditions.

2 Initial Questions

We would like to investigate the following questions prior to our analyses:

How does thrombosis evolve over time?
What age group is Thrombosis, the most prevalent in?
Does it depends upon the sex of the patient?
Is this dataset good enough to understand and visualize the factors causing and effects of Thrombosis?

3 Data Ingestion, Cleaning and Munging

Before doing any kind of analysis, it is important to properly understand the datasets in terms of shape, strengths, weaknesses, etc.

3.1 Ingestion

3.1.0.1 Importing the libraries

Code

# Importing all necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import klib
import missingno as msno
from upsetplot import UpSet

from wordcloud import WordCloud, STOPWORDS
import random

import sys

import warnings
warnings.filterwarnings("ignore")

3.1.0.2 Importing the datasets

Code

# Redirect stdout and stderr to null device
sys.stdout = open('/dev/null', 'w')
sys.stderr = open('/dev/null', 'w')

TSUMOTO_A.CSV

Code

A = pd.read_csv("./data/TSUMOTO_A.CSV", encoding= 'unicode_escape')
A.head()

	ID	SEX	Birthday	Description	First Date	Admission	Diagnosis
0	2110	F	1934/2/13	94.02.14	93.02.10	+	RA susp.
1	11408	F	1937/5/2	96.12.01	73.01.01	+	PSS
2	12052	F	1956/4/14	91.08.13	NaN	+	SLE
3	14872	F	1953/9/21	97.08.13	NaN	+	MCTD
4	27654	F	1936/3/25	NaN	92.02.03	+	RA, SLE susp

Code

"Shape of A: " + str(A.shape)

'Shape of A: (1240, 7)'

TSUMOTO_B.CSV

Code

B = pd.read_csv("./data/TSUMOTO_B.CSV", encoding= 'unicode_escape')
B.head()

	ID	Examination Date	aCL IgG	aCL IgM	ANA	ANA Pattern	aCL IgA	Diagnosis	KCT	RVVT	LAC	Symptoms	Thrombosis
0	14872.0	1997/5/27	1.3	1.6	256	P	0.0	MCTD, AMI	NaN	NaN	-	AMI	1
1	48473.0	1992/12/21	4.3	4.6	256	P,S	3.3	SLE	-	-	-	NaN	0
2	102490.0	1995/4/20	2.3	2.5	0	NaN	3.5	PSS	NaN	NaN	NaN	NaN	0
3	108788.0	1997/5/6	0.0	0.0	16	S	0.0	NaN	NaN	NaN	-	NaN	0
4	122405.0	1998/4/2	0.0	4.0	4	P	0.0	SLE, SjS, vertigo	NaN	NaN	NaN	NaN	0

Code

"Shape of B: " + str(B.shape)

'Shape of B: (806, 13)'

TSUMOTO_C.CSV

Note: Couple of lines in this file were causing issues when reading the dataset. Using “error_bad_lines=False” argument resolved that. A total of 30 lines (out of 50k + where dropped), so any impact of this drop would be very minimal.

Code

# A couple of bad (unreadable) lines in the 3rd dataset were dropped. The dropped lines account for 30 (out of 50k+), 
# which makes this drop insignificant
C = pd.read_csv("data/TSUMOTO_C.CSV" , encoding= 'unicode_escape', error_bad_lines=False) 
C.head()

	ID	Date	GOT	GPT	LDH	ALP	TP	ALB	UA	UN	...	C3	C4	RNP	SM	SC170	SSA	SSB	CENTROMEA	DNA	DNA-II
0	2110	860419	24	12	152	63	7.5	4.5	3.4	16.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2110	860430	25	12	162	76	7.9	4.6	4.7	18.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2110	860502	22	8	144	68	7	4.2	5	18.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2110	860506	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2110	860507	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 44 columns

Code

"Shape of C: " + str(C.shape)

'Shape of C: (57510, 44)'

C is by far the largest dataframe that has information about general laboratory examinations.

3.2 Cleaning and Munging

3.2.0.1 Analyzing Missing values

Searching for missing data is the first and most important stage in data cleaning. Checking for missing values for each column (per data set) would give a solid idea of which columns are necessary and which need to be adjusted or omitted as this project entails combining the datasets.

It is very important to note that this is medical data. So if any of the dataset columns has a missing value, then it means that it wasn’t recorded. We need to carefully analyze and then take appropriate steps to handle missing data.

We will use ‘klib’, ‘missingno’, and ‘UpSet’ libraries to visualize the missing data for all the datasets.

klib library helps us to visualize missing data trends in the dataset. Using the ‘missing_val’ plot, we will be able to extract necessary information of the missing data in every column.
missingno library will be helpful in plotting the correlation amongst the columns having missing values. This way we will be able to make a decision if a certain column needs to be dropped or not.
UpSet library looks at the multivariate relationships among categorical variables and is applied to missing data.

Lets create a function to analyze missing data statistics of our dataset.

Code

# define a function that returns a data-frame of missing data statistics
def missing_val_stats(df):
    # define columns of the data-frame
    df_stats = pd.DataFrame(columns = ['column', 'unique_val', 'num_unique_val', 'num_unique_val_nona', 
                                       'num_miss', 'pct_miss'])
    tmp = pd.DataFrame()
    
    for c in df.columns:
        # column
        tmp['column'] = [c]
        # unique values in the column
        tmp['unique_val'] = [df[c].unique()]
        # number of unique values in the column
        tmp['num_unique_val'] = len(list(df[c].unique()))
        # number of unique values in the column without nan
        tmp['num_unique_val_nona'] = int(df[c].nunique())
        # number of missing values in the column
        tmp['num_miss'] = df[c].isnull().sum()
        # percentage of missing values in the column
        tmp['pct_miss'] = (df[c].isnull().sum()/ len(df)).round(3)*100
        df_stats = df_stats.append(tmp)
    
    return df_stats

Lets take a look at the missing values statistics of A

Code

df_A_stats = missing_val_stats(A)
df_A_stats

column	unique_val	num_unique_val	num_unique_val_nona	num_miss	pct_miss
ID	[2110, 11408, 12052, 14872, 27654, 30609, 4300...	1238	1238	0	0.0
SEX	[F, M, nan]	3	2	13	1.0
Birthday	[1934/2/13, 1937/5/2, 1956/4/14, 1953/9/21, 19...	1195	1194	1	0.1
Description	[94.02.14, 96.12.01, 91.08.13, 97.08.13, nan, ...	100	99	217	17.5
First Date	[93.02.10, 73.01.01, nan, 92.02.03, 94.03.08, ...	821	820	248	20.0
Admission	[+, -, nan, üH, îýô€, +üiö‡òåë®üj, +(ë¦ÄuÆÈòaë...	9	8	31	2.5
Diagnosis	[RA susp., PSS, SLE, MCTD, RA, SLE susp, SLE, ...	222	222	0	0.0

Visualizing missing values in A

Using klib

Code

"Missing Value Plot (TSUMOTO_A.CSV)"
a_klib = klib.missingval_plot(A, figsize=(10,15))

There are a few columns with few missing values with ‘First Date’ reaching a maximum of 20%. Dealing with missing values in these columns could be dealt with at later stages, when/if they are used.

Using missingno

Code

"Correlation amongst missing values (TSUMOTO_A.CSV)"
a_msno = msno.heatmap(A, figsize=(8, 5))

Nothing too significant for A.

Using UpSet

Code

"UpSet plot for TSUMOTO_A.CSV"
# check for nan values
a_upset = pd.isna(
  A[['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission',
       'Diagnosis']]
)

# using groupby for the missing values
a_upset = a_upset.groupby(['ID', 'SEX', 'Birthday', 'Description', 'First Date', 'Admission',
       'Diagnosis'], as_index=True).size()

# plot the groupby object
UpSet(a_upset).plot(fig= plt.figure(figsize=(8, 5)));
plt.show()

As stated earlier, there isn’t much of a correlation amongst the missing values in A. This is proved by the upset plot as we see in the figure that there are less cases of two or more values missing together.

Lets take a look at the missing values statistics of B

Code

df_B_stats = missing_val_stats(B)
df_B_stats

column	unique_val	num_unique_val	num_unique_val_nona	num_miss	pct_miss
ID	[14872.0, 48473.0, 102490.0, 108788.0, 122405....	768	767	36	4.5
Examination Date	[1997/5/27, 1992/12/21, 1995/4/20, 1997/5/6, 1...	476	475	10	1.2
aCL IgG	[1.3, 4.3, 2.3, 0.0, 1.0, 6.1, 3.7, 1.7, 5.1, ...	115	115	0	0.0
aCL IgM	[1.6, 4.6, 2.5, 0.0, 4.0, 2.4, 9.5, 3.0, 0.8, ...	140	140	0	0.0
ANA	[256, 0, 16, 4, 1024, 4096, 64, nan, 4094, >4096]	10	9	38	4.7
ANA Pattern	[P, P,S, nan, S, D,P,S, P.S, S,P, S,D, D,P, P,...	16	15	261	32.4
aCL IgA	[0.0, 3.3, 3.5, 8.7, 12.3, 22.5, 3.1, 9.1, 5.1...	163	163	0	0.0
Diagnosis	[MCTD, AMI, SLE, PSS, nan, SLE, SjS, vertigo, ...	182	181	331	41.1
KCT	[nan, -, +]	3	2	660	81.9
RVVT	[nan, -, +]	3	2	660	81.9
LAC	[-, nan, +]	3	2	584	72.5
Symptoms	[AMI, nan, CNS lupus, Apo, AP, thrombocytepeni...	40	39	726	90.1
Thrombosis	[1, 0, 2, 3]	4	4	0	0.0

Visualizing missing values in B

Using klib

Code

"Missing Value Plot (TSUMOTO_B.CSV)"
b_klib = klib.missingval_plot(B, figsize=(10,15))

This dataset is a bit concerning reason being it has some columns that are missing more than 70% of their data. ‘KCT’, ‘RVVT’, ‘LAC’, and ‘Symptoms’, pose a significant challenge in terms of replacing missing values due to various reasons.

If we take a look at the ‘Symptoms’ column, we can see that it has the most missing values (726). Now it can be a scenario where a patient doesn’t have any symptom so we see the corresponding value as Nan. We should change this to ‘None’ for our analysis later.

It is challenging to gauge how the missing values are distributed and how they might affect the data. Replacing these missing values would distort and bring bias into the data, thereby producing unrepresentative results.

Using missingno

Code

"Correlation amongst missing values (TSUMOTO_B.CSV)"
b_msno = msno.heatmap(B, figsize=(8,5))

Looking at the plot above, we get the idea that missing value columns ‘KCT’, ‘RVVT’, and ‘LAC’ are correlated. Two of these three columns can be dropped considering they represent the same thing (measure of degree of coagulation).

Using UpSet

Code

"UpSet plot for TSUMOTO_B.CSV"
# check for missing values
b_upset = pd.isna(
  B[['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern',
       'aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis']]
)

# using groupby for the missing values
b_upset = b_upset.groupby(['ID', 'Examination Date', 'aCL IgG', 'aCL IgM', 'ANA', 'ANA Pattern',
       'aCL IgA', 'Diagnosis', 'KCT', 'RVVT', 'LAC', 'Symptoms', 'Thrombosis'], as_index=True).size()

# plot the groupby object
UpSet(b_upset).plot(fig= plt.figure(figsize=(8, 5)));
plt.show()

We see the correlation quite clearly in the plot. The height of the bar is the highest when ‘KCT’, ‘RVVT’, ‘LAC’, and ‘Symptoms’ are missing together. Similar pattern can be witnessed for other columns.

Lets take a look at the missing values statistics of C

Code

df_C_stats = missing_val_stats(C)
df_C_stats

column	unique_val	num_unique_val	num_unique_val_nona	num_miss	pct_miss
ID	[2110, 11408, 12052, 14872, 27654, 30609, 4300...	1236	1236	0	0.0
Date	[860419, 860430, 860502, 860506, 860507, 86050...	4960	4960	0	0.0
GOT	[24, 25, 22, nan, 28, 21, 23, 38, 27, 32, 20, ...	424	423	11341	19.7
GPT	[12, 8, nan, 13, 9, 10, 22, 18, 17, 15, 14, 46...	770	769	11356	19.7
LDH	[152, 162, 144, nan, 156, 167, 185, 134, 165, ...	1500	1499	11202	19.5
ALP	[63, 76, 68, nan, 66, 64, 60, 56, 55, 75, 89, ...	1335	1334	11989	20.8
TP	[7.5, 7.9, 7, nan, 7.6, 8, 7.2, 7.1, 8.2, 7.7,...	95	94	11874	20.6
ALB	[4.5, 4.6, 4.2, nan, 4.8, 4.4, 4.7, 4, 4.3, 4....	95	94	12189	21.2
UA	[3.4, 4.7, 5, nan, 4.4, 4.5, 4, 3.7, 3.6, 3.2,...	275	274	12202	21.2
UN	[16.0, 18.0, nan, 15.0, 14.0, 13.0, 12.0, 19.0...	346	345	11387	19.8
CRE	[0.6, nan, 0.7, 0.8, 0.66, 0.9, 1.2, 1, 1.1, 1...	601	600	11291	19.6
T-BIL	[0.4, nan, 0.5, 0.8, 0.6, 0.7, 0.2, 0.3, 0.9, ...	340	339	18733	32.6
T-CHO	[192, 187, 191, nan, 214, 231, 212, 204, 208, ...	848	847	15090	26.2
TG	[nan, 76, 65, 67, 113, 74, 63, 169, 155, 119, ...	1008	1007	28252	49.1
CPK	[nan, 31, 23, 25, 26, 28, 16, 37, 22, 18, 21, ...	920	919	36946	64.2
GLU	[nan, 114.0, 155.0, 88.0, 80.0, 84.0, 103.0, 2...	514	513	49078	85.3
WBC	[5.9, 6.9, nan, 7.2, 5, 4.9, 7.4, 6.5, 6.2, 7....	547	546	7905	13.7
RBC	[4.69, 4.73, nan, 4.93, 4.62, 4.49, 4.71, 4.38...	856	855	7905	13.7
HGB	[9, 9.3, nan, 9.5, 8.9, 8.5, 9.2, 9.4, 9.1, 10...	285	284	7904	13.7
HCT	[31.6, 31.8, nan, 33, 30.9, 30, 31.7, 29.3, 31...	766	765	7904	13.7
PLT	[380, 323, nan, 367, 372, 313, 299, 312, 346, ...	913	912	9605	16.7
PT	[nan, 11.9, 12.4, 12.0, 11.4, 12.1, 11.5, 11.3...	199	198	53507	93.0
APTT	[nan, 68.0, 67.0, 71.0, 65.0, 57.0, 56.0, 80.0...	129	128	56618	98.4
FG	[nan, 27.8, 27.0, 31.8, 31.9, 43.2, 47.5, 31.0...	876	875	54127	94.1
PIC	[nan, 290.0, 315.0, 105.0, 175.0, 285.0, 530.0...	350	349	56711	98.6
TAT	[nan, 86, 72, 63, 142, 138, 92, 107, 111, 110,...	231	230	56147	97.6
TAT2	[nan, 73, 96, 92, 120, 124, 112, 119, 128, 113...	164	163	56518	98.3
U-PRO	[0, nan, -, TR, 100, 30, 1, 2, 300, >=300, 3, ...	21	20	18402	32.0
IGG	[nan, 1265.0, 1485.0, 1530.0, 1149.0, 1240.0, ...	4815	4814	45181	78.6
IGA	[nan, 284, 328, 312, 327, 303, 270, 494, 390, ...	969	968	45184	78.6
IGM	[nan, 124, 140, 132, 39, 109, 115, 190, 160, 1...	763	762	45189	78.6
CRP	[nan, <0.002, -, <0.2, +, 2+, 3+, 6+, 4+, 0.3,...	441	440	10409	18.1
RA	[nan, -, +, 2+, +-, --, 22, 7-]	8	7	48950	85.1
RF	[nan, <40, 246, 286, 249, 162, 120, 104, 73.5,...	2394	2393	44943	78.1
C3	[nan, 30, 26, 24, 45, 49, 48, 65, 56, 58, 64, ...	373	372	30647	53.3
C4	[nan, 11, 8, 7, 15, 14, 13, 20, 19, 18, 16, 21...	107	106	30646	53.3
RNP	[nan, 0, 4, 64, 1, negative, 256, 16]	8	7	56768	98.7
SM	[nan, 0, negative, 2, 1, 8, 32]	7	6	56855	98.9
SC170	[nan, 16, 1, 0, negative, 4]	6	5	57372	99.8
SSA	[nan, 4, 0, negative, 64, 16, 1, 256]	8	7	56935	99.0
SSB	[nan, 0, negative, 1, 32, 2, 8, 128]	8	7	56966	99.1
CENTROMEA	[nan, 0, negative, 320, 160, 0.0, 1280.0, 1280...	11	10	57443	99.9
DNA	[nan, 63.1, 59.9, 57.1, 66.0, 41.9, 18.0, 15.8...	330	329	57078	99.2
DNA-II	[nan]	1	0	57510	100.0

Visualizing missing values in C

Using klib

Code

"Missing Value Plot (TSUMOTO_C.CSV)"
c_klib = klib.missingval_plot(C, figsize=(10,15))

As for our third dataset, it turns out to be the most problematic, with over half (19 out of 44) of the columns having more than 60% of their values missing, including 15 columns with more than 80%. The fundamental problem is that the majority of these columns are numerical and could have provided context for any study that may have been conducted, but their current state renders them useless and extremely problematic.

Using missingno

Code

"Correlation amongst missing values (TSUMOTO_C.CSV)"
c_msno = msno.heatmap(C, figsize=(20,15))

There is a high correlation between many columns that have missing data. Since this dataset contains information about general lab examination, it can be a possibility that the patient didn’t appear for a particular lab test. Since the majority of these lab test results are obtained using the same procedure, omitting one would affect numerous outcomes.

Using UpSet
Note: Since C has too many columns, the image generated by the UpSet plot will be too large to support. So we can skip generating the UpSet plot for C.

The above hypotheses would be further understood once missing values are dropped and new heatmaps are plotted, which is part of the next steps.

3.2.0.2 Taking care of missing data

As previously discussed, missing values are a big obstacle to the data sets, so dealing with them is a priority before carrying on any analysis.

Missing data for dataset A can be handled during cleaning and munging. As for Dataset B, we need to implement the following:

Dropping correlated columns
Modifying ‘Symptoms’ column

Code

# Drop correlated columns
B = B.drop(columns=['RVVT', 'LAC'], axis=1)
# Modify symptoms column
B[['Symptoms']] = B[['Symptoms']].fillna("None")

Any columns with more than 70% missing values should also be dropped since the probable bias and skewness caused by data imputation is not worth it. Columns with missing values in the 50–60% range will be preserved and individually assessed.

Code

# drop the columns that have more than 70% of their data missing
A = A.loc[:, A.isna().mean() < 0.7]
B = B.loc[:, B.isna().mean() < 0.7]
C = C.loc[:, C.isna().mean() < 0.7]

3.2.0.3 Cleaning data to make it tidier

It is a good idea to rename the columns in the dataset to make it more organized before continuing with any research. All columns should be lowercase, and underscore (’_‘) should be used in place of space (’ ’).

Code

def rename_columns(df):
    for i in range(len(df.columns)):
        # check for number of words per column name
        if len(df.columns[i].split()) != 1:
            # rename column as mentioned above
            df.rename(columns={str(df.columns[i]): (str(df.columns[i]).lower().split()[0]) + '_' + (str(df.columns[i]).lower().split()[1])}, inplace=True)
        # in case of only word, follow the same idea
        else:
            df.rename(columns={str(df.columns[i]):str(df.columns[i]).lower()}, inplace=True)

rename_columns(A)
rename_columns(B)
rename_columns(C)

There are some columns in Dataset A that have date as an object. Lets convert them to the appropriate format.

Code

def form_date(val):
    # check for nan values
    if not pd.isna(val):
        # Case 1, separator = '.'
        if '.' in val:
            date_list = val.split(".")
            year = date_list[0]
            month = date_list[1]
            day = date_list[2]

        # Case 2, separator = '/'
        elif '/' in val:
            date_list = val.split("/")
            year = date_list[0]
            month = date_list[1]
            day = date_list[2]

        # form year
        year = '19' + year[-2:]
        # form month
        if int(month) < 10:
            month = '0'+ str(int(month))
        # form day
        if int(day) < 10:
            day = '0' + str(int(day))
    
        return year + '/' + month + '/' + day
    else:
        return val

Code

# Convert relevant and required date columns
A['birthday'] = A['birthday'].apply(form_date)
A['first_date'] = A['first_date'].apply(form_date)

Now that we have defined the function for date to be converted, we can calculate the age of a patient.

Code

from datetime import datetime, date

# Function to calculate age
def calculate_age(val1, val2):
    # check for nan values
    if not pd.isna(val1) and not pd.isna(val2):
        # Identify given date as date, month and year
        born = datetime.strptime(val1, "%Y/%m/%d").date()
        curr_date = datetime.strptime(val2, "%Y/%m/%d").date()
        # return age
        return curr_date.year - born.year - ((curr_date.month, curr_date.day) < (born.month, born.day))
    else:
        np.nan

Code

# Calculating age of every patient at the time when they first came to the hospital
age_first_date = []
for val1, val2 in zip(A['birthday'], A['first_date']):
    age_first_date.append(calculate_age(val1, val2))
age_first_date

# appending the age column to the dataframe
A['age_first_date'] = age_first_date

Now that we have calculated the respective ages, we don’t need those columns. Also, we should drop negative values of age because that is absurd.

Note: We won’t be removing the birth date column since that can be used later for the examination date. Also, dropping columns may be a risk, but here we have already made the most use out of those. So now they seem irrelevant towards our goal.

Code

# Drop irrelevant columns from A
A = A.drop(columns=['description', 'first_date'],  axis=1)

# Filter dataframe to drop absurd values of age
A = A[A.age_first_date > 0]

If we take a look at the ‘admission’ column in A, we see that apart from ‘+“,’-’ and Nan values, it also has some absurd ones. So lets update that column so that it does not cause any issue later on.

Code

# define a function for the admission column to be formatted appropriately
def update_admit(val):
    # check for nan values
    if not pd.isna(val):
        if '+' in val:
            # if '+' in value, then update the entire string value with '+'
            val = val[val.index('+')]
        elif '-' in val:
            # if '-' in value, then update the entire string value with '-'
            val = val[val.index('-')]
        else:
            # if either of the above literals are not present, then that means it's irrelevant. This value won't be dropped since this is medical data, but it will be replaced with nan since it wasn't recorded.
            val = np.nan
    return val

Code

# Update the admission column
A['admission'] = A['admission'].apply(update_admit)
A.head()

	id	sex	birthday	admission	diagnosis	age_first_date
0	2110	F	1934/02/13	+	RA susp.	58.0
1	11408	F	1937/05/02	+	PSS	35.0
4	27654	F	1936/03/25	+	RA, SLE susp	55.0
6	43003	M	1937/11/24	-	Raynaud's phenomenon	56.0
9	57266	M	1923/07/25	+	RA	69.0

Moving on to dataset B.

B has 3 different anti-Cardiolipin antibody columns with their values on different scales. We should normalize this data since it may be useful for us later on for temporal analysis.

Code

# Normalizing values of the 3 different anti-Cardiolipin antibody column
B['acl_igg'] = (B['acl_igg']-B['acl_igg'].mean())/B['acl_igg'].std()
B['acl_igm'] = (B['acl_igm']-B['acl_igm'].mean())/B['acl_igm'].std()
B['acl_iga'] = (B['acl_iga']-B['acl_iga'].mean())/B['acl_iga'].std()

Also, notice that the ‘id’ column in B has missing values. Since this dataset has our dependent variable and the fact that it has information about those patients that got themselves tested for the same, if we don’t have their respective id then it may generate inappropriate results.

Also, these are just 36 rows out of 806. So dropping these won’t create any significant issue.

Code

# Drop NAs in the ID column of B since they are the main unit of analysis (total 36 rows dropped out of 806)
B = B.dropna(subset=['id'])
# change dtype of B.id to int for merging
B['id'] = B['id'].astype('int')

Also, ‘ana’ column has an inappropriate value of ‘>4096’ for the measure of anti-nucleus antibodies in a patient. Lets fix that.

Code

# filter dataset B for correct ana values
B = B[B.ana != '>4096']

That concludes for B. Lets move on to our third dataset.

For C, we again see that there is a date column but with dtype as integer. Lets fix that.

Code

# Converting date column from int to date
def int_to_date(val):
    # check for nan values
    if not pd.isna(val):
        val = str(val)
        # form year
        year = '19' + val[:2]
        # form month
        month = val[2:4]
        # form day
        day = val[4:]
        # add a 0 to month to denote months before October
        if int(month) < 10:
            month = '0'+ str(int(month))
        # add a 0 to day to denote days before the 10th of every month
        if int(day) < 10:
            day = '0' + str(int(day))
    
        return year + '/' + month + '/' + day
    else:
        return val

Code

# Convert dtype of date column to date
C['lab_date'] = C['date'].apply(int_to_date)

# Drop 'date' column since now it has been updated 
C = C.drop(columns=['date'], axis=1)
C.head()

	id	got	gpt	ldh	alp	tp	alb	ua	un	cre	...	wbc	rbc	hgb	hct	plt	u-pro	crp	c3	c4	lab_date
0	2110	24	12	152	63	7.5	4.5	3.4	16.0	0.6	...	5.9	4.69	9	31.6	380	0	NaN	NaN	NaN	1986/04/19
1	2110	25	12	162	76	7.9	4.6	4.7	18.0	0.6	...	6.9	4.73	9.3	31.8	323	NaN	<0.002	NaN	NaN	1986/04/30
2	2110	22	8	144	68	7	4.2	5	18.0	0.6	...	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	1986/05/02
3	2110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	1986/05/06
4	2110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	1986/05/07

5 rows × 24 columns

3.2.0.4 Check individual datasets before merging

Lets have one final look at each of our dataset before merging.

TSUMOTO_A.CSV

Code

A.head()

	id	sex	birthday	admission	diagnosis	age_first_date
0	2110	F	1934/02/13	+	RA susp.	58.0
1	11408	F	1937/05/02	+	PSS	35.0
4	27654	F	1936/03/25	+	RA, SLE susp	55.0
6	43003	M	1937/11/24	-	Raynaud's phenomenon	56.0
9	57266	M	1923/07/25	+	RA	69.0

Check for missing data statistics

Code

df_A_stats_new = missing_val_stats(A)
df_A_stats_new

column	unique_val	num_unique_val	num_unique_val_nona	num_miss	pct_miss
id	[2110, 11408, 27654, 43003, 57266, 71417, 1152...	980	980	0	0.0
sex	[F, M, nan]	3	2	7	0.7
birthday	[1934/02/13, 1937/05/02, 1936/03/25, 1937/11/2...	950	950	0	0.0
admission	[+, -, nan]	3	2	23	2.3
diagnosis	[RA susp., PSS, RA, SLE susp, Raynaud's phenom...	189	189	0	0.0
age_first_date	[58.0, 35.0, 55.0, 56.0, 69.0, 65.0, 20.0, 26....	75	75	0	0.0

We see that there are very few missing values in total (30 including ‘sex’ and ‘admission’). Given such a small number we can drop these values.

Code

# drop nan values from the subset columns
A = A.dropna(subset=['sex', 'admission'])

After removing NAs from the above two columns, we have no missing values for our first dataset.

TSUMOTO_B.CSV

Code

B.head()

	id	examination_date	acl_igg	acl_igm	ana	ana_pattern	acl_iga	diagnosis	symptoms	thrombosis
0	14872	1997/5/27	-0.114128	-0.035902	256	P	-0.038639	MCTD, AMI	AMI	1
1	48473	1992/12/21	-0.088135	-0.035447	256	P,S	-0.036709	SLE	None	0
2	102490	1995/4/20	-0.105464	-0.035765	0	NaN	-0.036592	PSS	None	0
3	108788	1997/5/6	-0.125392	-0.036144	16	S	-0.038639	NaN	None	0
4	122405	1998/4/2	-0.125392	-0.035538	4	P	-0.038639	SLE, SjS, vertigo	None	0

Check for missing data statistics

Code

df_B_stats_new = missing_val_stats(B)
df_B_stats_new

column	unique_val	num_unique_val	num_unique_val_nona	num_miss	pct_miss
id	[14872, 48473, 102490, 108788, 122405, 133382,...	766	766	0	0.0
examination_date	[1997/5/27, 1992/12/21, 1995/4/20, 1997/5/6, 1...	453	452	7	0.9
acl_igg	[-0.11412841735362349, -0.08813455752655686, -...	111	111	0	0.0
acl_igm	[-0.03590169890063771, -0.03544652764211938, -...	137	137	0	0.0
ana	[256, 0, 16, 4, 1024, 4096, 64, nan, 4094]	9	8	20	2.6
ana_pattern	[P, P,S, nan, S, D,P,S, P.S, S,P, S,D, D,P, P,...	16	15	236	30.7
acl_iga	[-0.038639334734420175, -0.036709439525941974,...	155	155	0	0.0
diagnosis	[MCTD, AMI, SLE, PSS, nan, SLE, SjS, vertigo, ...	177	176	312	40.6
symptoms	[AMI, None, CNS lupus, Apo, AP, thrombocytepen...	40	40	0	0.0
thrombosis	[1, 0, 2, 3]	4	4	0	0.0

We see that there is still some missing data that can’t be just dropped. This can be handled after merging.

TSUMOTO_C.CSV

Code

C.head()

	id	got	gpt	ldh	alp	tp	alb	ua	un	cre	...	wbc	rbc	hgb	hct	plt	u-pro	crp	c3	c4	lab_date
0	2110	24	12	152	63	7.5	4.5	3.4	16.0	0.6	...	5.9	4.69	9	31.6	380	0	NaN	NaN	NaN	1986/04/19
1	2110	25	12	162	76	7.9	4.6	4.7	18.0	0.6	...	6.9	4.73	9.3	31.8	323	NaN	<0.002	NaN	NaN	1986/04/30
2	2110	22	8	144	68	7	4.2	5	18.0	0.6	...	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	1986/05/02
3	2110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	1986/05/06
4	2110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	1986/05/07

5 rows × 24 columns

Check for missing data statistics

Code

df_C_stats_new = missing_val_stats(C)
df_C_stats_new

column	unique_val	num_unique_val	num_unique_val_nona	num_miss	pct_miss
id	[2110, 11408, 12052, 14872, 27654, 30609, 4300...	1236	1236	0	0.0
got	[24, 25, 22, nan, 28, 21, 23, 38, 27, 32, 20, ...	424	423	11341	19.7
gpt	[12, 8, nan, 13, 9, 10, 22, 18, 17, 15, 14, 46...	770	769	11356	19.7
ldh	[152, 162, 144, nan, 156, 167, 185, 134, 165, ...	1500	1499	11202	19.5
alp	[63, 76, 68, nan, 66, 64, 60, 56, 55, 75, 89, ...	1335	1334	11989	20.8
tp	[7.5, 7.9, 7, nan, 7.6, 8, 7.2, 7.1, 8.2, 7.7,...	95	94	11874	20.6
alb	[4.5, 4.6, 4.2, nan, 4.8, 4.4, 4.7, 4, 4.3, 4....	95	94	12189	21.2
ua	[3.4, 4.7, 5, nan, 4.4, 4.5, 4, 3.7, 3.6, 3.2,...	275	274	12202	21.2
un	[16.0, 18.0, nan, 15.0, 14.0, 13.0, 12.0, 19.0...	346	345	11387	19.8
cre	[0.6, nan, 0.7, 0.8, 0.66, 0.9, 1.2, 1, 1.1, 1...	601	600	11291	19.6
t-bil	[0.4, nan, 0.5, 0.8, 0.6, 0.7, 0.2, 0.3, 0.9, ...	340	339	18733	32.6
t-cho	[192, 187, 191, nan, 214, 231, 212, 204, 208, ...	848	847	15090	26.2
tg	[nan, 76, 65, 67, 113, 74, 63, 169, 155, 119, ...	1008	1007	28252	49.1
cpk	[nan, 31, 23, 25, 26, 28, 16, 37, 22, 18, 21, ...	920	919	36946	64.2
wbc	[5.9, 6.9, nan, 7.2, 5, 4.9, 7.4, 6.5, 6.2, 7....	547	546	7905	13.7
rbc	[4.69, 4.73, nan, 4.93, 4.62, 4.49, 4.71, 4.38...	856	855	7905	13.7
hgb	[9, 9.3, nan, 9.5, 8.9, 8.5, 9.2, 9.4, 9.1, 10...	285	284	7904	13.7
hct	[31.6, 31.8, nan, 33, 30.9, 30, 31.7, 29.3, 31...	766	765	7904	13.7
plt	[380, 323, nan, 367, 372, 313, 299, 312, 346, ...	913	912	9605	16.7
u-pro	[0, nan, -, TR, 100, 30, 1, 2, 300, >=300, 3, ...	21	20	18402	32.0
crp	[nan, <0.002, -, <0.2, +, 2+, 3+, 6+, 4+, 0.3,...	441	440	10409	18.1
c3	[nan, 30, 26, 24, 45, 49, 48, 65, 56, 58, 64, ...	373	372	30647	53.3
c4	[nan, 11, 8, 7, 15, 14, 13, 20, 19, 18, 16, 21...	107	106	30646	53.3
lab_date	[1986/04/19, 1986/04/30, 1986/05/02, 1986/05/0...	4960	4960	0	0.0

Too many missing values account for the uncertainity in the dataset.

3.2.0.5 Merging the datasets

In terms of joining data sets, there are multiple ways of doing so depending on the question being answered.

Now that we have an idea, we will first merge A with C. Reason being that A contains basic information about the patients and C has the data from general laboratory examinations stored in hospital information systems. Merging these will filter out the id(s) of the patients that dint get tested for these examinations.

Then the last step would be to merge the combined one with B.

Since B contains our target variable (Thrombosis), it has the information of those patients that got tested via the ‘Special Laboratory Examination’. Merging B at the end will give us the final dataset that we can use for EDA and strengthen our final conclusion.

Since ‘ID’ is the only common column amongst the 3 datasets, it will be used in the ‘left’ join for merging.

3.2.0.5.1 Step 1: Merge A with C

Code

# Merging A with C using left join on id column
C_A = C.merge(A, on='id', how='left')
C_A.head()

	id	got	gpt	ldh	alp	tp	alb	ua	un	cre	...	u-pro	crp	c3	c4	lab_date	sex	birthday	admission	diagnosis	age_first_date
0	2110	24	12	152	63	7.5	4.5	3.4	16.0	0.6	...	0	NaN	NaN	NaN	1986/04/19	F	1934/02/13	+	RA susp.	58.0
1	2110	25	12	162	76	7.9	4.6	4.7	18.0	0.6	...	NaN	<0.002	NaN	NaN	1986/04/30	F	1934/02/13	+	RA susp.	58.0
2	2110	22	8	144	68	7	4.2	5	18.0	0.6	...	NaN	-	NaN	NaN	1986/05/02	F	1934/02/13	+	RA susp.	58.0
3	2110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	NaN	NaN	NaN	1986/05/06	F	1934/02/13	+	RA susp.	58.0
4	2110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	NaN	NaN	NaN	1986/05/07	F	1934/02/13	+	RA susp.	58.0

5 rows × 29 columns

Check the shape.

Code

C_A.shape

(57512, 29)

3.2.0.5.2 Step 2: Merge combined (C_A) with B

Code

# Merging C_A with B using left join on id column
df_final = B.merge(C_A, on='id', how='left')
df_final.head()

	id	examination_date	acl_igg	acl_igm	ana	ana_pattern	acl_iga	diagnosis_x	symptoms	thrombosis	...	u-pro	crp	c3	c4	lab_date	sex	birthday	admission	diagnosis_y	age_first_date
0	14872	1997/5/27	-0.114128	-0.035902	256	P	-0.038639	MCTD, AMI	AMI	1	...	0	-	NaN	NaN	1981/02/17	NaN	NaN	NaN	NaN	NaN
1	14872	1997/5/27	-0.114128	-0.035902	256	P	-0.038639	MCTD, AMI	AMI	1	...	NaN	-	NaN	NaN	1981/03/16	NaN	NaN	NaN	NaN	NaN
2	14872	1997/5/27	-0.114128	-0.035902	256	P	-0.038639	MCTD, AMI	AMI	1	...	NaN	-	NaN	NaN	1981/04/06	NaN	NaN	NaN	NaN	NaN
3	14872	1997/5/27	-0.114128	-0.035902	256	P	-0.038639	MCTD, AMI	AMI	1	...	NaN	-	NaN	NaN	1981/04/27	NaN	NaN	NaN	NaN	NaN
4	14872	1997/5/27	-0.114128	-0.035902	256	P	-0.038639	MCTD, AMI	AMI	1	...	NaN	-	NaN	NaN	1981/05/01	NaN	NaN	NaN	NaN	NaN

5 rows × 38 columns

Check the shape.

Code

df_final.shape

(20279, 38)

Since ‘diagnosis’ was present both in B and C_A, we see two different columns for both in the final dataframe. But if you look carefully, the values are similar. Lets merge these columns into one and form the updated ‘diagnosis’ column for df_final.

Code

for i in range(len(df_final)):
    # check for nan in diagnosis_x, if false then split the values
    if df_final["diagnosis_x"].isna()[i] == False:
        df_final["diagnosis_x"][i] = df_final["diagnosis_x"][i].split(",")
    # check for nan in diagnosis_y, if false then split the values
    if df_final["diagnosis_y"].isna()[i] == False:
        df_final["diagnosis_y"][i] = df_final["diagnosis_y"][i].split(",")
        
# defining an empty list to store diagnosis as column in the dataframe later on
df_final["diagnosis"] = ""

for i in range(len(df_final)):
    # fill the value with diagnosis_x if diagnosis_y is nan
    if (df_final["diagnosis_x"].isna()[i] == False) & (df_final["diagnosis_y"].isna()[i] == True):
        df_final["diagnosis"][i] = df_final["diagnosis_x"][i]
    # fill the value with diagnosis_y if diagnosis_x is nan
    elif (df_final["diagnosis_x"].isna()[i] == True) & (df_final["diagnosis_y"].isna()[i] == False):
        df_final["diagnosis"][i] = df_final["diagnosis_y"][i]
    # fill the value with the list of the set of values in diagnosis_x and diagnosis_y if both are not nan
    elif (df_final["diagnosis_x"].isna()[i] == False) & (df_final["diagnosis_y"].isna()[i] == False):
        df_final["diagnosis"][i] = list(set(df_final["diagnosis_x"][i] + df_final["diagnosis_y"][i]))

If we remember fondly, we didn’t drop the ‘birthday’ column earlier. The main reason for not dropping that was to calculate the age of the patients at the time of their ‘Thrombosis’ lab test.

Now that the datasets have been merged, we can use the ‘birthday’ and ‘examination_date’ column to calculate the same.

Code

# convert examination_date to date format
df_final['examination_date'] = df_final['examination_date'].apply(form_date)

# Calculate age at the time of examination
age_at_examination = []
for val1, val2 in zip(df_final['birthday'], df_final['examination_date']):
    age_at_examination.append(calculate_age(val1, val2))

# add the column to the dataframe
df_final['age_at_examination'] = age_at_examination

Now that we have used these columns, we can drop them

Code

# drop irrelevant columns
df_final = df_final.drop(columns=['examination_date', 'diagnosis_x', 'lab_date', 'birthday', 'diagnosis_y'], axis=1)

A property of the dataframe is to have it’s columns organized with dependent column at the end. Lets implement the same for our merged dataset.

Code

# rearrange the columns
df_final = df_final[['id', 'sex', 'admission', 'age_first_date', 'age_at_examination',
                    'diagnosis', 'acl_igg', 'acl_igm', 'acl_iga','ana', 'ana_pattern', 
                    'got', 'gpt', 'ldh', 'alp', 'tp', 'alb', 'ua', 'un', 'cre', 't-bil', 
                    't-cho', 'tg', 'cpk', 'wbc', 'rbc', 'hgb', 'hct', 'plt', 'u-pro', 'crp', 
                    'c3', 'c4', 'symptoms','thrombosis']]
df_final.head()

	id	sex	admission	age_first_date	age_at_examination	diagnosis	acl_igg	acl_igm	acl_iga	ana	...	rbc	hgb	hct	plt	u-pro	crp	c3	c4	symptoms	thrombosis
0	14872	NaN	NaN	NaN	NaN	[MCTD, AMI]	-0.114128	-0.035902	-0.038639	256	...	4.44	12.5	38.2	256	0	-	NaN	NaN	AMI	1
1	14872	NaN	NaN	NaN	NaN	[MCTD, AMI]	-0.114128	-0.035902	-0.038639	256	...	4.5	11.8	37.7	199	NaN	-	NaN	NaN	AMI	1
2	14872	NaN	NaN	NaN	NaN	[MCTD, AMI]	-0.114128	-0.035902	-0.038639	256	...	4.76	12.8	39.4	199	NaN	-	NaN	NaN	AMI	1
3	14872	NaN	NaN	NaN	NaN	[MCTD, AMI]	-0.114128	-0.035902	-0.038639	256	...	4.76	12.3	39.4	183	NaN	-	NaN	NaN	AMI	1
4	14872	NaN	NaN	NaN	NaN	[MCTD, AMI]	-0.114128	-0.035902	-0.038639	256	...	4.43	11.7	36	215	NaN	-	NaN	NaN	AMI	1

5 rows × 35 columns

3.2.0.6 Missing data in the merged dataset

Lets visualize the missing data of the merged dataset.

Using klib

Code

print('Missing Value Plot of df_final')
df_final_klib = klib.missingval_plot(df_final, figsize=(10,15))

If you see the plot above, there are a few columns that show a similar trend in their missing values. Lets verify this using the ‘missingno’ library.

. Using missingno

Code

print('Correlation amongst missing values (df_final))')
df_final_msno = msno.heatmap(df_final, figsize=(20,15))

Some columns still are very much correlated to one another in terms of missing data.

For example, columns such as ‘GOT’, ‘GPT’, ‘LDH’, ‘ALP’, ‘TP’, ‘ALB’, ‘UA’, ‘UN’, and ‘CRE’ have values approximately equal to 1.

The set of columns (‘WBC’, ‘RBC’, ‘HGB’, ‘HCT’, ‘PLT’) and (‘C3’, and ‘C4’) are highly correlated as per their missing values.

We need to analyze and then handle the missing data. Since this is a medical dataset, we can’t just drop or modify such columns. We need to take precautions and then follow appropriate steps.

Using UpSet
Note: Since df_final has too many columns, the image generated by the UpSet plot will be too large to support.

4 Exploratory analysis

Lets plot a histogram for our temporal data from the first dataset (TSUMOTO_A.CSV). This plot will display the count of ‘age_first_date’ to analyze the age of patients when they first visited the hospital.

Code

# define figure size
plt.figure(figsize=(8, 5))
# plot the histogram using dataset A, age_first_date as x and sex as hue
sns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)
# define axis labels
plt.xlabel("Age(years)")
plt.ylabel("Count")
# put title on the plot
plt.title('Age Distribution of patients when they first visited the hospital')
# set margins for x and y axis
plt.margins(x=0.01, y=0.1)
plt.show()

From the above plot, we infer two things:

1. The number of female patients are more as compared to male.

2. The trend for both the categories is a bit similar. We see the highest count of types of patients for the age of 20-25.

We also see that in the beginning, the count increases irrespective of the sex. This may be due to vaccinations and other medical checks of infants and kids. People with age > 30 don’t tend to visit hospital that much for their medical checkups.

As per the analysis, thrombosis appears to be a gender-predominant condition because the dataset itself has more female individuals. Given that there is no scientific way to determine if gender matters in the discussion of thrombosis, this exposes a serious weakness and imbalance in the data.

Lets do another temporal analysis, this time using the second dataset (TSUMOTO_B.CSV) and see how has thrombosis evolved over time.

Code

# drop nan values from subset columns and then sort by examination_date
temporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])

# convert the examination_date column dtype to pandas datetime
temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])

# Extract the year from the datetime and create a new column
temporal_B['year'] = temporal_B['examination_date'].dt.year

# count of thrombosis patients per year
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()

# plot barchart of the thrombosis counts per year
ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=
['#979797', '#f44336', "#C2CD23", "#918BC3"])
# set title of the plot
ax.set_title("Thrombosis diagnosis over the years")
# define labels of the axis
ax.set_xlabel('Year')
ax.set_ylabel('Count')

Text(0, 0.5, 'Count')

The histogram above highlights that the data may be biased. Since the reports indicate majority of people as negative for thrombosis, it might lead to incorrect inferences. We see that for those reporting positive, level 2 and 3 are very rare. For instance, if we were to study the average IGA of patients with thrombosis level 3, the results might be substantially different from the population average (all patients with thrombosis level 3) due to the number being very less.

Before continuing with any analysis, it is crucial to emphasize this limitation because this is only a brief illustration of the potential problems that could arise.

Lets have a look at the age distribution of patients when got tested for thrombosis. This plot also accounts for temporal analysis.

Code

# drop nan values from the subset column
age_exam_nona = df_final.dropna(subset=['age_at_examination'])
# convert age_at_examination dtype to int
age_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')

# define figure size
plt.figure(figsize=(8, 5))
# plot histogram using df_final as data, age_at_examination as x and thrombosis as hue
ax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)

# define labels of axis
plt.xlabel("Age(years)")
plt.ylabel("Count")
# set title of the plot
plt.title('Age distribution of patients when they got tested for thrombosis')
# define margins of the x and y axis
plt.margins(x=0.005, y=0.1)
plt.show()

As per the plot above, we see the uncertainity in data. The kernel density for different levels of thrombosis shows a similar pattern except for the fact that number of patients that tested negative are way more than who tested positive. To signify, the age group 30-40 have the most severe thrombosis. As for the age group of 15-30 the likelihood that thrombosis will be detected in the patient is low.

Now we will conduct a comprehensive analysis of thrombosis since it is important to gain a deeper understanding of its associated side effects and prevalence among different age groups and genders within the population.

Below is a density plot that shows the age distribution of people that have thrombosis vs people who do not.

Code

# filter df with patients that have thrombosis
thrombosis_patients = df_final[df_final['thrombosis'] != 0]
# filter df with patients that don't have thrombosis
no_thrombosis_patients = df_final[df_final['thrombosis'] == 0]

# age at examination of patients with thrombosis
age_t = thrombosis_patients['age_at_examination']
# age at examination of patients without thrombosis
age_nt = no_thrombosis_patients['age_at_examination']

# Filter outliers
age_t = age_t[age_t >2]
age_nt = age_nt[age_nt >2]

# Age distribution plot
sns.kdeplot(age_t, color="#f44336",fill=True, label="Patients with Thrombosis")
sns.kdeplot(age_nt, color="#979797",fill=True, label="Patients without Thrombosis")

# Add legend
plt.legend()
# set title
plt.title("Age Distribution of Patients with respect to Thrombosis")
# set x-axis label
plt.xlabel("Age(years)")

# Display plot
plt.show()

The density plot above shows that Thrombosis does not seem to occur with people of age more than 70. In fact, most of Thrombosis patients’ age are in the range 20-60, which is pretty much the same as that of petients that did not have Thrombosis. The only slight difference is that patients with no thromboses are slightly skewed to the younger side of that range, but ultimately i dont think this holds any relvance.

Next step is to see if there are any differences within the Thrombosis group, meaning between the different levels of the disease’s severity.

Code

# filter the level of thrombosis disease into 3 separate variables
t1 = df_final[df_final['thrombosis'] == 1]
t2 = df_final[df_final['thrombosis'] == 2]
t3 = df_final[df_final['thrombosis'] == 3]

# store the respective ages of patients as per the variables chosen above
age_t1 = t1['age_at_examination']
age_t2 = t2['age_at_examination']
age_t3 = t3['age_at_examination']

# only check for patients having age > 2
age_t1 = age_t1[age_t1 > 2]
age_t2 = age_t2[age_t2 > 2]
age_t3 = age_t3[age_t3 > 2]

# plot the kernel density plot using the age calculated above
sns.kdeplot(age_t1, color="#979797",fill=True, label="Level 1")
sns.kdeplot(age_t2, color="black",fill=True, label="Level 2")
sns.kdeplot(age_t3, color="#f44336",fill=True, label="Level 3")

# Add legend
plt.legend()
# set title
plt.title("Age Distribution of differnet levels of Thrombosis")
# set x-label
plt.xlabel("Age(Years)")

# Display plot
plt.show()

Different levels of Thrombosis turn out to have different age distributions, with the age density plot of level 3 patients being the most condensed in the 30-40 age range, while that of level 1 and level 2 patients in 15-35 and 0-30 respectively.

Look at box plots to check for outliers.

Code

# Subset columns of interest
lab_results = df_final[['acl_igg', 'acl_igm', 'acl_iga', 'ana', 'thrombosis']]

# drop nan values from subset columns
lab_results = lab_results.dropna(subset=['ana']) # Only 4 missing values 
# change ana column dtype to int
lab_results['ana'] = lab_results['ana'].astype('int')

# create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# create box plots
boxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)

# plot boxplot
axs[0, 0].boxplot(boxp['acl_iga'])
# set title of boxplot
axs[0, 0].set_title('acl_iga')

# plot boxplot
axs[0, 1].boxplot(boxp['acl_igg'])
# set title of boxplot
axs[0, 1].set_title('acl_igg')

# plot boxplot
axs[1, 0].boxplot(boxp['acl_igm'])
# set title of boxplot
axs[1, 0].set_title('acl_igm')

# plot boxplot
axs[1, 1].boxplot(boxp['ana'])
# set title of boxplot
axs[1, 1].set_title('ana')

# add x-axis label
for ax in axs.flat:
    ax.set_xlabel('Feature')

# add y-axis label
for ax in axs.flat:
    ax.set_ylabel('Value')

# add overall title
plt.suptitle('Box plots of features', fontsize=16)

# adjust layout
plt.tight_layout()

# show plot
plt.show()

The data has outlier issues.

Setting the threshold at 100 for IgA, IgG, and IgM makes sense, while putting it at 200 for ANA also makes sense.

Code

lab_results['ana'] = lab_results['ana'].astype('int')
# setting the threshold as 100 for the anti-Cardiolipin antibodies and 200 for ana
lab_results = lab_results[(lab_results['acl_iga'] < 100) &  (lab_results['acl_igg'] < 100) &  
(lab_results['acl_igm'] < 100) & (lab_results['ana'] < 50)]


# create subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# create box plots
boxp = lab_results[['acl_igg', 'acl_igm', 'acl_iga', 'ana']].astype(float)

# plot boxplot
axs[0, 0].boxplot(boxp['acl_iga'])
# set title of boxplot
axs[0, 0].set_title('acl_iga')

# plot boxplot
axs[0, 1].boxplot(boxp['acl_igg'])
# set title of boxplot
axs[0, 1].set_title('acl_igg')

# plot boxplot
axs[1, 0].boxplot(boxp['acl_igm'])
# set title of boxplot
axs[1, 0].set_title('acl_igm')

# plot boxplot
axs[1, 1].boxplot(boxp['ana'])
# set title of boxplot
axs[1, 1].set_title('ana')

# add x-axis label
for ax in axs.flat:
    ax.set_xlabel('Feature')

# add y-axis label
for ax in axs.flat:
    ax.set_ylabel('Value')

# add overall title
plt.suptitle('Box plots of features', fontsize=16)

# adjust layout
plt.tight_layout()

# show plot
plt.show()

The data now is in a better state to analyze (a handful of outliers remain).

Below is a radar chart that shows how patients that do not have Thrombosis compare to those who have (and the different levels) in terms of lab results (averaged results).

Code

# Normalize columns to have them all in the same scale 
from sklearn.preprocessing import StandardScaler

# using standard scaler object for feature scaling
scaler = StandardScaler()
# choose all columns except thrombosis
cols_to_normalize = [col for col in lab_results.columns if col != 'thrombosis']
# normalize the column values and put those in a new dataframe
lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)
lab_results_norm['thrombosis'] = lab_results['thrombosis']

# Get the means of each Thrombosis group
means = lab_results_norm.groupby('thrombosis').mean().reset_index()

categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']

# store each respective mean in a different variable
variable1 = means.iloc[0, :].values
variable2 = means.iloc[1, :].values
variable3 = means.iloc[2, :].values
variable4 = means.iloc[3, :].values

label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(variable1))

# Set colors and fill
colors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']
fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']

# define figure size
plt.figure(figsize=(8.5, 8.5))
# add subplots
plt.subplot(polar=True)

# plot the graph for the first mean
plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])
plt.plot(label_loc, variable1,
         color=colors[0], label='No Thrombosis', linewidth=2)

# plot the graph for the second mean
plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])
plt.plot(label_loc, variable2,
         color=colors[1], label='Thrombosis 1', linewidth=2)

# plot the graph for the third mean
plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])
plt.plot(label_loc, variable3,
         color=colors[2], label='Thrombosis 2', linewidth=2)

# plot the graph for the fourth mean
plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])
plt.plot(label_loc, variable4,
         color=colors[3], label='Thrombosis 3', linewidth=2)

# set title
plt.title('Lab results averages of patients', size=24)

lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[
                               ''] + categories, color='gray')

# set legend
plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)
plt.tight_layout()
plt.show()

The radar chart shows that the only lab results that show significant differences between different Thrombosis groups are:

ANA: The higher the ANA value the more severe the Thrombosis diagnosis. Thrombosis level 1 seems have to similar acl_iga and acl_igm values.

A possible predictor of Thrombosis could be the symptoms a patient is experiencing.

Code

from wordcloud import WordCloud, STOPWORDS
import random
import matplotlib.pyplot as plt

# Define the thrombosis categories to iterate over
thrombosis_categories = [1, 2, 3]

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(9, 6))

# Iterate over the thrombosis categories and plot a word cloud in each column
for i, thrombosis_category in enumerate(thrombosis_categories):

    thrombosis_p = B[B['thrombosis'] == thrombosis_category]

    # Concatenate all the strings in the "Symptoms" column into a single string
    text = ' '.join(thrombosis_p['symptoms'])

    # Create a set of stopwords
    stopwords = set(STOPWORDS)
    stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ',
                      'ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY'])  # Excluding thrombsosis (obvious word) + some words that dont make sense

    # Define the color palette
    custom_colors = ['#121111', '#3646f4']
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return random.choice(custom_colors)

    # Create a WordCloud object with the custom color palette
    wordcloud = WordCloud(width=400, height=400, background_color='white',
                          color_func=color_func, stopwords=stopwords).generate(text)

    # Plot the word cloud in the corresponding subplot
    ax = axes[i]
    ax.imshow(wordcloud)
    ax.axis("off")
    ax.set_title(f"Thrombosis {thrombosis_category}")

# Adjust the layout of the figure and display it
plt.tight_layout()
plt.show()

The main symptoms of each thrombosis level are the following:

Level 1: Brain infarction

Level 2: CNS lupus

Level 3: Thrombocytopenia

The world clouds also highlight a major issue in the data previously mentioned, the lack of Thrombosis patients, which is why there are fewer words as the level of Thrombosis becomes more severe.

5 Final Plots

Code

# define figure size
plt.figure(figsize=(8, 5))
# plot the histogram using dataset A, age_first_date as x and sex as hue
sns.histplot(data=A, x="age_first_date", hue="sex", multiple="dodge", bins=30, kde=True)
# define axis labels
plt.xlabel("Age(years)")
plt.ylabel("Count")
# put title on the plot
plt.title('Age Distribution of patients when they first visited the hospital')
# set margins for x and y axis
plt.margins(x=0.01, y=0.1)
plt.show()

Code

# drop nan values from subset columns and then sort by examination_date
temporal_B = B.dropna(subset=['examination_date' , 'thrombosis']).sort_values(by=['examination_date'])

# convert the examination_date column dtype to pandas datetime
temporal_B['examination_date'] = pd.to_datetime(temporal_B['examination_date'])

# Extract the year from the datetime and create a new column
temporal_B['year'] = temporal_B['examination_date'].dt.year

# count of thrombosis patients per year
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts()
thrombosis_counts = temporal_B.groupby('year')['thrombosis'].value_counts().unstack()

# plot barchart of the thrombosis counts per year
ax = thrombosis_counts.plot(kind='bar', stacked=True, figsize=(8, 5), color=
['#979797', '#f44336', "#C2CD23", "#918BC3"])
# set title of the plot
ax.set_title("Thrombosis diagnosis over the years")
# define labels of the axis
ax.set_xlabel('Year')
ax.set_ylabel('Count')

Text(0, 0.5, 'Count')

Code

# drop nan values from the subset column
age_exam_nona = df_final.dropna(subset=['age_at_examination'])
# convert age_at_examination dtype to int
age_exam_nona['age_at_examination'] = age_exam_nona['age_at_examination'].astype('int')

# define figure size
plt.figure(figsize=(8, 5))
# plot histogram using df_final as data, age_at_examination as x and thrombosis as hue
ax = sns.histplot(data=df_final, x="age_at_examination", hue="thrombosis", multiple="dodge", bins=20, kde=True)

# define labels of axis
plt.xlabel("Age(years)")
plt.ylabel("Count")
# set title of the plot
plt.title('Age distribution of patients when they got tested for thrombosis')
# define margins of the x and y axis
plt.margins(x=0.005, y=0.1)
plt.show()

Code

# Normalize columns to have them all in the same scale 
from sklearn.preprocessing import StandardScaler

# using standard scaler object for feature scaling
scaler = StandardScaler()
# choose all columns except thrombosis
cols_to_normalize = [col for col in lab_results.columns if col != 'thrombosis']
# normalize the column values and put those in a new dataframe
lab_results_norm = pd.DataFrame(scaler.fit_transform(lab_results.loc[:, cols_to_normalize]), columns=cols_to_normalize, index=lab_results.index)
lab_results_norm['thrombosis'] = lab_results['thrombosis']

# Get the means of each Thrombosis group
means = lab_results_norm.groupby('thrombosis').mean().reset_index()

categories = ['acl_iga', 'acl_igg', 'acl_igm', 'ana']

# store each respective mean in a different variable
variable1 = means.iloc[0, :].values
variable2 = means.iloc[1, :].values
variable3 = means.iloc[2, :].values
variable4 = means.iloc[3, :].values

label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(variable1))

# Set colors and fill
colors = ['#4DBBD5', '#00A087', '#E64B35', '#3C5488']
fill_colors = ['#8ED4EB', '#4DD0C1', '#FF987D', '#7081BB']

# define figure size
plt.figure(figsize=(8.5, 8.5))
# add subplots
plt.subplot(polar=True)

# plot the graph for the first mean
plt.fill_between(label_loc, variable1, 0, alpha=0.2, color=fill_colors[0])
plt.plot(label_loc, variable1,
         color=colors[0], label='No Thrombosis', linewidth=2)

# plot the graph for the second mean
plt.fill_between(label_loc, variable2, 0, alpha=0.2, color=fill_colors[1])
plt.plot(label_loc, variable2,
         color=colors[1], label='Thrombosis 1', linewidth=2)

# plot the graph for the third mean
plt.fill_between(label_loc, variable3, 0, alpha=0.2, color=fill_colors[2])
plt.plot(label_loc, variable3,
         color=colors[2], label='Thrombosis 2', linewidth=2)

# plot the graph for the fourth mean
plt.fill_between(label_loc, variable4, 0, alpha=0.2, color=fill_colors[3])
plt.plot(label_loc, variable4,
         color=colors[3], label='Thrombosis 3', linewidth=2)

# set title
plt.title('Lab results averages of patients', size=24)

lines, labels = plt.thetagrids(np.degrees(label_loc), labels=[
                               ''] + categories, color='gray')

# set legend
plt.legend(loc='lower right' , bbox_to_anchor=(0.5, -0.3), ncol=2)
plt.tight_layout()
plt.show()

Code

from wordcloud import WordCloud, STOPWORDS
import random
import matplotlib.pyplot as plt

# Define the thrombosis categories to iterate over
thrombosis_categories = [1, 2, 3]

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(9, 6))

# Iterate over the thrombosis categories and plot a word cloud in each column
for i, thrombosis_category in enumerate(thrombosis_categories):

    thrombosis_p = B[B['thrombosis'] == thrombosis_category]

    # Concatenate all the strings in the "Symptoms" column into a single string
    text = ' '.join(thrombosis_p['symptoms'])

    # Create a set of stopwords
    stopwords = set(STOPWORDS)
    stopwords.update(['Thrombosis', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'öxîîÉ',
                      'ë¦ÄêÉÈû¼îîÉ ÅÌ ', 'ÅKè´ù¼ÄY', 'Budd-Chiari', 'ë¦ÄêÉÈû¼îîÉÅÌ', 'ÅKè´ù¼ÄY'])  # Excluding thrombsosis (obvious word) + some words that dont make sense

    # Define the color palette
    custom_colors = ['#121111', '#3646f4']
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return random.choice(custom_colors)

    # Create a WordCloud object with the custom color palette
    wordcloud = WordCloud(width=400, height=400, background_color='white',
                          color_func=color_func, stopwords=stopwords).generate(text)

    # Plot the word cloud in the corresponding subplot
    ax = axes[i]
    ax.imshow(wordcloud)
    ax.axis("off")
    ax.set_title(f"Thrombosis {thrombosis_category}")

# Adjust the layout of the figure and display it
plt.tight_layout()
plt.show()

6 Technical summary

Unexpectedly, the quality of the data provided for this investigation was far lower than anticipated, especially considering that it came from such a prestigious institution. The main issues with the data are provided below:

Missing Data: Missing values are a major problem in this set of data, particularly in TSUMOTO_C.CSV, where more than 80% of the columns and features have missing values. The biggest barrier to better understanding thrombosis and its connection to routine lab tests was this problem. Because of this problem, perhaps some significant connections and patterns were missed.
Lack of Thrombosis cases: The major goals of this study were to gain a deeper understanding of thrombosis and identify data patterns that might act as predictors of this condition. The data presented, however, contains extremely few thrombosis instances, particularly more severe ones (levels 2 and 3), making any study very susceptible to the addition of more cases.
Data not balanced: When sampling for thrombosis + cases exclusively, the dataset TSUMOTO_B.CSV was significantly unbalanced, with nearly all patients (about 85%) being female, which could lead to incorrect findings that thrombosis occurs considerably more commonly with females. This prevented a gender analysis that was accurate from being done.
Dates: TIn some cases, there is a discrepancy of more than 5 years between the dates patients’ lab tests and thrombosis testing are performed. Given that the patients’ actual results would have been drastically different if they had undergone more recent (closer to Thrombosis assessment) lab testing, this renders any analysis of lab data (TSUMOTO_C.CSV) very poor..

Yet it’s vital to note that all of the data was gathered between 1989 and 1996, a time period in which industrywide data discrepancies were frequent and data collection methods weren’t the most sophisticated. An examination of thrombosis using more recent data, if available, would be a better idea.

Below are some of main key takeaways (elaborated more in results.html):

Level 3 Thrombosis appears to happen with people on the higher end of the age distribution.
ANA seems to be the best predictor amongst all.
There are various stages of thrombosis, and each level has its own set of symptoms.