Employment Analysis
  • Home
  • Code
  • Data
    • Raw Data
    • Data Gathering
    • Data Cleaning
    • EDA
  • Techniques
    • Naive Bayes
    • Decision Trees
    • SVM
    • Clustering
    • ARM and Networking
  • Conclusions

On this page

  • 1 Introduction
  • 2 Dataset Used
  • 3 Data Cleaning
  • 4 Apriori Algorithm for Association Rule Mining
    • 4.1 Support
    • 4.2 Confidence
    • 4.3 Lift
  • 5 Steps Involved in Apriori Algorithm
  • 6 Code
    • 6.1 Importing the libraries
    • 6.2 Import the clean dataset
    • 6.3 Function to compute average sentiment
    • 6.4 Visualize sentiment
    • 6.5 Re-format output
    • 6.6 Convert to NetworkX object
    • 6.7 Plot NetworkX object
    • 6.8 Train ARM model by applying Apriori
    • 6.9 Results
    • 6.10 Visualize the results
  • 7 Conclusions
  • 8 References

ARM and Networking

  • Show All Code
  • Hide All Code

  • View Source

To view the HTML version of the complete ARM python code click here: ARM: Employment Analysis

This page focuses on ARM and networking using Python.

1 Introduction

Association rule mining (ARM) is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

  1. A and B can be placed together so that when a customer buys one of the product he doesn’t have to go far away to buy the other product.
  2. People who buy one of the products can be targeted through an advertisement campaign to buy the other.
  3. Collective discounts can be offered on these products if the customer buys both of them.
  4. Both A and B can be packaged together.

The process of identifying an associations between products is called association rule mining. The rule-based machine learning method of association rule learning is used to uncover interesting relationships between variables in huge databases. It’s goal is to use some interesting measures to detect strong rules identified in databases. It evaluates “transactions” for correlations/associations. This page will look at performing ARM and making networks where a deep dive will be done in rules like lift, support and confidence to determine and visualize the best networks possible.

2 Dataset Used

Since the original dataset wasn’t meant to predict transactions and given the fact the uniqueness of data restricts from working with ARM, we will use text data (tweets) from Twitter.

Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (Twitter setup page will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the step-by-step guide to create an app and project. Remember to copy the keys in a txt file on your local machine.

Screenshot of extracted tweets:

3 Data Cleaning

Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox ( Text.ipynb) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.

Screenshot of the Cleaned tweets:

4 Apriori Algorithm for Association Rule Mining

Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this section, we will study the theory behind the Apriori algorithm and will later implement Apriori algorithm in Python.

There are three major components of Apriori algorithm:

  1. Support
  2. Confidence
  3. Lift

Suppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find the support, confidence, and lift.

4.1 Support

Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:

Support(B) = (Transactions containing (B))/(Total Transactions)

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:

Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)

Support(Ketchup) = 100/1000 = 10%

4.2 Confidence

Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:

Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A) Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:

Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing A)

Confidence(Burger→Ketchup) = 50/150 = 33.3%

You may notice that this is similar to what you’d see in the Naive Bayes Algorithm, however, the two algorithms are meant for different types of problems.

4.3 Lift

Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:

Lift(A→B) = (Confidence (A→B))/(Support (B)) Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be calculated as:

Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))

Lift(Burger→Ketchup) = 33.3/10 = 3.33

Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

5 Steps Involved in Apriori Algorithm

For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

  1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
  2. Extract all the subsets having higher value of support than minimum threshold.
  3. Select all the rules from the subsets with confidence value higher than minimum threshold.
  4. Order the rules by descending order of Lift.

6 Code

Please note that for this report, the code for running ARM has been showcased only for the cleaned text data using the toolbox (Text.ipynb) as described above. In case you want the entire code, please click on the link provided at the beginning of the report.

6.1 Importing the libraries

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import nltk
import string
import networkx as nx
import warnings

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from apyori import apriori

warnings.filterwarnings("ignore")

6.2 Import the clean dataset

Code
df = pd.read_csv('./tweets_text.csv')

# Renaming the columns
df.rename(columns={df.columns[0]: 'tweets', df.columns[1]: 'tweets_clean', 
df.columns[2]: 'tweets_tokenized', df.columns[3]: 'tweets_without_stop', df.columns[4]: 'tweets_stemmed', 
df.columns[5]: 'tweets_lemmatized'}, inplace=True)

# Drop redundant columns and generate a new dataframe
clean_text_df=df.drop(['tweets','tweets_clean','tweets_tokenized','tweets_without_stop','tweets_stemmed'],axis=1)

# remove punctuation
final_tweets=[i.replace(",","").replace("[","").replace("]","").replace("'","") for i in clean_text_df['tweets_lemmatized']]

# add the column of clean text to the dataframe
clean_text_df['final_tweets'] = final_tweets
df=clean_text_df.drop('tweets_lemmatized',axis=1)
df.head()
final_tweets
0 one hand receive salary hand go look like paym...
1 fact today origin word salary signup amp post ...
2 grow salary morale good fight good company oth...
3 negotiation az inside secret master negotiator...
4 follow johnsfinancetips daily finance tip agre...

6.3 Function to compute average sentiment

Code
# USER PARAM
input_path          =   df
compute_sentiment   =   True        
sentiment           =   []          # average sentiment of each chunk of text 
ave_window_size     =   250         # size of scanning window for moving average
                    

# OUTPUT FILE
output='transactions.txt'
if os.path.exists(output): os.remove(output)

# INITIALIZE
lemmatizer  =   WordNetLemmatizer()
ps          =   PorterStemmer()
sia         =   SentimentIntensityAnalyzer()

# ADD MORE
stopwords   =   stopwords.words('english')
add=['mr','mrs','wa','dr','said','back','could','one','looked','like','know','around','dont']
for sp in add: stopwords.append(sp)

def read_and_clean(path,START=0,STOP=-1):
    
    global sentiment

    sentences =  []
    for sentence in path['final_tweets']:
        sentences.append(sentence) 

    print("NUMBER OF SENTENCES FOUND:",len(sentences));

    # CLEAN AND LEMMATIZE
    keep='0123456789abcdefghijklmnopqrstuvwxy'

    new_sentences=[];vocabulary=[]
    
    for sentence in sentences:
        new_sentence=''

        # REBUILD LEMMATIZED SENTENCE
        for word in sentence.split():
            
            # ONLY KEEP CHAR IN "keep"
            tmp2=''
            for char in word: 
                if(char in keep): 
                    tmp2=tmp2+char
                else:
                    tmp2=tmp2+' '
            word=tmp2

            #-----------------------
            # LEMMATIZE THE WORDS
            #-----------------------

            new_word = lemmatizer.lemmatize(word)

            # REMOVE WHITE SPACES
            new_word=new_word.replace(' ', '')

            # BUILD NEW SENTENCE BACK UP
            if new_word not in stopwords:
                if new_sentence=='':
                    new_sentence=new_word
                else:
                    new_sentence=new_sentence+','+new_word
                if new_word not in vocabulary: vocabulary.append(new_word)

        # SAVE (LIST OF LISTS)      
        new_sentences.append(new_sentence.split(","))
        
        # SIA
        if(compute_sentiment):
            #-----------------------
            # USE NLTK TO DO SENTIMENT ANALYSIS 
            #-----------------------

            text1=new_sentence.replace(',', ' ')
            ss = sia.polarity_scores(text1)
            sentiment.append([ss['neg'], ss['neu'], ss['pos'], ss['compound']])
            
        # SAVE SENTENCE TO OUTPUT FILE
        if(len(new_sentence.split(','))>2):
            f = open(output, "a")
            f.write(new_sentence+"\n")
            f.close()

    sentiment=np.array(sentiment)
    print("TOTAL AVERAGE SENTIMENT: ",np.mean(sentiment,axis=0))
    print("VOCAB LENGTH: ",len(vocabulary))
    return new_sentences
Code
transactions=read_and_clean(input_path,400,-400)
texT = pd.DataFrame(transactions)
texT.head()
NUMBER OF SENTENCES FOUND: 538
TOTAL AVERAGE SENTIMENT:  [0.04886059 0.81501859 0.13611338 0.22220242]
VOCAB LENGTH:  3584
0 1 2 3 4 5 6 7 8 9 ... 25 26 27 28 29 30 31 32 33 34
0 hand receive salary hand go look payment gateway source somewhere ... None None None None None None None None None None
1 fact today origin word salary signup amp post httpstcodgchla download ... None None None None None None None None None None
2 grow salary morale good fight good company otherwise relatively good ... None None None None None None None None None None
3 negotiation inside secret master negotiator hour student november release link ... None None None None None None None None None None
4 follow johnsfinancetips daily finance tip agree seen frommessi fifaworldcup alabama ... salary httpstcosluhgllr None None None None None None None None

5 rows × 35 columns

6.4 Visualize sentiment

Code
def moving_ave(y,w=100):
    #-----------------------
    # COMPUTE THE MOVING AVERAGE OF A SIGNAL Y
    #-----------------------
    mask=np.ones((1,w))/w; mask=mask[0,:]
    return np.convolve(y,mask,'same')


# VISUALIZE THE SENTIMENT ANALYSIS AS A TIME-SERIES
# this is activated by compute_sentiment = True in the first cell
if (compute_sentiment):
    # take sentiment moving ave and renormalize
    neg=moving_ave(sentiment[:,0], ave_window_size)
    neg=(neg-np.mean(neg))/np.std(neg)

    neu=moving_ave(sentiment[:,1], ave_window_size)
    neu=(neu-np.mean(neu))/np.std(neu)

    pos=moving_ave(sentiment[:,2], ave_window_size)
    pos=(pos-np.mean(pos))/np.std(pos)

    cmpd=moving_ave(sentiment[:,3], ave_window_size)
    cmpd=(cmpd-np.mean(cmpd))/np.std(cmpd)

    # Plot sentiment
    indx = np.linspace(0,len(sentiment), len(sentiment))
    plt.plot(indx, neg, label="negative")
    plt.plot(indx, neu, label="neutral")
    plt.plot(indx, pos, label="positive")
    plt.plot(indx, cmpd, label="combined")

    plt.legend(loc="upper left")
    plt.xlabel("text chunks: tweet progression")
    plt.ylabel("sentiment")
    plt.show()

6.5 Re-format output

Code
# RE-FORMAT THE APRIORI OUTPUT INTO A PANDAS DATA-FRAME WITH COLUMNS "rhs","lhs","supp","conf","supp x conf","lift"
def reformat_results(results):

    # CLEAN-UP RESULTS 
    keep=[]
    for i in range(0,len(results)):
        for j in range(0,len(list(results[i]))):
            if(j>1):
                for k in range(0,len(list(results[i][j]))):
                    if(len(results[i][j][k][0])!=0):
                        rhs=list(results[i][j][k][0])
                        lhs=list(results[i][j][k][1])
                        conf=float(results[i][j][k][2])
                        lift=float(results[i][j][k][3])
                        keep.append([rhs,lhs,supp,conf,supp*conf,lift])
            if(j==1):
                supp=results[i][j]

    return pd.DataFrame(keep, columns =["rhs","lhs","supp","conf","supp x conf","lift"])

6.6 Convert to NetworkX object

Code
def convert_to_network(df):
    print(df)

    # BUILD GRAPH
    G = nx.DiGraph()  # DIRECTED
    for row in df.iterrows():
        # for column in df.columns:
        lhs="_".join(row[1][0])
        rhs="_".join(row[1][1])
        conf=row[1][3]; #print(conf)
        if(lhs not in G.nodes): 
            G.add_node(lhs)
        if(rhs not in G.nodes): 
            G.add_node(rhs)

        edge=(lhs,rhs)
        if edge not in G.edges:
            G.add_edge(lhs, rhs, weight=conf)

    # print(G.nodes)
    # print(G.edges)
    return G

6.7 Plot NetworkX object

Code
def plot_network(G):
    # SPECIFIY X-Y POSITIONS FOR PLOTTING
    pos=nx.random_layout(G)

    # GENERATE PLOT
    fig, ax = plt.subplots()
    fig.set_size_inches(15, 15)

    # assign colors based on attributes
    weights_e   = [G[u][v]['weight'] for u,v in G.edges()]

    # SAMPLE CMAP FOR COLORS 
    cmap=plt.cm.get_cmap('Blues')
    colors_e    = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]

    # PLOT
    nx.draw(
    G,
    edgecolors="black",
    edge_color=colors_e,
    node_size=2000,
    linewidths=2,
    font_size=8,
    font_color="white",
    font_weight="bold",
    width=weights_e,
    with_labels=True,
    pos=pos,
    ax=ax
    )
    ax.set(title='ARM on Text Data(Tweets)')
    plt.show()

6.8 Train ARM model by applying Apriori

The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.

The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.

Code
# TRAIN THE ARM MODEL USING THE "apriori" PACKAGE
print("Transactions:",texT)
# Run Apriori algorithm
results = list(apriori(transactions, min_support=0.099, min_confidence=0.02, min_length=2, max_length=5))
Transactions:               0                 1             2        3           4       5   \
0           hand           receive        salary     hand          go    look   
1           fact             today        origin     word      salary  signup   
2           grow            salary        morale     good       fight    good   
3    negotiation            inside        secret   master  negotiator    hour   
4         follow  johnsfinancetips         daily  finance         tip   agree   
..           ...               ...           ...      ...         ...     ...   
533      biggest               lie         earth  tadaaaa         ctc   funny   
534     computer           science  everevolving  subject       scope     amp   
535     thinking               ooh          work  company     without  salary   
536         read                uk       ireland  inhouse       legal  market   
537         ever          wondered        salary  compare    industry    peer   

             6            7               8          9   ...  \
0       payment      gateway          source  somewhere  ...   
1           amp         post  httpstcodgchla   download  ...   
2       company    otherwise      relatively       good  ...   
3       student     november         release       link  ...   
4          seen    frommessi    fifaworldcup    alabama  ...   
..          ...          ...             ...        ...  ...   
533      office         meme           radio       city  ...   
534      career  opportunity      constantly    growing  ...   
535  management          say           happy       work  ...   
536      report       salary           guide     discus  ...   
537        time       annual          salary     survey  ...   

                     25                26    27    28    29    30    31    32  \
0                  None              None  None  None  None  None  None  None   
1                  None              None  None  None  None  None  None  None   
2                  None              None  None  None  None  None  None  None   
3                  None              None  None  None  None  None  None  None   
4                salary  httpstcosluhgllr  None  None  None  None  None  None   
..                  ...               ...   ...   ...   ...   ...   ...   ...   
533                None              None  None  None  None  None  None  None   
534  httpstcoojoictopop              None  None  None  None  None  None  None   
535                None              None  None  None  None  None  None  None   
536                None              None  None  None  None  None  None  None   
537                None              None  None  None  None  None  None  None   

       33    34  
0    None  None  
1    None  None  
2    None  None  
3    None  None  
4    None  None  
..    ...   ...  
533  None  None  
534  None  None  
535  None  None  
536  None  None  
537  None  None  

[538 rows x 35 columns]

6.9 Results

Let’s first find the total number of rules mined by the apriori class.

Code
print(len(results))
11

The script above should return 11. Each item corresponds to one rule.

Let’s print a random item set in the association_rules list to see it’s rule.

Code
print(results[7])
RelationRecord(items=frozenset({'salary', 'job'}), support=0.26579925650557623, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'salary', 'job'}), confidence=0.26579925650557623, lift=1.0), OrderedStatistic(items_base=frozenset({'job'}), items_add=frozenset({'salary'}), confidence=1.0, lift=1.0037313432835822), OrderedStatistic(items_base=frozenset({'salary'}), items_add=frozenset({'job'}), confidence=0.2667910447761194, lift=1.0037313432835822)])

The items in the item set are ‘salary’ and ‘job’.

The support value for the first rule is 0.26579925650557623. This number is calculated by dividing the number of transactions containing ‘salary’ divided by total number of transactions.

The confidence level for the rule is 0.26579925650557623 which shows that out of all the transactions that contain ‘salary’, 27% (approx.) of the transactions also contain ‘job’.

Finally, the lift of 1.0 tells us that ‘job’ is 1.0 (approx.) times more likely to occur in a transaction that contains ‘salary’ compared to the default likelihood of the occurence of ‘job’.

Code
# Use the utility function to reformat the output
result_df = reformat_results(results)

6.10 Visualize the results

Code
# PLOT THE RESULTS AS A NETWORK-X OBJECT
pd_results = reformat_results(results)
G = convert_to_network(result_df)
print(G)
plot_network(G)
          rhs         lhs      supp      conf  supp x conf      lift
0  [employee]    [salary]  0.100372  1.000000     0.100372  1.003731
1    [salary]  [employee]  0.100372  0.100746     0.010112  1.003731
2       [job]    [salary]  0.265799  1.000000     0.265799  1.003731
3    [salary]       [job]  0.265799  0.266791     0.070913  1.003731
4     [money]    [salary]  0.141264  1.000000     0.141264  1.003731
5    [salary]     [money]  0.141264  0.141791     0.020030  1.003731
6       [pay]    [salary]  0.131970  1.000000     0.131970  1.003731
7    [salary]       [pay]  0.131970  0.132463     0.017481  1.003731
8    [salary]      [work]  0.137546  0.138060     0.018990  1.003731
9      [work]    [salary]  0.137546  1.000000     0.137546  1.003731
DiGraph with 6 nodes and 10 edges

7 Conclusions

Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explainability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as recommender systems, are used. However, you can probably see that this method is a very simple way to get basic associations if that’s all your use-case needs.

8 References

  1. Associate Rule Mining
  2. Wikipedia
Source Code
---
title: ARM and Networking
---

To view the HTML version of the complete ARM python code click here: <a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/ARM/Python/ARM-python.ipynb" target="_blank">ARM: Employment Analysis</a>

<b>This page focuses on ARM and networking using Python.</b>

# Introduction

Association rule mining (ARM) is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

1. A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
2. People who buy one of the products can be targeted through an advertisement campaign to buy the other.
3. Collective discounts can be offered on these products if the customer buys both of them.
4. Both A and B can be packaged together.


The process of identifying an associations between products is called association rule mining. The rule-based machine learning method of association rule learning is used to uncover interesting relationships between variables in huge databases. It's goal is to use some interesting measures to detect strong rules identified in databases. It evaluates “transactions” for correlations/associations. This page will look at performing ARM and making networks where a deep dive will be done in rules like lift, support and confidence to determine and visualize the best networks possible.

# Dataset Used

Since the original dataset wasn't meant to predict transactions and given the fact the uniqueness of data restricts from working with ARM, we will use text data (tweets) from Twitter.

Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (<u><a href="https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api" target="_blank">Twitter setup page</a></u>  will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the <u><a href="https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2" target="_blank"> step-by-step guide</a></u> to create an app and project. Remember to copy the keys in a txt file on your local machine.

Screenshot of extracted tweets:

<img src="./images/tweets_initial.png">

# Data Cleaning

Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox (  <u><a href="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Text.ipynb" target="_blank">Text.ipynb</a></u>) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.

Screenshot of the Cleaned tweets:

<img src="./images/tweets_cleaned.png" style="width:1000px;" align="center">

# Apriori Algorithm for Association Rule Mining

Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this section, we will study the theory behind the <u><a href="https://en.wikipedia.org/wiki/Apriori_algorithm" target="_blank">Apriori algorithm</a></u> and will later implement Apriori algorithm in Python.

There are three major components of Apriori algorithm:

1. Support
2. Confidence
3. Lift

Suppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find the support, confidence, and lift.

## Support

Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:

Support(B) = (Transactions containing (B))/(Total Transactions)

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:

Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)

Support(Ketchup) = 100/1000
                 = 10%

## Confidence

Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:

Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)
Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:

Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing A)

Confidence(Burger→Ketchup) = 50/150
                           = 33.3%

You may notice that this is similar to what you'd see in the <u><a href="https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/" target="_blank">Naive Bayes Algorithm</a></u>, however, the two algorithms are meant for different types of problems.

## Lift

Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:

Lift(A→B) = (Confidence (A→B))/(Support (B))
Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be calculated as:

Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))

Lift(Burger→Ketchup) = 33.3/10
                     = 3.33

Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

# Steps Involved in Apriori Algorithm

For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
2. Extract all the subsets having higher value of support than minimum threshold.
3. Select all the rules from the subsets with confidence value higher than minimum threshold.
4. Order the rules by descending order of Lift.

# Code

Please note that for this report, the code for running ARM has been showcased only for the cleaned text data using the toolbox (Text.ipynb) as described above. In case you want the entire code, please click on the link provided at the beginning of the report.

## Importing the libraries

```{python}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import nltk
import string
import networkx as nx
import warnings

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from apyori import apriori

warnings.filterwarnings("ignore")
```

## Import the clean dataset

```{python}
df = pd.read_csv('./tweets_text.csv')

# Renaming the columns
df.rename(columns={df.columns[0]: 'tweets', df.columns[1]: 'tweets_clean', 
df.columns[2]: 'tweets_tokenized', df.columns[3]: 'tweets_without_stop', df.columns[4]: 'tweets_stemmed', 
df.columns[5]: 'tweets_lemmatized'}, inplace=True)

# Drop redundant columns and generate a new dataframe
clean_text_df=df.drop(['tweets','tweets_clean','tweets_tokenized','tweets_without_stop','tweets_stemmed'],axis=1)

# remove punctuation
final_tweets=[i.replace(",","").replace("[","").replace("]","").replace("'","") for i in clean_text_df['tweets_lemmatized']]

# add the column of clean text to the dataframe
clean_text_df['final_tweets'] = final_tweets
df=clean_text_df.drop('tweets_lemmatized',axis=1)
df.head()
```

## Function to compute average sentiment

```{python}
# USER PARAM
input_path          =   df
compute_sentiment   =   True        
sentiment           =   []          # average sentiment of each chunk of text 
ave_window_size     =   250         # size of scanning window for moving average
                    

# OUTPUT FILE
output='transactions.txt'
if os.path.exists(output): os.remove(output)

# INITIALIZE
lemmatizer  =   WordNetLemmatizer()
ps          =   PorterStemmer()
sia         =   SentimentIntensityAnalyzer()

# ADD MORE
stopwords   =   stopwords.words('english')
add=['mr','mrs','wa','dr','said','back','could','one','looked','like','know','around','dont']
for sp in add: stopwords.append(sp)

def read_and_clean(path,START=0,STOP=-1):
    
    global sentiment

    sentences =  []
    for sentence in path['final_tweets']:
        sentences.append(sentence) 

    print("NUMBER OF SENTENCES FOUND:",len(sentences));

    # CLEAN AND LEMMATIZE
    keep='0123456789abcdefghijklmnopqrstuvwxy'

    new_sentences=[];vocabulary=[]
    
    for sentence in sentences:
        new_sentence=''

        # REBUILD LEMMATIZED SENTENCE
        for word in sentence.split():
            
            # ONLY KEEP CHAR IN "keep"
            tmp2=''
            for char in word: 
                if(char in keep): 
                    tmp2=tmp2+char
                else:
                    tmp2=tmp2+' '
            word=tmp2

            #-----------------------
            # LEMMATIZE THE WORDS
            #-----------------------

            new_word = lemmatizer.lemmatize(word)

            # REMOVE WHITE SPACES
            new_word=new_word.replace(' ', '')

            # BUILD NEW SENTENCE BACK UP
            if new_word not in stopwords:
                if new_sentence=='':
                    new_sentence=new_word
                else:
                    new_sentence=new_sentence+','+new_word
                if new_word not in vocabulary: vocabulary.append(new_word)

        # SAVE (LIST OF LISTS)      
        new_sentences.append(new_sentence.split(","))
        
        # SIA
        if(compute_sentiment):
            #-----------------------
            # USE NLTK TO DO SENTIMENT ANALYSIS 
            #-----------------------

            text1=new_sentence.replace(',', ' ')
            ss = sia.polarity_scores(text1)
            sentiment.append([ss['neg'], ss['neu'], ss['pos'], ss['compound']])
            
        # SAVE SENTENCE TO OUTPUT FILE
        if(len(new_sentence.split(','))>2):
            f = open(output, "a")
            f.write(new_sentence+"\n")
            f.close()

    sentiment=np.array(sentiment)
    print("TOTAL AVERAGE SENTIMENT: ",np.mean(sentiment,axis=0))
    print("VOCAB LENGTH: ",len(vocabulary))
    return new_sentences
```

```{python}
transactions=read_and_clean(input_path,400,-400)
texT = pd.DataFrame(transactions)
texT.head()
```

## Visualize sentiment

```{python}
def moving_ave(y,w=100):
    #-----------------------
    # COMPUTE THE MOVING AVERAGE OF A SIGNAL Y
    #-----------------------
    mask=np.ones((1,w))/w; mask=mask[0,:]
    return np.convolve(y,mask,'same')


# VISUALIZE THE SENTIMENT ANALYSIS AS A TIME-SERIES
# this is activated by compute_sentiment = True in the first cell
if (compute_sentiment):
    # take sentiment moving ave and renormalize
    neg=moving_ave(sentiment[:,0], ave_window_size)
    neg=(neg-np.mean(neg))/np.std(neg)

    neu=moving_ave(sentiment[:,1], ave_window_size)
    neu=(neu-np.mean(neu))/np.std(neu)

    pos=moving_ave(sentiment[:,2], ave_window_size)
    pos=(pos-np.mean(pos))/np.std(pos)

    cmpd=moving_ave(sentiment[:,3], ave_window_size)
    cmpd=(cmpd-np.mean(cmpd))/np.std(cmpd)

    # Plot sentiment
    indx = np.linspace(0,len(sentiment), len(sentiment))
    plt.plot(indx, neg, label="negative")
    plt.plot(indx, neu, label="neutral")
    plt.plot(indx, pos, label="positive")
    plt.plot(indx, cmpd, label="combined")

    plt.legend(loc="upper left")
    plt.xlabel("text chunks: tweet progression")
    plt.ylabel("sentiment")
    plt.show()
```

## Re-format output

```{python}
# RE-FORMAT THE APRIORI OUTPUT INTO A PANDAS DATA-FRAME WITH COLUMNS "rhs","lhs","supp","conf","supp x conf","lift"
def reformat_results(results):

    # CLEAN-UP RESULTS 
    keep=[]
    for i in range(0,len(results)):
        for j in range(0,len(list(results[i]))):
            if(j>1):
                for k in range(0,len(list(results[i][j]))):
                    if(len(results[i][j][k][0])!=0):
                        rhs=list(results[i][j][k][0])
                        lhs=list(results[i][j][k][1])
                        conf=float(results[i][j][k][2])
                        lift=float(results[i][j][k][3])
                        keep.append([rhs,lhs,supp,conf,supp*conf,lift])
            if(j==1):
                supp=results[i][j]

    return pd.DataFrame(keep, columns =["rhs","lhs","supp","conf","supp x conf","lift"])
```

## Convert to NetworkX object

```{python}
def convert_to_network(df):
    print(df)

    # BUILD GRAPH
    G = nx.DiGraph()  # DIRECTED
    for row in df.iterrows():
        # for column in df.columns:
        lhs="_".join(row[1][0])
        rhs="_".join(row[1][1])
        conf=row[1][3]; #print(conf)
        if(lhs not in G.nodes): 
            G.add_node(lhs)
        if(rhs not in G.nodes): 
            G.add_node(rhs)

        edge=(lhs,rhs)
        if edge not in G.edges:
            G.add_edge(lhs, rhs, weight=conf)

    # print(G.nodes)
    # print(G.edges)
    return G
```

## Plot NetworkX object

```{python}
def plot_network(G):
    # SPECIFIY X-Y POSITIONS FOR PLOTTING
    pos=nx.random_layout(G)

    # GENERATE PLOT
    fig, ax = plt.subplots()
    fig.set_size_inches(15, 15)

    # assign colors based on attributes
    weights_e   = [G[u][v]['weight'] for u,v in G.edges()]

    # SAMPLE CMAP FOR COLORS 
    cmap=plt.cm.get_cmap('Blues')
    colors_e    = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]

    # PLOT
    nx.draw(
    G,
    edgecolors="black",
    edge_color=colors_e,
    node_size=2000,
    linewidths=2,
    font_size=8,
    font_color="white",
    font_weight="bold",
    width=weights_e,
    with_labels=True,
    pos=pos,
    ax=ax
    )
    ax.set(title='ARM on Text Data(Tweets)')
    plt.show()
```

## Train ARM model by applying Apriori

The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.

The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.

```{python}
# TRAIN THE ARM MODEL USING THE "apriori" PACKAGE
print("Transactions:",texT)
# Run Apriori algorithm
results = list(apriori(transactions, min_support=0.099, min_confidence=0.02, min_length=2, max_length=5))
```

## Results

Let's first find the total number of rules mined by the apriori class.

```{python}
print(len(results))
```

The script above should return 11. Each item corresponds to one rule.

Let's print a random item set in the association_rules list to see it's rule. 

```{python}
print(results[7])
```

The items in the item set are 'salary' and 'job'.

The support value for the first rule is 0.26579925650557623. This number is calculated by dividing the number of transactions containing 'salary' divided by total number of transactions. 

The confidence level for the rule is 0.26579925650557623 which shows that out of all the transactions that contain 'salary', 27% (approx.) of the transactions also contain 'job'.

Finally, the lift of 1.0 tells us that 'job' is 1.0 (approx.) times more likely to occur in a transaction that contains 'salary' compared to the default likelihood of the occurence of 'job'.

```{python}
# Use the utility function to reformat the output
result_df = reformat_results(results)
```

## Visualize the results

```{python}
# PLOT THE RESULTS AS A NETWORK-X OBJECT
pd_results = reformat_results(results)
G = convert_to_network(result_df)
print(G)
plot_network(G)
```

# Conclusions

Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explainability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as <u><a href="https://en.wikipedia.org/wiki/Recommender_system" target="_blank">recommender systems</a></u>, are used. However, you can probably see that this method is a very simple way to get basic associations if that's all your use-case needs.

# References

1. <a href="https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/">Associate Rule Mining</a> <br>
2. <a href="https://www.wikipedia.org/">Wikipedia</a> <br>