This page focuses on ARM and networking using Python.
1 Introduction
Association rule mining (ARM) is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.
For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:
A and B can be placed together so that when a customer buys one of the product he doesn’t have to go far away to buy the other product.
People who buy one of the products can be targeted through an advertisement campaign to buy the other.
Collective discounts can be offered on these products if the customer buys both of them.
Both A and B can be packaged together.
The process of identifying an associations between products is called association rule mining. The rule-based machine learning method of association rule learning is used to uncover interesting relationships between variables in huge databases. It’s goal is to use some interesting measures to detect strong rules identified in databases. It evaluates “transactions” for correlations/associations. This page will look at performing ARM and making networks where a deep dive will be done in rules like lift, support and confidence to determine and visualize the best networks possible.
2 Dataset Used
Since the original dataset wasn’t meant to predict transactions and given the fact the uniqueness of data restricts from working with ARM, we will use text data (tweets) from Twitter.
Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (Twitter setup page will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the step-by-step guide to create an app and project. Remember to copy the keys in a txt file on your local machine.
Screenshot of extracted tweets:
3 Data Cleaning
Next, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox ( Text.ipynb) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.
Screenshot of the Cleaned tweets:
4 Apriori Algorithm for Association Rule Mining
Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this section, we will study the theory behind the Apriori algorithm and will later implement Apriori algorithm in Python.
There are three major components of Apriori algorithm:
Support
Confidence
Lift
Suppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find the support, confidence, and lift.
4.1 Support
Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:
Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A) Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:
Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing A)
Confidence(Burger→Ketchup) = 50/150 = 33.3%
You may notice that this is similar to what you’d see in the Naive Bayes Algorithm, however, the two algorithms are meant for different types of problems.
4.3 Lift
Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:
Lift(A→B) = (Confidence (A→B))/(Support (B)) Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be calculated as:
Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.
5 Steps Involved in Apriori Algorithm
For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.
As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:
Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
Extract all the subsets having higher value of support than minimum threshold.
Select all the rules from the subsets with confidence value higher than minimum threshold.
Order the rules by descending order of Lift.
6 Code
Please note that for this report, the code for running ARM has been showcased only for the cleaned text data using the toolbox (Text.ipynb) as described above. In case you want the entire code, please click on the link provided at the beginning of the report.
6.1 Importing the libraries
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport osimport nltkimport stringimport networkx as nximport warningsfrom nltk.stem import WordNetLemmatizerfrom nltk.stem import PorterStemmerfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom nltk.sentiment import SentimentIntensityAnalyzerfrom apyori import aprioriwarnings.filterwarnings("ignore")
6.2 Import the clean dataset
Code
df = pd.read_csv('./tweets_text.csv')# Renaming the columnsdf.rename(columns={df.columns[0]: 'tweets', df.columns[1]: 'tweets_clean', df.columns[2]: 'tweets_tokenized', df.columns[3]: 'tweets_without_stop', df.columns[4]: 'tweets_stemmed', df.columns[5]: 'tweets_lemmatized'}, inplace=True)# Drop redundant columns and generate a new dataframeclean_text_df=df.drop(['tweets','tweets_clean','tweets_tokenized','tweets_without_stop','tweets_stemmed'],axis=1)# remove punctuationfinal_tweets=[i.replace(",","").replace("[","").replace("]","").replace("'","") for i in clean_text_df['tweets_lemmatized']]# add the column of clean text to the dataframeclean_text_df['final_tweets'] = final_tweetsdf=clean_text_df.drop('tweets_lemmatized',axis=1)df.head()
final_tweets
0
one hand receive salary hand go look like paym...
1
fact today origin word salary signup amp post ...
2
grow salary morale good fight good company oth...
3
negotiation az inside secret master negotiator...
4
follow johnsfinancetips daily finance tip agre...
6.3 Function to compute average sentiment
Code
# USER PARAMinput_path = dfcompute_sentiment =Truesentiment = [] # average sentiment of each chunk of text ave_window_size =250# size of scanning window for moving average# OUTPUT FILEoutput='transactions.txt'if os.path.exists(output): os.remove(output)# INITIALIZElemmatizer = WordNetLemmatizer()ps = PorterStemmer()sia = SentimentIntensityAnalyzer()# ADD MOREstopwords = stopwords.words('english')add=['mr','mrs','wa','dr','said','back','could','one','looked','like','know','around','dont']for sp in add: stopwords.append(sp)def read_and_clean(path,START=0,STOP=-1):global sentiment sentences = []for sentence in path['final_tweets']: sentences.append(sentence) print("NUMBER OF SENTENCES FOUND:",len(sentences));# CLEAN AND LEMMATIZE keep='0123456789abcdefghijklmnopqrstuvwxy' new_sentences=[];vocabulary=[]for sentence in sentences: new_sentence=''# REBUILD LEMMATIZED SENTENCEfor word in sentence.split():# ONLY KEEP CHAR IN "keep" tmp2=''for char in word: if(char in keep): tmp2=tmp2+charelse: tmp2=tmp2+' ' word=tmp2#-----------------------# LEMMATIZE THE WORDS#----------------------- new_word = lemmatizer.lemmatize(word)# REMOVE WHITE SPACES new_word=new_word.replace(' ', '')# BUILD NEW SENTENCE BACK UPif new_word notin stopwords:if new_sentence=='': new_sentence=new_wordelse: new_sentence=new_sentence+','+new_wordif new_word notin vocabulary: vocabulary.append(new_word)# SAVE (LIST OF LISTS) new_sentences.append(new_sentence.split(","))# SIAif(compute_sentiment):#-----------------------# USE NLTK TO DO SENTIMENT ANALYSIS #----------------------- text1=new_sentence.replace(',', ' ') ss = sia.polarity_scores(text1) sentiment.append([ss['neg'], ss['neu'], ss['pos'], ss['compound']])# SAVE SENTENCE TO OUTPUT FILEif(len(new_sentence.split(','))>2): f =open(output, "a") f.write(new_sentence+"\n") f.close() sentiment=np.array(sentiment)print("TOTAL AVERAGE SENTIMENT: ",np.mean(sentiment,axis=0))print("VOCAB LENGTH: ",len(vocabulary))return new_sentences
TOTAL AVERAGE SENTIMENT: [0.04886059 0.81501859 0.13611338 0.22220242]
VOCAB LENGTH: 3584
0
1
2
3
4
5
6
7
8
9
...
25
26
27
28
29
30
31
32
33
34
0
hand
receive
salary
hand
go
look
payment
gateway
source
somewhere
...
None
None
None
None
None
None
None
None
None
None
1
fact
today
origin
word
salary
signup
amp
post
httpstcodgchla
download
...
None
None
None
None
None
None
None
None
None
None
2
grow
salary
morale
good
fight
good
company
otherwise
relatively
good
...
None
None
None
None
None
None
None
None
None
None
3
negotiation
inside
secret
master
negotiator
hour
student
november
release
link
...
None
None
None
None
None
None
None
None
None
None
4
follow
johnsfinancetips
daily
finance
tip
agree
seen
frommessi
fifaworldcup
alabama
...
salary
httpstcosluhgllr
None
None
None
None
None
None
None
None
5 rows × 35 columns
6.4 Visualize sentiment
Code
def moving_ave(y,w=100):#-----------------------# COMPUTE THE MOVING AVERAGE OF A SIGNAL Y#----------------------- mask=np.ones((1,w))/w; mask=mask[0,:]return np.convolve(y,mask,'same')# VISUALIZE THE SENTIMENT ANALYSIS AS A TIME-SERIES# this is activated by compute_sentiment = True in the first cellif (compute_sentiment):# take sentiment moving ave and renormalize neg=moving_ave(sentiment[:,0], ave_window_size) neg=(neg-np.mean(neg))/np.std(neg) neu=moving_ave(sentiment[:,1], ave_window_size) neu=(neu-np.mean(neu))/np.std(neu) pos=moving_ave(sentiment[:,2], ave_window_size) pos=(pos-np.mean(pos))/np.std(pos) cmpd=moving_ave(sentiment[:,3], ave_window_size) cmpd=(cmpd-np.mean(cmpd))/np.std(cmpd)# Plot sentiment indx = np.linspace(0,len(sentiment), len(sentiment)) plt.plot(indx, neg, label="negative") plt.plot(indx, neu, label="neutral") plt.plot(indx, pos, label="positive") plt.plot(indx, cmpd, label="combined") plt.legend(loc="upper left") plt.xlabel("text chunks: tweet progression") plt.ylabel("sentiment") plt.show()
6.5 Re-format output
Code
# RE-FORMAT THE APRIORI OUTPUT INTO A PANDAS DATA-FRAME WITH COLUMNS "rhs","lhs","supp","conf","supp x conf","lift"def reformat_results(results):# CLEAN-UP RESULTS keep=[]for i inrange(0,len(results)):for j inrange(0,len(list(results[i]))):if(j>1):for k inrange(0,len(list(results[i][j]))):if(len(results[i][j][k][0])!=0): rhs=list(results[i][j][k][0]) lhs=list(results[i][j][k][1]) conf=float(results[i][j][k][2]) lift=float(results[i][j][k][3]) keep.append([rhs,lhs,supp,conf,supp*conf,lift])if(j==1): supp=results[i][j]return pd.DataFrame(keep, columns =["rhs","lhs","supp","conf","supp x conf","lift"])
6.6 Convert to NetworkX object
Code
def convert_to_network(df):print(df)# BUILD GRAPH G = nx.DiGraph() # DIRECTEDfor row in df.iterrows():# for column in df.columns: lhs="_".join(row[1][0]) rhs="_".join(row[1][1]) conf=row[1][3];#print(conf)if(lhs notin G.nodes): G.add_node(lhs)if(rhs notin G.nodes): G.add_node(rhs) edge=(lhs,rhs)if edge notin G.edges: G.add_edge(lhs, rhs, weight=conf)# print(G.nodes)# print(G.edges)return G
6.7 Plot NetworkX object
Code
def plot_network(G):# SPECIFIY X-Y POSITIONS FOR PLOTTING pos=nx.random_layout(G)# GENERATE PLOT fig, ax = plt.subplots() fig.set_size_inches(15, 15)# assign colors based on attributes weights_e = [G[u][v]['weight'] for u,v in G.edges()]# SAMPLE CMAP FOR COLORS cmap=plt.cm.get_cmap('Blues') colors_e = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]# PLOT nx.draw( G, edgecolors="black", edge_color=colors_e, node_size=2000, linewidths=2, font_size=8, font_color="white", font_weight="bold", width=weights_e, with_labels=True, pos=pos, ax=ax ) ax.set(title='ARM on Text Data(Tweets)') plt.show()
6.8 Train ARM model by applying Apriori
The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.
The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.
Code
# TRAIN THE ARM MODEL USING THE "apriori" PACKAGEprint("Transactions:",texT)# Run Apriori algorithmresults =list(apriori(transactions, min_support=0.099, min_confidence=0.02, min_length=2, max_length=5))
The support value for the first rule is 0.26579925650557623. This number is calculated by dividing the number of transactions containing ‘salary’ divided by total number of transactions.
The confidence level for the rule is 0.26579925650557623 which shows that out of all the transactions that contain ‘salary’, 27% (approx.) of the transactions also contain ‘job’.
Finally, the lift of 1.0 tells us that ‘job’ is 1.0 (approx.) times more likely to occur in a transaction that contains ‘salary’ compared to the default likelihood of the occurence of ‘job’.
Code
# Use the utility function to reformat the outputresult_df = reformat_results(results)
6.10 Visualize the results
Code
# PLOT THE RESULTS AS A NETWORK-X OBJECTpd_results = reformat_results(results)G = convert_to_network(result_df)print(G)plot_network(G)
Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explainability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as recommender systems, are used. However, you can probably see that this method is a very simple way to get basic associations if that’s all your use-case needs.
---title: ARM and Networking---To view the HTML version of the complete ARM python code click here: <ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/ARM/Python/ARM-python.ipynb"target="_blank">ARM: Employment Analysis</a><b>This page focuses on ARM and networking using Python.</b># IntroductionAssociation rule mining (ARM) is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:1. A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.2. People who buy one of the products can be targeted through an advertisement campaign to buy the other.3. Collective discounts can be offered on these products if the customer buys both of them.4. Both A and B can be packaged together.The process of identifying an associations between products is called association rule mining. The rule-based machine learning method of association rule learning is used to uncover interesting relationships between variables in huge databases. It's goal is to use some interesting measures to detect strong rules identified in databases. It evaluates “transactions” for correlations/associations. This page will look at performing ARM and making networks where a deep dive will be done in rules like lift, support and confidence to determine and visualize the best networks possible.# Dataset UsedSince the original dataset wasn't meant to predict transactions and given the fact the uniqueness of data restricts from working with ARM, we will use text data (tweets) from Twitter.Tweepy is a Python library for accessing the Twitter Developer API. You need to sign up for a Twitter Developer Account (<u><ahref="https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api"target="_blank">Twitter setup page</a></u> will guide you in the process) in order to extract tweets from Twitter and use that data for your purpose. You will need to create an app to get access to the API. Once you have access to the API use the <u><ahref="https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2"target="_blank"> step-by-step guide</a></u> to create an app and project. Remember to copy the keys in a txt file on your local machine.Screenshot of extracted tweets:<imgsrc="./images/tweets_initial.png"># Data CleaningNext, the extracted data has been cleaned and prepped to be set in a specific way to run our model. Firstly, the text has been cleaned, stemmed and lemmatized. Then stopwords have been removed to get a clean vectorizer that gives a count of news words. For this, a toolbox ( <u><ahref="https://github.com/anly501/anly-501-project-raghavSharmaCode/blob/main/501-project-website/501/codes/Techniques/Text.ipynb"target="_blank">Text.ipynb</a></u>) has been implemented which makes use of polymorphism and inheritance. The toolbox has a parent class called Datasets and a sub-class called Texdataset. This toolbox can clean any sort of text data and makes a wordcloud for it.Screenshot of the Cleaned tweets:<imgsrc="./images/tweets_cleaned.png"style="width:1000px;"align="center"># Apriori Algorithm for Association Rule MiningDifferent statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this section, we will study the theory behind the <u><ahref="https://en.wikipedia.org/wiki/Apriori_algorithm"target="_blank">Apriori algorithm</a></u> and will later implement Apriori algorithm in Python.There are three major components of Apriori algorithm:1. Support2. Confidence3. LiftSuppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find the support, confidence, and lift.## SupportSupport refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:Support(B) = (Transactions containing (B))/(Total Transactions)For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)Support(Ketchup) = 100/1000 = 10%## ConfidenceConfidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing A)Confidence(Burger→Ketchup) = 50/150 = 33.3%You may notice that this is similar to what you'd see in the <u><ahref="https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/"target="_blank">Naive Bayes Algorithm</a></u>, however, the two algorithms are meant for different types of problems.## LiftLift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:Lift(A→B) = (Confidence (A→B))/(Support (B))Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be calculated as:Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))Lift(Burger→Ketchup) = 33.3/10 = 3.33Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.# Steps Involved in Apriori AlgorithmFor large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).2. Extract all the subsets having higher value of support than minimum threshold.3. Select all the rules from the subsets with confidence value higher than minimum threshold.4. Order the rules by descending order of Lift.# CodePlease note that for this report, the code for running ARM has been showcased only for the cleaned text data using the toolbox (Text.ipynb) as described above. In case you want the entire code, please click on the link provided at the beginning of the report.## Importing the libraries```{python}import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport osimport nltkimport stringimport networkx as nximport warningsfrom nltk.stem import WordNetLemmatizerfrom nltk.stem import PorterStemmerfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom nltk.sentiment import SentimentIntensityAnalyzerfrom apyori import aprioriwarnings.filterwarnings("ignore")```## Import the clean dataset```{python}df = pd.read_csv('./tweets_text.csv')# Renaming the columnsdf.rename(columns={df.columns[0]: 'tweets', df.columns[1]: 'tweets_clean', df.columns[2]: 'tweets_tokenized', df.columns[3]: 'tweets_without_stop', df.columns[4]: 'tweets_stemmed', df.columns[5]: 'tweets_lemmatized'}, inplace=True)# Drop redundant columns and generate a new dataframeclean_text_df=df.drop(['tweets','tweets_clean','tweets_tokenized','tweets_without_stop','tweets_stemmed'],axis=1)# remove punctuationfinal_tweets=[i.replace(",","").replace("[","").replace("]","").replace("'","") for i in clean_text_df['tweets_lemmatized']]# add the column of clean text to the dataframeclean_text_df['final_tweets'] = final_tweetsdf=clean_text_df.drop('tweets_lemmatized',axis=1)df.head()```## Function to compute average sentiment```{python}# USER PARAMinput_path = dfcompute_sentiment =Truesentiment = [] # average sentiment of each chunk of text ave_window_size =250# size of scanning window for moving average# OUTPUT FILEoutput='transactions.txt'if os.path.exists(output): os.remove(output)# INITIALIZElemmatizer = WordNetLemmatizer()ps = PorterStemmer()sia = SentimentIntensityAnalyzer()# ADD MOREstopwords = stopwords.words('english')add=['mr','mrs','wa','dr','said','back','could','one','looked','like','know','around','dont']for sp in add: stopwords.append(sp)def read_and_clean(path,START=0,STOP=-1):global sentiment sentences = []for sentence in path['final_tweets']: sentences.append(sentence) print("NUMBER OF SENTENCES FOUND:",len(sentences));# CLEAN AND LEMMATIZE keep='0123456789abcdefghijklmnopqrstuvwxy' new_sentences=[];vocabulary=[]for sentence in sentences: new_sentence=''# REBUILD LEMMATIZED SENTENCEfor word in sentence.split():# ONLY KEEP CHAR IN "keep" tmp2=''for char in word: if(char in keep): tmp2=tmp2+charelse: tmp2=tmp2+' ' word=tmp2#-----------------------# LEMMATIZE THE WORDS#----------------------- new_word = lemmatizer.lemmatize(word)# REMOVE WHITE SPACES new_word=new_word.replace(' ', '')# BUILD NEW SENTENCE BACK UPif new_word notin stopwords:if new_sentence=='': new_sentence=new_wordelse: new_sentence=new_sentence+','+new_wordif new_word notin vocabulary: vocabulary.append(new_word)# SAVE (LIST OF LISTS) new_sentences.append(new_sentence.split(","))# SIAif(compute_sentiment):#-----------------------# USE NLTK TO DO SENTIMENT ANALYSIS #----------------------- text1=new_sentence.replace(',', ' ') ss = sia.polarity_scores(text1) sentiment.append([ss['neg'], ss['neu'], ss['pos'], ss['compound']])# SAVE SENTENCE TO OUTPUT FILEif(len(new_sentence.split(','))>2): f =open(output, "a") f.write(new_sentence+"\n") f.close() sentiment=np.array(sentiment)print("TOTAL AVERAGE SENTIMENT: ",np.mean(sentiment,axis=0))print("VOCAB LENGTH: ",len(vocabulary))return new_sentences``````{python}transactions=read_and_clean(input_path,400,-400)texT = pd.DataFrame(transactions)texT.head()```## Visualize sentiment```{python}def moving_ave(y,w=100):#-----------------------# COMPUTE THE MOVING AVERAGE OF A SIGNAL Y#----------------------- mask=np.ones((1,w))/w; mask=mask[0,:]return np.convolve(y,mask,'same')# VISUALIZE THE SENTIMENT ANALYSIS AS A TIME-SERIES# this is activated by compute_sentiment = True in the first cellif (compute_sentiment):# take sentiment moving ave and renormalize neg=moving_ave(sentiment[:,0], ave_window_size) neg=(neg-np.mean(neg))/np.std(neg) neu=moving_ave(sentiment[:,1], ave_window_size) neu=(neu-np.mean(neu))/np.std(neu) pos=moving_ave(sentiment[:,2], ave_window_size) pos=(pos-np.mean(pos))/np.std(pos) cmpd=moving_ave(sentiment[:,3], ave_window_size) cmpd=(cmpd-np.mean(cmpd))/np.std(cmpd)# Plot sentiment indx = np.linspace(0,len(sentiment), len(sentiment)) plt.plot(indx, neg, label="negative") plt.plot(indx, neu, label="neutral") plt.plot(indx, pos, label="positive") plt.plot(indx, cmpd, label="combined") plt.legend(loc="upper left") plt.xlabel("text chunks: tweet progression") plt.ylabel("sentiment") plt.show()```## Re-format output```{python}# RE-FORMAT THE APRIORI OUTPUT INTO A PANDAS DATA-FRAME WITH COLUMNS "rhs","lhs","supp","conf","supp x conf","lift"def reformat_results(results):# CLEAN-UP RESULTS keep=[]for i inrange(0,len(results)):for j inrange(0,len(list(results[i]))):if(j>1):for k inrange(0,len(list(results[i][j]))):if(len(results[i][j][k][0])!=0): rhs=list(results[i][j][k][0]) lhs=list(results[i][j][k][1]) conf=float(results[i][j][k][2]) lift=float(results[i][j][k][3]) keep.append([rhs,lhs,supp,conf,supp*conf,lift])if(j==1): supp=results[i][j]return pd.DataFrame(keep, columns =["rhs","lhs","supp","conf","supp x conf","lift"])```## Convert to NetworkX object```{python}def convert_to_network(df):print(df)# BUILD GRAPH G = nx.DiGraph() # DIRECTEDfor row in df.iterrows():# for column in df.columns: lhs="_".join(row[1][0]) rhs="_".join(row[1][1]) conf=row[1][3];#print(conf)if(lhs notin G.nodes): G.add_node(lhs)if(rhs notin G.nodes): G.add_node(rhs) edge=(lhs,rhs)if edge notin G.edges: G.add_edge(lhs, rhs, weight=conf)# print(G.nodes)# print(G.edges)return G```## Plot NetworkX object```{python}def plot_network(G):# SPECIFIY X-Y POSITIONS FOR PLOTTING pos=nx.random_layout(G)# GENERATE PLOT fig, ax = plt.subplots() fig.set_size_inches(15, 15)# assign colors based on attributes weights_e = [G[u][v]['weight'] for u,v in G.edges()]# SAMPLE CMAP FOR COLORS cmap=plt.cm.get_cmap('Blues') colors_e = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]# PLOT nx.draw( G, edgecolors="black", edge_color=colors_e, node_size=2000, linewidths=2, font_size=8, font_color="white", font_weight="bold", width=weights_e, with_labels=True, pos=pos, ax=ax ) ax.set(title='ARM on Text Data(Tweets)') plt.show()```## Train ARM model by applying AprioriThe next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.```{python}# TRAIN THE ARM MODEL USING THE "apriori" PACKAGEprint("Transactions:",texT)# Run Apriori algorithmresults =list(apriori(transactions, min_support=0.099, min_confidence=0.02, min_length=2, max_length=5))```## ResultsLet's first find the total number of rules mined by the apriori class.```{python}print(len(results))```The script above should return 11. Each item corresponds to one rule.Let's print a random item set in the association_rules list to see it's rule. ```{python}print(results[7])```The items in the item set are 'salary' and 'job'.The support value for the first rule is 0.26579925650557623. This number is calculated by dividing the number of transactions containing 'salary' divided by total number of transactions. The confidence level for the rule is 0.26579925650557623 which shows that out of all the transactions that contain 'salary', 27% (approx.) of the transactions also contain 'job'.Finally, the lift of 1.0 tells us that 'job' is 1.0 (approx.) times more likely to occur in a transaction that contains 'salary' compared to the default likelihood of the occurence of 'job'.```{python}# Use the utility function to reformat the outputresult_df = reformat_results(results)```## Visualize the results```{python}# PLOT THE RESULTS AS A NETWORK-X OBJECTpd_results = reformat_results(results)G = convert_to_network(result_df)print(G)plot_network(G)```# ConclusionsAssociation rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explainability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as <u><ahref="https://en.wikipedia.org/wiki/Recommender_system"target="_blank">recommender systems</a></u>, are used. However, you can probably see that this method is a very simple way to get basic associations if that's all your use-case needs.# References1. <ahref="https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/">Associate Rule Mining</a><br>2. <ahref="https://www.wikipedia.org/">Wikipedia</a><br>