One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. We need an automated algorithm that can read through the text documents and automatically output the topics discussed. In this lab, we will use Latent Dirichlet Allocation(LDA), a popular algorithm for topic modeling. LDA is a Bayesian network that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.

Preprocessing¶

In [1]:

# Installing nltk for generating stop words. Stop words are commonly used words (such as "the", "a", "an", "in")
# that we want our topic models to ignore (as they are not suggestive to "topics")
! pip install nltk

Requirement already satisfied: nltk in c:\users\colto\anaconda3\lib\site-packages (3.8.1)
Requirement already satisfied: click in c:\users\colto\anaconda3\lib\site-packages (from nltk) (8.0.4)
Requirement already satisfied: joblib in c:\users\colto\anaconda3\lib\site-packages (from nltk) (1.2.0)
Requirement already satisfied: regex>=2021.8.3 in c:\users\colto\anaconda3\lib\site-packages (from nltk) (2022.7.9)
Requirement already satisfied: tqdm in c:\users\colto\anaconda3\lib\site-packages (from nltk) (4.65.0)
Requirement already satisfied: colorama in c:\users\colto\anaconda3\lib\site-packages (from click->nltk) (0.4.6)

In [2]:

# Loading nltk stop words
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\colto\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[2]:

True

In [3]:

# Import other packages
import re # regular expression
import numpy as np # performing calculations on arrays
import pandas as pd # dealing with dataframe

# Gensim, for topic modeling
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Plotting tools
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:

# Prepare stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should', 'shouldn', "shouldn't", "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', "we'd", "we'll", "we're", 'were', 'weren', "weren't", "we've", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've"]

In [5]:

# Load dataset
# The dataset is available at https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json
df = pd.read_json("Data/newsgroups.json")
df

Out[5]:

	content	target	target_names
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7	rec.autos
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4	comp.sys.mac.hardware
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4	comp.sys.mac.hardware
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1	comp.graphics
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14	sci.space
...	...	...	...
11309	From: jim.zisfein@factory.com (Jim Zisfein) \n...	13	sci.med
11310	From: ebodin@pearl.tufts.edu\nSubject: Screen ...	4	comp.sys.mac.hardware
11311	From: westes@netcom.com (Will Estes)\nSubject:...	3	comp.sys.ibm.pc.hardware
11312	From: steve@hcrlgw (Steven Collins)\nSubject: ...	1	comp.graphics
11313	From: gunning@cco.caltech.edu (Kevin J. Gunnin...	8	rec.motorcycles

11314 rows × 3 columns

In [6]:

# We are only using the first 1,000 documents because model training can be time-consuming
df = df.iloc[:1000,:]
df

Out[6]:

	content	target	target_names
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7	rec.autos
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4	comp.sys.mac.hardware
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4	comp.sys.mac.hardware
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1	comp.graphics
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14	sci.space
...	...	...	...
995	From: ibeshir@nyx.cs.du.edu (Ibrahim)\nSubject...	6	misc.forsale
996	From: anwar+@cs.cmu.edu (Anwar Mohammed)\nSubj...	17	talk.politics.mideast
997	From: DJCOHEN@YaleVM.YCC.Yale.Edu (Daniel Cohe...	4	comp.sys.mac.hardware
998	From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...	13	sci.med
999	From: dxf12@po.cwru.edu (Douglas Fowler)\nSubj...	15	soc.religion.christian

1000 rows × 3 columns

Data cleaning¶

In [7]:

# Get the first document
data = df.content.values.tolist()
data[0]

Out[7]:

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [8]:

# Removing Emails
email_pattern = re.compile("\S*@\S*\s?") # \S means matching a character, \s means matching a space
                                         # * means matching 0 or multiple times, ? means matching 0 or 1 times
for i in range(len(data)):
    data[i] = re.sub(email_pattern, '', data[i])
print(data[0])

From: (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

In [9]:

# Removing new line characters
new_line_pattern = re.compile("\s+") # + means matching 1 or more times
for i in range(len(data)):
    data[i] = re.sub(new_line_pattern, ' ', data[i]) # converting line breaks and multiple spaces into one space
print(data[0])

From: (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----

In [10]:

# Tokenize words using gensim's simple_preprocess
for i in range(len(data)):
    data[i] = gensim.utils.simple_preprocess(str(data[i]), deacc=True) # deacc=True means removing punctuations
print(data[0])

['from', 'where', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']

In [11]:

# Removing stop words
for i in range(len(data)):
    data[i] = [word for word in data[i] if word not in stop_words]
print(data[0])

['thing', 'subject', 'car', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'maryland', 'college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']

Including bigrams¶

In [12]:

# The basic idea here is we group words that frequently appear together to a phrase
bigram = gensim.models.Phrases(data, min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
print(bigram_mod[data[0]])

['thing', 'subject', 'car', 'nntp_posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'maryland', 'college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']

Task 1¶

What would be the impact of changing the 'min_count' to 1 when generating bigrams? Please redefine the bigrams with the new 'min_count' setting and rerun the experiment. Print the updated word list of data[0] with these new bigrams.
Are there more or fewer words grouped together compared to the original result obtained with a 'min_count' setting of 5? Are these additional or removed bigrams meaningful? Provide an assessment of the advantages and disadvantages of this change and suggest whether the change is advisable.
Include your codes in the following cell (code) and the discussion in the next cell (markdown).

In [13]:

bigram_new = gensim.models.Phrases(data, min_count=1, threshold=100)
bigram_mod_new = gensim.models.phrases.Phraser(bigram_new)
print(bigram_mod_new[data[0]])

['thing', 'subject', 'car', 'nntp_posting', 'host', 'rac', 'wam', 'umd_edu', 'organization', 'university', 'maryland_college', 'park', 'lines', 'wondering', 'anyone', 'could_enlighten', 'car', 'saw', 'day', 'door_sports', 'car', 'looked_late', 'early', 'called_bricklin', 'doors_really', 'small', 'addition_front', 'bumper_separate', 'rest', 'body', 'know', 'anyone_tellme', 'model_name', 'engine_specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info_funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought_neighborhood', 'lerxst']

Creating models¶

In [14]:

# Convert each document to a list of words including generated bigrams 
bigram = gensim.models.Phrases(data, min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
data_words_bigrams = []
for i in range(len(data)):
    data_words_bigrams.append(bigram_mod[data[i]])
print(data_words_bigrams[0])

['thing', 'subject', 'car', 'nntp_posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'maryland', 'college', 'park', 'lines', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'door', 'sports', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']

In [15]:

# Create the Dictionary for topic modeling
id2word = corpora.Dictionary(data_words_bigrams)
id2word

Out[15]:

<gensim.corpora.dictionary.Dictionary at 0x22b10749650>

In [16]:

# Create the Corpus for topic modeling
corpus = [id2word.doc2bow(text) for text in data_words_bigrams]
print(corpus[0])

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 5), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1)]

In [17]:

# Let's see what are these words
for word in corpus[0]:
    print(id2word[word[0]], word[1])

addition 1
anyone 2
body 1
bricklin 1
brought 1
bumper 1
called 1
car 5
college 1
could 1
day 1
door 1
doors 1
early 1
edu 1
engine 1
enlighten 1
front 1
funky 1
history 1
host 1
il 1
info 1
know 1
late 1
lerxst 1
lines 1
looked 1
looking 1
made 1
mail 1
maryland 1
model 1
name 1
neighborhood 1
nntp_posting 1
organization 1
park 1
please 1
production 1
rac 1
really 1
rest 1
saw 1
separate 1
small 1
specs 1
sports 1
subject 1
tellme 1
thanks 1
thing 1
umd 1
university 1
wam 1
whatever 1
wondering 1
years 1

In [18]:

# Build the topic model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, # specify the number of topics
                                           random_state=100, # set a random seed so that each output will be the same
                                           update_every=1, # determine how often the model parameters should be updated
                                           chunksize=100, # number of documents to be used in each training chunk
                                           passes=10, # total number of training passes
                                           per_word_topics=True)

Analyzing the results¶

In [19]:

# Print the keywords in the topics
for topics in lda_model.print_topics():
    print(topics)

(0, '0.029*"death" + 0.026*"moral" + 0.016*"morality" + 0.015*"punishment" + 0.014*"criminal" + 0.012*"trial" + 0.012*"uk" + 0.011*"fair" + 0.011*"al" + 0.010*"islam"')
(1, '0.022*"software" + 0.020*"mb" + 0.018*"dos" + 0.018*"files" + 0.017*"system" + 0.017*"disk" + 0.015*"windows" + 0.014*"pc" + 0.013*"memory" + 0.013*"mac"')
(2, '0.012*"health" + 0.011*"april" + 0.010*"department" + 0.009*"happened" + 0.009*"years" + 0.009*"key" + 0.008*"president" + 0.008*"context" + 0.007*"information" + 0.007*"national"')
(3, '0.032*"lines" + 0.031*"subject" + 0.030*"organization" + 0.027*"nntp_posting" + 0.026*"host" + 0.025*"matthew" + 0.019*"edu" + 0.012*"university" + 0.012*"david" + 0.011*"sale"')
(4, '0.018*"would" + 0.014*"one" + 0.013*"writes" + 0.012*"subject" + 0.012*"article" + 0.012*"lines" + 0.011*"people" + 0.011*"like" + 0.011*"organization" + 0.009*"know"')
(5, '0.021*"january" + 0.018*"ps" + 0.013*"mailing_list" + 0.009*"hole" + 0.008*"motorcycle" + 0.007*"cd_rom" + 0.007*"ericsson" + 0.006*"trick" + 0.005*"hammer" + 0.004*"integer"')
(6, '0.046*"pp" + 0.025*"season" + 0.024*"pittsburgh" + 0.020*"buffalo" + 0.018*"nhl" + 0.017*"lemieux" + 0.015*"chicago" + 0.014*"player" + 0.013*"boston" + 0.012*"first"')
(7, '0.017*"self" + 0.016*"thanks_advance" + 0.012*"printer" + 0.012*"added" + 0.011*"flight" + 0.010*"electric" + 0.010*"installing" + 0.009*"bell" + 0.009*"hm" + 0.009*"labs"')
(8, '0.025*"please" + 0.015*"subject" + 0.015*"organization" + 0.014*"lines" + 0.014*"mail" + 0.013*"windows" + 0.011*"program" + 0.011*"university" + 0.009*"help" + 0.009*"thanks"')
(9, '0.011*"rings" + 0.006*"precisely" + 0.005*"signing" + 0.005*"adding" + 0.004*"starters" + 0.003*"stadium" + 0.002*"yankee" + 0.002*"seldom" + 0.001*"toll" + 0.001*"lefthanded"')
(10, '0.023*"tape" + 0.020*"percent" + 0.009*"angle" + 0.009*"band" + 0.005*"japan" + 0.003*"fm" + 0.003*"mccarthy" + 0.000*"weaver" + 0.000*"photo_radar" + 0.000*"gm"')
(11, '0.000*"zg" + 0.000*"zfxz" + 0.000*"zjm" + 0.000*"zjk" + 0.000*"zj" + 0.000*"zilos" + 0.000*"zhd" + 0.000*"zh" + 0.000*"zgp" + 0.000*"zgos"')
(12, '0.013*"organization" + 0.013*"lines" + 0.013*"subject" + 0.012*"article" + 0.010*"writes" + 0.009*"government" + 0.009*"year" + 0.009*"car" + 0.009*"game" + 0.008*"host"')
(13, '0.020*"physical" + 0.018*"comp" + 0.014*"direct" + 0.011*"ho" + 0.009*"kit" + 0.008*"york" + 0.006*"sys" + 0.006*"adam" + 0.006*"genesis" + 0.004*"beings"')
(14, '0.022*"andrew" + 0.020*"cable" + 0.013*"reserve" + 0.012*"modem" + 0.011*"hp" + 0.011*"hewlett_packard" + 0.010*"case_western" + 0.010*"strip" + 0.010*"wa" + 0.010*"nj"')
(15, '0.034*"armenians" + 0.025*"armenian" + 0.022*"people" + 0.020*"turkish" + 0.019*"greek" + 0.014*"serdar_argic" + 0.014*"murder" + 0.014*"genocide" + 0.013*"government" + 0.011*"killed"')
(16, '0.050*"israel" + 0.034*"jews" + 0.023*"land" + 0.016*"united_states" + 0.014*"committee" + 0.014*"israeli" + 0.013*"died" + 0.012*"brother" + 0.011*"attacks" + 0.011*"peace"')
(17, '0.041*"jesus" + 0.030*"god" + 0.014*"christian" + 0.010*"may" + 0.010*"argument" + 0.009*"father" + 0.009*"bible" + 0.008*"certainly" + 0.008*"faith" + 0.008*"example"')
(18, '0.022*"max" + 0.012*"science" + 0.011*"dr" + 0.009*"passage" + 0.009*"values" + 0.009*"mother" + 0.008*"christ" + 0.008*"family" + 0.007*"manual" + 0.007*"fbi"')
(19, '0.053*"space" + 0.043*"nasa" + 0.015*"moon" + 0.014*"launch" + 0.013*"shuttle" + 0.012*"objective" + 0.010*"mission" + 0.008*"orbit" + 0.007*"possibility" + 0.007*"apollo"')

Topic 0 is a represented as 0.029"death" + 0.026"moral" + 0.016"morality" + 0.015"punishment" + 0.014"criminal" + 0.012"trial" + 0.012"uk" + 0.011"fair" + 0.011"al" + 0.010"islam". It means the top 10 keywords that contribute to this topic are: "death", "moral", "morality", etc., and the weight of "death" on topic 0 is 0.029. The weights reflect how important a keyword is to that topic. So it is reasonable to summarize the topic as "religion".

Task 2¶

Summarize two additional topics (excluding topic 0) based on the top words in the topic (see output above). Analyze the common theme or subject that these topics convey through the frequent use of these top words. Include both your summarization (a single word or phrase) for these two topics and your rationale for the chosen summarization.

In [20]:

# Compute model coherence score, which measures how good a given topic model is
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print("Coherence LDA score:", coherence_lda)

Coherence LDA score: 0.5123042740644081

Task 3¶

Rebuild the topic model using a chunksize of 50 and 5 passes. Then print the topics generated by this updated model, and compute the new Coherence LDA score.
How has this new model affected the composition of the top words in topic 0, and do you find the top words in topic 0 in the old or the new model to be more coherent? Has the new Coherence LDA score increased or decreased compared with the original score, and what implications can be drawn from this change? Discuss the connection between the change in the Coherence LDA score and the changes in chunksize and passes.
Include your codes in the following cell (code) and the discussion in the next cell (markdown).

In [21]:

lda_model_new = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, # specify the number of topics
                                           random_state=100, # setting a random seed so that each output will be the same
                                           update_every=1, # determines how often the model parameters should be updated
                                           chunksize=50, # number of documents to be used in each training chunk
                                           passes=5, # total number of training passes
                                           per_word_topics=True)

for topics in lda_model_new.print_topics():
    print(topics)
    
coherence_model_lda = CoherenceModel(model=lda_model_new, texts=data_words_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print("Coherence LDA score:", coherence_lda)

(0, '0.048*"death" + 0.025*"punishment" + 0.025*"died" + 0.020*"dave" + 0.020*"follow" + 0.018*"trial" + 0.017*"ass" + 0.016*"turbo" + 0.016*"al" + 0.015*"rose"')
(1, '0.025*"making" + 0.024*"doctor" + 0.023*"contact" + 0.022*"groups" + 0.020*"condition" + 0.020*"head" + 0.017*"kids" + 0.017*"constitution" + 0.017*"merely" + 0.016*"fair"')
(2, '0.091*"jesus" + 0.040*"god" + 0.027*"christian" + 0.019*"example" + 0.019*"father" + 0.016*"david" + 0.016*"bible" + 0.016*"faith" + 0.016*"saying" + 0.015*"children"')
(3, '0.031*"able" + 0.022*"values" + 0.021*"offer" + 0.021*"user" + 0.015*"avoid" + 0.015*"names" + 0.015*"charge" + 0.014*"needed" + 0.014*"clearly" + 0.013*"knowing"')
(4, '0.018*"would" + 0.015*"one" + 0.014*"writes" + 0.013*"people" + 0.012*"article" + 0.011*"like" + 0.010*"subject" + 0.010*"lines" + 0.009*"know" + 0.009*"organization"')
(5, '0.036*"mb" + 0.033*"pc" + 0.032*"software" + 0.028*"disk" + 0.027*"ibm" + 0.025*"advice" + 0.025*"apple" + 0.024*"mac" + 0.023*"scsi" + 0.018*"ftp"')
(6, '0.038*"pittsburgh" + 0.031*"lemieux" + 0.028*"buffalo" + 0.017*"player" + 0.013*"penguins" + 0.011*"traded" + 0.007*"pens" + 0.004*"ub" + 0.003*"nhl" + 0.003*"team"')
(7, '0.048*"program" + 0.038*"original" + 0.037*"words" + 0.028*"files" + 0.028*"images" + 0.017*"nature" + 0.016*"success" + 0.015*"via" + 0.014*"pat" + 0.014*"highly"')
(8, '0.023*"excellent" + 0.021*"american" + 0.016*"posting" + 0.016*"encryption" + 0.015*"school" + 0.015*"talk" + 0.015*"gun" + 0.013*"series" + 0.013*"brian" + 0.013*"request"')
(9, '0.041*"page" + 0.021*"led" + 0.018*"created" + 0.015*"allen" + 0.014*"animals" + 0.012*"chen" + 0.008*"younger" + 0.008*"mankind" + 0.007*"catalog" + 0.006*"popular"')
(10, '0.028*"lee" + 0.015*"portable" + 0.009*"forsale" + 0.008*"angle" + 0.007*"band" + 0.006*"brand_new" + 0.004*"iron" + 0.001*"fm" + 0.000*"dryer" + 0.000*"dc"')
(11, '0.000*"zg" + 0.000*"zfxz" + 0.000*"zjm" + 0.000*"zjk" + 0.000*"zj" + 0.000*"zilos" + 0.000*"zhd" + 0.000*"zh" + 0.000*"zgp" + 0.000*"zgos"')
(12, '0.016*"organization" + 0.015*"subject" + 0.015*"lines" + 0.013*"game" + 0.012*"article" + 0.010*"reference" + 0.008*"host" + 0.008*"university" + 0.008*"else" + 0.008*"college"')
(13, '0.052*"dept" + 0.017*"comp" + 0.010*"kit" + 0.004*"sys" + 0.000*"vince" + 0.000*"smokeless_tobacco" + 0.000*"among" + 0.000*"coli" + 0.000*"vt" + 0.000*"outbreak"')
(14, '0.044*"lines" + 0.043*"subject" + 0.042*"organization" + 0.026*"host" + 0.025*"nntp_posting" + 0.020*"university" + 0.018*"please" + 0.017*"edu" + 0.015*"windows" + 0.014*"thanks"')
(15, '0.011*"operator" + 0.010*"verify" + 0.008*"captain" + 0.003*"rat" + 0.001*"rn" + 0.000*"zod" + 0.000*"yalanci" + 0.000*"photo_radar" + 0.000*"armenians" + 0.000*"radar"')
(16, '0.043*"israel" + 0.031*"nasa" + 0.022*"space" + 0.018*"north" + 0.016*"jews" + 0.015*"army" + 0.013*"standards" + 0.012*"israeli" + 0.012*"library" + 0.011*"attacks"')
(17, '0.073*"certainly" + 0.033*"graphics" + 0.017*"reasoning" + 0.009*"load" + 0.006*"significance" + 0.003*"spec" + 0.003*"expense" + 0.003*"tiff" + 0.001*"philosophical" + 0.001*"gilbert"')
(18, '0.034*"dos" + 0.029*"include" + 0.022*"mother" + 0.022*"ie" + 0.021*"mention" + 0.020*"dr" + 0.018*"student" + 0.017*"size" + 0.016*"detroit" + 0.015*"max"')
(19, '0.000*"address" + 0.000*"zgp" + 0.000*"zgos" + 0.000*"zk" + 0.000*"zjm" + 0.000*"zjk" + 0.000*"zj" + 0.000*"zilos" + 0.000*"zhd" + 0.000*"zh"')
Coherence LDA score: 0.4830067294508546

In [22]:

# Finding the dominant topic in each document
count = 0
doc_id_list = []
dominant_topic_list = []
dominant_topic_weight = []
keywords_list = []
text_list = []
for doc in corpus:
    doc_id_list.append(count) # id of the document
    doc_topics = lda_model.get_document_topics(doc) # getting the distribution of topics
    doc_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True) # sorting the topics
    dominant_topic_list.append(doc_topics[0][0]) # getting the most important topic's id
    dominant_topic_weight.append(doc_topics[0][1]) # getting the most important topic's weight
    topic_keywords = []
    for keyword in lda_model.show_topic(doc_topics[0][0]): # each keyword in that topic
        topic_keywords.append(keyword[0]) # adding each keyword to the topic keyword list
    keywords_list.append(topic_keywords)
    text_list.append(df.content.values.tolist()[count]) # getting the full text of the document
    count += 1
dominant_topic_df = pd.DataFrame({'Document ID': doc_id_list, 'Dominant Topic': dominant_topic_list,
                                 'Dominant Topic Weight': dominant_topic_weight, 'Keywords': keywords_list,
                                 'Text': text_list})
dominant_topic_df

Out[22]:

	Document ID	Dominant Topic	Dominant Topic Weight	Keywords	Text
0	0	12	0.918188	[organization, lines, subject, article, writes...	From: lerxst@wam.umd.edu (where's my thing)\nS...
1	1	14	0.406496	[andrew, cable, reserve, modem, hp, hewlett_pa...	From: guykuo@carson.u.washington.edu (Guy Kuo)...
2	2	4	0.770112	[would, one, writes, subject, article, lines, ...	From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3	3	4	0.422536	[would, one, writes, subject, article, lines, ...	From: jgreen@amber (Joe Green)\nSubject: Re: W...
4	4	12	0.460902	[organization, lines, subject, article, writes...	From: jcm@head-cfa.harvard.edu (Jonathan McDow...
...	...	...	...	...	...
995	995	3	0.379162	[lines, subject, organization, nntp_posting, h...	From: ibeshir@nyx.cs.du.edu (Ibrahim)\nSubject...
996	996	4	0.581008	[would, one, writes, subject, article, lines, ...	From: anwar+@cs.cmu.edu (Anwar Mohammed)\nSubj...
997	997	4	0.609946	[would, one, writes, subject, article, lines, ...	From: DJCOHEN@YaleVM.YCC.Yale.Edu (Daniel Cohe...
998	998	4	0.387053	[would, one, writes, subject, article, lines, ...	From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...
999	999	4	0.568791	[would, one, writes, subject, article, lines, ...	From: dxf12@po.cwru.edu (Douglas Fowler)\nSubj...

1000 rows × 5 columns

Task 4¶

Print the keywords of the document with index 51, along with the text of the document.
Provide a summary of the topic based on these keywords (in a single word or phrase). After reading the text, discuss whether the dominant topic truly reflects the main subject matter discussed in this document.
Include your codes in the following cell (code) and the discussion in the next cell (markdown).

In [23]:

print(dominant_topic_df['Keywords'][51])
print(dominant_topic_df['Text'][51])

['jesus', 'god', 'christian', 'may', 'argument', 'father', 'bible', 'certainly', 'faith', 'example']
From: dlecoint@garnet.acns.fsu.edu (Darius_Lecointe)
Subject: Re: Sabbath Admissions 5of5
Organization: Florida State University
Lines: 27

I have been following this thread on talk.religion,
soc.religion.christian.bible-study and here with interest.  I am amazed at
the different non-biblical argument those who oppose the Sabbath present. 

One question comes to mind, especially since my last one was not answered
from Scripture.  Maybe clh may wish to provide the first response.

There is a lot of talk about the Sabbath of the TC being ceremonial. 
Answer this:

Since the TC commandments is one law with ten parts on what biblical
basis have you decided that only the Sabbath portion is ceremonial?
OR You say that the seventh-day is the Sabbath but not applicable to
Gentile Christians.  Does that mean the Sabbath commandment has been
annulled?  References please.

If God did not intend His requirements on the Jews to be applicable to
Gentile Christians why did He make it plain that the Gentiles were now
grafted into the commonwealth of Israel?

Darius

[Acts 15, Rom 14:5, Col 2:16, Gal 4:10.  I believe we've gotten into
a loop at this point.  This is one of those classic situations where
both sides think they have clear Scriptural support, and there's no
obvious argument that is going to change anybody's mind.  I don't think
we're going anything but repeating ourselves.  --clh]

In [24]:

# Finding the most dominant document in each topic
topic_id_list = list(range(20)) # a list of topic IDs from 0 to 20
highest_weight_list = [0] * 20 # a list with 20 "0"s, will be replaced with larger values
highest_weight_document = [0] * 20
keywords_list = [0] * 20
text_list = [0] * 20
count = 0
for doc in corpus:
    doc_topics = lda_model.get_document_topics(doc) # getting the distribution of topics
    for topic in doc_topics:
        # for each topic, if the topic weight is larger than the current largest weight of that topic
        if topic[1] > highest_weight_list[topic[0]]:
            highest_weight_list[topic[0]] = topic[1] # update that value to this topic weight
            highest_weight_document[topic[0]] = count 
            # update the ID of the most dominant document of that topic to the ID of current document
            topic_keywords = []
            for keyword in lda_model.show_topic(topic[0]):
                topic_keywords.append(keyword[0])
            keywords_list[topic[0]] = topic_keywords # update the keywords to the keywords of current topic
            text_list[topic[0]] = df.content.values.tolist()[count] # update the text to the text of current document
    count += 1
dominant_document_df = pd.DataFrame({'Topic ID': topic_id_list, 'Keywords': keywords_list, 
                                     'Dominant Document': highest_weight_document, 
                                     'Dominant Document Weight': highest_weight_list, 
                                     'Text': text_list})
dominant_document_df

Out[24]:

	Topic ID	Keywords	Dominant Document	Dominant Document Weight	Text
0	0	[death, moral, morality, punishment, criminal,...	10	0.424875	From: irwin@cmptrc.lonestar.org (Irwin Arnstei...
1	1	[software, mb, dos, files, system, disk, windo...	675	0.922028	From: sp@odin.NoSubdomain.NoDomain (Svein Pede...
2	2	[health, april, department, happened, years, k...	57	0.713098	From: d_jaracz@oz.plymouth.edu (David R. Jarac...
3	3	[lines, subject, organization, nntp_posting, h...	69	0.953684	From: suresh@pa.dec.com (Suresh Balasubramania...
4	4	[would, one, writes, subject, article, lines, ...	38	0.957335	From: behanna@syl.nj.nec.com (Chris BeHanna)\n...
5	5	[january, ps, mailing_list, hole, motorcycle, ...	8	0.251854	From: holmes7000@iscsvax.uni.edu\nSubject: WIn...
6	6	[pp, season, pittsburgh, buffalo, nhl, lemieux...	583	0.576145	Subject: NHL Summary parse results for games p...
7	7	[self, thanks_advance, printer, added, flight,...	73	0.481259	From: rdb1@cbnewsj.cb.att.com (ronald.j.debloc...
8	8	[please, subject, organization, lines, mail, w...	78	0.803162	From: jfb@cci632.cci.com (John Bruno)\nSubject...
9	9	[rings, precisely, signing, adding, starters, ...	44	0.262216	From: cmeyer@bloch.Stanford.EDU (Craig Meyer)\...
10	10	[tape, percent, angle, band, japan, fm, mccart...	169	0.107919	Organization: Central Michigan University\nFro...
11	11	0	0	0.000000	0
12	12	[organization, lines, subject, article, writes...	0	0.918144	From: lerxst@wam.umd.edu (where's my thing)\nS...
13	13	[physical, comp, direct, ho, kit, york, sys, a...	98	0.440307	From: cjhs@minster.york.ac.uk\nSubject: Re: fr...
14	14	[andrew, cable, reserve, modem, hp, hewlett_pa...	12	0.632761	From: rodc@fc.hp.com (Rod Cerkoney)\nSubject: ...
15	15	[armenians, armenian, people, turkish, greek, ...	378	0.717784	From: sera@zuma.UUCP (Serdar Argic)\nSubject: ...
16	16	[israel, jews, land, united_states, committee,...	33	0.401612	From: ayr1@cunixa.cc.columbia.edu (Amir Y Rose...
17	17	[jesus, god, christian, may, argument, father,...	51	0.664635	From: dlecoint@garnet.acns.fsu.edu (Darius_Lec...
18	18	[max, science, dr, passage, values, mother, ch...	144	0.978154	Subject: roman.bmp 09/14------------ Part 9 of...
19	19	[space, nasa, moon, launch, shuttle, objective...	59	0.270125	From: leech@cs.unc.edu (Jon Leech)\nSubject: S...

Task 5¶

Based on what we have learned about topic modeling, can you suggest potential applications for this method? Please provide an example from either academic research or real-life situations and explain the advantages that topic modeling could offer in that context.

In [ ]: