Questions tagged [topic-modeling]
Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.
984
questions
44
votes
6
answers
33k
views
Remove empty documents from DocumentTermMatrix in R topicmodels?
I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:
corpus <- Corpus(VectorSource(...
44
votes
2
answers
27k
views
LDA topic modeling - Training and testing
I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.
References say that LDA is an algorithm which, given a collection of ...
32
votes
2
answers
4k
views
Simple Python implementation of collaborative topic modeling?
I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles ...
29
votes
5
answers
31k
views
Understanding LDA implementation using gensim
I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following:
Define the dataset
documents = ["Apple is releasing a new product",
...
29
votes
2
answers
34k
views
Topic models: cross validation with loglikelihood or perplexity
I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ...
26
votes
10
answers
48k
views
How to print the LDA topics models from gensim? Python
Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models?
When printing the lda.print_topics(10) the code gave the ...
26
votes
2
answers
46k
views
Gensim: KeyError: "word not in vocabulary"
I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:
b = ['let',
'know',
'buy',
'someth',
'...
26
votes
2
answers
15k
views
What's the disadvantage of LDA for short texts?
I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do ...
24
votes
1
answer
21k
views
Export pyLDAvis graphs as standalone webpage
i am analysing text with topic modelling and using Gensim and pyLDAvis for that. Would like to share the results with distant colleagues, without a need for them to install python and all required ...
21
votes
1
answer
18k
views
Predicting LDA topics for new data
It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as ...
21
votes
6
answers
10k
views
Using scikit-learn vectorizers and vocabularies with gensim
I am trying to recycle scikit-learn vectorizer objects with gensim topic models. The reasons are simple: first of all, I already have a great deal of vectorized data; second, I prefer the interface ...
21
votes
3
answers
23k
views
Using Word2Vec for topic modeling
I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA).
However, I am interested whether it is a good idea to try ...
19
votes
4
answers
19k
views
LDA model generates different topics everytime i train on the same corpus
I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does ...
19
votes
3
answers
27k
views
LDA with topicmodels, how can I see which topics different documents belong to?
I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see ...
17
votes
2
answers
26k
views
get_document_topics and get_term_topics in gensim
The ldamodel in gensim has the two methods: get_document_topics and get_term_topics.
Despite their use in this gensim tutorial notebook, I do not fully understand how to interpret the output of ...
15
votes
1
answer
7k
views
How to interpret LDA components (using sklearn)?
I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to ...
14
votes
3
answers
34k
views
Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad? [closed]
I need to know whether coherence score of 0.4 is good or bad? I use LDA as topic modelling algorithm.
What is the average coherence score in this context?
14
votes
1
answer
5k
views
Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in ...
14
votes
1
answer
4k
views
R Supervised Latent Dirichlet Allocation Package
I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it ...
12
votes
2
answers
13k
views
__init__() got an unexpected keyword argument 'cachedir' when importing top2vec
I keep getting this error when importing top2vec.
TypeError Traceback (most recent call last)
Cell In [1], line 1
----> 1 from top2vec import Top2Vec
File ~\AppData\...
12
votes
2
answers
17k
views
What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?
I am trying to obtain the optimal number of topics for an LDA-model within Gensim. One method I found is to calculate the log likelihood for each model and compare each against each other, e.g. at The ...
12
votes
2
answers
3k
views
Gensim LDA topic assignment
I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most ...
11
votes
1
answer
9k
views
Understanding LDA / topic modelling -- too much topic overlap
I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach).
I have a small number of ...
11
votes
2
answers
12k
views
Making gsub only replace entire words?
(I'm using R.) For a list of words that's called "goodwords.corpus", I am looping through the documents in a corpus, and replacing each of the words on the list "goodwords.corpus" with the word + a ...
11
votes
5
answers
15k
views
Visualizing an LDA model, using Python
I have a LDA model with the 10 most common topics in 10K documents. Now it's just an overview of the words with corresponding probability distribution for each topic.
I was wondering if there is ...
11
votes
3
answers
17k
views
How to predict the topic of a new query using a trained LDA model using gensim?
I have trained a corpus for LDA topic modelling using gensim.
Going through the tutorial on the gensim website (this is not the whole code):
question = 'Changelog generation from Github issues?';
...
10
votes
6
answers
9k
views
How to access topic words only in gensim
I built LDA model using Gensim and I want to get the topic words only How can I get the words of the topics only no probabilities and no IDs.words only
I tried print_topics() and show_topics() ...
10
votes
2
answers
4k
views
What is the relation between topic modeling and document clustering?
Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do ...
10
votes
1
answer
7k
views
How to get document_topics distribution of all of the document in gensim LDA?
I'm new to python and I need to construct a LDA project. After doing some preprocessing step, here is my code:
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
from ...
10
votes
3
answers
7k
views
How to understand the output of Topic Model class in Mallet?
As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code.
First during the running process, it gives out:
Coded LDA:...
10
votes
1
answer
10k
views
LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn
I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.
Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic ...
9
votes
2
answers
7k
views
How to get all documents per topic in bertopic modeling
I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic.
topic_model = ...
9
votes
1
answer
6k
views
Topic modelling - Assign a document with top 2 topics as category label - sklearn Latent Dirichlet Allocation
I am now going through LDA(Latent Dirichlet Allocation) Topic modelling method to help in extraction of topics from a set of documents. As from what I have understood from the link below, this is an ...
9
votes
4
answers
8k
views
pyLDAvis: Validation error on trying to visualize topics
I tried generating topics using gensim for 300000 records. On trying to visualize the topics, I get a validation error. I can print the topics after model training, but it fails on using pyLDAvis
# ...
9
votes
2
answers
12k
views
How do I print lda topic model and the word cloud of each of the topics
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from gensim import corpora, models
import gensim
import os
from os import path
from time import sleep
import matplotlib....
8
votes
2
answers
7k
views
python scikit learn, get documents per topic in LDA
I am doing an LDA on a text data, using the example here:
My question is:
How can I know which documents correspond to which topic?
In other words, what are the documents talking about topic 1 for ...
8
votes
2
answers
6k
views
Gensim LDA Coherence Score Nan
I created a Gensim LDA Model as shown in this tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
lda_model = gensim.models.LdaMulticore(data_df['bow_corpus'], num_topics=...
8
votes
3
answers
8k
views
How to print out the full distribution of words in an LDA topic in gensim?
The lda.show_topics module from the following code only prints the distribution of the top 10 words for each topic, how do i print out the full distribution of all the words in the corpus?
from ...
8
votes
2
answers
48k
views
How to avoid decoding to str: need a bytes-like object error in pandas?
Here is my code :
data = pd.read_csv('asscsv2.csv', encoding = "ISO-8859-1", error_bad_lines=False);
data_text = data[['content']]
data_text['index'] = data_text.index
documents = data_text
It looks ...
8
votes
1
answer
5k
views
Why getting different results with MALLET topic inference for single and batch of documents?
I'm trying to perform LDA topic modeling with Mallet 2.0.7. I can train a LDA model and get good results, judging by the output from the training session. Also, I can use the inferencer built in ...
8
votes
1
answer
2k
views
Is there any way to match Gensim LDA output with topics in pyLDAvis graph?
I need to process the topics in the LDA output (lda.show_topics(num_topics=-1, num_words=100...) and then compare what I do with the pyLDAvis graph but the topic numbers are differently numbered. Is ...
8
votes
2
answers
4k
views
Topic modelling, but with known topics?
Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as ...
8
votes
1
answer
11k
views
Pickle AttributeError: Can't get attribute 'Wishart' on <module '__main__' from 'app.py'>
I already run my code to load my variable saved by pickle. This my code
import pickle
last_priors_file = open('simpanan/priors', 'rb')
priors = pickle.load(last_priors_file)
and i get error like ...
7
votes
1
answer
8k
views
ValueError: Stop argument for islice() must be None or an integer: 0 <= x <= sys.maxsize on topic coherence
im following this tutorials https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 and find problem. so my purpose on this code to make iterate it ...
7
votes
3
answers
7k
views
pyLDAvis with Mallet LDA implementation : LdaMallet object has no attribute 'inference'
is it possible to plot a pyLDAvis with a Mallet implementation of LDA ? I have no troubles with LDA_Model but when I use Mallet I get :
'LdaMallet' object has no attribute 'inference'
My code :
...
7
votes
1
answer
3k
views
error Installing topicmodels in R Ubuntu
I am getting error while installing topicmodels package in R.
on running install.packages("topicmodels",dependencies=TRUE) following are the last few lines I am getting. Please help. My R version is ...
7
votes
3
answers
4k
views
Meaning of bar width for pyLDAvis for lambda = 0
Not sure if this is the right forum but I was wondering if anyone understands how to interpret the width of the red vs. blue bars on the right-hand side of pyLDAvis plots when lambda = 0 (see http://...
7
votes
1
answer
3k
views
What is the difference between LDA and NTM in Amazon Sagemaker for Topic Modeling?
I am looking for difference between LDA and NTM . What are some use case where you will use LDA over NTM?
As per AWS doc:
LDA : The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an ...
7
votes
5
answers
2k
views
Mallet topic model example can not compile
I want to compile mallet in my Java (instead using the command line), so I include the jar in my project, and cite the code of the example from: http://mallet.cs.umass.edu/topics-devel.php, however, ...
7
votes
3
answers
9k
views
Text Clustering and topic extraction
I'm doing some text mining using the excellent scikit-learn module. I'm trying to cluster and classify scientific abstracts.
I'm looking for a way to cluster my set of tf-id representations, without ...