9

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic.

topic_model  = BERTopic(verbose=True, embedding_model=embedding_model,
                                nr_topics = 'auto',
                                n_gram_range = (3,3),
                                top_n_words = 10,
                               calculate_probabilities=True, 
                              seed_topic_list = topic_list,
                              )
topics, probs = topic_model.fit_transform(docs_test)
representative_doc = topic_model.get_representative_docs(topic#1)
representative_doc

this topic contain more then 300 documents but bertopic only shows 3 of them with .get_representative_docs

2 Answers 2

8

There are probably solutions that are more elegant because I am not an expert, but I can share what worked for me:

topics, probs = topic_model.fit_transform(docs_test)

returns the topics.

Therefore, you can combine this output and the documents. For example, combine them into a Pandas dataframe using:

df = pd.DataFrame({'topic': topics, 'document': docs_test})

Now, you can filter this dataframe for each topic to identify the referring documents:

topic_0 = df[df.topic == 0]
4

There is an API from BERTopic get_document_info() which returns the dataframe for each document and associated topic for it. https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info

The response from this API is shown below:

index Document Topic Name ...
0 doc1_text 241 kw1_kw2_ ...
1 doc2_text -1 kw1_kw2_ ...

You can use this dataframe to get all the documents associated for a particular topic using pandas groupby or however you prefer.

T = topic_model.get_document_info(docs)
docs_per_topics = T.groupby(["Topic"]).apply(lambda x: x.index).to_dict()

The code returns a dictionary shown as below:

{
    -1: Int64Index([3,10,11,12,15,16,18,19,20,22,...365000], dtype='int64',length=149232),
    0: Int64Index([907,1281,1335,1337,...308420,308560,308645],dtype='int64',length=5127),
    ...
}
1
  • it was when bertopic don't have any API, earlier version of it was having this problem now it is quite easy.
    – Kaleem
    Aug 7, 2023 at 6:24

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.