Questions tagged [hdbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.

Filter by
Sorted by
Tagged with
13 votes
4 answers
20k views

how do I solve " Failed building wheel for hdbscan "?

I tried to download Hdbscan using pip install hdbscan , I get this : ERROR: Failed building wheel for hdbscan ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed ...
Omar Hossam's user avatar
9 votes
4 answers
32k views

How to resolve ERROR: Could not build wheels for hdbscan, which is required to install pyproject.toml-based projects

I am trying to install bertopic and I got this error: pip install bertopic Collecting bertopic > Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB) > Collecting ...
DorothyK's user avatar
8 votes
1 answer
8k views

HDBSCAN difference between parameters

I'm confused about the difference between the following parameters in HDBSCAN min_cluster_size min_samples cluster_selection_epsilon Correct me if I'm wrong. For min_samples, if it is set to 7, then ...
HR1's user avatar
  • 507
7 votes
1 answer
3k views

Issue with hdbscan (ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject)

I know a number of people have posted about this before but I still can't resolve my error. I'm trying to import hdbscan but it keeps returning the following error -------------------------------------...
code_learner93's user avatar
6 votes
1 answer
10k views

DBSCAN or HDBSCAN is better option? and why?

which clustering method is considered to be the best among DBSCAN and HDBSCAN and what is the reason behind that?
Mahnaz Rafia Islam's user avatar
5 votes
1 answer
17k views

How do I use sklearn.metrics.pairwise pairwise_distances with callable metric?

I'm doing some behavior analysis where I track behaviors over time and then create n-grams of those behaviors. sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'], ...
not-bob's user avatar
  • 835
5 votes
2 answers
2k views

What is the appropriate distance metric when clustering paragraph/doc2vec vectors?

My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates. To do this I am using gensim to generate document ...
fluffet's user avatar
  • 53
4 votes
1 answer
1k views

hdbscan error: TypeError: 'numpy.float64' object cannot be interpreted as an integer

I ran hdbscan function code both on Linux and google colab and got the same error TypeError: 'numpy.float64' object cannot be interpreted as an integer error seems to happen when applying data to the '...
Sotiris's user avatar
  • 41
4 votes
1 answer
4k views

Problems with HDBSCAN and approximate predict

I would like to use the HDBSCAN clustering technique to predict outliers. I have trained my model to optimize the parameters, but then, when I apply approximate_predict on new data, I get different ...
Ariadna Fernández's user avatar
4 votes
0 answers
832 views

Problem with hdbscan used with bertopic: OSError: [Errno 22] Invalid argument

I am writing because I have a problem (silly and obvious introduction, I know). I am trying to use the BERTopic package using the Python interpreter in RStudio and the reticulate extension: Python 3....
Francis's user avatar
  • 41
3 votes
1 answer
2k views

HDBSCAN handling of large datasets

I am trying to implement a clustering on a large dataset consisting of 146,000 observations, using the HDBSCAN algorithm. When I cluster these observations with the (default) Minkowski/Euclidean ...
statsguy96's user avatar
3 votes
1 answer
9k views

how to install HDBSCAN modula, python 3.7, windows 10

I need to use the HDBSCAN algorithme on my data but the module is not installed. I use python 3.7. I am not very familiar with this kind of tricky installations, please, can anyone give me a clear and ...
Artashes's user avatar
  • 112
3 votes
1 answer
521 views

HDBSCAN for R Crashed with large dataset

I tried to apply HDBSCAN algorithm to my dataset (50000 GPS points). However, every time I run the code, the R session is crashed. Here is the basic info. about my PC: processor: Intel i7 7820x 3.6 ...
Yunzhe Liu's user avatar
3 votes
0 answers
2k views

Clustering with UMAP and HDBScan

I have a somewhat large amount of textual data, input by approximately 5000 people. I've assigned each person a vector using Doc2vec, reduced to two dimensions using UMAP and highlighted groups ...
Jacob's user avatar
  • 53
2 votes
4 answers
59k views

ERROR: You must give at least one requirement to install -- when running: pip install --upgrade --no-binary hdbscan

I am trying to install hdbscan in my PC which runs Windows 10 and has installed Python 3.6. My first attempt failed: (base) C:\WINDOWS\system32>pip install hdbscan --user Collecting hdbscan ...
user8270077's user avatar
  • 4,891
2 votes
3 answers
5k views

How to evaluate HDBSCAN text clusters?

I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each ...
J.Doe's user avatar
  • 539
2 votes
1 answer
744 views

TypeError issue importing hdbscan

Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" ...
Nathan Luo's user avatar
2 votes
1 answer
664 views

Scikit HDBSCAN *tree* labeling (not single-slice labeling)

BLUF: For a specific epsilon (or for HDBSCAN's 'favorite' epsilon), I can extract the mapping of my data in that epsilon's partition. But how can I see my data's full tree membership? I've gotten a ...
Sam Greenberg's user avatar
2 votes
1 answer
2k views

HDBSCAN won't utilize all available cpus. Processes just sleep

For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at ...
Marc Frankel's user avatar
2 votes
2 answers
11k views

Trouble installing hdbscan package for python : "no module named 'hdbscan'" error

I want to run an algorithm written in Python on my Ubuntu virtual machine. It needs to import the hdbscan module. I thus want to install it on my virtual machine. Following the documentationfrom Pypi....
Lalastro's user avatar
  • 181
2 votes
1 answer
501 views

Can the results of UMAP for HDBScan clustering be made more consistent?

I have a set of ~40K phrases which I'm clustering with HDBScan after using UMAP for dimensionality reduction. The steps are: Generate embeddings using a fine-tuned BERT model Reduce dimensions with ...
TKR's user avatar
  • 155
2 votes
0 answers
680 views

HDBSCAN approximate_predict always returning probability of 0

I am using HDBSCAN to generate prediction data for a given cluster model. I then attempt to classify new points using the approximate_predict function to find the correct cluster for a new point. The ...
James's user avatar
  • 469
2 votes
0 answers
394 views

Reduce spatial data set size using HDBSCAN

I am trying to reduce the spatial data set size by clustering them and finding the center point for the clusters. I referenced to this article (which uses DBSCAN)and it kind of helped except that now ...
M_S_N's user avatar
  • 2,800
2 votes
0 answers
2k views

Difference Between OPTICS and HDBSCAN clustering techniques

As a part of my assignment, I have to work on both HDBSCAN and OPTICS clustering technique. I have researched on many sites to identify the difference between these algorithms. All I got was OPTICS ...
Minu's user avatar
  • 33
1 vote
2 answers
4k views

dealing with noise in hdbscan

I have been testing hdbscan from the scikit learn package with a small instance of (x,y) points "point_coord" and the resulting clusters do not really make sense to me. Given the small size of the ...
Mike's user avatar
  • 375
1 vote
1 answer
2k views

Explain Behavior of HDBSCAN Clustering

I have a dataset of 6 elements. I computed the distance matrix using Gower distance, which resulted in the following matrix: By just looking at this matrix, I can tell that element #0 is similar to ...
HR1's user avatar
  • 507
1 vote
1 answer
775 views

HDBSCAN : clustering , persistance and approximate_predict()

I want to cache my model results in order to make predictions without redoing the clustering. I read that I can do that with memory parameter in HDBSCAN. I did that instead because I wanted to save ...
tonythestark's user avatar
1 vote
1 answer
399 views

Plot a single cluster

I am working with HDBSCAN and I want to plot only one cluster of the data. This is my current code: import hdbscan import pandas as pd from sklearn.datasets import make_blobs blobs, labels = ...
Cruz's user avatar
  • 133
1 vote
1 answer
1k views

How to properly cluster with HDBSCAN for 1D dataset?

My dataset below shows product sales per price (link to download dataset csv): price quantity 0 5098.0 20 1 5098.5 40 2 5099.0 10 3 5100.0 90 4 ...
Eduardo Gomes's user avatar
1 vote
1 answer
1k views

HDBSCAN Shouldn't any object in a cluster have a probability value > 0? And producing inconsistent results

I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook. import pandas as pandas import numpy as np data = pandas.read_csv('data.csv') That data looks something like this: ...
Glen Pierce's user avatar
  • 4,661
1 vote
1 answer
206 views

Python HDBScan class always fails on second iteration before even entering first function

I am attempting to look at conglomerated outlier information, utilizing several different SKLearn, HDBScan, and custom outlier detection classes. However, for some reason I am consistently running ...
WolVes's user avatar
  • 1,316
1 vote
1 answer
1k views

Anomalies Detection by DBSCAN

I am using DBSCAN on my training datatset in order to find outliers and remove those outliers from the dataset before training model. I am using DBSCAN on my train rows 7697 with 8 columns.Here is my ...
user172500's user avatar
1 vote
2 answers
2k views

Cluster a list of geographic points by distance and constraints

I have a delivery app, and I want to group orders (each order has a lat and lng coordinates) by location proximity (linear distance) and constraints like max orders and max total products (each order ...
Alex's user avatar
  • 1,043
1 vote
2 answers
2k views

How to visualise top terms on each HDBSCAN cluster

I'm currently trying to use HDBSCAN to cluster a bunch of movie data, in order to group similar content together and be able to come up with 'topics' that describe those clusters. I'm interested in ...
J.Doe's user avatar
  • 539
1 vote
1 answer
306 views

How to know to which matrix row corresponds each cluster label?

After doing clustering I end up with an object which stores all the cluster labels, something like this: clusterer.labels_ The above is typically a list or an array. Then I always assign the labels ...
tumbleweed's user avatar
  • 4,572
1 vote
0 answers
285 views

Fine-tuning UMAP parameters for clustering using HDBSCAN relative_validity (DBCV) scores

I am using UMAP and HDBSCAN to cluster similar embedded text data (https://towardsdatascience.com/clustering-sentence-embeddings-to-identify-intents-in-short-text-48d22d3bf02e). There are multiple ...
Jackie's user avatar
  • 11
1 vote
1 answer
404 views

HDBSCAN doesn't work anymore - 'float' object cannot be interpreted as an integer

I'm running HDBSCAN for weeks now on gene expression datasets and everything went perfectly well, but lately it refuses to run : clusterer = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=1)....
Nozelar's user avatar
  • 11
1 vote
1 answer
117 views

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float' - hdbscan validity_index

I'm using the validity index in the hdbscan package, which implements DBCV score according to the following paper: https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf I'm working on a face ...
Faisal Aldhuwayhi's user avatar
1 vote
0 answers
222 views

Creating clusters from 3D data through HDBSCAN

I have a problem, I have big data set of 15000 points, those points represent the airplanes over Europe and I have latitudes, longitudes and altitudes. I am trying to create program that will take ...
Martin Kavka's user avatar
1 vote
0 answers
46 views

Serving "Frankenstein" (combined) models at scale

I have a tensorflow model that's combined with a clustering algorithm in (HDBSCAN). Both have been trained/fitted separately but they work together (tf -> hdbscan). I'm looking to serve predictions ...
bli00's user avatar
  • 2,407
1 vote
0 answers
287 views

HDBSCAN on Movielens Latent embeddings does not cluster well

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job. Data The data I'm working with is the MovieLens 100K dataset, from ...
Mhaexym's user avatar
  • 11
1 vote
0 answers
39 views

Measuring "single strongest peak" in a distribution

I'd like to automatically detect whether data have a very strongly discernable peak, with any particular distribution. The data can otherwise be quite noisy, or there might be several 'false' peaks. ...
L Fischman's user avatar
1 vote
1 answer
456 views

HDBSCAN Cluster choice

I have been working with HDBSCAN and have a few hundreds of clusters based on my data. I am trying to select some cluster groups for further analysis. Looking for the clusters which have high inter-...
Jazz's user avatar
  • 465
1 vote
1 answer
770 views

How to extract clusters from HDBSCAN algorithm

I'd like to extract original points that form each cluster, I know that HDBSCAN doesn't have cluster centers , so I thought in case each label corresponds to the original point at the same order, I ...
user avatar
1 vote
0 answers
359 views

How to find top terms in dbscan or hdbscan clusters?

I'm using dbscan from sklearn and HDBSCAN to cluster some documents. vectorizer = TfidfVectorizer(stop_words=mystopwords) X = vectorizer.fit_transform(y) dbscan = DBSCAN(eps=0.75, min_samples = 9) ...
user3400567's user avatar
1 vote
0 answers
117 views

Printing a Python-generated plot in R

I am working on performing a HDBSCAN, and am performing the analysis using the hdbscan python module within R. I have the following code: library(reticulate) hdb <- import("hdbscan") # Import ...
kneijenhuijs's user avatar
  • 1,199
0 votes
1 answer
255 views

Clustering issue, can't find good params for HDBSCAN

I made a torch model which say if two anime cropped face images are similar or not (trained using cosine similarity and contrastive loss on pairs of faces). I get the embeddings from my model for each ...
Maximax67's user avatar
0 votes
1 answer
501 views

How to import hdbscan in VScode (anaconda installed)

Based on existing information, I've successfully installed HDBSCAN package in my conda virtual environment using conda install -c conda-forge hdbscan However, when I try to run this code import ...
Gillian's user avatar
  • 49
0 votes
1 answer
1k views

Using callable metric for HDBSCAN*

I want to cluster some data with HDBSCAN*. The distance is calculated as a function of some parameters from both values so if the data look like: label1 | label2 | label3 0 32 18.5 ...
Roy Ancri's user avatar
  • 119
0 votes
1 answer
352 views

clustering for a single timeseries

I have a single array numpy array(x) and i want to cluster it in unsupervised way using DBSCAN and hierarchial clustering using scikitlearn. Is the clustering possible for single array data? ...
pro's user avatar
  • 113