Questions tagged [hdbscan]
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.
82
questions
13
votes
4
answers
20k
views
how do I solve " Failed building wheel for hdbscan "?
I tried to download Hdbscan using pip install hdbscan , I get this :
ERROR: Failed building wheel for hdbscan
ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed ...
9
votes
4
answers
32k
views
How to resolve ERROR: Could not build wheels for hdbscan, which is required to install pyproject.toml-based projects
I am trying to install bertopic and I got this error:
pip install bertopic
Collecting bertopic
> Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB)
> Collecting ...
8
votes
1
answer
8k
views
HDBSCAN difference between parameters
I'm confused about the difference between the following parameters in HDBSCAN
min_cluster_size
min_samples
cluster_selection_epsilon
Correct me if I'm wrong.
For min_samples, if it is set to 7, then ...
7
votes
1
answer
3k
views
Issue with hdbscan (ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject)
I know a number of people have posted about this before but I still can't resolve my error. I'm trying to import hdbscan but it keeps returning the following error
-------------------------------------...
6
votes
1
answer
10k
views
DBSCAN or HDBSCAN is better option? and why?
which clustering method is considered to be the best among DBSCAN and HDBSCAN and what is the reason behind that?
5
votes
1
answer
17k
views
How do I use sklearn.metrics.pairwise pairwise_distances with callable metric?
I'm doing some behavior analysis where I track behaviors over time and then create n-grams of those behaviors.
sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
...
5
votes
2
answers
2k
views
What is the appropriate distance metric when clustering paragraph/doc2vec vectors?
My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates.
To do this I am using gensim to generate document ...
4
votes
1
answer
1k
views
hdbscan error: TypeError: 'numpy.float64' object cannot be interpreted as an integer
I ran hdbscan function code both on Linux and google colab and got the same error
TypeError: 'numpy.float64' object cannot be interpreted as an integer
error seems to happen when applying data to the '...
4
votes
1
answer
4k
views
Problems with HDBSCAN and approximate predict
I would like to use the HDBSCAN clustering technique to predict outliers. I have trained my model to optimize the parameters, but then, when I apply approximate_predict on new data, I get different ...
4
votes
0
answers
832
views
Problem with hdbscan used with bertopic: OSError: [Errno 22] Invalid argument
I am writing because I have a problem (silly and obvious introduction, I know).
I am trying to use the BERTopic package using the Python interpreter in RStudio and the reticulate extension:
Python 3....
3
votes
1
answer
2k
views
HDBSCAN handling of large datasets
I am trying to implement a clustering on a large dataset consisting of 146,000 observations, using the HDBSCAN algorithm. When I cluster these observations with the (default) Minkowski/Euclidean ...
3
votes
1
answer
9k
views
how to install HDBSCAN modula, python 3.7, windows 10
I need to use the HDBSCAN algorithme on my data but the module is not installed. I use python 3.7. I am not very familiar with this kind of tricky installations, please, can anyone give me a clear and ...
3
votes
1
answer
521
views
HDBSCAN for R Crashed with large dataset
I tried to apply HDBSCAN algorithm to my dataset (50000 GPS points).
However, every time I run the code, the R session is crashed.
Here is the basic info. about my PC:
processor: Intel i7 7820x 3.6 ...
3
votes
0
answers
2k
views
Clustering with UMAP and HDBScan
I have a somewhat large amount of textual data, input by approximately 5000 people. I've assigned each person a vector using Doc2vec, reduced to two dimensions using UMAP and highlighted groups ...
2
votes
4
answers
59k
views
ERROR: You must give at least one requirement to install -- when running: pip install --upgrade --no-binary hdbscan
I am trying to install hdbscan in my PC which runs Windows 10 and has installed Python 3.6.
My first attempt failed:
(base) C:\WINDOWS\system32>pip install hdbscan --user
Collecting hdbscan
...
2
votes
3
answers
5k
views
How to evaluate HDBSCAN text clusters?
I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each ...
2
votes
1
answer
744
views
TypeError issue importing hdbscan
Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" ...
2
votes
1
answer
664
views
Scikit HDBSCAN *tree* labeling (not single-slice labeling)
BLUF: For a specific epsilon (or for HDBSCAN's 'favorite' epsilon), I can extract the mapping of my data in that epsilon's partition. But how can I see my data's full tree membership?
I've gotten a ...
2
votes
1
answer
2k
views
HDBSCAN won't utilize all available cpus. Processes just sleep
For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at ...
2
votes
2
answers
11k
views
Trouble installing hdbscan package for python : "no module named 'hdbscan'" error
I want to run an algorithm written in Python on my Ubuntu virtual machine. It needs to import the hdbscan module. I thus want to install it on my virtual machine.
Following the documentationfrom Pypi....
2
votes
1
answer
501
views
Can the results of UMAP for HDBScan clustering be made more consistent?
I have a set of ~40K phrases which I'm clustering with HDBScan after using UMAP for dimensionality reduction. The steps are:
Generate embeddings using a fine-tuned BERT model
Reduce dimensions with ...
2
votes
0
answers
680
views
HDBSCAN approximate_predict always returning probability of 0
I am using HDBSCAN to generate prediction data for a given cluster model. I then attempt to classify new points using the approximate_predict function to find the correct cluster for a new point. The ...
2
votes
0
answers
394
views
Reduce spatial data set size using HDBSCAN
I am trying to reduce the spatial data set size by clustering them and finding the center point for the clusters. I referenced to this article (which uses DBSCAN)and it kind of helped except that now ...
2
votes
0
answers
2k
views
Difference Between OPTICS and HDBSCAN clustering techniques
As a part of my assignment, I have to work on both HDBSCAN and OPTICS clustering technique. I have researched on many sites to identify the difference between these algorithms. All I got was OPTICS ...
1
vote
2
answers
4k
views
dealing with noise in hdbscan
I have been testing hdbscan from the scikit learn package with a small instance of (x,y) points "point_coord" and the resulting clusters do not really make sense to me. Given the small size of the ...
1
vote
1
answer
2k
views
Explain Behavior of HDBSCAN Clustering
I have a dataset of 6 elements. I computed the distance matrix using Gower distance, which resulted in the following matrix:
By just looking at this matrix, I can tell that element #0 is similar to ...
1
vote
1
answer
775
views
HDBSCAN : clustering , persistance and approximate_predict()
I want to cache my model results in order to make predictions without redoing the clustering.
I read that I can do that with memory parameter in HDBSCAN.
I did that instead because I wanted to save ...
1
vote
1
answer
399
views
Plot a single cluster
I am working with HDBSCAN and I want to plot only one cluster of the data.
This is my current code:
import hdbscan
import pandas as pd
from sklearn.datasets import make_blobs
blobs, labels = ...
1
vote
1
answer
1k
views
How to properly cluster with HDBSCAN for 1D dataset?
My dataset below shows product sales per price (link to download dataset csv):
price quantity
0 5098.0 20
1 5098.5 40
2 5099.0 10
3 5100.0 90
4 ...
1
vote
1
answer
1k
views
HDBSCAN Shouldn't any object in a cluster have a probability value > 0? And producing inconsistent results
I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook.
import pandas as pandas
import numpy as np
data = pandas.read_csv('data.csv')
That data looks something like this:
...
1
vote
1
answer
206
views
Python HDBScan class always fails on second iteration before even entering first function
I am attempting to look at conglomerated outlier information, utilizing several different SKLearn, HDBScan, and custom outlier detection classes. However, for some reason I am consistently running ...
1
vote
1
answer
1k
views
Anomalies Detection by DBSCAN
I am using DBSCAN on my training datatset in order to find outliers and remove those outliers from the dataset before training model. I am using DBSCAN on my train rows 7697 with 8 columns.Here is my ...
1
vote
2
answers
2k
views
Cluster a list of geographic points by distance and constraints
I have a delivery app, and I want to group orders (each order has a lat and lng coordinates) by location proximity (linear distance) and constraints like max orders and max total products (each order ...
1
vote
2
answers
2k
views
How to visualise top terms on each HDBSCAN cluster
I'm currently trying to use HDBSCAN to cluster a bunch of movie data, in order to group similar content together and be able to come up with 'topics' that describe those clusters. I'm interested in ...
1
vote
1
answer
306
views
How to know to which matrix row corresponds each cluster label?
After doing clustering I end up with an object which stores all the cluster labels, something like this:
clusterer.labels_
The above is typically a list or an array. Then I always assign the labels ...
1
vote
0
answers
285
views
Fine-tuning UMAP parameters for clustering using HDBSCAN relative_validity (DBCV) scores
I am using UMAP and HDBSCAN to cluster similar embedded text data (https://towardsdatascience.com/clustering-sentence-embeddings-to-identify-intents-in-short-text-48d22d3bf02e). There are multiple ...
1
vote
1
answer
404
views
HDBSCAN doesn't work anymore - 'float' object cannot be interpreted as an integer
I'm running HDBSCAN for weeks now on gene expression datasets and everything went perfectly well, but lately it refuses to run :
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=1)....
1
vote
1
answer
117
views
ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float' - hdbscan validity_index
I'm using the validity index in the hdbscan package, which implements DBCV score according to the following paper:
https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
I'm working on a face ...
1
vote
0
answers
222
views
Creating clusters from 3D data through HDBSCAN
I have a problem, I have big data set of 15000 points, those points represent the airplanes over Europe and I have latitudes, longitudes and altitudes. I am trying to create program that will take ...
1
vote
0
answers
46
views
Serving "Frankenstein" (combined) models at scale
I have a tensorflow model that's combined with a clustering algorithm in (HDBSCAN). Both have been trained/fitted separately but they work together (tf -> hdbscan). I'm looking to serve predictions ...
1
vote
0
answers
287
views
HDBSCAN on Movielens Latent embeddings does not cluster well
I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.
Data
The data I'm working with is the MovieLens 100K dataset, from ...
1
vote
0
answers
39
views
Measuring "single strongest peak" in a distribution
I'd like to automatically detect whether data have a very strongly discernable peak, with any particular distribution. The data can otherwise be quite noisy, or there might be several 'false' peaks. ...
1
vote
1
answer
456
views
HDBSCAN Cluster choice
I have been working with HDBSCAN and have a few hundreds of clusters based on my data. I am trying to select some cluster groups for further analysis. Looking for the clusters which have high inter-...
1
vote
1
answer
770
views
How to extract clusters from HDBSCAN algorithm
I'd like to extract original points that form each cluster, I know that HDBSCAN doesn't have cluster centers , so I thought in case each label corresponds to the original point at the same order, I ...
1
vote
0
answers
359
views
How to find top terms in dbscan or hdbscan clusters?
I'm using dbscan from sklearn and HDBSCAN to cluster some documents.
vectorizer = TfidfVectorizer(stop_words=mystopwords)
X = vectorizer.fit_transform(y)
dbscan = DBSCAN(eps=0.75, min_samples = 9)
...
1
vote
0
answers
117
views
Printing a Python-generated plot in R
I am working on performing a HDBSCAN, and am performing the analysis using the hdbscan python module within R. I have the following code:
library(reticulate)
hdb <- import("hdbscan") # Import ...
0
votes
1
answer
255
views
Clustering issue, can't find good params for HDBSCAN
I made a torch model which say if two anime cropped face images are similar or not (trained using cosine similarity and contrastive loss on pairs of faces). I get the embeddings from my model for each ...
0
votes
1
answer
501
views
How to import hdbscan in VScode (anaconda installed)
Based on existing information, I've successfully installed HDBSCAN package in my conda virtual environment using conda install -c conda-forge hdbscan
However, when I try to run this code import ...
0
votes
1
answer
1k
views
Using callable metric for HDBSCAN*
I want to cluster some data with HDBSCAN*.
The distance is calculated as a function of some parameters from both values so if the data look like:
label1 | label2 | label3
0 32 18.5 ...
0
votes
1
answer
352
views
clustering for a single timeseries
I have a single array numpy array(x) and i want to cluster it in unsupervised way using DBSCAN and hierarchial clustering using scikitlearn. Is the clustering possible for single array data? ...