esda.adbscan.ADBSCAN¶
- class esda.adbscan.ADBSCAN(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]¶
A-DBSCAN, as introduced in [ABGLopezVM21].
A-DSBCAN is an extension of the original DBSCAN algorithm that creates an ensemble of solutions generated by running DBSCAN on a random subset and “extending” the solution to the rest of the sample through nearest-neighbor regression.
See the original reference ([ABGLopezVM21]) for more details or the notebook guide for an illustration. …
- Parameters:
-
as in the same neighborhood.
for a point to be considered as a core point. This includes the
point itself.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
the number of jobs is set to the number of CPU cores.
used to calculate DBSCAN in each draw
build final solution
objects are kept, else it is deleted to save memory
a non-noise label need to be assigned to an observation for that
observation to be labelled as such
Examples
>>> import pandas
>>> from esda.adbscan import ADBSCAN
>>> import numpy as np
>>> np.random.seed(10)
>>> db = pandas.DataFrame({'X': np.random.random(25), 'Y': np.random.random(25) })
ADBSCAN can be run following scikit-learn like API as:
>>> np.random.seed(10)
>>> clusterer = ADBSCAN(0.03, 3, reps=10, keep_solus=True)
>>> _ = clusterer.fit(db)
>>> clusterer.labels_
array(['-1', '-1', '-1', '0', '-1', '-1', '-1', '0', '-1', '-1', '-1',
'-1', '-1', '-1', '0', '0', '0', '-1', '0', '-1', '0', '-1', '-1',
'-1', '-1'], dtype=object)
We can inspect the winning label for each observation, as well as the proportion of votes:
>>> print(clusterer.votes.head().to_string())
lbls pct
0 -1 0.7
1 -1 0.5
2 -1 0.7
3 0 1.0
4 -1 0.7
If you have set the option to keep them, you can even inspect each solution that makes up the ensemble:
>>> print(clusterer.solus.head().to_string())
rep-00 rep-01 rep-02 rep-03 rep-04 rep-05 rep-06 rep-07 rep-08 rep-09
0 0 1 1 0 1 0 0 0 1 0
1 1 1 1 1 0 1 0 1 1 1
2 0 1 1 0 0 1 0 0 1 0
3 0 1 1 0 0 1 1 1 0 0
4 0 1 1 1 0 1 0 1 0 1
If we select only one replication and the proportion of the entire dataset that is sampled to 100%, we obtain a traditional DBSCAN:
>>> clusterer = ADBSCAN(0.2, 5, reps=1, pct_exact=1)
>>> np.random.seed(10)
>>> _ = clusterer.fit(db)
>>> clusterer.labels_
array(['0', '-1', '0', '0', '0', '-1', '-1', '0', '-1', '-1', '0', '-1',
'-1', '-1', '0', '0', '0', '-1', '0', '0', '0', '-1', '-1', '0',
'-1'], dtype=object)
- Attributes:
-
dataset given to fit().
Noisy (if the proportion of the most common label is < pct_thr)
samples are given the label -1.
- votes
DataFrame
[Only available after fit] Table indexed on X.index with labels_ under the lbls column, and the frequency across draws of that label under pct
- solus
DataFrame
, shape = [n
,reps
] [Only available after fit] Each solution of labels for every draw
- solus_relabelled
DataFrame
, shape = [n
,reps
] [Only available after fit] Each solution of labels for every draw, relabelled to be consistent across solutions
- __init__(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]¶
Methods
|
|
|
Perform ADBSCAN clustering from fetaures |
|
Perform clustering on X and returns cluster labels. |
|
Get metadata routing of this object. |
|
Get parameters for this estimator. |
|
Request metadata passed to the |
|
Set the parameters of this estimator. |
- fit(X, y=None, sample_weight=None, xy=['X', 'Y'])[source]¶
Perform ADBSCAN clustering from fetaures …
- Parameters:
- X
DataFrame
Features
- sample_weight
Series
, shape (n_samples,) [Optional. Default=None] Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
coordinates in xys
- X
- y
Ignored
Request metadata passed to the
fit
method.Note that this method is only relevant if mechanism works.
The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
- Parameters: