Skip to content

DistanceClusteringGraphID

Specialized Graph ID generator using DBSCAN clustering on interatomic distances.

API Reference

graph_id.core.distance_clustering_graph_id.DistanceClusteringGraphID

Bases: GraphIDGenerator

Graph ID generator using DBSCAN distance clustering for neighbor detection.

This variant uses DBSCAN clustering on interatomic distances to identify distinct bond length populations. This is useful for structures where standard neighbor detection methods may fail, such as:

  • Structures with unusual bonding patterns
  • MOFs and zeolites with multiple bond length scales
  • Systems where simple distance cutoffs are insufficient

The algorithm iterates over the first rank_k distance clusters, computing separate compositional sequences for each, then combines them into a final ID.

Examples:

>>> from graph_id.core.distance_clustering_graph_id import DistanceClusteringGraphID
>>> gen = DistanceClusteringGraphID(rank_k=3, cutoff=6.0)
>>> gen.get_id(complex_structure)
See Also

GraphIDGenerator : Standard Graph ID generator DistanceClusteringNN : The underlying neighbor detection class

Source code in graph_id/core/distance_clustering_graph_id.py
class DistanceClusteringGraphID(GraphIDGenerator):

    """Graph ID generator using DBSCAN distance clustering for neighbor detection.

    This variant uses DBSCAN clustering on interatomic distances to identify
    distinct bond length populations. This is useful for structures where
    standard neighbor detection methods may fail, such as:

    - Structures with unusual bonding patterns
    - MOFs and zeolites with multiple bond length scales
    - Systems where simple distance cutoffs are insufficient

    The algorithm iterates over the first ``rank_k`` distance clusters,
    computing separate compositional sequences for each, then combines
    them into a final ID.

    Examples
    --------
    >>> from graph_id.core.distance_clustering_graph_id import DistanceClusteringGraphID
    >>> gen = DistanceClusteringGraphID(rank_k=3, cutoff=6.0)
    >>> gen.get_id(complex_structure)

    See Also
    --------
    GraphIDGenerator : Standard Graph ID generator
    DistanceClusteringNN : The underlying neighbor detection class
    """

    def __init__(  # noqa: PLR0913
        self,
        nn=None,
        wyckoff=False,
        diameter_factor=2,
        additional_depth=1,
        symmetry_tol=0.1,
        topology_only=False,
        loop=False,
        rank_k=3,
        cutoff=6.0,
        digest_size=8,
    ) -> None:
        """Initialize the DistanceClusteringGraphID generator.

        Parameters
        ----------
        nn : NearNeighbors, optional
            A neighbor-finding strategy. If None, defaults to DistanceClusteringNN().
        wyckoff : bool, default False
            If True, include Wyckoff position information in the ID.
        diameter_factor : int, default 2
            Multiplier for graph diameter to determine traversal depth.
        additional_depth : int, default 1
            Extra depth added to the calculated traversal depth.
        symmetry_tol : float, default 0.1
            Tolerance for symmetry operations (used with wyckoff=True).
        topology_only : bool, default False
            If True, generate topology-only IDs ignoring element types.
        loop : bool, default False
            If True, use loop-based identification algorithm.
        rank_k : int, default 3
            Number of distance clusters to consider. Higher values capture
            more neighbor shells but increase computation time.
        cutoff : float, default 6.0
            Maximum distance cutoff in Angstroms for neighbor search.
        digest_size : int, default 8
            Size of the BLAKE2b hash digest in bytes.

        Examples
        --------
        >>> gen = DistanceClusteringGraphID()  # Default settings
        >>> gen = DistanceClusteringGraphID(rank_k=5, cutoff=8.0)  # More clusters
        """
        super().__init__(
            nn,
            wyckoff,
            diameter_factor,
            additional_depth,
            symmetry_tol,
            topology_only,
            loop,
            digest_size,
        )

        self.rank_k = rank_k
        self.cutoff = cutoff
        self.digest_size = digest_size

        if nn is None:
            self.nn = DistanceClusteringNN()
        else:
            self.nn = nn

    def get_id(self, structure):
        """Generate a Graph ID using distance clustering.

        Parameters
        ----------
        structure : Structure
            A pymatgen Structure object.

        Returns
        -------
        str
            The Graph ID hash (16 hexadecimal characters by default).

        Notes
        -----
        Unlike the base class, this does not prepend composition or
        dimensionality. The returned ID is the raw hash only.
        """
        gid_list = []
        _sg = StructureGraph.from_local_env_strategy(structure, MinimumDistanceNN())
        for cluster_idx in range(self.rank_k):
            long_str_list = []
            # _sg = StructureGraph.from_local_env_strategy(structure, MinimumDistanceNN())
            for idx in range(len(structure)):
                copied_sg = deepcopy(_sg)
                # まず原子idxが含まれる結合を削除する
                for from_index, to_index, dct in _sg.graph.edges(keys=False, data=True):
                    if idx in (from_index, to_index):
                        copied_sg.break_edge(from_index, to_index, dct["to_jimage"], allow_reverse=True)
                sg = self.prepare_structure_graph(structure, copied_sg, idx, cluster_idx)
                n = len(sg.cc_cs)
                array = np.empty(
                    [
                        n,
                    ],
                    dtype=object,
                )
                for i, component in enumerate(sg.cc_cs):
                    array[i] = blake("-".join(sorted(component["cs_list"])))

                long_str_tmp = ":".join(np.sort(array))

                long_str_list.append(long_str_tmp)
            long_str = ":".join(np.sort(long_str_list))
            gid = blake2b(long_str.encode("ascii"), digest_size=self.digest_size).hexdigest()
            gid_list.append(gid)

        long_gid = "".join(gid_list)

        return blake2b(long_gid.encode("ascii"), digest_size=self.digest_size).hexdigest()

    def prepare_structure_graph(self, structure, _sg, n, rank_k):
        """Prepare the structure graph for a specific site and distance cluster.

        Parameters
        ----------
        structure : Structure
            The pymatgen Structure object.
        _sg : StructureGraph
            The base structure graph with bonds from previous processing.
        n : int
            The site index being processed.
        rank_k : int
            The current distance cluster index (0-based).

        Returns
        -------
        StructureGraph
            The prepared structure graph with compositional sequences
            computed for the specified site and cluster.
        """
        sg = StructureGraph.with_indivisual_state_comp_strategy(
            structure=structure,
            strategy=self.nn,
            _sg=_sg,
            n=n,
            rank_k=rank_k,
            cutoff=self.cutoff,
        )

        use_previous_cs = False

        compound = sg.structure
        prev_num_uniq = len(compound.composition)

        if self.topology_only:
            for site_i in range(len(sg.structure)):
                sg.structure.replace(site_i, Element("H"))

        if self.wyckoff:
            sg.set_wyckoffs(symmetry_tol=self.symmetry_tol)
            prev_num_uniq = len(list(set(nx.get_node_attributes(sg.graph, "compositional_sequence").values())))

        elif self.loop:
            sg.set_loops(
                diameter_factor=self.diameter_factor,
                additional_depth=self.additional_depth,
            )

        else:
            sg.set_elemental_labels()

        while True:
            sg.set_indivisual_compositional_sequence_node_attr(
                n=n,
                hash_cs=False,
                wyckoff=self.wyckoff,
                additional_depth=self.additional_depth,
                diameter_factor=self.diameter_factor,
                use_previous_cs=use_previous_cs or self.wyckoff,
            )

            num_unique_nodes = len(list(set(nx.get_node_attributes(sg.graph, "compositional_sequence").values())))
            use_previous_cs = True

            if prev_num_uniq == num_unique_nodes:
                break

            prev_num_uniq = num_unique_nodes

        return sg

Methods:

__init__(nn=None, wyckoff=False, diameter_factor=2, additional_depth=1, symmetry_tol=0.1, topology_only=False, loop=False, rank_k=3, cutoff=6.0, digest_size=8) -> None

Initialize the DistanceClusteringGraphID generator.

Parameters:

Name Type Description Default
nn NearNeighbors

A neighbor-finding strategy. If None, defaults to DistanceClusteringNN().

None
wyckoff bool

If True, include Wyckoff position information in the ID.

False
diameter_factor int

Multiplier for graph diameter to determine traversal depth.

2
additional_depth int

Extra depth added to the calculated traversal depth.

1
symmetry_tol float

Tolerance for symmetry operations (used with wyckoff=True).

0.1
topology_only bool

If True, generate topology-only IDs ignoring element types.

False
loop bool

If True, use loop-based identification algorithm.

False
rank_k int

Number of distance clusters to consider. Higher values capture more neighbor shells but increase computation time.

3
cutoff float

Maximum distance cutoff in Angstroms for neighbor search.

6.0
digest_size int

Size of the BLAKE2b hash digest in bytes.

8

Examples:

>>> gen = DistanceClusteringGraphID()  # Default settings
>>> gen = DistanceClusteringGraphID(rank_k=5, cutoff=8.0)  # More clusters
Source code in graph_id/core/distance_clustering_graph_id.py
def __init__(  # noqa: PLR0913
    self,
    nn=None,
    wyckoff=False,
    diameter_factor=2,
    additional_depth=1,
    symmetry_tol=0.1,
    topology_only=False,
    loop=False,
    rank_k=3,
    cutoff=6.0,
    digest_size=8,
) -> None:
    """Initialize the DistanceClusteringGraphID generator.

    Parameters
    ----------
    nn : NearNeighbors, optional
        A neighbor-finding strategy. If None, defaults to DistanceClusteringNN().
    wyckoff : bool, default False
        If True, include Wyckoff position information in the ID.
    diameter_factor : int, default 2
        Multiplier for graph diameter to determine traversal depth.
    additional_depth : int, default 1
        Extra depth added to the calculated traversal depth.
    symmetry_tol : float, default 0.1
        Tolerance for symmetry operations (used with wyckoff=True).
    topology_only : bool, default False
        If True, generate topology-only IDs ignoring element types.
    loop : bool, default False
        If True, use loop-based identification algorithm.
    rank_k : int, default 3
        Number of distance clusters to consider. Higher values capture
        more neighbor shells but increase computation time.
    cutoff : float, default 6.0
        Maximum distance cutoff in Angstroms for neighbor search.
    digest_size : int, default 8
        Size of the BLAKE2b hash digest in bytes.

    Examples
    --------
    >>> gen = DistanceClusteringGraphID()  # Default settings
    >>> gen = DistanceClusteringGraphID(rank_k=5, cutoff=8.0)  # More clusters
    """
    super().__init__(
        nn,
        wyckoff,
        diameter_factor,
        additional_depth,
        symmetry_tol,
        topology_only,
        loop,
        digest_size,
    )

    self.rank_k = rank_k
    self.cutoff = cutoff
    self.digest_size = digest_size

    if nn is None:
        self.nn = DistanceClusteringNN()
    else:
        self.nn = nn

get_id(structure)

Generate a Graph ID using distance clustering.

Parameters:

Name Type Description Default
structure Structure

A pymatgen Structure object.

required

Returns:

Type Description
str

The Graph ID hash (16 hexadecimal characters by default).

Notes

Unlike the base class, this does not prepend composition or dimensionality. The returned ID is the raw hash only.

Source code in graph_id/core/distance_clustering_graph_id.py
def get_id(self, structure):
    """Generate a Graph ID using distance clustering.

    Parameters
    ----------
    structure : Structure
        A pymatgen Structure object.

    Returns
    -------
    str
        The Graph ID hash (16 hexadecimal characters by default).

    Notes
    -----
    Unlike the base class, this does not prepend composition or
    dimensionality. The returned ID is the raw hash only.
    """
    gid_list = []
    _sg = StructureGraph.from_local_env_strategy(structure, MinimumDistanceNN())
    for cluster_idx in range(self.rank_k):
        long_str_list = []
        # _sg = StructureGraph.from_local_env_strategy(structure, MinimumDistanceNN())
        for idx in range(len(structure)):
            copied_sg = deepcopy(_sg)
            # まず原子idxが含まれる結合を削除する
            for from_index, to_index, dct in _sg.graph.edges(keys=False, data=True):
                if idx in (from_index, to_index):
                    copied_sg.break_edge(from_index, to_index, dct["to_jimage"], allow_reverse=True)
            sg = self.prepare_structure_graph(structure, copied_sg, idx, cluster_idx)
            n = len(sg.cc_cs)
            array = np.empty(
                [
                    n,
                ],
                dtype=object,
            )
            for i, component in enumerate(sg.cc_cs):
                array[i] = blake("-".join(sorted(component["cs_list"])))

            long_str_tmp = ":".join(np.sort(array))

            long_str_list.append(long_str_tmp)
        long_str = ":".join(np.sort(long_str_list))
        gid = blake2b(long_str.encode("ascii"), digest_size=self.digest_size).hexdigest()
        gid_list.append(gid)

    long_gid = "".join(gid_list)

    return blake2b(long_gid.encode("ascii"), digest_size=self.digest_size).hexdigest()

prepare_structure_graph(structure, _sg, n, rank_k)

Prepare the structure graph for a specific site and distance cluster.

Parameters:

Name Type Description Default
structure Structure

The pymatgen Structure object.

required
_sg StructureGraph

The base structure graph with bonds from previous processing.

required
n int

The site index being processed.

required
rank_k int

The current distance cluster index (0-based).

required

Returns:

Type Description
StructureGraph

The prepared structure graph with compositional sequences computed for the specified site and cluster.

Source code in graph_id/core/distance_clustering_graph_id.py
def prepare_structure_graph(self, structure, _sg, n, rank_k):
    """Prepare the structure graph for a specific site and distance cluster.

    Parameters
    ----------
    structure : Structure
        The pymatgen Structure object.
    _sg : StructureGraph
        The base structure graph with bonds from previous processing.
    n : int
        The site index being processed.
    rank_k : int
        The current distance cluster index (0-based).

    Returns
    -------
    StructureGraph
        The prepared structure graph with compositional sequences
        computed for the specified site and cluster.
    """
    sg = StructureGraph.with_indivisual_state_comp_strategy(
        structure=structure,
        strategy=self.nn,
        _sg=_sg,
        n=n,
        rank_k=rank_k,
        cutoff=self.cutoff,
    )

    use_previous_cs = False

    compound = sg.structure
    prev_num_uniq = len(compound.composition)

    if self.topology_only:
        for site_i in range(len(sg.structure)):
            sg.structure.replace(site_i, Element("H"))

    if self.wyckoff:
        sg.set_wyckoffs(symmetry_tol=self.symmetry_tol)
        prev_num_uniq = len(list(set(nx.get_node_attributes(sg.graph, "compositional_sequence").values())))

    elif self.loop:
        sg.set_loops(
            diameter_factor=self.diameter_factor,
            additional_depth=self.additional_depth,
        )

    else:
        sg.set_elemental_labels()

    while True:
        sg.set_indivisual_compositional_sequence_node_attr(
            n=n,
            hash_cs=False,
            wyckoff=self.wyckoff,
            additional_depth=self.additional_depth,
            diameter_factor=self.diameter_factor,
            use_previous_cs=use_previous_cs or self.wyckoff,
        )

        num_unique_nodes = len(list(set(nx.get_node_attributes(sg.graph, "compositional_sequence").values())))
        use_previous_cs = True

        if prev_num_uniq == num_unique_nodes:
            break

        prev_num_uniq = num_unique_nodes

    return sg

Quick Example

from graph_id.core.distance_clustering_graph_id import DistanceClusteringGraphID

gen = DistanceClusteringGraphID()
graph_id = gen.get_id(structure)

When to Use

Use this variant when:

  • Standard neighbor detection gives unexpected results
  • Structure has multiple distinct bond length scales
  • Working with MOFs, zeolites, or complex frameworks

Configuration Examples

# For complex structures (MOFs, zeolites)
gen = DistanceClusteringGraphID(
    rank_k=5,      # More distance clusters
    cutoff=10.0    # Larger search radius
)

Performance

Distance clustering is slower than standard GraphIDGenerator. Use only when standard methods don't work.

See Also