ragraph.analysis.similarity
¶
Similarity analyses¶
Graph similarity is often expressed as a metric, where nodes and edges are scanned for similar patterns, properties, or other aspects. There are three levels of equivalence, being structural, automorphic, or regular equivalence. Where each of the former implies all latter equivalences, respectively.
Available analyses¶
The following algorithms are directly accessible after importing
ragraph.analysis.similarity
:
jaccard_index
: Jaccard Similarity Index of two objects based on the number properties they both possess divided by the number of properties either of them have.jaccard_matrix
: Jaccard Similarity Index between a set of objects stored in a square matrix.
Note
Both Jaccard methods require a callable that takes an object and returns a list of booleans
representing the possession of a property (the on
argument). Some examples are included in the
ragraph.analysis.similarity.utils
module, like
ragraph.analysis.similarity.utils.on_hasattrs
.
SimilarityAnalysis
¶
Similarity analysis of nodes based upon mutual mapping relations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cols |
List[Node]
|
List of column nodes. |
required |
rows |
List[Node]
|
List of row nodes. |
required |
edges |
List[Edge]
|
List of edges from column nodes to row nodes to be used in similarity analysis. |
required |
col_sim_threshold |
float
|
Column similarity threshold. Values below this threshold are pruned from the similarity matrix and the corresponding edges are removed. Defaults to 0.0 (no threshold). |
0.0
|
row_sim_threshold |
float
|
Column similarity threshold. Values below this threshold are pruned from the similarity matrix and the corresponding edges are removed. Defaults to 0.0 (no threshold). |
0.0
|
Class Attributes
Note
A mapping matrix relating M column nodes to N row nodes is used as input for the similarity analysis.
Source code in ragraph/analysis/similarity/_similarity.py
col_sim_threshold
property
writable
¶
Similarity threshold. Values below this threshold are pruned from the column similarity matrix and the corresponding edges are removed.
row_sim_threshold
property
writable
¶
Similarity threshold. Values below this threshold are pruned from the row similarity matrix and the corresponding edges are removed.
_cluster
¶
Cluster column nodes based on their similarity. Updates Graph in-place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
leafs |
List[Node]
|
List of row or column nodes to be clustered. |
required |
algo |
Callable[[Graph, Any], Tuple[List[Node]]]
|
Clustering algorithm. Should take a graph as first argument and cluster it
in-place. Defaults to |
markov
|
**algo_args |
Any
|
Algorithm arguments. See
|
{}
|
Source code in ragraph/analysis/similarity/_similarity.py
_update_similarity
¶
Update Jaccard Similarity Index edges between (clustered) nodes.
Source code in ragraph/analysis/similarity/_similarity.py
check_mapping
¶
Check whether a column node maps to a row node.
cluster_cols
¶
Cluster column nodes based on their similarity. Updates Graph in-place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
algo |
Callable[[Graph, Any], Tuple[List[Node]]]
|
Clustering algorithm. Should take a graph as first argument and cluster it
in-place. Defaults to |
markov
|
**algo_args |
Any
|
Algorithm arguments. See
|
{}
|
Source code in ragraph/analysis/similarity/_similarity.py
cluster_rows
¶
Cluster column nodes based on their similarity. Updates Graph in-place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
algo |
Callable[[Graph, Any], Tuple[List[Node]]]
|
Clustering algorithm. Should take a graph as first argument and cluster it
in-place. Defaults to |
markov
|
**algo_args |
Any
|
Algorithm arguments. See
|
{}
|
Source code in ragraph/analysis/similarity/_similarity.py
col_mapping
¶
Boolean possession checklist for a column node w.r.t.
self.rows
.
Source code in ragraph/analysis/similarity/_similarity.py
row_mapping
¶
Boolean possession checklist for a row node w.r.t.
self.cols
.
Source code in ragraph/analysis/similarity/_similarity.py
jaccard_index
¶
Calculate the Jaccard Similarity Index between to objects based on an object description function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj1 |
Any
|
First object to compare. |
required |
obj2 |
Any
|
Second object to compare. |
required |
on |
Callable[[Any], List[bool]]
|
Callable that takes an object and describes it with a list of booleans. Each entry indicates the possession of a property. |
required |
Returns:
Type | Description |
---|---|
float
|
Jaccard Similarity between two objects, which is calculated as the size of the |
float
|
overlap in properties divided by total size of properties they posess. |
Source code in ragraph/analysis/similarity/_jaccard.py
jaccard_matrix
¶
Calculate the Jaccard Similarity Index for a set of objects based on an object description function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
objects |
List[Any]
|
List of objects to generate a similarity matrix for. |
required |
on |
Callable[[Any], List[bool]]
|
Callable that takes an object and describes it with a list of booleans. Each entry indicates the possession of a property. |
required |
Source code in ragraph/analysis/similarity/_jaccard.py
_jaccard
¶
Jaccard Similarity Index¶
The index compares two objects, and is calculated as the size of the overlap in properties divided by total size of properties they possess.
For examples on 'object description functions', please refer to the similarity
utilities
.
References
Kosub, S. (2016). A note on the triangle inequality for the Jaccard distance. Retrieved from arXiv.org Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de La Société Vaudoise Des Sciences Naturelles. DOI: 10.5169/seals-266450
_calculate
¶
Calculate the Jaccard Index by the boolean object description arrays.
Source code in ragraph/analysis/similarity/_jaccard.py
jaccard_index
¶
Calculate the Jaccard Similarity Index between to objects based on an object description function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj1 |
Any
|
First object to compare. |
required |
obj2 |
Any
|
Second object to compare. |
required |
on |
Callable[[Any], List[bool]]
|
Callable that takes an object and describes it with a list of booleans. Each entry indicates the possession of a property. |
required |
Returns:
Type | Description |
---|---|
float
|
Jaccard Similarity between two objects, which is calculated as the size of the |
float
|
overlap in properties divided by total size of properties they posess. |
Source code in ragraph/analysis/similarity/_jaccard.py
jaccard_matrix
¶
Calculate the Jaccard Similarity Index for a set of objects based on an object description function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
objects |
List[Any]
|
List of objects to generate a similarity matrix for. |
required |
on |
Callable[[Any], List[bool]]
|
Callable that takes an object and describes it with a list of booleans. Each entry indicates the possession of a property. |
required |
Source code in ragraph/analysis/similarity/_jaccard.py
mapping_matrix
¶
Calculate an object-property mapping matrix where each entry (i,j) indicates the possession of property j by object i.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
objects |
List[Any]
|
List of objects to describe. |
required |
on |
Callable[[Any], List[bool]]
|
Callable that takes an object and describes it with a list of booleans. Each entry indicates the possession of a property. |
required |
Source code in ragraph/analysis/similarity/_jaccard.py
_similarity
¶
Similarity analysis¶
SimilarityAnalysis
¶
Similarity analysis of nodes based upon mutual mapping relations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cols |
List[Node]
|
List of column nodes. |
required |
rows |
List[Node]
|
List of row nodes. |
required |
edges |
List[Edge]
|
List of edges from column nodes to row nodes to be used in similarity analysis. |
required |
col_sim_threshold |
float
|
Column similarity threshold. Values below this threshold are pruned from the similarity matrix and the corresponding edges are removed. Defaults to 0.0 (no threshold). |
0.0
|
row_sim_threshold |
float
|
Column similarity threshold. Values below this threshold are pruned from the similarity matrix and the corresponding edges are removed. Defaults to 0.0 (no threshold). |
0.0
|
Class Attributes
Note
A mapping matrix relating M column nodes to N row nodes is used as input for the similarity analysis.
Source code in ragraph/analysis/similarity/_similarity.py
col_sim_threshold
property
writable
¶
Similarity threshold. Values below this threshold are pruned from the column similarity matrix and the corresponding edges are removed.
row_sim_threshold
property
writable
¶
Similarity threshold. Values below this threshold are pruned from the row similarity matrix and the corresponding edges are removed.
_cluster
¶
Cluster column nodes based on their similarity. Updates Graph in-place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
leafs |
List[Node]
|
List of row or column nodes to be clustered. |
required |
algo |
Callable[[Graph, Any], Tuple[List[Node]]]
|
Clustering algorithm. Should take a graph as first argument and cluster it
in-place. Defaults to |
markov
|
**algo_args |
Any
|
Algorithm arguments. See
|
{}
|
Source code in ragraph/analysis/similarity/_similarity.py
_update_similarity
¶
Update Jaccard Similarity Index edges between (clustered) nodes.
Source code in ragraph/analysis/similarity/_similarity.py
check_mapping
¶
Check whether a column node maps to a row node.
cluster_cols
¶
Cluster column nodes based on their similarity. Updates Graph in-place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
algo |
Callable[[Graph, Any], Tuple[List[Node]]]
|
Clustering algorithm. Should take a graph as first argument and cluster it
in-place. Defaults to |
markov
|
**algo_args |
Any
|
Algorithm arguments. See
|
{}
|
Source code in ragraph/analysis/similarity/_similarity.py
cluster_rows
¶
Cluster column nodes based on their similarity. Updates Graph in-place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
algo |
Callable[[Graph, Any], Tuple[List[Node]]]
|
Clustering algorithm. Should take a graph as first argument and cluster it
in-place. Defaults to |
markov
|
**algo_args |
Any
|
Algorithm arguments. See
|
{}
|
Source code in ragraph/analysis/similarity/_similarity.py
col_mapping
¶
Boolean possession checklist for a column node w.r.t.
self.rows
.
Source code in ragraph/analysis/similarity/_similarity.py
row_mapping
¶
Boolean possession checklist for a row node w.r.t.
self.cols
.
Source code in ragraph/analysis/similarity/_similarity.py
utils
¶
Similarity analysis utilities¶
on_checks
¶
Get an object description function that runs a predefined set of checks (which should be in a fixed order) and returns their boolean results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
checks |
List[Callable[[Any], bool]]
|
Checks to perform. |
required |
Returns:
Type | Description |
---|---|
Callable[[Any], List[bool]]
|
Object description function indicating check passings. |
Source code in ragraph/analysis/similarity/utils.py
on_contains
¶
Check whether an object contains certain contents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
contents |
List[Any]
|
Contents to check for with |
required |
Returns:
Type | Description |
---|---|
Callable[[Any], List[bool]]
|
Object description function indicating content presence. |
Source code in ragraph/analysis/similarity/utils.py
on_hasattrs
¶
Get an object description function that checks whether an instance possesses certain attributes. It does not check the values thereof!
Parameters:
Name | Type | Description | Default |
---|---|---|---|
attrs |
List[str]
|
List of attributes to check the existence of. |
required |
Returns:
Type | Description |
---|---|
Callable[[Any], List[bool]]
|
Object description function indicating attribute possession. |
Source code in ragraph/analysis/similarity/utils.py
on_hasweights
¶
Check whether an objects has certain weights above a threshold in its weights dictionary property.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
weights |
List[str]
|
Keys to the |
required |
threshold |
float
|
Threshold to verify against. |
0.0
|
Returns:
Type | Description |
---|---|
Callable[[Any], List[bool]]
|
Object description function indicating weights exceeding a threshold. |