evaluation_clustering module

evaluation_clustering.py. Main script for running and evaluating result of the clustering process.

Usage:

python evaluation_clustering.py [1] -i [2] -p [3] -t [4] -cfg [5] --mcl --help

where:

  • [1] : input similarity matrix (unnormalized similarities or pre-treated MCL format). The script expects a ‘exp_configuration.ini’ file in the same folder, usually generated when using main.py.
  • [2] -i: MCL inflation parameter. Defaults to 1.4.
  • [3] -p: MCL pre-inflation parameter. Defaults to 1.0.
  • [4] -t: number of cores to use for MCL.
  • [5] -cfg: provide a custom configuration file to replace ‘exp_configuration.ini’.
  • -m, --mcl: if present, the script expects an input matrix in MCL label format.
  • -h, --help

This outputs the results of the MCL clustering with the given inflation and pre-inflation parameters.

evaluation_clustering.cluster(co_occ, output_folder, index_to_label, cores, task_params, **kwargs)

Returns the clustering obtained after applying the chosen algorithm on the co-occurence matrix.

Args:
  • co_occ (ndarray): Co-occurence matrix.
  • output_folder (str): path to the output folder.
  • index_to_label (list): list mapping an index to the corresponding named entity.
  • cores (int): Number of cores to use for the clustering algorithm (if threading option available).
  • task_params (dict): additional clustering algorithms.
  • formated (bool, optional): if True the co-occurence matrix is expected to be already formatted for MCL input.
  • verbose (int, optional): controls verbosity level.
Returns:
  • clustering (list): Resulting clustering (as a list mapping a sample’s index to the index of its cluster).
  • n_clusters (int): number of retrieved clusters.
  • step_id (%str*): step identifier, optionnal, for output
  • summary (str): string representation of the execution (for displaying purpose).
evaluation_clustering.evaluate(co_occ, output_folder, temp_folder, ground_truth, index_to_label, cores, task_params, **kwargs)

Evaluate a clustering method given a similarity matrix and various clustering parameters.

Args:
  • co_occ (ndarray): co-occurence matrix.
  • output_folder (str): path to the output folder.
  • temp_folder (str): path to temporary folder.
  • ground_truth (dict): ground truth clustering to compare against.
  • index_to_label (list): list mapping an index to the corresponding named entity. used to generate a readable clustering.
  • cores (int): number of cores to use.
  • task_params (list): additional clustering parameters.
  • formated (bool, optional): if True the co-occurence matrix is expected to be already formatted for MCL input.