SIC variants

run_basic Module

run_basic. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).

run_basic.compute_similarity_basic(n_iter, n_cores, n_locks, data_type, n_samples, data, data_occurrences, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, writing_steps, convergence_step, convergence_criterion)

Computes the simlarity matrix in the basic setting.

Args:
  • n_iter (int): number of iterations.
  • n_cores (int): number of cores to use.
  • n_locks (int): number of locks to use on the full shared matrix.
  • data_type (str): data set to use.
  • n_samples (int): number of samples.
  • data (struct): initial data.
  • data_occurrences (struct): indicates where each of the samples occurs in the data base.
  • temp_folder (str): path to the temporary folder.
  • output_folder (str): path to the output folder.
  • annotation_params (list): parameters for synthetic annotation.
  • classification_params (list): parameters for classification.
  • verbose (int): sets the verbosity level.
  • with_unique_occurrences (bool): indicates wether samples occur only once in the data set or not.
  • with_common_label_wordform (bool): indicates wether to give the same label to samples with identical wordform or not.
  • writing_steps (list): steps at which to save the partial matrix.
  • convergence_step (int): every ‘convergence_step’, the script checks if convergence is reached.
Returns:
  • n_samples_occurrences (list): number of occurrences of each sample in a test set over all iterations.
  • synthetic_labels (list): synthetic labels repartition for each iteration.
  • co_occ (ndarray): full similarity matrix.

run_wem Module

run_wem. Computing the similarity matrix in the WEM setting.

run_wem.compute_similarity_wem(n_iter, n_cores, n_locks, data_type, n_samples, data, ground_truth_file, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering)

Computes the simlarity matrix in the basic setting.

Args:
  • n_iter (int): number of iterations.
  • n_cores (int): number of cores to use.
  • n_locks (int): number of locks to use on the full shared matrix.
  • data_type (str): data set to use.
  • n_samples (int): number of samples.
  • data (struct): initial data.
  • ground_truth_file (str): path to ground-truth.
  • temp_folder (str): path to the temporary folder.
  • output_folder (str): path to the output folder.
  • annotation_params (list): parameters for synthetic annotation.
  • classification_params (list): parameters for classification.
  • verbose (int): sets the verbosity level.
  • with_unique_occurrences (bool): indicates wether samples occur only once in the data set or not.
  • with_common_label_wordform (bool): indicates wether to give the same label to samples with identical wordform or not.
Returns:
  • n_samples_occurrences (list): number of occurrences of each sample in a test set over all iterations.
  • synthetic_labels (list): synthetic labels repartition for each iteration.
  • co_occ (ndarray): full similarity matrix.

run_ova Module

run_ova. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).

run_ova.compute_similarity_ova(n_iter, n_cores, n_locks, input_file, ground_truth_file, data_type, n_samples, data, data_occurrences, index_to_label, label_to_index, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, gtonly=True)

Computes the simlarity matrix in the basic setting.

Args:
  • n_iter (int): number of iterations.
  • n_cores (int): number of cores to use.
  • n_locks (int): number of locks to use on the full shared matrix.
  • data_type (str): data set to use.
  • n_samples (int): number of samples.
  • data (struct): initial data.
  • data_occurrences (struct): indicates where each of the samples occurs in the data base.
  • index_to_label (list): maps a sample’s index to a string representation.
  • label_to_index (dict): reverse index_to_label mapping.
  • temp_folder (str): path to the temporary folder.
  • output_folder (str): path to the output folder.
  • annotation_params (list): parameters for synthetic annotation.
  • classification_params (list): parameters for classification.
  • verbose (int): sets the verbosity level.
  • with_unique_occurrences (bool): indicates wether samples occur only once in the data set or not.
  • with_common_label_wordform (bool): indicates wether to give the ame label to samples with identical wordform or not.
  • writing_steps (list): steps at which to save the partial matrix.
  • gt_only (boolean, optional): if True, experiments will only be conducted from query samples from the ground-truth. Defqults to True.
Returns:
  • n_samples_occurrences (list): number of occurrences of each sample in a test set over all iterations.
  • synthetic_labels (list): synthetic labels repartition for each iteration.
  • co_occ (ndarray): full similarity matrix.
run_ova.load_file(file, input_file, train_docs, label_to_index, classification_params)

Parse an Aquaint file to retrieve all occurrences of a single words. Used for a threaded execution in a Pool.

Args:
  • file (str): name the file to parse.
  • input_file (str): directory of file.
  • train_docs (dict): dict mapping a file to the documents containing the considered word.
  • label_to_index (dict): maps a label to the corresponding entity index.
  • classification_params (dict): classification parameters.
Returns:
  • samples (list): list of tagged sentences containing the considered word in the file.