SIC variants¶

`run_basic` Module¶

run_basic. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).

run_basic.compute_similarity_basic(n_iter, n_cores, n_locks, data_type, n_samples, data, data_occurrences, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, writing_steps, convergence_step, convergence_criterion)¶

Computes the simlarity matrix in the basic setting.

Args:

n_iter (int): number of iterations.
n_cores (int): number of cores to use.
n_locks (int): number of locks to use on the full shared matrix.
data_type (str): data set to use.
n_samples (int): number of samples.
data (struct): initial data.
data_occurrences (struct): indicates where each of the samples occurs in the data base.
temp_folder (str): path to the temporary folder.
output_folder (str): path to the output folder.
annotation_params (list): parameters for synthetic annotation.
classification_params (list): parameters for classification.
verbose (int): sets the verbosity level.
with_unique_occurrences (bool): indicates wether samples occur only once in the data set or not.
with_common_label_wordform (bool): indicates wether to give the same label to samples with identical wordform or not.
writing_steps (list): steps at which to save the partial matrix.
convergence_step (int): every ‘convergence_step’, the script checks if convergence is reached.

Returns:

n_samples_occurrences (list): number of occurrences of each sample in a test set over all iterations.
synthetic_labels (list): synthetic labels repartition for each iteration.
co_occ (ndarray): full similarity matrix.

`run_wem` Module¶

run_wem. Computing the similarity matrix in the WEM setting.

run_wem.compute_similarity_wem(n_iter, n_cores, n_locks, data_type, n_samples, data, ground_truth_file, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering)¶

Computes the simlarity matrix in the basic setting.

Args:

n_iter (int): number of iterations.
n_cores (int): number of cores to use.
n_locks (int): number of locks to use on the full shared matrix.
data_type (str): data set to use.
n_samples (int): number of samples.
data (struct): initial data.
ground_truth_file (str): path to ground-truth.
temp_folder (str): path to the temporary folder.
output_folder (str): path to the output folder.
annotation_params (list): parameters for synthetic annotation.
classification_params (list): parameters for classification.
verbose (int): sets the verbosity level.
with_unique_occurrences (bool): indicates wether samples occur only once in the data set or not.
with_common_label_wordform (bool): indicates wether to give the same label to samples with identical wordform or not.

Returns:

n_samples_occurrences (list): number of occurrences of each sample in a test set over all iterations.
synthetic_labels (list): synthetic labels repartition for each iteration.
co_occ (ndarray): full similarity matrix.

`run_ova` Module¶

run_ova. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).

run_ova.compute_similarity_ova(n_iter, n_cores, n_locks, input_file, ground_truth_file, data_type, n_samples, data, data_occurrences, index_to_label, label_to_index, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, gtonly=True)¶

Computes the simlarity matrix in the basic setting.

Args:

n_iter (int): number of iterations.
n_cores (int): number of cores to use.
n_locks (int): number of locks to use on the full shared matrix.
data_type (str): data set to use.
n_samples (int): number of samples.
data (struct): initial data.
data_occurrences (struct): indicates where each of the samples occurs in the data base.
index_to_label (list): maps a sample’s index to a string representation.
label_to_index (dict): reverse index_to_label mapping.
temp_folder (str): path to the temporary folder.
output_folder (str): path to the output folder.
annotation_params (list): parameters for synthetic annotation.
classification_params (list): parameters for classification.
verbose (int): sets the verbosity level.
with_unique_occurrences (bool): indicates wether samples occur only once in the data set or not.
with_common_label_wordform (bool): indicates wether to give the ame label to samples with identical wordform or not.
writing_steps (list): steps at which to save the partial matrix.
gt_only (boolean, optional): if True, experiments will only be conducted from query samples from the ground-truth. Defqults to True.

Returns:

n_samples_occurrences (list): number of occurrences of each sample in a test set over all iterations.
synthetic_labels (list): synthetic labels repartition for each iteration.
co_occ (ndarray): full similarity matrix.

run_ova.load_file(file, input_file, train_docs, label_to_index, classification_params)¶

Parse an Aquaint file to retrieve all occurrences of a single words. Used for a threaded execution in a Pool.

Args:

file (str): name the file to parse.
input_file (str): directory of file.
train_docs (dict): dict mapping a file to the documents containing the considered word.
label_to_index (dict): maps a label to the corresponding entity index.
classification_params (dict): classification parameters.

Returns:

samples (list): list of tagged sentences containing the considered word in the file.

SIC variants¶

`run_basic` Module¶

`run_wem` Module¶

`run_ova` Module¶

Table Of Contents

Previous topic

Next topic

This Page

SIC variants¶

run_basic Module¶

run_wem Module¶

run_ova Module¶

`run_basic` Module¶

`run_wem` Module¶

`run_ova` Module¶