SIC variants¶
run_basic
Module¶
run_basic. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).
-
run_basic.
compute_similarity_basic
(n_iter, n_cores, n_locks, data_type, n_samples, data, data_occurrences, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, writing_steps, convergence_step, convergence_criterion)¶ Computes the simlarity matrix in the basic setting.
- Args:
n_iter
(int): number of iterations.n_cores
(int): number of cores to use.n_locks
(int): number of locks to use on the full shared matrix.data_type
(str): data set to use.n_samples
(int): number of samples.data
(struct): initial data.data_occurrences
(struct): indicates where each of the samples occurs in the data base.temp_folder
(str): path to the temporary folder.output_folder
(str): path to the output folder.annotation_params
(list): parameters for synthetic annotation.classification_params
(list): parameters for classification.verbose
(int): sets the verbosity level.with_unique_occurrences
(bool): indicates wether samples occur only once in the data set or not.with_common_label_wordform
(bool): indicates wether to give the same label to samples with identical wordform or not.writing_steps
(list): steps at which to save the partial matrix.convergence_step
(int): every ‘convergence_step’, the script checks if convergence is reached.
- Returns:
n_samples_occurrences
(list): number of occurrences of each sample in a test set over all iterations.synthetic_labels
(list): synthetic labels repartition for each iteration.co_occ
(ndarray): full similarity matrix.
run_wem
Module¶
run_wem. Computing the similarity matrix in the WEM setting.
-
run_wem.
compute_similarity_wem
(n_iter, n_cores, n_locks, data_type, n_samples, data, ground_truth_file, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering)¶ Computes the simlarity matrix in the basic setting.
- Args:
n_iter
(int): number of iterations.n_cores
(int): number of cores to use.n_locks
(int): number of locks to use on the full shared matrix.data_type
(str): data set to use.n_samples
(int): number of samples.data
(struct): initial data.ground_truth_file
(str): path to ground-truth.temp_folder
(str): path to the temporary folder.output_folder
(str): path to the output folder.annotation_params
(list): parameters for synthetic annotation.classification_params
(list): parameters for classification.verbose
(int): sets the verbosity level.with_unique_occurrences
(bool): indicates wether samples occur only once in the data set or not.with_common_label_wordform
(bool): indicates wether to give the same label to samples with identical wordform or not.
- Returns:
n_samples_occurrences
(list): number of occurrences of each sample in a test set over all iterations.synthetic_labels
(list): synthetic labels repartition for each iteration.co_occ
(ndarray): full similarity matrix.
run_ova
Module¶
run_ova. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).
-
run_ova.
compute_similarity_ova
(n_iter, n_cores, n_locks, input_file, ground_truth_file, data_type, n_samples, data, data_occurrences, index_to_label, label_to_index, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, gtonly=True)¶ Computes the simlarity matrix in the basic setting.
- Args:
n_iter
(int): number of iterations.n_cores
(int): number of cores to use.n_locks
(int): number of locks to use on the full shared matrix.data_type
(str): data set to use.n_samples
(int): number of samples.data
(struct): initial data.data_occurrences
(struct): indicates where each of the samples occurs in the data base.index_to_label
(list): maps a sample’s index to a string representation.label_to_index
(dict): reverse index_to_label mapping.temp_folder
(str): path to the temporary folder.output_folder
(str): path to the output folder.annotation_params
(list): parameters for synthetic annotation.classification_params
(list): parameters for classification.verbose
(int): sets the verbosity level.with_unique_occurrences
(bool): indicates wether samples occur only once in the data set or not.with_common_label_wordform
(bool): indicates wether to give the ame label to samples with identical wordform or not.writing_steps
(list): steps at which to save the partial matrix.gt_only
(boolean, optional): if True, experiments will only be conducted from query samples from the ground-truth. Defqults to True.
- Returns:
n_samples_occurrences
(list): number of occurrences of each sample in a test set over all iterations.synthetic_labels
(list): synthetic labels repartition for each iteration.co_occ
(ndarray): full similarity matrix.
-
run_ova.
load_file
(file, input_file, train_docs, label_to_index, classification_params)¶ Parse an Aquaint file to retrieve all occurrences of a single words. Used for a threaded execution in a Pool.
- Args:
file
(str): name the file to parse.input_file
(str): directory offile
.train_docs
(dict): dict mapping a file to the documents containing the considered word.label_to_index
(dict): maps a label to the corresponding entity index.classification_params
(dict): classification parameters.
- Returns:
samples
(list): list of tagged sentences containing the considered word in the file.