SIC variants¶
run_basic Module¶
run_basic. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).
-
run_basic.compute_similarity_basic(n_iter, n_cores, n_locks, data_type, n_samples, data, data_occurrences, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, writing_steps, convergence_step, convergence_criterion)¶ Computes the simlarity matrix in the basic setting.
- Args:
n_iter(int): number of iterations.n_cores(int): number of cores to use.n_locks(int): number of locks to use on the full shared matrix.data_type(str): data set to use.n_samples(int): number of samples.data(struct): initial data.data_occurrences(struct): indicates where each of the samples occurs in the data base.temp_folder(str): path to the temporary folder.output_folder(str): path to the output folder.annotation_params(list): parameters for synthetic annotation.classification_params(list): parameters for classification.verbose(int): sets the verbosity level.with_unique_occurrences(bool): indicates wether samples occur only once in the data set or not.with_common_label_wordform(bool): indicates wether to give the same label to samples with identical wordform or not.writing_steps(list): steps at which to save the partial matrix.convergence_step(int): every ‘convergence_step’, the script checks if convergence is reached.
- Returns:
n_samples_occurrences(list): number of occurrences of each sample in a test set over all iterations.synthetic_labels(list): synthetic labels repartition for each iteration.co_occ(ndarray): full similarity matrix.
run_wem Module¶
run_wem. Computing the similarity matrix in the WEM setting.
-
run_wem.compute_similarity_wem(n_iter, n_cores, n_locks, data_type, n_samples, data, ground_truth_file, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering)¶ Computes the simlarity matrix in the basic setting.
- Args:
n_iter(int): number of iterations.n_cores(int): number of cores to use.n_locks(int): number of locks to use on the full shared matrix.data_type(str): data set to use.n_samples(int): number of samples.data(struct): initial data.ground_truth_file(str): path to ground-truth.temp_folder(str): path to the temporary folder.output_folder(str): path to the output folder.annotation_params(list): parameters for synthetic annotation.classification_params(list): parameters for classification.verbose(int): sets the verbosity level.with_unique_occurrences(bool): indicates wether samples occur only once in the data set or not.with_common_label_wordform(bool): indicates wether to give the same label to samples with identical wordform or not.
- Returns:
n_samples_occurrences(list): number of occurrences of each sample in a test set over all iterations.synthetic_labels(list): synthetic labels repartition for each iteration.co_occ(ndarray): full similarity matrix.
run_ova Module¶
run_ova. Computing the similarity matrix in the basic setting (BIN, WBIN, WUBIN).
-
run_ova.compute_similarity_ova(n_iter, n_cores, n_locks, input_file, ground_truth_file, data_type, n_samples, data, data_occurrences, index_to_label, label_to_index, temp_folder, output_folder, annotation_params, classification_params, verbose, debug, with_unique_occurrences, preclustering, gtonly=True)¶ Computes the simlarity matrix in the basic setting.
- Args:
n_iter(int): number of iterations.n_cores(int): number of cores to use.n_locks(int): number of locks to use on the full shared matrix.data_type(str): data set to use.n_samples(int): number of samples.data(struct): initial data.data_occurrences(struct): indicates where each of the samples occurs in the data base.index_to_label(list): maps a sample’s index to a string representation.label_to_index(dict): reverse index_to_label mapping.temp_folder(str): path to the temporary folder.output_folder(str): path to the output folder.annotation_params(list): parameters for synthetic annotation.classification_params(list): parameters for classification.verbose(int): sets the verbosity level.with_unique_occurrences(bool): indicates wether samples occur only once in the data set or not.with_common_label_wordform(bool): indicates wether to give the ame label to samples with identical wordform or not.writing_steps(list): steps at which to save the partial matrix.gt_only(boolean, optional): if True, experiments will only be conducted from query samples from the ground-truth. Defqults to True.
- Returns:
n_samples_occurrences(list): number of occurrences of each sample in a test set over all iterations.synthetic_labels(list): synthetic labels repartition for each iteration.co_occ(ndarray): full similarity matrix.
-
run_ova.load_file(file, input_file, train_docs, label_to_index, classification_params)¶ Parse an Aquaint file to retrieve all occurrences of a single words. Used for a threaded execution in a Pool.
- Args:
file(str): name the file to parse.input_file(str): directory offile.train_docs(dict): dict mapping a file to the documents containing the considered word.label_to_index(dict): maps a label to the corresponding entity index.classification_params(dict): classification parameters.
- Returns:
samples(list): list of tagged sentences containing the considered word in the file.