utils package¶
Submodules¶
utils.annotation module¶
annotation.py. Generation of synthetic annotations for the data samples.
-
utils.annotation.annotate(n, temp_folder, classifier_type, (train, test), annotation_params, **kwargs)¶ Return a synthetic annotation of both the training and the testing set.
- Args:
n(int): step identifiertemp_folder: directory for temporary filesclassifier_type(str): type of the classifier that will take the annotation as input.train(list): initial train data structure.test(list): initial test data structure.annotation_params(dict): additional annotation parameters.with_common_label_wordform `` (*bool, optional*): if ``True, each entity occurence wordform receives the same label. Defaults to False.verbose(int, optional): controls verbosity level.debug(bool, optional): enable/disable debug mode.
- Returns:
N(int): random max number of synthetic labels for this step.n_unique_labels_used(int): number of synthetic labels that were actually used.train_file(str): path to the formatted train data.test_file(str): path to the formatted test data.training_size(int): number of sequences in the training database.testing_size(int): number of sequences in the testing database.n_entities_train(int): number of entities in the training database.entites_indices(list): indentify the entities of interest (tag B) in the testing set; this will be used to filter the classifier’s output.
-
utils.annotation.choose_features(pattern, distrib, output_pattern_file=None)¶ Given a set of features and their probability of occurrence (pattern), choose features at random for the current training step.
- Args:
pattern(list): features organized by importance category.distrib(list): probability of sampling a feature for each category.output_pattern_file(str): testing set.
- Returns:
to_keep(list): the selected features.
-
utils.annotation.random() → x in the interval [0, 1).¶
utils.classify module¶
classify.py. For training and applying a classifier on the artifically annotated data set.
-
utils.classify.do_classify_step(n, temp_folder, train, test, test_entities_indices, coocc_matrix, unit_length, count_lab, classification_params, **kwargs)¶ Builds the similarity for the current step given a classifier and synthetic annotation.
- Args:
n(int): step number.temp_folder(str): path to the directory for storing temporary files.train: annotated training set (structure may depend on the classifier).test: formatted testing set.test_entities_indices(list): indices of test entities and their position in the classifier output.cocc_matrix(ndarray): shared similarity matrix.unit_length(int): length of a locked cell in the matrix.count_lab(ndarray): count how many times an entity has been classified as non-null in the test set.classification_params(dict): additional classification parameters.clean(bool, optional): ifTrue, removes the temporary files that were created. Defaults toTrue.with_unique_occurrences(bool, optional): ifTrue, each entity occurence is considered unique and can receive a different label. Defaults toTrue.verbose(int, optional): controls verbosity level. Defaults to 1.debug(bool, optional): runs in debug mode.pretrained(optional): pretrained model.
-
utils.classify.update_similarity(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)¶ Otpimizes the similarity matrix update in the case where multiple occurences correspond to the same entity(e.g. Aquaint2 case).
- Args:
n(int): step number.result_iter(str): output of the classification algorithm (Wapiti CRF).coocc_matrix(ndarray): similarity matrix of the current thread.count_lab(ndarray): count how many times an entity has been classified as non-null in the test set.similarity_type(str): type of the similarity to use.verbose(int, optional): controls verbosity level. Defaults to 1.
- Returns:
n_test_labels: number of distinct annotated labels in the testing baseweights: repartition of the samples in the test classifications (only with weighted similarity).b(float): penalty to be added at the end of construction.
-
utils.classify.update_similarity_unique(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)¶ Updates the similarity matrix in the case where each occurence is an unique named entity (i.e. default case).
- Args:
n(int): step number.result_iter(str): generator expression on the output of the classifier.coocc_matrix(ndarray): similarity matrix of the current thread.unit_cell_length(int): length of one locked cell in the matrix.label_occurrences(ndarray): count how many times an entity has been classified as non-null in the test set.similarity_type(str): type of the similarity to use.verbose(int, optional): controls verbosity level. Defaults to 1.
- Returns:
n_test_labels: number of distinct annotated labels in the testing base.weights: repartition of the samples in the test classifications (only with weighted similarity).b(float): penalty to be added at the end of the building.
utils.em_analysis module¶
em_analysis.py. EM estimation of the mixture parameters for each iteration.
-
utils.em_analysis.em_step(pi0, p0, p1, x, n_obs, N, cores)¶ One single expectation maximization step. Multiprocess execution (each process computes the part of the summations for one set of observations).
- Args:
pi0(float): current estimate of the pi0 parameter ( P(x~y) ).p1(float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).p0(float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).x(list): observation (as dict mapping unique observation -> number of occurences).n_obs(int): total number of observations.N(int): dimension of the multivariate Bernoulli.cores(int): number of cores to use.
- Returns:
nz1(float): estimate of the z1 hidden variables.npi0(float): estimate of the pi0 parameter ( P(x~y) ).np0(float): estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).np1(float): estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
-
utils.em_analysis.em_step_threaded(pi0, p0, p1, obs, N, res_queue)¶ Function for parameter estimation on one thread.
- Args:
pi0(float): current estimate of the pi0 parameter ( P(x~y) ).pi1(float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).pi0(float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).obs(list): observations given to this thread (as dict mapping unique observation -> number of occurences).N(int): dimension of the multivariate Bernoulli.res_queue(Queue): output queue.
-
utils.em_analysis.estimate_parameters_em(co_occ, N, p1i=0.9, p0i=0.1, pi0i=0.8, n_iter=20, cores=20)¶ Expectation Maximization on 2-components Bernoulli mixture.
- Args:
co_occ(list): list of observations.N(int): dimension of an observation (here, corresponds to the number of considered iterations).pi1i(float, optional): initial estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y). Defaults to 0.9.pi0i(float, optional): initial estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y). Defaults to 0.1.pi0i(float, optional): initial estimate of the pi0 parameter ( P(x~y) ). Defaults to 0.2.n_iter(int, optional): number of iterations. Defaults to 20.
utils.error module¶
error.py. error module
-
exception
utils.error.ConfigError¶ Bases:
exceptions.ExceptionException raised when a compatibility error is found in the configuration options.
-
exception
utils.error.InputError¶ Bases:
exceptions.ExceptionException raised when an error is found in the input data.
-
exception
utils.error.ParsingError¶ Bases:
exceptions.ExceptionException raised when a parsing error is found the configuration options.
-
utils.error.signal_handler(signal, frame)¶ Handles the Keyboard interrupt signal in case of multi-process execution.
-
utils.error.warning(obj)¶ Print a warning on the error stream.
- Args:
obj(str): warning message
utils.eval module¶
-
utils.eval.APP(predicted, labels)¶
-
utils.eval.F(predicted, labels)¶
-
utils.eval.P(predicted, labels)¶
-
utils.eval.R(predicted, labels)¶
-
utils.eval.Read_file(file_cl)¶
-
utils.eval.V(predicted, labels)¶
-
utils.eval.combination(a, k)¶
-
utils.eval.grp2idx(labels)¶
-
utils.eval.mi(predicted, labels)¶
-
utils.eval.mutual_info(x, y)¶
-
utils.eval.nmi(x, y)¶
-
utils.eval.rand(predicted, labels)¶
utils.matrix_op module¶
matrix_op.py. Functions operating on the full similarity matrix (normalization and distribution analysis).
-
utils.matrix_op.ROC_analysis(line, name, output_folder, ground_truth)¶ Plots a ROC curve for one given sample (one line of the matrix).
- Args:
line(ndarray): similarities for one sample.name(str): prefix for naming the plots.output_folder(str): path to directory to output the plots.ground_truth(str): indices of the samples belonging to the same class as the current sample.
-
utils.matrix_op.ROC_mean_analysis(lines, key, output_folder, gt)¶ Plots all ROC curve and their horizontal/vertical means for several samples of the same class (lines of the matrix).
- Args:
line(ndarray): similarities for one sample.name(str): prefix for naming the plots.output_folder(str): path to directory to output the plots.gt(list): indices of the samples belonging to the currently considered class.
-
utils.matrix_op.distribution_analysis(line, name, output_folder, temp_folder, ground_truth, kbest=[2000, 1000, 500, 200], mode='matlab')¶ Plots the similarities histogram and densities (+ ground-truth display) for a line (sample) of the similarity matrix, at various scales.
- Args:
line(ndarray): sorted similarities in increasing order for one sample.name(str): prefix for naming the plots.output_folder(str): path to directory to output the plots.temp_folder(str): path to directory to output temporary plots (before concatenation).ground_truth: indices of samples belonging to the same class as current sample.kbest(list, optional): indices for zoom ins (keep thekbest values, for allkinkbest).mode(str): if ‘matlab’, then plots the distribution on the whole interval using matplotlib. If ‘R’, plots the distribution for all zoom-values inkbestusingR(requiresggplot2library).
-
utils.matrix_op.keep_k_best(co_occ, k=200)¶ Keep the
kbest values in the matrix and set the rest to 0. Relies on the bottleneck library for fast sort.Args:
co_occ(ndarray): input matrix.k(int, optional): number of values to keep. Defaults to 200.
Returns:
normalized(ndarray): normalized matrix.
-
utils.matrix_op.normalize(co_occ)¶ Returns a normalized version of the input matrix.
Args:
co_occ(ndarray): co-occurence matrix.
Returns:
normalized(ndarray): a normalized version of the co-occurence matrix.
-
utils.matrix_op.normalize_gauss_global(co_occ)¶ Normalize the full matrix with respect to its global standard deviation and mean (X <- (X - mean) / std).
Args:
co_occ(ndarray): input matrix.
Returns:
normalized(ndarray): normalized matrix.
-
utils.matrix_op.normalize_gauss_local(co_occ)¶ Normalize the full matrix line by line with respect to their global standard deviation and mean (X <- (X - mean) / std).
Args:
co_occ(ndarray): input matrix.
Returns:
normalized(ndarray): normalized matrix.
-
utils.matrix_op.normalize_min_max(co_occ)¶ Normalize the full matrix globally with respect to its minimum and maximum value (X <- (X - min) / (max - min)).
Args:
co_occ(ndarray): input matrix.
Returns:
normalized(ndarray): normalized matrix.
-
utils.matrix_op.statistical_analysis_binary(line, rnd_distrib, ground_truth, temp_folder, output_folder, name, suffix=None)¶ For binary SIC: plots the similarity distribution and its given theoretical model (Poisson Binomial).
- Args:
line(list): similarities for the considered sample x.rnd_distrib(list): values of the discrete random case distribution (k -> P(X = k))ground_truth(list): indices of the samples in the same class as x.temp_folder(str): path to the folder containing the temporary files.output_folder(str): path to the output folder.name(str): string representation of the considered sample.suffix(str, optional): additional suffix for the output file. Defaults to None (no suffix).
-
utils.matrix_op.statistical_analysis_weighted(line, N, ground_truth, temp_folder, output_folder, name, step=1.0)¶ Plots the similarity distribution and gaussian theoretical distribution in the case of a weighted similarity (negative samples).
- Args:
line(list): similarities for the considered sample x.N(int): standard deviation of the gaussian.ground_truth(list): indices of the samples in the same class as x.temp_folder(str): path to the folder containing the temporary files.output_folder(str): path to the output folder.name(str): string representation of the considered sample.step(float, optional): step between x-axis’ ticks.
utils.one_step module¶
one_step.py. one classification step for building the similarity. Designed for a threaded execution.
-
utils.one_step.split_data(n, data, data_type, train_frac, classifier_type, annotation_params, temp_folder, with_full_test=False)¶ Splits the given database into a training and testing set.
- Args:
n(int): iteration identifier.data(list): initial data structure.data_type(str): data set used for the experiments.train_frac(float): proportion of the database to keep for training.classifier_type(str): type of classifier (for on-the-fly parsing format in AQUAINT).annotation_params(str): annotation parameter for OVA.temp_folder(str): path to temporary folder.with_full_test(bool, optional): ifTrue, use the whole dataset (including training) for testing.
- Returns:
train(list): the data kept for training (generator).test(list): the data kept for testing (generator).test_indices(list): indices of the test samples in the whole data; used to compute the number of test occurrences afterwards. (Except for AQUAINT where test_indices directly returns the occurrences of each samples in the test set).
-
utils.one_step.thread_step(n, coocc_matrix, unit_cell_length, n_samples, sim_queue, data, temp_folder, data_type, annotation_params, classification_params, verbose=1, debug=False, with_unique_occurrences=False, preclustering=[])¶ One classification iteration. Results are output in a queue.
- Args:
n(int): index of current iteration.coocc_matrix(array): similarity matrix (shared in memory).unit_cell_length(int): length of one locked cell in the matrix.n_samples(int): number of samples in the data set.sim_queue(Queue): output queue.data(list): initial data structure.temp_folder(str): path to temporary folder.data_type(str): data set used.annotation_params(dict): parameters for the synthetic annotation.classification_params(dict): parameters for the supervised classification algorithm.verbose(int, optional): controls the verbosity level. Defaults to 1.debug(bool, optional): runs in debugging mode. Defaults to False.with_unique_occurrences(bool, optional):Truewhen occurrences of a same entity are distinct items in the database (e.g. NER). Defaults toFalse.preclustering(list, optional): Entity index to class mapping. This clustring is used to given the same annotation to entities in the same class.
utils.opt module¶
opt.py.: Functions designed for vectorial updates of the similarity matrix instead of cell by cell.
-
utils.opt.cartesian(arrays, out=None, numpy=True)¶ Generate a cartesian product of input arrays.
- Parameters:
- arrays : list of array-like. 1-D arrays to form the cartesian product of.
- out : ndarray. Array to place the cartesian product in.
- Returns:
- out : ndarray. 2-D array of shape (M, len(arrays)) containing cartesian products formed of input arrays.
- See:
-
utils.opt.cartesian_prod(arrays, full=None, out=None, numpy=True)¶ Generates the products for all possible cartesian combinations of the arrays components.
- Args:
arrays: list of array-like.full(int): length of the base array. Starting value should be None.out(ndarray): array where to put the result.numpy(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out(ndarray): array containing the products for all possible cartesian combinations.
-
utils.opt.pairs_combination(a, numpy=True)¶ Returns all the possible pair combinations from an index array.
- Args:
a: array-like.numpy(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out: array containing all the possible pair combinations from an index array.
-
utils.opt.pairs_combination_indices(a, n_samples, numpy=True)¶ Returns all the possible pairs combinations from an index array, under their index form (upper triangle matrix).
- Args:
a: array-like.n_samples: size of the 2D matrix (ie max possible value of the index + 1)numpy(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out: array containing the indices of all possible pairs combinations.
-
utils.opt.product_combination(a, numpy=True)¶ Returns all the possible pair product combinations from an index array.
- Args:
a: array-like.numpy(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out: array containing all the possible pair product combinations from an index array.
utils.output_format module¶
output_format.py. functions linked to format of the program outputs.
-
class
utils.output_format.Tee(*files)¶ Bases:
objectTee object used for linking several files (used for linking stdout to log file).
-
flush()¶
-
write(obj)¶
-
-
utils.output_format.clustering_to_file(output_folder, clustering, suffix=None)¶ Writes a clustering in a file.
- Args:
output_folder(str): Path to the output folder.clustering(dict): clustering represented as a dictionnary mapping a cluster to its elements.suffix(str, optional): if given, this is added as a suffix to the name of the output file. Defaults toNone.
- Returns:
output_path(str): Path to the file in which the clustering was written.
-
utils.output_format.clustering_to_string(clustering)¶ - Return a clustering as a string with entities separated by tabulations and clusters by newlines ‘
‘.
- Args:
clustering(dict): clustering represented as a dictionnary mapping a cluster to its elements.
- Returns:
clustering_to_string(str): string representation of the clustering.
-
utils.output_format.init_folder(path)¶ Create directory in
pathif not already existing.- Args:
path(str): path of the directory to initialize.
-
utils.output_format.load_cooc(mat)¶ Load and format, if needed, the given similarity matrix.
- Args:
mat(str): path to the similarity matrix (txt, npy or pickle)
-
utils.output_format.log(output_folder)¶ Create file for redirecting output
- Args:
output_folder(str): path to root of directory for outputs.
-
utils.output_format.readable_clustering(output_folder, clustering, index_to_label, suffix=None)¶ Outputs a clustering in a more readable format.
- Args:
output_folder(str): Path to the output folder.clustering(dict): clustering represented as a dictionnary mapping a cluster to the lsit of its elements.index_to_label(list): mapping from an entity index to a string label.suffix(str, optional): if given, this is added as a suffix to the name of the output file. Defaults toNone.
- Returns:
output_file(str): path to the file now containing the clustering under readable format.
-
utils.output_format.save_coocc(output_folder, coocc, suffix=None, type='binary')¶ Saves the current co-occurence matrix.
- Args:
output_folder(str): path to root of output folder.n(int): step of the matrix.coocc(ndarray): co-occurence matrix.type(str): output file format (text,binaryorpickle).
- Returns:
output_file(str): path to the file now containing the matrix.
-
utils.output_format.save_coocc_mcl(output_folder, coocc, index_to_label)¶ Save current co-occurence matrix (only non-zero entries) in the label format of MCL.
- Args:
output_folder(str): path to root of output folder.coocc(ndarray): co-occurence matrix.index_to_label(list): mapping from an entity index to a readable label.
- Returns:
output_file(str): path to the file now containing the matrix.
utils.parse module¶
parse.py. Functions for parsing data files and ground-truth file for a given data set. The module can also be used as a script to generate .qrel version of the ground-truth for the information retrieval evaluation procedure.
-
utils.parse.ground_truth_AQUA_qrel(ground_truth_file, output_file, aqua_entities_file)¶ Builds the qrel file for the AQUAINT ground-truth to be used for the nearest neighbour evaluation.
- Args:
ground_truth_file(str): path to the AQUAINT ground-truth file.output_file(str): path to the qrel output file.aqua_entities_file(str): path to the file listing all AQUAINT entities.
-
utils.parse.ground_truth_indices(data_type, ground_truth_file, label_to_index=None)¶ Return indices of the entities that are in the ground truth. Only used for AQUAINT as for NER, samples = ground_truth
- Args:
data_type(str): dataset identifierground_truth_file(str): path to the ground_truth_filelabel_to_index(list): maps a label to the corresponding entity index. Required for Aquaint.
- Returns:
indices(list): indices of samples in ground-truth.
-
utils.parse.ground_truth_pairs(data_type, ground_truth_file, n_samples)¶ Retrieves indices of samples’pairs that have the same ground-truth class. Indices are computed as if using the upper triangle of a squqre n_samples x n_samples matrix.
- Args:
data_type(str): data set.ground_truth_file(str): ground-truth file for the given data set.n_samples(str): number of samples for the given data set.
- Return:
indices(int ist): list of indices of samples pairs with same ground-truth class.
-
utils.parse.ground_truth_qrel(ground_truth_file, output_file, index_to_label)¶ Builds the qrel file for NER and AUDIO ground-truth to be used for the nearest neighbour evaluation.
- Args:
ground_truth_file(str): path to the AQUAINT ground-truth file.output_file(str): path to the qrel output file.index_to_label(list*): maps an entity index to the corresponding string label.
-
utils.parse.parse_AQUA_entities(entities_file)¶ Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.
- Args:
entities_file(str): file containing the retrieved entities and number of occurences.
- Returns:
index_to_label(list): list assocating a sample’s index with its string representation.label_to_index(list): reversed mapping of index_to_label.
-
utils.parse.parse_AQUA_single_partial(classifier_type, data_file, label_to_index, training_size, testing_size, train_acc, test_acc, test_occurences, train_included_test=False)¶ Read and parse partially the given Aquaint2 data file. Contrary to
parse_AQUA_single, this function does not load the whole data, but directly builds the required training and testing sets.- Args:
classifier_type(str): type of the classifier that will be used for the experiment.data_file(str): path tothe data file.label_to_index(str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.training_size(int): number of docs for training from this document, or a list of docs indices + sentences to keep.testing_size(int): number of docs for testing from this document, or a list of docs indices + sentences to keep.train_acc(iterator): accumulator for training sentences.test_acc(iterator): accumulator for testing sentences.test_indices(array): accumulator for the number of occurrences of each word in the test database.train_included_test(bool, optional): IfTrue, retrieved training sentences will also be included in the testing set.
- Returns:
train_acc(iterator): updated training sentences accumulator.test_acc(iterator): updated testing sentences accumulator.
-
utils.parse.parse_NER(classifier_type, data_file)¶ Reads and parses the given data file for the Named Entity Recognition (NER) task.
- Args:
classifier_type(str): type of the classifier that will be used for the experiments.data_file(str): path to data file.
- Returns:
n_samples(int): number of samples in the database.data(list): structure containing the data. A word is usually represented by a tuple (index of the sample if interesting entity, word representant of the sample, additional tags, B(I)O tag).data_occurrences(list): number of occurences of each word in each sentence.index_to_label(list): list assocating a sample’s index with its string representation.summary(str): additional information on the dataset.
-
utils.parse.parse_audio(data_folder, selected_features)¶ Parse features selection for the Audio task.
- Args:
data_folder(str): If the features are precomputed, thendata_folderis the path to the directory containing the features. Otherwise, it is a file containing the samples folder as its first line and all the possible HTK features markers (one per line) to consider.selected_features(list): list of the features type to use in the experiments (given in the configuration file).
- Returns:
n_samples(int): number of samples in the dataset.data(list): maps a feature identifier to the corresponding HTK generated features.index_to_label(lsit): maps an entity index to a string label.
-
utils.parse.parse_data(data_type, classification_params, data_file)¶ Parse data file(s) depending on the chosen options.
- Args:
data_type(str): dataset identifier.classification_params(str): classifier parameters.data_file(str): path to data file.
- Returns:
n_samples(int): number of samples in the database.data(list): structure containing the data.data_occurrences(list): number of entity occurrences in each sentence/docs of the data (only for AQUAINT and NER).index_to_label(list): list associating a sample’s index with its string representation.label_to_index(list): reversed mapping of index_to_label.
-
utils.parse.parse_ground_truth(data_type, ground_truth_file, label_to_index=None)¶ Reads and parses the given ground-truth file.
- Args:
data(list): structure containing the data.ground_truth_file(str): path to the ground truth file.label_to_index(dict, optional): maps an AQUAINT word to its index. Required when outputting the AQUAINT groundtruth with the entity indices rather than string representation.
- Returns:
ground_truth(list): list associating a sample with its ground-truth cluster:- for NER,
ground_truth: cluster (str) -> entity indices (int list) - for AQUA,
ground_truth: entity (str) -> entity indices (int list) if label_to_index, else str list
- for NER,
-
utils.parse.parse_pattern(classifier_type, pattern_file)¶ Reads and parses the given pattern file (Wapiti/CRF++ expected format).
- Args:
classifier_type(str): type of the classifier that will be used for the experiment.pattern_file(str): path to the pattern file.
- Returns:
features(list): features organized by category.distrib(list): probability of sampling a feature for each category.
-
utils.parse.split_on_ground_truth_no_indices(data_type, ground_truth_file, numb=6, keys=None)¶ Given the ground_truth data file, returns a random set number of entities for each class (usually used to visualize similarity distributions)
- Args:
data(list): structure containing the data.ground_truth_file(str): path to the ground truth file.numb(int): number to plot for each entitykeys(list, optional): if given, the algorithm returns a list of samples whose ground-truth classes form thekeyslist.
- Returns:
ground_truth(list): list associating a sample with its ground-truth cluster.selected_entities(list): selected entities to be plotted.
utils.parse_stat module¶
parse.py_stat. Additional functions for parsing some statistics and precomputing information on the data sets (mostly Aquaint).
-
utils.parse_stat.count_aqua_docs(directory, output_folder, aqua_entities_file)¶ Counts the number occurrences of each word in each document as well as for the whole data set
- Args:
directory(str): directory containing all xml documents of the dataset.output_folder(str): path to output directory.aqua_entities_file(str): path to the file listing all AQUAINT entities.
-
utils.parse_stat.count_aqua_docs_score(directory, output_file, aqua_entities_file)¶ Computes a score for each document based on how rare are the words occuring in the document.
- Args:
directory(str): path to the directory containing the Aquaint data files.output_file(str): path to the file to write the output scores.aqua_entities_file(str): path to the file containing all aquaint entities and their number of occurrences.
-
utils.parse_stat.parse_AQUA(classifier_type, data_folder, label_to_index)¶ Read and parse all data files from the AQUAINT folder.
- Args:
classifier_type(str): type of the classifier that will be used for the experiment.data_folder(str): path to folder containing all data files.label_to_index(str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
- Returns:
data(list): structure containing the data (file -> docs -> sentence -> words).data_occurrences(list): number of entity occurrences in each sentence/docs of the data.summary(str): additional information on the dataset.
-
utils.parse_stat.parse_AQUA_entities(entities_file)¶ Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.
- Args:
entities_file(str): file containing the retrieved entities and number of occurences.
- Returns:
index_to_label(list): list assocating a sample’s index with its string representation.label_to_index(list): reversed mapping of index_to_label.
-
utils.parse_stat.parse_AQUA_single(classifier_type, data_file, label_to_index)¶ Read and parse the full given AQUAINT data file.
- Args:
classifier_type(str): type of the classifier that will be used for the experiment.data_file(str): path tothe data file.label_to_index(str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
- Returns:
data(list): structure containing the data (file -> docs -> sentence -> words).data_occurrences(list): number of entity occurrences in each sentence/docs of the data.
-
utils.parse_stat.retrieve_aqua_entities(directory, output_file)¶ Retrieves all interesting entities for the AQUAINT2 dataset (common names with strictly more than 10 occurences).
- Args:
directory(str): directory containing all xml documents of the dataset.output_file(str): path where to output the retrieved entities and number of occurences.
-
utils.parse_stat.retrieve_aqua_occurrences(directory, output_file, aqua_entities_file)¶ Retrieve all occurrences of each word in the dataset (position of their occurrences given as a tupple file -> doc -> sentence)
- Args:
directory(str): path to the directory containing the Aquaint data files.output_file(str): path to the directory to write the output files (1 file = 1 word).aqua_entities_file(str): path to the file containing all aquaint entities and their number of occurrences.
-
utils.parse_stat.retrieve_aqua_occurrences_sentences(directory, output_file, aqua_entities_file)¶ Same as
parse_stat.retrieve_aqua_occurrences, but outputs the sentence of the occurrences rather than its position.- Args:
directory(str): path to the directory containing the Aquaint data files.output_file(str): path to the file to write the output scores.aqua_entities_file(str): path to the file containing all aquaint entities and their number of occurrences.
-
utils.parse_stat.stat_aqua(directory)¶ Computes some statistics about the AQUAINT dataset.
- Args:
directory(str): directory containing all xml documents of the dataset.
utils.plot module¶
plot.py. Functions related to plotting and data visualization.
-
utils.plot.compare_params(true_params, em_params, ip0, output_folder)¶ Plots several visual comparisons of the parameters estimation.
- Args:
true_params(list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.em_params(dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.ip0(list): p0 parameters estimates with independance assumption.output_folder(str): path to the output folder.
-
utils.plot.convergence_curve(output_folder, log_file)¶ Plots the evolution of the correlation coefficients for the convergence experiments.
- Args:
output_folder(str): path to the directory for the outputs.log_file(str): path to the log file output during the experiments (convergence_analysis.py script)
-
utils.plot.expected_binary(true_params, em_params, tpi0, epi0, output_folder)¶ Plot the 2-components Poisson Binomial mixture model given its parameters.
- Args:
true_params(list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.em_params(dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.tpi0(float): ground-truth pi0 estimate.epi0(float): EM pi0 estimate.output_folder(str): path to the output folder.
-
utils.plot.fraction_plot(clustering, ground_truth, output_name)¶ Plots an histogram representation of a clustering with colors representing the ground-truth clustering.
- Args:
clustering((cluster -> values) dict): a dict representation of a clustering.ground_truth((cluster -> values) dict): a dict representation of the ground-truth clustering.output_name(str): prefix of the file in which to output the figure.
-
utils.plot.get_Aquaint_graph(start, synonyms, nodes, edges, level, maxlevel)¶ Returns an excerpt networkx graph from the Aquaint ground-truth. Recursive function.
- Args:
start(list): list of nodes to build edge from.synonyms(dict): Aquaint ground-truth neighbours relations.nodes(dict): list of nodes already built, organized by depth (minimum depth relatively to one of the starting nodes).edges(list): list of edges already built.level(int): current depth (starting at 0).maxlevel(int): max depth to consider.
-
utils.plot.heatmap(similarity_matrix, ground_truth, output_folder)¶ Plots a heat map of the similarity matrix.
- Args:
sim_matrix(ndarray): similarity matrix.ground_truth(dict): ground-truth clustering.output_folder(str): path to the output folder.
-
utils.plot.histo_cluster(clustering, output_name)¶ Plots an histogram representation of a clustering (number of samples per cluster).
- Args:
clustering((cluster -> values) dict): a dict representation of a clustering.output_name(str): prefix of the file in which to output the figure.
-
utils.plot.is_in_upper_level(nodes, word, level)¶ Determines wether a node has already been seen as a closer neighbour (level).
- Args:
nodes(dict): dict mapping a level to the nodes it contains.word(str): label of the node to consider.level(int): current level.
-
utils.plot.mds_representation(sim_matrix, ground_truth, index_to_label, colors, output_folder, dim=2, cores=20, mode='mds')¶ Compute euclidean distances (MDS) from the computed similarity matrix
- Args:
sim_matrix(ndarray): similarity matrix.ground_truth(dict): ground-truth clustering.colors(dict): mapping from a class to a color.output_folder(str): path to the output folder.dim(int, optional): number of dimensions in the metric space. Defaults to 2.cores(int, optional): number of cores to use (threaded MDS).mode(str, optional): Projection algorithm ('mds'or'tsne').
-
utils.plot.on_the_fly_cvg(file)¶ Plot some statistics on the on-the-fly convergence criterion.
- Args:
file(str): path to the file containing the measurements of the on-the-fly criterion
-
utils.plot.pie_chart(clusters, output_folder)¶ Plots a pie-chart representation of a clustering.
- Args:
clustering((cluster -> values) dict): a dict representation of a clustering.output_folder(str): path to output folder.
-
utils.plot.plot_Aquaint_graph(words, aqua_gt, level=1)¶ Plot an excerpt graph from the Aquaint ground-truth using the networkx library.
- Args:
words(list): nodes to consider as origin.aqua_gt(str): path to the Aquaint ground-truth.level(int): max depth to consider (starting at 0).
utils.probability_fit module¶
probability_fit.py. functions related to estimate probability distribution.
-
utils.probability_fit.estimate_poisson_binomial(N, p_values)¶ Estimate the values of a Poisson Binomial distribution given its p-parameters.
- Args:
N(int): number of independant Bernouilli experiments.p_values(list): Bernouilli parameter for each experiment.
- Returns:
values(list): values taken by the distribution (k -> P(X = k))
-
utils.probability_fit.sample_from_pdf(N, values)¶ Returns N samples from a given discrete distribution
P.- Args:
N(int): number of samples to draw.values(list): values taken by the discrete distribution (k -> P(X = k))
- Returns:
samples(list):Nsamples drawn from thePdistribution
utils.read_config module¶
read_config.py. Functions for setting the main parameters. Reads the configuration file in configuration.ini and the command line options.
-
utils.read_config.get_config_option(config, section, name, arg, type='str')¶ Returns the default configuration if
arg(command line argument) isNone, else returnsarg.- Args:
config: current configuration object returned by ConfigParser.section(str): section of the considered argument.name(str): name of the considered argument.arg: value passed through command line for the considered argument.type(str, optional): type of the considered argument (str, int or float). Defaults to str.
-
utils.read_config.parse_cmd_line()¶ Parses the command line for options to override the default configuration parameters.
- Returns:
opt(list): specified command line options to override the default config options:N(int): number of iterations (-N, –iter).cores(int): number of cores (-t, –threads).data_type(str): chosen dataset (-d, –dataset).input(str): chosen data file (-in, –input).groundtruth(str): chosen data file (-g, –groundtruth).n_distrib(str): type of annotation (-di, –distrib).training_size(float): training percentage, strictly between 0 and 1 (-ts, –trainsize).nmin(int): minimum number of synthetic labels (-nmin).nmax(int): minimum number of synthetic labels (-nmax).
cfg(str): path to default configuration file.verbose(int): controls verbosity level (0 to 4).debug(bool): runs in debugging mode.
-
utils.read_config.read_config_file(f, N=None, cores=None, data_type=None, input_file=None, ground_truth_file=None, n_distrib=None, training_size=None, nmin=None, nmax=None, classifier_type=None, similarity_type=None, task_type=None, cvg_step=None, cvg_criterion=None, output_folder=None, temp_folder=None, oar=False)¶ Reads and parses the default arguments in the configuration file.
- Args:
f(str): path to the configuration file.opt(list): specified command line options (see result ofread_config.parse_cmd_line).
- Returns:
N(int): number of iterations.cores(int): number of cores to use.locks(int): number of independant locks to add on the similarity matrix.data_type(str): name of the dataset chosen for the experiments.input_file(str): path to default input file.ground_truth_file(str): path to default configuration file.temp_folder(str): path to the folder for temporary files (e.g. MCL input format file).output_folder(str): path to the folder for output files.annotation_params(dict): parameters for the synthetic annotation.classification_params(dict): parameters for the supervised classification algorithm. (classifier type, training percentage, similarity type, additional parameters).classifier_binary(str): path to the binary for the classifier.task_params(list): parameters for the post-processing task. (task type, algorithm binary, additional parameters).cvg_step(int): Ifstep> 2, the convergence criterion will be evaluated everystepiteration.cvg_criterion(float): value of the criterion on the mean entropies to stop the algorithm.config(ConfigParser): configuration object updated with the values in command line.