utils package¶

Subpackages¶

Submodules¶

utils.annotation module¶

annotation.py. Generation of synthetic annotations for the data samples.

utils.annotation.annotate(n, temp_folder, classifier_type, (train, test), annotation_params, **kwargs)¶

Return a synthetic annotation of both the training and the testing set.

Args:

n (int): step identifier
temp_folder: directory for temporary files
classifier_type (str): type of the classifier that will take the annotation as input.
train (list): initial train data structure.
test (list): initial test data structure.
annotation_params (dict): additional annotation parameters.
with_common_label_wordform `` (*bool, optional*): if ``True, each entity occurence wordform receives the same label. Defaults to False.
verbose (int, optional): controls verbosity level.
debug (bool, optional): enable/disable debug mode.

Returns:

N (int): random max number of synthetic labels for this step.
n_unique_labels_used (int): number of synthetic labels that were actually used.
train_file (str): path to the formatted train data.
test_file (str): path to the formatted test data.
training_size (int): number of sequences in the training database.
testing_size (int): number of sequences in the testing database.
n_entities_train (int): number of entities in the training database.
entites_indices (list): indentify the entities of interest (tag B) in the testing set; this will be used to filter the classifier’s output.

utils.annotation.choose_features(pattern, distrib, output_pattern_file=None)¶

Given a set of features and their probability of occurrence (pattern), choose features at random for the current training step.

Args:

pattern (list): features organized by importance category.
distrib (list): probability of sampling a feature for each category.
output_pattern_file (str): testing set.

Returns:

to_keep (list): the selected features.

utils.annotation.random() → x in the interval [0, 1).¶

utils.classify module¶

classify.py. For training and applying a classifier on the artifically annotated data set.

utils.classify.do_classify_step(n, temp_folder, train, test, test_entities_indices, coocc_matrix, unit_length, count_lab, classification_params, **kwargs)¶

Builds the similarity for the current step given a classifier and synthetic annotation.

Args:

n (int): step number.
temp_folder (str): path to the directory for storing temporary files.
train: annotated training set (structure may depend on the classifier).
test: formatted testing set.
test_entities_indices (list): indices of test entities and their position in the classifier output.
cocc_matrix (ndarray): shared similarity matrix.
unit_length (int): length of a locked cell in the matrix.
count_lab (ndarray): count how many times an entity has been classified as non-null in the test set.
classification_params (dict): additional classification parameters.
clean (bool, optional): if True, removes the temporary files that were created. Defaults to True.
with_unique_occurrences (bool, optional): if True, each entity occurence is considered unique and can receive a different label. Defaults to True.
verbose (int, optional): controls verbosity level. Defaults to 1.
debug (bool, optional): runs in debug mode.
pretrained (optional): pretrained model.

utils.classify.update_similarity(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)¶

Otpimizes the similarity matrix update in the case where multiple occurences correspond to the same entity(e.g. Aquaint2 case).

Args:

n (int): step number.
result_iter (str): output of the classification algorithm (Wapiti CRF).
coocc_matrix (ndarray): similarity matrix of the current thread.
count_lab (ndarray): count how many times an entity has been classified as non-null in the test set.
similarity_type (str): type of the similarity to use.
verbose (int, optional): controls verbosity level. Defaults to 1.

Returns:

n_test_labels: number of distinct annotated labels in the testing base
weights: repartition of the samples in the test classifications (only with weighted similarity).
b (float): penalty to be added at the end of construction.

utils.classify.update_similarity_unique(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)¶

Updates the similarity matrix in the case where each occurence is an unique named entity (i.e. default case).

Args:

n (int): step number.
result_iter (str): generator expression on the output of the classifier.
coocc_matrix (ndarray): similarity matrix of the current thread.
unit_cell_length (int): length of one locked cell in the matrix.
label_occurrences (ndarray): count how many times an entity has been classified as non-null in the test set.
similarity_type (str): type of the similarity to use.
verbose (int, optional): controls verbosity level. Defaults to 1.

Returns:

n_test_labels: number of distinct annotated labels in the testing base.
weights: repartition of the samples in the test classifications (only with weighted similarity).
b (float): penalty to be added at the end of the building.

utils.em_analysis module¶

em_analysis.py. EM estimation of the mixture parameters for each iteration.

utils.em_analysis.em_step(pi0, p0, p1, x, n_obs, N, cores)¶

One single expectation maximization step. Multiprocess execution (each process computes the part of the summations for one set of observations).

Args:

pi0 (float): current estimate of the pi0 parameter ( P(x~y) ).
p1 (float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).
p0 (float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
x (list): observation (as dict mapping unique observation -> number of occurences).
n_obs (int): total number of observations.
N (int): dimension of the multivariate Bernoulli.
cores (int): number of cores to use.

Returns:

nz1 (float): estimate of the z1 hidden variables.
npi0 (float): estimate of the pi0 parameter ( P(x~y) ).
np0 (float): estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).
np1 (float): estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).

utils.em_analysis.em_step_threaded(pi0, p0, p1, obs, N, res_queue)¶

Function for parameter estimation on one thread.

Args:

pi0 (float): current estimate of the pi0 parameter ( P(x~y) ).
pi1 (float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).
pi0 (float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
obs (list): observations given to this thread (as dict mapping unique observation -> number of occurences).
N (int): dimension of the multivariate Bernoulli.
res_queue (Queue): output queue.

utils.em_analysis.estimate_parameters_em(co_occ, N, p1i=0.9, p0i=0.1, pi0i=0.8, n_iter=20, cores=20)¶

Expectation Maximization on 2-components Bernoulli mixture.

Args:

co_occ (list): list of observations.
N (int): dimension of an observation (here, corresponds to the number of considered iterations).
pi1i (float, optional): initial estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y). Defaults to 0.9.
pi0i (float, optional): initial estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y). Defaults to 0.1.
pi0i (float, optional): initial estimate of the pi0 parameter ( P(x~y) ). Defaults to 0.2.
n_iter (int, optional): number of iterations. Defaults to 20.

utils.error module¶

error.py. error module

exception utils.error.ConfigError¶

Bases: exceptions.Exception

Exception raised when a compatibility error is found in the configuration options.

exception utils.error.InputError¶

Bases: exceptions.Exception

Exception raised when an error is found in the input data.

exception utils.error.ParsingError¶

Bases: exceptions.Exception

Exception raised when a parsing error is found the configuration options.

utils.error.signal_handler(signal, frame)¶: Handles the Keyboard interrupt signal in case of multi-process execution.

utils.error.warning(obj)¶

Print a warning on the error stream.

Args:

obj (str): warning message

utils.eval module¶

utils.eval.APP(predicted, labels)¶

utils.eval.F(predicted, labels)¶

utils.eval.P(predicted, labels)¶

utils.eval.R(predicted, labels)¶

utils.eval.Read_file(file_cl)¶

utils.eval.V(predicted, labels)¶

utils.eval.combination(a, k)¶

utils.eval.grp2idx(labels)¶

utils.eval.mi(predicted, labels)¶

utils.eval.mutual_info(x, y)¶

utils.eval.nmi(x, y)¶

utils.eval.rand(predicted, labels)¶

utils.matrix_op module¶

matrix_op.py. Functions operating on the full similarity matrix (normalization and distribution analysis).

utils.matrix_op.ROC_analysis(line, name, output_folder, ground_truth)¶

Plots a ROC curve for one given sample (one line of the matrix).

Args:

line (ndarray): similarities for one sample.
name (str): prefix for naming the plots.
output_folder (str): path to directory to output the plots.
ground_truth (str): indices of the samples belonging to the same class as the current sample.

utils.matrix_op.ROC_mean_analysis(lines, key, output_folder, gt)¶

Plots all ROC curve and their horizontal/vertical means for several samples of the same class (lines of the matrix).

Args:

line (ndarray): similarities for one sample.
name (str): prefix for naming the plots.
output_folder (str): path to directory to output the plots.
gt (list): indices of the samples belonging to the currently considered class.

utils.matrix_op.distribution_analysis(line, name, output_folder, temp_folder, ground_truth, kbest=[2000, 1000, 500, 200], mode='matlab')¶

Plots the similarities histogram and densities (+ ground-truth display) for a line (sample) of the similarity matrix, at various scales.

Args:

line (ndarray): sorted similarities in increasing order for one sample.
name (str): prefix for naming the plots.
output_folder (str): path to directory to output the plots.
temp_folder (str): path to directory to output temporary plots (before concatenation).
ground_truth: indices of samples belonging to the same class as current sample.
kbest (list, optional): indices for zoom ins (keep the k best values, for all k in kbest).
mode (str): if ‘matlab’, then plots the distribution on the whole interval using matplotlib. If ‘R’, plots the distribution for all zoom-values in kbest using R (requires ggplot2 library).

utils.matrix_op.keep_k_best(co_occ, k=200)¶

Keep the k best values in the matrix and set the rest to 0. Relies on the bottleneck library for fast sort.

Args:

co_occ (ndarray): input matrix.

k (int, optional): number of values to keep. Defaults to 200.

Returns:

normalized (ndarray): normalized matrix.

utils.matrix_op.normalize(co_occ)¶

Returns a normalized version of the input matrix.

Args:

co_occ (ndarray): co-occurence matrix.

Returns:

normalized (ndarray): a normalized version of the co-occurence matrix.

utils.matrix_op.normalize_gauss_global(co_occ)¶

Normalize the full matrix with respect to its global standard deviation and mean (X <- (X - mean) / std).

Args:

co_occ (ndarray): input matrix.

Returns:

normalized (ndarray): normalized matrix.

utils.matrix_op.normalize_gauss_local(co_occ)¶

Normalize the full matrix line by line with respect to their global standard deviation and mean (X <- (X - mean) / std).

Args:

co_occ (ndarray): input matrix.

Returns:

normalized (ndarray): normalized matrix.

utils.matrix_op.normalize_min_max(co_occ)¶

Normalize the full matrix globally with respect to its minimum and maximum value (X <- (X - min) / (max - min)).

Args:

co_occ (ndarray): input matrix.

Returns:

normalized (ndarray): normalized matrix.

utils.matrix_op.statistical_analysis_binary(line, rnd_distrib, ground_truth, temp_folder, output_folder, name, suffix=None)¶

For binary SIC: plots the similarity distribution and its given theoretical model (Poisson Binomial).

Args:

line (list): similarities for the considered sample x.
rnd_distrib (list): values of the discrete random case distribution (k -> P(X = k))
ground_truth (list): indices of the samples in the same class as x.
temp_folder (str): path to the folder containing the temporary files.
output_folder (str): path to the output folder.
name (str): string representation of the considered sample.
suffix (str, optional): additional suffix for the output file. Defaults to None (no suffix).

utils.matrix_op.statistical_analysis_weighted(line, N, ground_truth, temp_folder, output_folder, name, step=1.0)¶

Plots the similarity distribution and gaussian theoretical distribution in the case of a weighted similarity (negative samples).

Args:

line (list): similarities for the considered sample x.
N (int): standard deviation of the gaussian.
ground_truth (list): indices of the samples in the same class as x.
temp_folder (str): path to the folder containing the temporary files.
output_folder (str): path to the output folder.
name (str): string representation of the considered sample.
step (float, optional): step between x-axis’ ticks.

utils.one_step module¶

one_step.py. one classification step for building the similarity. Designed for a threaded execution.

utils.one_step.split_data(n, data, data_type, train_frac, classifier_type, annotation_params, temp_folder, with_full_test=False)¶

Splits the given database into a training and testing set.

Args:

n (int): iteration identifier.
data (list): initial data structure.
data_type (str): data set used for the experiments.
train_frac (float): proportion of the database to keep for training.
classifier_type (str): type of classifier (for on-the-fly parsing format in AQUAINT).
annotation_params (str): annotation parameter for OVA.
temp_folder (str): path to temporary folder.
with_full_test (bool, optional): if True, use the whole dataset (including training) for testing.

Returns:

train (list): the data kept for training (generator).
test (list): the data kept for testing (generator).
test_indices (list): indices of the test samples in the whole data; used to compute the number of test occurrences afterwards. (Except for AQUAINT where test_indices directly returns the occurrences of each samples in the test set).

utils.one_step.thread_step(n, coocc_matrix, unit_cell_length, n_samples, sim_queue, data, temp_folder, data_type, annotation_params, classification_params, verbose=1, debug=False, with_unique_occurrences=False, preclustering=[])¶

One classification iteration. Results are output in a queue.

Args:

n (int): index of current iteration.
coocc_matrix (array): similarity matrix (shared in memory).
unit_cell_length (int): length of one locked cell in the matrix.
n_samples (int): number of samples in the data set.
sim_queue (Queue): output queue.
data (list): initial data structure.
temp_folder (str): path to temporary folder.
data_type (str): data set used.
annotation_params (dict): parameters for the synthetic annotation.
classification_params (dict): parameters for the supervised classification algorithm.
verbose (int, optional): controls the verbosity level. Defaults to 1.
debug (bool, optional): runs in debugging mode. Defaults to False.
with_unique_occurrences (bool, optional): True when occurrences of a same entity are distinct items in the database (e.g. NER). Defaults to False.
preclustering (list, optional): Entity index to class mapping. This clustring is used to given the same annotation to entities in the same class.

utils.opt module¶

opt.py.: Functions designed for vectorial updates of the similarity matrix instead of cell by cell.

utils.opt.cartesian(arrays, out=None, numpy=True)¶

Generate a cartesian product of input arrays.

Parameters:

arrays : list of array-like. 1-D arrays to form the cartesian product of.
out : ndarray. Array to place the cartesian product in.

Returns:

out : ndarray. 2-D array of shape (M, len(arrays)) containing cartesian products formed of input arrays.

See:

http://stackoverflow.com/questions/1208118/using-numpy-to-build-an-array-of-all-combinations-of-two-arrays/1235363#1235363

utils.opt.cartesian_prod(arrays, full=None, out=None, numpy=True)¶

Generates the products for all possible cartesian combinations of the arrays components.

Args:

arrays: list of array-like.
full (int): length of the base array. Starting value should be None.
out (ndarray): array where to put the result.
numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.

Returns:

out (ndarray): array containing the products for all possible cartesian combinations.

utils.opt.pairs_combination(a, numpy=True)¶

Returns all the possible pair combinations from an index array.

Args:

a: array-like.
numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.

Returns:

out: array containing all the possible pair combinations from an index array.

utils.opt.pairs_combination_indices(a, n_samples, numpy=True)¶

Returns all the possible pairs combinations from an index array, under their index form (upper triangle matrix).

Args:

a: array-like.
n_samples: size of the 2D matrix (ie max possible value of the index + 1)
numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.

Returns:

out: array containing the indices of all possible pairs combinations.

utils.opt.product_combination(a, numpy=True)¶

Returns all the possible pair product combinations from an index array.

Args:

a: array-like.
numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.

Returns:

out: array containing all the possible pair product combinations from an index array.

utils.output_format module¶

output_format.py. functions linked to format of the program outputs.

class utils.output_format.Tee(*files)¶

Bases: object

Tee object used for linking several files (used for linking stdout to log file).

flush()¶

write(obj)¶

utils.output_format.clustering_to_file(output_folder, clustering, suffix=None)¶

Writes a clustering in a file.

Args:

output_folder (str): Path to the output folder.
clustering (dict): clustering represented as a dictionnary mapping a cluster to its elements.
suffix (str, optional): if given, this is added as a suffix to the name of the output file. Defaults to None.

Returns:

output_path (str): Path to the file in which the clustering was written.

utils.output_format.clustering_to_string(clustering)¶

Return a clustering as a string with entities separated by tabulations and clusters by newlines ‘

‘.

Args:

clustering (dict): clustering represented as a dictionnary mapping a cluster to its elements.

Returns:

clustering_to_string (str): string representation of the clustering.

utils.output_format.init_folder(path)¶

Create directory in path if not already existing.

Args:

path (str): path of the directory to initialize.

utils.output_format.load_cooc(mat)¶

Load and format, if needed, the given similarity matrix.

Args:

mat (str): path to the similarity matrix (txt, npy or pickle)

utils.output_format.log(output_folder)¶

Create file for redirecting output

Args:

output_folder (str): path to root of directory for outputs.

utils.output_format.readable_clustering(output_folder, clustering, index_to_label, suffix=None)¶

Outputs a clustering in a more readable format.

Args:

output_folder (str): Path to the output folder.
clustering (dict): clustering represented as a dictionnary mapping a cluster to the lsit of its elements.
index_to_label (list): mapping from an entity index to a string label.
suffix (str, optional): if given, this is added as a suffix to the name of the output file. Defaults to None.

Returns:

output_file (str): path to the file now containing the clustering under readable format.

utils.output_format.save_coocc(output_folder, coocc, suffix=None, type='binary')¶

Saves the current co-occurence matrix.

Args:

output_folder (str): path to root of output folder.
n (int): step of the matrix.
coocc (ndarray): co-occurence matrix.
type (str): output file format (text, binary or pickle).

Returns:

output_file (str): path to the file now containing the matrix.

utils.output_format.save_coocc_mcl(output_folder, coocc, index_to_label)¶

Save current co-occurence matrix (only non-zero entries) in the label format of MCL.

Args:

output_folder (str): path to root of output folder.
coocc (ndarray): co-occurence matrix.
index_to_label (list): mapping from an entity index to a readable label.

Returns:

output_file (str): path to the file now containing the matrix.

utils.parse module¶

parse.py. Functions for parsing data files and ground-truth file for a given data set. The module can also be used as a script to generate .qrel version of the ground-truth for the information retrieval evaluation procedure.

utils.parse.ground_truth_AQUA_qrel(ground_truth_file, output_file, aqua_entities_file)¶

Builds the qrel file for the AQUAINT ground-truth to be used for the nearest neighbour evaluation.

Args:

ground_truth_file (str): path to the AQUAINT ground-truth file.
output_file (str): path to the qrel output file.
aqua_entities_file (str): path to the file listing all AQUAINT entities.

utils.parse.ground_truth_indices(data_type, ground_truth_file, label_to_index=None)¶

Return indices of the entities that are in the ground truth. Only used for AQUAINT as for NER, samples = ground_truth

Args:

data_type (str): dataset identifier
ground_truth_file (str): path to the ground_truth_file
label_to_index (list): maps a label to the corresponding entity index. Required for Aquaint.

Returns:

indices (list): indices of samples in ground-truth.

utils.parse.ground_truth_pairs(data_type, ground_truth_file, n_samples)¶

Retrieves indices of samples’pairs that have the same ground-truth class. Indices are computed as if using the upper triangle of a squqre n_samples x n_samples matrix.

Args:

data_type (str): data set.
ground_truth_file (str): ground-truth file for the given data set.
n_samples (str): number of samples for the given data set.

Return:

indices (int ist): list of indices of samples pairs with same ground-truth class.

utils.parse.ground_truth_qrel(ground_truth_file, output_file, index_to_label)¶

Builds the qrel file for NER and AUDIO ground-truth to be used for the nearest neighbour evaluation.

Args:

ground_truth_file (str): path to the AQUAINT ground-truth file.
output_file (str): path to the qrel output file.
index_to_label (list*): maps an entity index to the corresponding string label.

utils.parse.parse_AQUA_entities(entities_file)¶

Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.

Args:

entities_file (str): file containing the retrieved entities and number of occurences.

Returns:

index_to_label (list): list assocating a sample’s index with its string representation.
label_to_index (list): reversed mapping of index_to_label.

utils.parse.parse_AQUA_single_partial(classifier_type, data_file, label_to_index, training_size, testing_size, train_acc, test_acc, test_occurences, train_included_test=False)¶

Read and parse partially the given Aquaint2 data file. Contrary to parse_AQUA_single, this function does not load the whole data, but directly builds the required training and testing sets.

Args:

classifier_type (str): type of the classifier that will be used for the experiment.
data_file (str): path tothe data file.
label_to_index (str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
training_size (int): number of docs for training from this document, or a list of docs indices + sentences to keep.
testing_size (int): number of docs for testing from this document, or a list of docs indices + sentences to keep.
train_acc (iterator): accumulator for training sentences.
test_acc (iterator): accumulator for testing sentences.
test_indices (array): accumulator for the number of occurrences of each word in the test database.
train_included_test (bool, optional): If True, retrieved training sentences will also be included in the testing set.

Returns:

train_acc (iterator): updated training sentences accumulator.
test_acc (iterator): updated testing sentences accumulator.

utils.parse.parse_NER(classifier_type, data_file)¶

Reads and parses the given data file for the Named Entity Recognition (NER) task.

Args:

classifier_type (str): type of the classifier that will be used for the experiments.
data_file (str): path to data file.

Returns:

n_samples (int): number of samples in the database.
data (list): structure containing the data. A word is usually represented by a tuple (index of the sample if interesting entity, word representant of the sample, additional tags, B(I)O tag).
data_occurrences (list): number of occurences of each word in each sentence.
index_to_label (list): list assocating a sample’s index with its string representation.
summary (str): additional information on the dataset.

utils.parse.parse_audio(data_folder, selected_features)¶

Parse features selection for the Audio task.

Args:

data_folder (str): If the features are precomputed, then data_folder is the path to the directory containing the features. Otherwise, it is a file containing the samples folder as its first line and all the possible HTK features markers (one per line) to consider.
selected_features (list): list of the features type to use in the experiments (given in the configuration file).

Returns:

n_samples (int): number of samples in the dataset.
data (list): maps a feature identifier to the corresponding HTK generated features.
index_to_label (lsit): maps an entity index to a string label.

utils.parse.parse_data(data_type, classification_params, data_file)¶

Parse data file(s) depending on the chosen options.

Args:

data_type (str): dataset identifier.
classification_params (str): classifier parameters.
data_file (str): path to data file.

Returns:

n_samples (int): number of samples in the database.
data (list): structure containing the data.
data_occurrences (list): number of entity occurrences in each sentence/docs of the data (only for AQUAINT and NER).
index_to_label (list): list associating a sample’s index with its string representation.
label_to_index (list): reversed mapping of index_to_label.

utils.parse.parse_ground_truth(data_type, ground_truth_file, label_to_index=None)¶

Reads and parses the given ground-truth file.

Args:

data (list): structure containing the data.
ground_truth_file (str): path to the ground truth file.
label_to_index (dict, optional): maps an AQUAINT word to its index. Required when outputting the AQUAINT groundtruth with the entity indices rather than string representation.

Returns:

ground_truth (list): list associating a sample with its ground-truth cluster:
- for NER, ground_truth: cluster (str) -> entity indices (int list)
- for AQUA, ground_truth: entity (str) -> entity indices (int list) if label_to_index, else str list

utils.parse.parse_pattern(classifier_type, pattern_file)¶

Reads and parses the given pattern file (Wapiti/CRF++ expected format).

Args:

classifier_type (str): type of the classifier that will be used for the experiment.
pattern_file (str): path to the pattern file.

Returns:

features (list): features organized by category.
distrib (list): probability of sampling a feature for each category.

utils.parse.split_on_ground_truth_no_indices(data_type, ground_truth_file, numb=6, keys=None)¶

Given the ground_truth data file, returns a random set number of entities for each class (usually used to visualize similarity distributions)

Args:

data (list): structure containing the data.
ground_truth_file (str): path to the ground truth file.
numb (int): number to plot for each entity
keys (list, optional): if given, the algorithm returns a list of samples whose ground-truth classes form the keys list.

Returns:

ground_truth (list): list associating a sample with its ground-truth cluster.
selected_entities (list): selected entities to be plotted.

utils.parse_stat module¶

parse.py_stat. Additional functions for parsing some statistics and precomputing information on the data sets (mostly Aquaint).

utils.parse_stat.count_aqua_docs(directory, output_folder, aqua_entities_file)¶

Counts the number occurrences of each word in each document as well as for the whole data set

Args:

directory (str): directory containing all xml documents of the dataset.
output_folder (str): path to output directory.
aqua_entities_file (str): path to the file listing all AQUAINT entities.

utils.parse_stat.count_aqua_docs_score(directory, output_file, aqua_entities_file)¶

Computes a score for each document based on how rare are the words occuring in the document.

Args:

directory (str): path to the directory containing the Aquaint data files.
output_file (str): path to the file to write the output scores.
aqua_entities_file (str): path to the file containing all aquaint entities and their number of occurrences.

utils.parse_stat.parse_AQUA(classifier_type, data_folder, label_to_index)¶

Read and parse all data files from the AQUAINT folder.

Args:

classifier_type (str): type of the classifier that will be used for the experiment.
data_folder (str): path to folder containing all data files.
label_to_index (str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.

Returns:

data (list): structure containing the data (file -> docs -> sentence -> words).
data_occurrences (list): number of entity occurrences in each sentence/docs of the data.
summary (str): additional information on the dataset.

utils.parse_stat.parse_AQUA_entities(entities_file)¶

Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.

Args:

entities_file (str): file containing the retrieved entities and number of occurences.

Returns:

index_to_label (list): list assocating a sample’s index with its string representation.
label_to_index (list): reversed mapping of index_to_label.

utils.parse_stat.parse_AQUA_single(classifier_type, data_file, label_to_index)¶

Read and parse the full given AQUAINT data file.

Args:

classifier_type (str): type of the classifier that will be used for the experiment.
data_file (str): path tothe data file.
label_to_index (str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.

Returns:

data (list): structure containing the data (file -> docs -> sentence -> words).
data_occurrences (list): number of entity occurrences in each sentence/docs of the data.

utils.parse_stat.retrieve_aqua_entities(directory, output_file)¶

Retrieves all interesting entities for the AQUAINT2 dataset (common names with strictly more than 10 occurences).

Args:

directory (str): directory containing all xml documents of the dataset.
output_file (str): path where to output the retrieved entities and number of occurences.

utils.parse_stat.retrieve_aqua_occurrences(directory, output_file, aqua_entities_file)¶

Retrieve all occurrences of each word in the dataset (position of their occurrences given as a tupple file -> doc -> sentence)

Args:

directory (str): path to the directory containing the Aquaint data files.
output_file (str): path to the directory to write the output files (1 file = 1 word).
aqua_entities_file (str): path to the file containing all aquaint entities and their number of occurrences.

utils.parse_stat.retrieve_aqua_occurrences_sentences(directory, output_file, aqua_entities_file)¶

Same as parse_stat.retrieve_aqua_occurrences, but outputs the sentence of the occurrences rather than its position.

Args:

directory (str): path to the directory containing the Aquaint data files.
output_file (str): path to the file to write the output scores.
aqua_entities_file (str): path to the file containing all aquaint entities and their number of occurrences.

utils.parse_stat.stat_aqua(directory)¶

Computes some statistics about the AQUAINT dataset.

Args:

directory (str): directory containing all xml documents of the dataset.

utils.plot module¶

plot.py. Functions related to plotting and data visualization.

utils.plot.compare_params(true_params, em_params, ip0, output_folder)¶

Plots several visual comparisons of the parameters estimation.

Args:

true_params (list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.
em_params (dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.
ip0 (list): p0 parameters estimates with independance assumption.
output_folder (str): path to the output folder.

utils.plot.convergence_curve(output_folder, log_file)¶

Plots the evolution of the correlation coefficients for the convergence experiments.

Args:

output_folder (str): path to the directory for the outputs.
log_file (str): path to the log file output during the experiments (convergence_analysis.py script)

utils.plot.expected_binary(true_params, em_params, tpi0, epi0, output_folder)¶

Plot the 2-components Poisson Binomial mixture model given its parameters.

Args:

true_params (list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.
em_params (dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.
tpi0 (float): ground-truth pi0 estimate.
epi0 (float): EM pi0 estimate.
output_folder (str): path to the output folder.

utils.plot.fraction_plot(clustering, ground_truth, output_name)¶

Plots an histogram representation of a clustering with colors representing the ground-truth clustering.

Args:

clustering ((cluster -> values) dict): a dict representation of a clustering.
ground_truth ((cluster -> values) dict): a dict representation of the ground-truth clustering.
output_name (str): prefix of the file in which to output the figure.

utils.plot.get_Aquaint_graph(start, synonyms, nodes, edges, level, maxlevel)¶

Returns an excerpt networkx graph from the Aquaint ground-truth. Recursive function.

Args:

start (list): list of nodes to build edge from.
synonyms (dict): Aquaint ground-truth neighbours relations.
nodes (dict): list of nodes already built, organized by depth (minimum depth relatively to one of the starting nodes).
edges (list): list of edges already built.
level (int): current depth (starting at 0).
maxlevel (int): max depth to consider.

utils.plot.heatmap(similarity_matrix, ground_truth, output_folder)¶

Plots a heat map of the similarity matrix.

Args:

sim_matrix (ndarray): similarity matrix.
ground_truth (dict): ground-truth clustering.
output_folder (str): path to the output folder.

utils.plot.histo_cluster(clustering, output_name)¶

Plots an histogram representation of a clustering (number of samples per cluster).

Args:

clustering ((cluster -> values) dict): a dict representation of a clustering.
output_name (str): prefix of the file in which to output the figure.

utils.plot.is_in_upper_level(nodes, word, level)¶

Determines wether a node has already been seen as a closer neighbour (level).

Args:

nodes (dict): dict mapping a level to the nodes it contains.
word (str): label of the node to consider.
level (int): current level.

utils.plot.mds_representation(sim_matrix, ground_truth, index_to_label, colors, output_folder, dim=2, cores=20, mode='mds')¶

Compute euclidean distances (MDS) from the computed similarity matrix

Args:

sim_matrix (ndarray): similarity matrix.
ground_truth (dict): ground-truth clustering.
colors (dict): mapping from a class to a color.
output_folder (str): path to the output folder.
dim (int, optional): number of dimensions in the metric space. Defaults to 2.
cores (int, optional): number of cores to use (threaded MDS).
mode (str, optional): Projection algorithm ('mds' or 'tsne').

utils.plot.on_the_fly_cvg(file)¶

Plot some statistics on the on-the-fly convergence criterion.

Args:

file (str): path to the file containing the measurements of the on-the-fly criterion

utils.plot.pie_chart(clusters, output_folder)¶

Plots a pie-chart representation of a clustering.

Args:

clustering ((cluster -> values) dict): a dict representation of a clustering.
output_folder (str): path to output folder.

utils.plot.plot_Aquaint_graph(words, aqua_gt, level=1)¶

Plot an excerpt graph from the Aquaint ground-truth using the networkx library.

Args:

words (list): nodes to consider as origin.
aqua_gt (str): path to the Aquaint ground-truth.
level (int): max depth to consider (starting at 0).

utils.probability_fit module¶

probability_fit.py. functions related to estimate probability distribution.

utils.probability_fit.estimate_poisson_binomial(N, p_values)¶

Estimate the values of a Poisson Binomial distribution given its p-parameters.

Args:

N (int): number of independant Bernouilli experiments.
p_values (list): Bernouilli parameter for each experiment.

Returns:

values (list): values taken by the distribution (k -> P(X = k))

utils.probability_fit.sample_from_pdf(N, values)¶

Returns N samples from a given discrete distribution P.

Args:

N (int): number of samples to draw.
values (list): values taken by the discrete distribution (k -> P(X = k))

Returns:

samples (list): N samples drawn from the P distribution

utils.read_config module¶

read_config.py. Functions for setting the main parameters. Reads the configuration file in configuration.ini and the command line options.

utils.read_config.get_config_option(config, section, name, arg, type='str')¶

Returns the default configuration if arg (command line argument) is None, else returns arg.

Args:

config: current configuration object returned by ConfigParser.
section (str): section of the considered argument.
name (str): name of the considered argument.
arg: value passed through command line for the considered argument.
type (str, optional): type of the considered argument (str, int or float). Defaults to str.

utils.read_config.parse_cmd_line()¶

Parses the command line for options to override the default configuration parameters.

Returns:

opt (list): specified command line options to override the default config options:
- N (int): number of iterations (-N, –iter).
- cores (int): number of cores (-t, –threads).
- data_type (str): chosen dataset (-d, –dataset).
- input (str): chosen data file (-in, –input).
- groundtruth (str): chosen data file (-g, –groundtruth).
- n_distrib (str): type of annotation (-di, –distrib).
- training_size (float): training percentage, strictly between 0 and 1 (-ts, –trainsize).
- nmin (int): minimum number of synthetic labels (-nmin).
- nmax (int): minimum number of synthetic labels (-nmax).
cfg (str): path to default configuration file.
verbose (int): controls verbosity level (0 to 4).
debug (bool): runs in debugging mode.

utils.read_config.read_config_file(f, N=None, cores=None, data_type=None, input_file=None, ground_truth_file=None, n_distrib=None, training_size=None, nmin=None, nmax=None, classifier_type=None, similarity_type=None, task_type=None, cvg_step=None, cvg_criterion=None, output_folder=None, temp_folder=None, oar=False)¶

Reads and parses the default arguments in the configuration file.

Args:

f (str): path to the configuration file.
opt (list): specified command line options (see result of read_config.parse_cmd_line).

Returns:

N (int): number of iterations.
cores (int): number of cores to use.
locks (int): number of independant locks to add on the similarity matrix.
data_type (str): name of the dataset chosen for the experiments.
input_file (str): path to default input file.
ground_truth_file (str): path to default configuration file.
temp_folder (str): path to the folder for temporary files (e.g. MCL input format file).
output_folder (str): path to the folder for output files.
annotation_params (dict): parameters for the synthetic annotation.
classification_params (dict): parameters for the supervised classification algorithm. (classifier type, training percentage, similarity type, additional parameters).
classifier_binary (str): path to the binary for the classifier.
task_params (list): parameters for the post-processing task. (task type, algorithm binary, additional parameters).
cvg_step (int): If step > 2, the convergence criterion will be evaluated every step iteration.
cvg_criterion (float): value of the criterion on the mean entropies to stop the algorithm.
config (ConfigParser): configuration object updated with the values in command line.

utils package¶

Subpackages¶

Submodules¶

utils.annotation module¶

utils.classify module¶

utils.em_analysis module¶

utils.error module¶

utils.eval module¶

utils.matrix_op module¶

utils.one_step module¶

utils.opt module¶

utils.output_format module¶

utils.parse module¶

utils.parse_stat module¶

utils.plot module¶

utils.probability_fit module¶

utils.read_config module¶

Module contents¶

Table Of Contents

Previous topic

Next topic

This Page