utils package

Submodules

utils.annotation module

annotation.py. Generation of synthetic annotations for the data samples.

utils.annotation.annotate(n, temp_folder, classifier_type, (train, test), annotation_params, **kwargs)

Return a synthetic annotation of both the training and the testing set.

Args:
  • n (int): step identifier
  • temp_folder: directory for temporary files
  • classifier_type (str): type of the classifier that will take the annotation as input.
  • train (list): initial train data structure.
  • test (list): initial test data structure.
  • annotation_params (dict): additional annotation parameters.
  • with_common_label_wordform `` (*bool, optional*): if ``True, each entity occurence wordform receives the same label. Defaults to False.
  • verbose (int, optional): controls verbosity level.
  • debug (bool, optional): enable/disable debug mode.
Returns:
  • N (int): random max number of synthetic labels for this step.
  • n_unique_labels_used (int): number of synthetic labels that were actually used.
  • train_file (str): path to the formatted train data.
  • test_file (str): path to the formatted test data.
  • training_size (int): number of sequences in the training database.
  • testing_size (int): number of sequences in the testing database.
  • n_entities_train (int): number of entities in the training database.
  • entites_indices (list): indentify the entities of interest (tag B) in the testing set; this will be used to filter the classifier’s output.
utils.annotation.choose_features(pattern, distrib, output_pattern_file=None)

Given a set of features and their probability of occurrence (pattern), choose features at random for the current training step.

Args:
  • pattern (list): features organized by importance category.
  • distrib (list): probability of sampling a feature for each category.
  • output_pattern_file (str): testing set.
Returns:
  • to_keep (list): the selected features.
utils.annotation.random() → x in the interval [0, 1).

utils.classify module

classify.py. For training and applying a classifier on the artifically annotated data set.

utils.classify.do_classify_step(n, temp_folder, train, test, test_entities_indices, coocc_matrix, unit_length, count_lab, classification_params, **kwargs)

Builds the similarity for the current step given a classifier and synthetic annotation.

Args:
  • n (int): step number.
  • temp_folder (str): path to the directory for storing temporary files.
  • train: annotated training set (structure may depend on the classifier).
  • test: formatted testing set.
  • test_entities_indices (list): indices of test entities and their position in the classifier output.
  • cocc_matrix (ndarray): shared similarity matrix.
  • unit_length (int): length of a locked cell in the matrix.
  • count_lab (ndarray): count how many times an entity has been classified as non-null in the test set.
  • classification_params (dict): additional classification parameters.
  • clean (bool, optional): if True, removes the temporary files that were created. Defaults to True.
  • with_unique_occurrences (bool, optional): if True, each entity occurence is considered unique and can receive a different label. Defaults to True.
  • verbose (int, optional): controls verbosity level. Defaults to 1.
  • debug (bool, optional): runs in debug mode.
  • pretrained (optional): pretrained model.
utils.classify.update_similarity(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)

Otpimizes the similarity matrix update in the case where multiple occurences correspond to the same entity(e.g. Aquaint2 case).

Args:
  • n (int): step number.
  • result_iter (str): output of the classification algorithm (Wapiti CRF).
  • coocc_matrix (ndarray): similarity matrix of the current thread.
  • count_lab (ndarray): count how many times an entity has been classified as non-null in the test set.
  • similarity_type (str): type of the similarity to use.
  • verbose (int, optional): controls verbosity level. Defaults to 1.
Returns:
  • n_test_labels: number of distinct annotated labels in the testing base
  • weights: repartition of the samples in the test classifications (only with weighted similarity).
  • b (float): penalty to be added at the end of construction.
utils.classify.update_similarity_unique(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)

Updates the similarity matrix in the case where each occurence is an unique named entity (i.e. default case).

Args:
  • n (int): step number.
  • result_iter (str): generator expression on the output of the classifier.
  • coocc_matrix (ndarray): similarity matrix of the current thread.
  • unit_cell_length (int): length of one locked cell in the matrix.
  • label_occurrences (ndarray): count how many times an entity has been classified as non-null in the test set.
  • similarity_type (str): type of the similarity to use.
  • verbose (int, optional): controls verbosity level. Defaults to 1.
Returns:
  • n_test_labels: number of distinct annotated labels in the testing base.
  • weights: repartition of the samples in the test classifications (only with weighted similarity).
  • b (float): penalty to be added at the end of the building.

utils.em_analysis module

em_analysis.py. EM estimation of the mixture parameters for each iteration.

utils.em_analysis.em_step(pi0, p0, p1, x, n_obs, N, cores)

One single expectation maximization step. Multiprocess execution (each process computes the part of the summations for one set of observations).

Args:
  • pi0 (float): current estimate of the pi0 parameter ( P(x~y) ).
  • p1 (float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).
  • p0 (float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
  • x (list): observation (as dict mapping unique observation -> number of occurences).
  • n_obs (int): total number of observations.
  • N (int): dimension of the multivariate Bernoulli.
  • cores (int): number of cores to use.
Returns:
  • nz1 (float): estimate of the z1 hidden variables.
  • npi0 (float): estimate of the pi0 parameter ( P(x~y) ).
  • np0 (float): estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).
  • np1 (float): estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
utils.em_analysis.em_step_threaded(pi0, p0, p1, obs, N, res_queue)

Function for parameter estimation on one thread.

Args:
  • pi0 (float): current estimate of the pi0 parameter ( P(x~y) ).
  • pi1 (float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).
  • pi0 (float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
  • obs (list): observations given to this thread (as dict mapping unique observation -> number of occurences).
  • N (int): dimension of the multivariate Bernoulli.
  • res_queue (Queue): output queue.
utils.em_analysis.estimate_parameters_em(co_occ, N, p1i=0.9, p0i=0.1, pi0i=0.8, n_iter=20, cores=20)

Expectation Maximization on 2-components Bernoulli mixture.

Args:
  • co_occ (list): list of observations.
  • N (int): dimension of an observation (here, corresponds to the number of considered iterations).
  • pi1i (float, optional): initial estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y). Defaults to 0.9.
  • pi0i (float, optional): initial estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y). Defaults to 0.1.
  • pi0i (float, optional): initial estimate of the pi0 parameter ( P(x~y) ). Defaults to 0.2.
  • n_iter (int, optional): number of iterations. Defaults to 20.

utils.error module

error.py. error module

exception utils.error.ConfigError

Bases: exceptions.Exception

Exception raised when a compatibility error is found in the configuration options.

exception utils.error.InputError

Bases: exceptions.Exception

Exception raised when an error is found in the input data.

exception utils.error.ParsingError

Bases: exceptions.Exception

Exception raised when a parsing error is found the configuration options.

utils.error.signal_handler(signal, frame)

Handles the Keyboard interrupt signal in case of multi-process execution.

utils.error.warning(obj)

Print a warning on the error stream.

Args:
  • obj (str): warning message

utils.eval module

utils.eval.APP(predicted, labels)
utils.eval.F(predicted, labels)
utils.eval.P(predicted, labels)
utils.eval.R(predicted, labels)
utils.eval.Read_file(file_cl)
utils.eval.V(predicted, labels)
utils.eval.combination(a, k)
utils.eval.grp2idx(labels)
utils.eval.mi(predicted, labels)
utils.eval.mutual_info(x, y)
utils.eval.nmi(x, y)
utils.eval.rand(predicted, labels)

utils.matrix_op module

matrix_op.py. Functions operating on the full similarity matrix (normalization and distribution analysis).

utils.matrix_op.ROC_analysis(line, name, output_folder, ground_truth)

Plots a ROC curve for one given sample (one line of the matrix).

Args:
  • line (ndarray): similarities for one sample.
  • name (str): prefix for naming the plots.
  • output_folder (str): path to directory to output the plots.
  • ground_truth (str): indices of the samples belonging to the same class as the current sample.
utils.matrix_op.ROC_mean_analysis(lines, key, output_folder, gt)

Plots all ROC curve and their horizontal/vertical means for several samples of the same class (lines of the matrix).

Args:
  • line (ndarray): similarities for one sample.
  • name (str): prefix for naming the plots.
  • output_folder (str): path to directory to output the plots.
  • gt (list): indices of the samples belonging to the currently considered class.
utils.matrix_op.distribution_analysis(line, name, output_folder, temp_folder, ground_truth, kbest=[2000, 1000, 500, 200], mode='matlab')

Plots the similarities histogram and densities (+ ground-truth display) for a line (sample) of the similarity matrix, at various scales.

Args:
  • line (ndarray): sorted similarities in increasing order for one sample.
  • name (str): prefix for naming the plots.
  • output_folder (str): path to directory to output the plots.
  • temp_folder (str): path to directory to output temporary plots (before concatenation).
  • ground_truth: indices of samples belonging to the same class as current sample.
  • kbest (list, optional): indices for zoom ins (keep the k best values, for all k in kbest).
  • mode (str): if ‘matlab’, then plots the distribution on the whole interval using matplotlib. If ‘R’, plots the distribution for all zoom-values in kbest using R (requires ggplot2 library).
utils.matrix_op.keep_k_best(co_occ, k=200)

Keep the k best values in the matrix and set the rest to 0. Relies on the bottleneck library for fast sort.

Args:

  • co_occ (ndarray): input matrix.
  • k (int, optional): number of values to keep. Defaults to 200.

Returns:

  • normalized (ndarray): normalized matrix.
utils.matrix_op.normalize(co_occ)

Returns a normalized version of the input matrix.

Args:

  • co_occ (ndarray): co-occurence matrix.

Returns:

  • normalized (ndarray): a normalized version of the co-occurence matrix.
utils.matrix_op.normalize_gauss_global(co_occ)

Normalize the full matrix with respect to its global standard deviation and mean (X <- (X - mean) / std).

Args:

  • co_occ (ndarray): input matrix.

Returns:

  • normalized (ndarray): normalized matrix.
utils.matrix_op.normalize_gauss_local(co_occ)

Normalize the full matrix line by line with respect to their global standard deviation and mean (X <- (X - mean) / std).

Args:

  • co_occ (ndarray): input matrix.

Returns:

  • normalized (ndarray): normalized matrix.
utils.matrix_op.normalize_min_max(co_occ)

Normalize the full matrix globally with respect to its minimum and maximum value (X <- (X - min) / (max - min)).

Args:

  • co_occ (ndarray): input matrix.

Returns:

  • normalized (ndarray): normalized matrix.
utils.matrix_op.statistical_analysis_binary(line, rnd_distrib, ground_truth, temp_folder, output_folder, name, suffix=None)

For binary SIC: plots the similarity distribution and its given theoretical model (Poisson Binomial).

Args:
  • line (list): similarities for the considered sample x.
  • rnd_distrib (list): values of the discrete random case distribution (k -> P(X = k))
  • ground_truth (list): indices of the samples in the same class as x.
  • temp_folder (str): path to the folder containing the temporary files.
  • output_folder (str): path to the output folder.
  • name (str): string representation of the considered sample.
  • suffix (str, optional): additional suffix for the output file. Defaults to None (no suffix).
utils.matrix_op.statistical_analysis_weighted(line, N, ground_truth, temp_folder, output_folder, name, step=1.0)

Plots the similarity distribution and gaussian theoretical distribution in the case of a weighted similarity (negative samples).

Args:
  • line (list): similarities for the considered sample x.
  • N (int): standard deviation of the gaussian.
  • ground_truth (list): indices of the samples in the same class as x.
  • temp_folder (str): path to the folder containing the temporary files.
  • output_folder (str): path to the output folder.
  • name (str): string representation of the considered sample.
  • step (float, optional): step between x-axis’ ticks.

utils.one_step module

one_step.py. one classification step for building the similarity. Designed for a threaded execution.

utils.one_step.split_data(n, data, data_type, train_frac, classifier_type, annotation_params, temp_folder, with_full_test=False)

Splits the given database into a training and testing set.

Args:
  • n (int): iteration identifier.
  • data (list): initial data structure.
  • data_type (str): data set used for the experiments.
  • train_frac (float): proportion of the database to keep for training.
  • classifier_type (str): type of classifier (for on-the-fly parsing format in AQUAINT).
  • annotation_params (str): annotation parameter for OVA.
  • temp_folder (str): path to temporary folder.
  • with_full_test (bool, optional): if True, use the whole dataset (including training) for testing.
Returns:
  • train (list): the data kept for training (generator).
  • test (list): the data kept for testing (generator).
  • test_indices (list): indices of the test samples in the whole data; used to compute the number of test occurrences afterwards. (Except for AQUAINT where test_indices directly returns the occurrences of each samples in the test set).
utils.one_step.thread_step(n, coocc_matrix, unit_cell_length, n_samples, sim_queue, data, temp_folder, data_type, annotation_params, classification_params, verbose=1, debug=False, with_unique_occurrences=False, preclustering=[])

One classification iteration. Results are output in a queue.

Args:
  • n (int): index of current iteration.
  • coocc_matrix (array): similarity matrix (shared in memory).
  • unit_cell_length (int): length of one locked cell in the matrix.
  • n_samples (int): number of samples in the data set.
  • sim_queue (Queue): output queue.
  • data (list): initial data structure.
  • temp_folder (str): path to temporary folder.
  • data_type (str): data set used.
  • annotation_params (dict): parameters for the synthetic annotation.
  • classification_params (dict): parameters for the supervised classification algorithm.
  • verbose (int, optional): controls the verbosity level. Defaults to 1.
  • debug (bool, optional): runs in debugging mode. Defaults to False.
  • with_unique_occurrences (bool, optional): True when occurrences of a same entity are distinct items in the database (e.g. NER). Defaults to False.
  • preclustering (list, optional): Entity index to class mapping. This clustring is used to given the same annotation to entities in the same class.

utils.opt module

opt.py.: Functions designed for vectorial updates of the similarity matrix instead of cell by cell.

utils.opt.cartesian(arrays, out=None, numpy=True)

Generate a cartesian product of input arrays.

Parameters:
  • arrays : list of array-like. 1-D arrays to form the cartesian product of.
  • out : ndarray. Array to place the cartesian product in.
Returns:
  • out : ndarray. 2-D array of shape (M, len(arrays)) containing cartesian products formed of input arrays.
See:
utils.opt.cartesian_prod(arrays, full=None, out=None, numpy=True)

Generates the products for all possible cartesian combinations of the arrays components.

Args:
  • arrays: list of array-like.
  • full (int): length of the base array. Starting value should be None.
  • out (ndarray): array where to put the result.
  • numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.
Returns:
  • out (ndarray): array containing the products for all possible cartesian combinations.
utils.opt.pairs_combination(a, numpy=True)

Returns all the possible pair combinations from an index array.

Args:
  • a: array-like.
  • numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.
Returns:
  • out: array containing all the possible pair combinations from an index array.
utils.opt.pairs_combination_indices(a, n_samples, numpy=True)

Returns all the possible pairs combinations from an index array, under their index form (upper triangle matrix).

Args:
  • a: array-like.
  • n_samples: size of the 2D matrix (ie max possible value of the index + 1)
  • numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.
Returns:
  • out: array containing the indices of all possible pairs combinations.
utils.opt.product_combination(a, numpy=True)

Returns all the possible pair product combinations from an index array.

Args:
  • a: array-like.
  • numpy (bool, optional): If False, convert the arrays to numpy type. Defaults to True.
Returns:
  • out: array containing all the possible pair product combinations from an index array.

utils.output_format module

output_format.py. functions linked to format of the program outputs.

class utils.output_format.Tee(*files)

Bases: object

Tee object used for linking several files (used for linking stdout to log file).

flush()
write(obj)
utils.output_format.clustering_to_file(output_folder, clustering, suffix=None)

Writes a clustering in a file.

Args:
  • output_folder (str): Path to the output folder.
  • clustering (dict): clustering represented as a dictionnary mapping a cluster to its elements.
  • suffix (str, optional): if given, this is added as a suffix to the name of the output file. Defaults to None.
Returns:
  • output_path (str): Path to the file in which the clustering was written.
utils.output_format.clustering_to_string(clustering)
Return a clustering as a string with entities separated by tabulations and clusters by newlines ‘

‘.

Args:
  • clustering (dict): clustering represented as a dictionnary mapping a cluster to its elements.
Returns:
  • clustering_to_string (str): string representation of the clustering.
utils.output_format.init_folder(path)

Create directory in path if not already existing.

Args:
  • path (str): path of the directory to initialize.
utils.output_format.load_cooc(mat)

Load and format, if needed, the given similarity matrix.

Args:
  • mat (str): path to the similarity matrix (txt, npy or pickle)
utils.output_format.log(output_folder)

Create file for redirecting output

Args:
  • output_folder (str): path to root of directory for outputs.
utils.output_format.readable_clustering(output_folder, clustering, index_to_label, suffix=None)

Outputs a clustering in a more readable format.

Args:
  • output_folder (str): Path to the output folder.
  • clustering (dict): clustering represented as a dictionnary mapping a cluster to the lsit of its elements.
  • index_to_label (list): mapping from an entity index to a string label.
  • suffix (str, optional): if given, this is added as a suffix to the name of the output file. Defaults to None.
Returns:
  • output_file (str): path to the file now containing the clustering under readable format.
utils.output_format.save_coocc(output_folder, coocc, suffix=None, type='binary')

Saves the current co-occurence matrix.

Args:
  • output_folder (str): path to root of output folder.
  • n (int): step of the matrix.
  • coocc (ndarray): co-occurence matrix.
  • type (str): output file format (text, binary or pickle).
Returns:
  • output_file (str): path to the file now containing the matrix.
utils.output_format.save_coocc_mcl(output_folder, coocc, index_to_label)

Save current co-occurence matrix (only non-zero entries) in the label format of MCL.

Args:
  • output_folder (str): path to root of output folder.
  • coocc (ndarray): co-occurence matrix.
  • index_to_label (list): mapping from an entity index to a readable label.
Returns:
  • output_file (str): path to the file now containing the matrix.

utils.parse module

parse.py. Functions for parsing data files and ground-truth file for a given data set. The module can also be used as a script to generate .qrel version of the ground-truth for the information retrieval evaluation procedure.

utils.parse.ground_truth_AQUA_qrel(ground_truth_file, output_file, aqua_entities_file)

Builds the qrel file for the AQUAINT ground-truth to be used for the nearest neighbour evaluation.

Args:
  • ground_truth_file (str): path to the AQUAINT ground-truth file.
  • output_file (str): path to the qrel output file.
  • aqua_entities_file (str): path to the file listing all AQUAINT entities.
utils.parse.ground_truth_indices(data_type, ground_truth_file, label_to_index=None)

Return indices of the entities that are in the ground truth. Only used for AQUAINT as for NER, samples = ground_truth

Args:
  • data_type (str): dataset identifier
  • ground_truth_file (str): path to the ground_truth_file
  • label_to_index (list): maps a label to the corresponding entity index. Required for Aquaint.
Returns:
  • indices (list): indices of samples in ground-truth.
utils.parse.ground_truth_pairs(data_type, ground_truth_file, n_samples)

Retrieves indices of samples’pairs that have the same ground-truth class. Indices are computed as if using the upper triangle of a squqre n_samples x n_samples matrix.

Args:
  • data_type (str): data set.
  • ground_truth_file (str): ground-truth file for the given data set.
  • n_samples (str): number of samples for the given data set.
Return:
  • indices (int ist): list of indices of samples pairs with same ground-truth class.
utils.parse.ground_truth_qrel(ground_truth_file, output_file, index_to_label)

Builds the qrel file for NER and AUDIO ground-truth to be used for the nearest neighbour evaluation.

Args:
  • ground_truth_file (str): path to the AQUAINT ground-truth file.
  • output_file (str): path to the qrel output file.
  • index_to_label (list*): maps an entity index to the corresponding string label.
utils.parse.parse_AQUA_entities(entities_file)

Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.

Args:
  • entities_file (str): file containing the retrieved entities and number of occurences.
Returns:
  • index_to_label (list): list assocating a sample’s index with its string representation.
  • label_to_index (list): reversed mapping of index_to_label.
utils.parse.parse_AQUA_single_partial(classifier_type, data_file, label_to_index, training_size, testing_size, train_acc, test_acc, test_occurences, train_included_test=False)

Read and parse partially the given Aquaint2 data file. Contrary to parse_AQUA_single, this function does not load the whole data, but directly builds the required training and testing sets.

Args:
  • classifier_type (str): type of the classifier that will be used for the experiment.
  • data_file (str): path tothe data file.
  • label_to_index (str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
  • training_size (int): number of docs for training from this document, or a list of docs indices + sentences to keep.
  • testing_size (int): number of docs for testing from this document, or a list of docs indices + sentences to keep.
  • train_acc (iterator): accumulator for training sentences.
  • test_acc (iterator): accumulator for testing sentences.
  • test_indices (array): accumulator for the number of occurrences of each word in the test database.
  • train_included_test (bool, optional): If True, retrieved training sentences will also be included in the testing set.
Returns:
  • train_acc (iterator): updated training sentences accumulator.
  • test_acc (iterator): updated testing sentences accumulator.
utils.parse.parse_NER(classifier_type, data_file)

Reads and parses the given data file for the Named Entity Recognition (NER) task.

Args:
  • classifier_type (str): type of the classifier that will be used for the experiments.
  • data_file (str): path to data file.
Returns:
  • n_samples (int): number of samples in the database.
  • data (list): structure containing the data. A word is usually represented by a tuple (index of the sample if interesting entity, word representant of the sample, additional tags, B(I)O tag).
  • data_occurrences (list): number of occurences of each word in each sentence.
  • index_to_label (list): list assocating a sample’s index with its string representation.
  • summary (str): additional information on the dataset.
utils.parse.parse_audio(data_folder, selected_features)

Parse features selection for the Audio task.

Args:
  • data_folder (str): If the features are precomputed, then data_folder is the path to the directory containing the features. Otherwise, it is a file containing the samples folder as its first line and all the possible HTK features markers (one per line) to consider.
  • selected_features (list): list of the features type to use in the experiments (given in the configuration file).
Returns:
  • n_samples (int): number of samples in the dataset.
  • data (list): maps a feature identifier to the corresponding HTK generated features.
  • index_to_label (lsit): maps an entity index to a string label.
utils.parse.parse_data(data_type, classification_params, data_file)

Parse data file(s) depending on the chosen options.

Args:
  • data_type (str): dataset identifier.
  • classification_params (str): classifier parameters.
  • data_file (str): path to data file.
Returns:
  • n_samples (int): number of samples in the database.
  • data (list): structure containing the data.
  • data_occurrences (list): number of entity occurrences in each sentence/docs of the data (only for AQUAINT and NER).
  • index_to_label (list): list associating a sample’s index with its string representation.
  • label_to_index (list): reversed mapping of index_to_label.
utils.parse.parse_ground_truth(data_type, ground_truth_file, label_to_index=None)

Reads and parses the given ground-truth file.

Args:
  • data (list): structure containing the data.
  • ground_truth_file (str): path to the ground truth file.
  • label_to_index (dict, optional): maps an AQUAINT word to its index. Required when outputting the AQUAINT groundtruth with the entity indices rather than string representation.
Returns:
  • ground_truth (list): list associating a sample with its ground-truth cluster:
    • for NER, ground_truth: cluster (str) -> entity indices (int list)
    • for AQUA, ground_truth: entity (str) -> entity indices (int list) if label_to_index, else str list
utils.parse.parse_pattern(classifier_type, pattern_file)

Reads and parses the given pattern file (Wapiti/CRF++ expected format).

Args:
  • classifier_type (str): type of the classifier that will be used for the experiment.
  • pattern_file (str): path to the pattern file.
Returns:
  • features (list): features organized by category.
  • distrib (list): probability of sampling a feature for each category.
utils.parse.split_on_ground_truth_no_indices(data_type, ground_truth_file, numb=6, keys=None)

Given the ground_truth data file, returns a random set number of entities for each class (usually used to visualize similarity distributions)

Args:
  • data (list): structure containing the data.
  • ground_truth_file (str): path to the ground truth file.
  • numb (int): number to plot for each entity
  • keys (list, optional): if given, the algorithm returns a list of samples whose ground-truth classes form the keys list.
Returns:
  • ground_truth (list): list associating a sample with its ground-truth cluster.
  • selected_entities (list): selected entities to be plotted.

utils.parse_stat module

parse.py_stat. Additional functions for parsing some statistics and precomputing information on the data sets (mostly Aquaint).

utils.parse_stat.count_aqua_docs(directory, output_folder, aqua_entities_file)

Counts the number occurrences of each word in each document as well as for the whole data set

Args:
  • directory (str): directory containing all xml documents of the dataset.
  • output_folder (str): path to output directory.
  • aqua_entities_file (str): path to the file listing all AQUAINT entities.
utils.parse_stat.count_aqua_docs_score(directory, output_file, aqua_entities_file)

Computes a score for each document based on how rare are the words occuring in the document.

Args:
  • directory (str): path to the directory containing the Aquaint data files.
  • output_file (str): path to the file to write the output scores.
  • aqua_entities_file (str): path to the file containing all aquaint entities and their number of occurrences.
utils.parse_stat.parse_AQUA(classifier_type, data_folder, label_to_index)

Read and parse all data files from the AQUAINT folder.

Args:
  • classifier_type (str): type of the classifier that will be used for the experiment.
  • data_folder (str): path to folder containing all data files.
  • label_to_index (str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
Returns:
  • data (list): structure containing the data (file -> docs -> sentence -> words).
  • data_occurrences (list): number of entity occurrences in each sentence/docs of the data.
  • summary (str): additional information on the dataset.
utils.parse_stat.parse_AQUA_entities(entities_file)

Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.

Args:
  • entities_file (str): file containing the retrieved entities and number of occurences.
Returns:
  • index_to_label (list): list assocating a sample’s index with its string representation.
  • label_to_index (list): reversed mapping of index_to_label.
utils.parse_stat.parse_AQUA_single(classifier_type, data_file, label_to_index)

Read and parse the full given AQUAINT data file.

Args:
  • classifier_type (str): type of the classifier that will be used for the experiment.
  • data_file (str): path tothe data file.
  • label_to_index (str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
Returns:
  • data (list): structure containing the data (file -> docs -> sentence -> words).
  • data_occurrences (list): number of entity occurrences in each sentence/docs of the data.
utils.parse_stat.retrieve_aqua_entities(directory, output_file)

Retrieves all interesting entities for the AQUAINT2 dataset (common names with strictly more than 10 occurences).

Args:
  • directory (str): directory containing all xml documents of the dataset.
  • output_file (str): path where to output the retrieved entities and number of occurences.
utils.parse_stat.retrieve_aqua_occurrences(directory, output_file, aqua_entities_file)

Retrieve all occurrences of each word in the dataset (position of their occurrences given as a tupple file -> doc -> sentence)

Args:
  • directory (str): path to the directory containing the Aquaint data files.
  • output_file (str): path to the directory to write the output files (1 file = 1 word).
  • aqua_entities_file (str): path to the file containing all aquaint entities and their number of occurrences.
utils.parse_stat.retrieve_aqua_occurrences_sentences(directory, output_file, aqua_entities_file)

Same as parse_stat.retrieve_aqua_occurrences, but outputs the sentence of the occurrences rather than its position.

Args:
  • directory (str): path to the directory containing the Aquaint data files.
  • output_file (str): path to the file to write the output scores.
  • aqua_entities_file (str): path to the file containing all aquaint entities and their number of occurrences.
utils.parse_stat.stat_aqua(directory)

Computes some statistics about the AQUAINT dataset.

Args:
  • directory (str): directory containing all xml documents of the dataset.

utils.plot module

plot.py. Functions related to plotting and data visualization.

utils.plot.compare_params(true_params, em_params, ip0, output_folder)

Plots several visual comparisons of the parameters estimation.

Args:
  • true_params (list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.
  • em_params (dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.
  • ip0 (list): p0 parameters estimates with independance assumption.
  • output_folder (str): path to the output folder.
utils.plot.convergence_curve(output_folder, log_file)

Plots the evolution of the correlation coefficients for the convergence experiments.

Args:
  • output_folder (str): path to the directory for the outputs.
  • log_file (str): path to the log file output during the experiments (convergence_analysis.py script)
utils.plot.expected_binary(true_params, em_params, tpi0, epi0, output_folder)

Plot the 2-components Poisson Binomial mixture model given its parameters.

Args:
  • true_params (list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.
  • em_params (dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.
  • tpi0 (float): ground-truth pi0 estimate.
  • epi0 (float): EM pi0 estimate.
  • output_folder (str): path to the output folder.
utils.plot.fraction_plot(clustering, ground_truth, output_name)

Plots an histogram representation of a clustering with colors representing the ground-truth clustering.

Args:
  • clustering ((cluster -> values) dict): a dict representation of a clustering.
  • ground_truth ((cluster -> values) dict): a dict representation of the ground-truth clustering.
  • output_name (str): prefix of the file in which to output the figure.
utils.plot.get_Aquaint_graph(start, synonyms, nodes, edges, level, maxlevel)

Returns an excerpt networkx graph from the Aquaint ground-truth. Recursive function.

Args:
  • start (list): list of nodes to build edge from.
  • synonyms (dict): Aquaint ground-truth neighbours relations.
  • nodes (dict): list of nodes already built, organized by depth (minimum depth relatively to one of the starting nodes).
  • edges (list): list of edges already built.
  • level (int): current depth (starting at 0).
  • maxlevel (int): max depth to consider.
utils.plot.heatmap(similarity_matrix, ground_truth, output_folder)

Plots a heat map of the similarity matrix.

Args:
  • sim_matrix (ndarray): similarity matrix.
  • ground_truth (dict): ground-truth clustering.
  • output_folder (str): path to the output folder.
utils.plot.histo_cluster(clustering, output_name)

Plots an histogram representation of a clustering (number of samples per cluster).

Args:
  • clustering ((cluster -> values) dict): a dict representation of a clustering.
  • output_name (str): prefix of the file in which to output the figure.
utils.plot.is_in_upper_level(nodes, word, level)

Determines wether a node has already been seen as a closer neighbour (level).

Args:
  • nodes (dict): dict mapping a level to the nodes it contains.
  • word (str): label of the node to consider.
  • level (int): current level.
utils.plot.mds_representation(sim_matrix, ground_truth, index_to_label, colors, output_folder, dim=2, cores=20, mode='mds')

Compute euclidean distances (MDS) from the computed similarity matrix

Args:
  • sim_matrix (ndarray): similarity matrix.
  • ground_truth (dict): ground-truth clustering.
  • colors (dict): mapping from a class to a color.
  • output_folder (str): path to the output folder.
  • dim (int, optional): number of dimensions in the metric space. Defaults to 2.
  • cores (int, optional): number of cores to use (threaded MDS).
  • mode (str, optional): Projection algorithm ('mds' or 'tsne').
utils.plot.on_the_fly_cvg(file)

Plot some statistics on the on-the-fly convergence criterion.

Args:
  • file (str): path to the file containing the measurements of the on-the-fly criterion
utils.plot.pie_chart(clusters, output_folder)

Plots a pie-chart representation of a clustering.

Args:
  • clustering ((cluster -> values) dict): a dict representation of a clustering.
  • output_folder (str): path to output folder.
utils.plot.plot_Aquaint_graph(words, aqua_gt, level=1)

Plot an excerpt graph from the Aquaint ground-truth using the networkx library.

Args:
  • words (list): nodes to consider as origin.
  • aqua_gt (str): path to the Aquaint ground-truth.
  • level (int): max depth to consider (starting at 0).

utils.probability_fit module

probability_fit.py. functions related to estimate probability distribution.

utils.probability_fit.estimate_poisson_binomial(N, p_values)

Estimate the values of a Poisson Binomial distribution given its p-parameters.

Args:
  • N (int): number of independant Bernouilli experiments.
  • p_values (list): Bernouilli parameter for each experiment.
Returns:
  • values (list): values taken by the distribution (k -> P(X = k))
utils.probability_fit.sample_from_pdf(N, values)

Returns N samples from a given discrete distribution P.

Args:
  • N (int): number of samples to draw.
  • values (list): values taken by the discrete distribution (k -> P(X = k))
Returns:
  • samples (list): N samples drawn from the P distribution

utils.read_config module

read_config.py. Functions for setting the main parameters. Reads the configuration file in configuration.ini and the command line options.

utils.read_config.get_config_option(config, section, name, arg, type='str')

Returns the default configuration if arg (command line argument) is None, else returns arg.

Args:
  • config: current configuration object returned by ConfigParser.
  • section (str): section of the considered argument.
  • name (str): name of the considered argument.
  • arg: value passed through command line for the considered argument.
  • type (str, optional): type of the considered argument (str, int or float). Defaults to str.
utils.read_config.parse_cmd_line()

Parses the command line for options to override the default configuration parameters.

Returns:
  • opt (list): specified command line options to override the default config options:
    • N (int): number of iterations (-N, –iter).
    • cores (int): number of cores (-t, –threads).
    • data_type (str): chosen dataset (-d, –dataset).
    • input (str): chosen data file (-in, –input).
    • groundtruth (str): chosen data file (-g, –groundtruth).
    • n_distrib (str): type of annotation (-di, –distrib).
    • training_size (float): training percentage, strictly between 0 and 1 (-ts, –trainsize).
    • nmin (int): minimum number of synthetic labels (-nmin).
    • nmax (int): minimum number of synthetic labels (-nmax).
  • cfg (str): path to default configuration file.
  • verbose (int): controls verbosity level (0 to 4).
  • debug (bool): runs in debugging mode.
utils.read_config.read_config_file(f, N=None, cores=None, data_type=None, input_file=None, ground_truth_file=None, n_distrib=None, training_size=None, nmin=None, nmax=None, classifier_type=None, similarity_type=None, task_type=None, cvg_step=None, cvg_criterion=None, output_folder=None, temp_folder=None, oar=False)

Reads and parses the default arguments in the configuration file.

Args:
  • f (str): path to the configuration file.
  • opt (list): specified command line options (see result of read_config.parse_cmd_line).
Returns:
  • N (int): number of iterations.
  • cores (int): number of cores to use.
  • locks (int): number of independant locks to add on the similarity matrix.
  • data_type (str): name of the dataset chosen for the experiments.
  • input_file (str): path to default input file.
  • ground_truth_file (str): path to default configuration file.
  • temp_folder (str): path to the folder for temporary files (e.g. MCL input format file).
  • output_folder (str): path to the folder for output files.
  • annotation_params (dict): parameters for the synthetic annotation.
  • classification_params (dict): parameters for the supervised classification algorithm. (classifier type, training percentage, similarity type, additional parameters).
  • classifier_binary (str): path to the binary for the classifier.
  • task_params (list): parameters for the post-processing task. (task type, algorithm binary, additional parameters).
  • cvg_step (int): If step > 2, the convergence criterion will be evaluated every step iteration.
  • cvg_criterion (float): value of the criterion on the mean entropies to stop the algorithm.
  • config (ConfigParser): configuration object updated with the values in command line.

Module contents