utils package¶
Submodules¶
utils.annotation module¶
annotation.py. Generation of synthetic annotations for the data samples.
-
utils.annotation.
annotate
(n, temp_folder, classifier_type, (train, test), annotation_params, **kwargs)¶ Return a synthetic annotation of both the training and the testing set.
- Args:
n
(int): step identifiertemp_folder
: directory for temporary filesclassifier_type
(str): type of the classifier that will take the annotation as input.train
(list): initial train data structure.test
(list): initial test data structure.annotation_params
(dict): additional annotation parameters.with_common_label_wordform `` (*bool, optional*): if ``True
, each entity occurence wordform receives the same label. Defaults to False.verbose
(int, optional): controls verbosity level.debug
(bool, optional): enable/disable debug mode.
- Returns:
N
(int): random max number of synthetic labels for this step.n_unique_labels_used
(int): number of synthetic labels that were actually used.train_file
(str): path to the formatted train data.test_file
(str): path to the formatted test data.training_size
(int): number of sequences in the training database.testing_size
(int): number of sequences in the testing database.n_entities_train
(int): number of entities in the training database.entites_indices
(list): indentify the entities of interest (tag B) in the testing set; this will be used to filter the classifier’s output.
-
utils.annotation.
choose_features
(pattern, distrib, output_pattern_file=None)¶ Given a set of features and their probability of occurrence (pattern), choose features at random for the current training step.
- Args:
pattern
(list): features organized by importance category.distrib
(list): probability of sampling a feature for each category.output_pattern_file
(str): testing set.
- Returns:
to_keep
(list): the selected features.
-
utils.annotation.
random
() → x in the interval [0, 1).¶
utils.classify module¶
classify.py. For training and applying a classifier on the artifically annotated data set.
-
utils.classify.
do_classify_step
(n, temp_folder, train, test, test_entities_indices, coocc_matrix, unit_length, count_lab, classification_params, **kwargs)¶ Builds the similarity for the current step given a classifier and synthetic annotation.
- Args:
n
(int): step number.temp_folder
(str): path to the directory for storing temporary files.train
: annotated training set (structure may depend on the classifier).test
: formatted testing set.test_entities_indices
(list): indices of test entities and their position in the classifier output.cocc_matrix
(ndarray): shared similarity matrix.unit_length
(int): length of a locked cell in the matrix.count_lab
(ndarray): count how many times an entity has been classified as non-null in the test set.classification_params
(dict): additional classification parameters.clean
(bool, optional): ifTrue
, removes the temporary files that were created. Defaults toTrue
.with_unique_occurrences
(bool, optional): ifTrue
, each entity occurence is considered unique and can receive a different label. Defaults toTrue
.verbose
(int, optional): controls verbosity level. Defaults to 1.debug
(bool, optional): runs in debug mode.pretrained
(optional): pretrained model.
-
utils.classify.
update_similarity
(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)¶ Otpimizes the similarity matrix update in the case where multiple occurences correspond to the same entity(e.g. Aquaint2 case).
- Args:
n
(int): step number.result_iter
(str): output of the classification algorithm (Wapiti CRF).coocc_matrix
(ndarray): similarity matrix of the current thread.count_lab
(ndarray): count how many times an entity has been classified as non-null in the test set.similarity_type
(str): type of the similarity to use.verbose
(int, optional): controls verbosity level. Defaults to 1.
- Returns:
n_test_labels
: number of distinct annotated labels in the testing baseweights
: repartition of the samples in the test classifications (only with weighted similarity).b
(float): penalty to be added at the end of construction.
-
utils.classify.
update_similarity_unique
(n, result_iter, coocc_matrix, unit_cell_length, label_occurrences, similarity_type, verbose)¶ Updates the similarity matrix in the case where each occurence is an unique named entity (i.e. default case).
- Args:
n
(int): step number.result_iter
(str): generator expression on the output of the classifier.coocc_matrix
(ndarray): similarity matrix of the current thread.unit_cell_length
(int): length of one locked cell in the matrix.label_occurrences
(ndarray): count how many times an entity has been classified as non-null in the test set.similarity_type
(str): type of the similarity to use.verbose
(int, optional): controls verbosity level. Defaults to 1.
- Returns:
n_test_labels
: number of distinct annotated labels in the testing base.weights
: repartition of the samples in the test classifications (only with weighted similarity).b
(float): penalty to be added at the end of the building.
utils.em_analysis module¶
em_analysis.py. EM estimation of the mixture parameters for each iteration.
-
utils.em_analysis.
em_step
(pi0, p0, p1, x, n_obs, N, cores)¶ One single expectation maximization step. Multiprocess execution (each process computes the part of the summations for one set of observations).
- Args:
pi0
(float): current estimate of the pi0 parameter ( P(x~y) ).p1
(float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).p0
(float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).x
(list): observation (as dict mapping unique observation -> number of occurences).n_obs
(int): total number of observations.N
(int): dimension of the multivariate Bernoulli.cores
(int): number of cores to use.
- Returns:
nz1
(float): estimate of the z1 hidden variables.npi0
(float): estimate of the pi0 parameter ( P(x~y) ).np0
(float): estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).np1
(float): estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).
-
utils.em_analysis.
em_step_threaded
(pi0, p0, p1, obs, N, res_queue)¶ Function for parameter estimation on one thread.
- Args:
pi0
(float): current estimate of the pi0 parameter ( P(x~y) ).pi1
(float): current estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y).pi0
(float): current estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y).obs
(list): observations given to this thread (as dict mapping unique observation -> number of occurences).N
(int): dimension of the multivariate Bernoulli.res_queue
(Queue): output queue.
-
utils.em_analysis.
estimate_parameters_em
(co_occ, N, p1i=0.9, p0i=0.1, pi0i=0.8, n_iter=20, cores=20)¶ Expectation Maximization on 2-components Bernoulli mixture.
- Args:
co_occ
(list): list of observations.N
(int): dimension of an observation (here, corresponds to the number of considered iterations).pi1i
(float, optional): initial estimate of the p1 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x ~ y). Defaults to 0.9.pi0i
(float, optional): initial estimate of the p0 parameter (P(c1(c)=c1(y)...cN(x)=cN(y) | x <> y). Defaults to 0.1.pi0i
(float, optional): initial estimate of the pi0 parameter ( P(x~y) ). Defaults to 0.2.n_iter
(int, optional): number of iterations. Defaults to 20.
utils.error module¶
error.py. error module
-
exception
utils.error.
ConfigError
¶ Bases:
exceptions.Exception
Exception raised when a compatibility error is found in the configuration options.
-
exception
utils.error.
InputError
¶ Bases:
exceptions.Exception
Exception raised when an error is found in the input data.
-
exception
utils.error.
ParsingError
¶ Bases:
exceptions.Exception
Exception raised when a parsing error is found the configuration options.
-
utils.error.
signal_handler
(signal, frame)¶ Handles the Keyboard interrupt signal in case of multi-process execution.
-
utils.error.
warning
(obj)¶ Print a warning on the error stream.
- Args:
obj
(str): warning message
utils.eval module¶
-
utils.eval.
APP
(predicted, labels)¶
-
utils.eval.
F
(predicted, labels)¶
-
utils.eval.
P
(predicted, labels)¶
-
utils.eval.
R
(predicted, labels)¶
-
utils.eval.
Read_file
(file_cl)¶
-
utils.eval.
V
(predicted, labels)¶
-
utils.eval.
combination
(a, k)¶
-
utils.eval.
grp2idx
(labels)¶
-
utils.eval.
mi
(predicted, labels)¶
-
utils.eval.
mutual_info
(x, y)¶
-
utils.eval.
nmi
(x, y)¶
-
utils.eval.
rand
(predicted, labels)¶
utils.matrix_op module¶
matrix_op.py. Functions operating on the full similarity matrix (normalization and distribution analysis).
-
utils.matrix_op.
ROC_analysis
(line, name, output_folder, ground_truth)¶ Plots a ROC curve for one given sample (one line of the matrix).
- Args:
line
(ndarray): similarities for one sample.name
(str): prefix for naming the plots.output_folder
(str): path to directory to output the plots.ground_truth
(str): indices of the samples belonging to the same class as the current sample.
-
utils.matrix_op.
ROC_mean_analysis
(lines, key, output_folder, gt)¶ Plots all ROC curve and their horizontal/vertical means for several samples of the same class (lines of the matrix).
- Args:
line
(ndarray): similarities for one sample.name
(str): prefix for naming the plots.output_folder
(str): path to directory to output the plots.gt
(list): indices of the samples belonging to the currently considered class.
-
utils.matrix_op.
distribution_analysis
(line, name, output_folder, temp_folder, ground_truth, kbest=[2000, 1000, 500, 200], mode='matlab')¶ Plots the similarities histogram and densities (+ ground-truth display) for a line (sample) of the similarity matrix, at various scales.
- Args:
line
(ndarray): sorted similarities in increasing order for one sample.name
(str): prefix for naming the plots.output_folder
(str): path to directory to output the plots.temp_folder
(str): path to directory to output temporary plots (before concatenation).ground_truth
: indices of samples belonging to the same class as current sample.kbest
(list, optional): indices for zoom ins (keep thek
best values, for allk
inkbest
).mode
(str): if ‘matlab’, then plots the distribution on the whole interval using matplotlib. If ‘R’, plots the distribution for all zoom-values inkbest
usingR
(requiresggplot2
library).
-
utils.matrix_op.
keep_k_best
(co_occ, k=200)¶ Keep the
k
best values in the matrix and set the rest to 0. Relies on the bottleneck library for fast sort.Args:
co_occ
(ndarray): input matrix.k
(int, optional): number of values to keep. Defaults to 200.
Returns:
normalized
(ndarray): normalized matrix.
-
utils.matrix_op.
normalize
(co_occ)¶ Returns a normalized version of the input matrix.
Args:
co_occ
(ndarray): co-occurence matrix.
Returns:
normalized
(ndarray): a normalized version of the co-occurence matrix.
-
utils.matrix_op.
normalize_gauss_global
(co_occ)¶ Normalize the full matrix with respect to its global standard deviation and mean (X <- (X - mean) / std).
Args:
co_occ
(ndarray): input matrix.
Returns:
normalized
(ndarray): normalized matrix.
-
utils.matrix_op.
normalize_gauss_local
(co_occ)¶ Normalize the full matrix line by line with respect to their global standard deviation and mean (X <- (X - mean) / std).
Args:
co_occ
(ndarray): input matrix.
Returns:
normalized
(ndarray): normalized matrix.
-
utils.matrix_op.
normalize_min_max
(co_occ)¶ Normalize the full matrix globally with respect to its minimum and maximum value (X <- (X - min) / (max - min)).
Args:
co_occ
(ndarray): input matrix.
Returns:
normalized
(ndarray): normalized matrix.
-
utils.matrix_op.
statistical_analysis_binary
(line, rnd_distrib, ground_truth, temp_folder, output_folder, name, suffix=None)¶ For binary SIC: plots the similarity distribution and its given theoretical model (Poisson Binomial).
- Args:
line
(list): similarities for the considered sample x.rnd_distrib
(list): values of the discrete random case distribution (k -> P(X = k)
)ground_truth
(list): indices of the samples in the same class as x.temp_folder
(str): path to the folder containing the temporary files.output_folder
(str): path to the output folder.name
(str): string representation of the considered sample.suffix
(str, optional): additional suffix for the output file. Defaults to None (no suffix).
-
utils.matrix_op.
statistical_analysis_weighted
(line, N, ground_truth, temp_folder, output_folder, name, step=1.0)¶ Plots the similarity distribution and gaussian theoretical distribution in the case of a weighted similarity (negative samples).
- Args:
line
(list): similarities for the considered sample x.N
(int): standard deviation of the gaussian.ground_truth
(list): indices of the samples in the same class as x.temp_folder
(str): path to the folder containing the temporary files.output_folder
(str): path to the output folder.name
(str): string representation of the considered sample.step
(float, optional): step between x-axis’ ticks.
utils.one_step module¶
one_step.py. one classification step for building the similarity. Designed for a threaded execution.
-
utils.one_step.
split_data
(n, data, data_type, train_frac, classifier_type, annotation_params, temp_folder, with_full_test=False)¶ Splits the given database into a training and testing set.
- Args:
n
(int): iteration identifier.data
(list): initial data structure.data_type
(str): data set used for the experiments.train_frac
(float): proportion of the database to keep for training.classifier_type
(str): type of classifier (for on-the-fly parsing format in AQUAINT).annotation_params
(str): annotation parameter for OVA.temp_folder
(str): path to temporary folder.with_full_test
(bool, optional): ifTrue
, use the whole dataset (including training) for testing.
- Returns:
train
(list): the data kept for training (generator).test
(list): the data kept for testing (generator).test_indices
(list): indices of the test samples in the whole data; used to compute the number of test occurrences afterwards. (Except for AQUAINT where test_indices directly returns the occurrences of each samples in the test set).
-
utils.one_step.
thread_step
(n, coocc_matrix, unit_cell_length, n_samples, sim_queue, data, temp_folder, data_type, annotation_params, classification_params, verbose=1, debug=False, with_unique_occurrences=False, preclustering=[])¶ One classification iteration. Results are output in a queue.
- Args:
n
(int): index of current iteration.coocc_matrix
(array): similarity matrix (shared in memory).unit_cell_length
(int): length of one locked cell in the matrix.n_samples
(int): number of samples in the data set.sim_queue
(Queue): output queue.data
(list): initial data structure.temp_folder
(str): path to temporary folder.data_type
(str): data set used.annotation_params
(dict): parameters for the synthetic annotation.classification_params
(dict): parameters for the supervised classification algorithm.verbose
(int, optional): controls the verbosity level. Defaults to 1.debug
(bool, optional): runs in debugging mode. Defaults to False.with_unique_occurrences
(bool, optional):True
when occurrences of a same entity are distinct items in the database (e.g. NER). Defaults toFalse
.preclustering
(list, optional): Entity index to class mapping. This clustring is used to given the same annotation to entities in the same class.
utils.opt module¶
opt.py.: Functions designed for vectorial updates of the similarity matrix instead of cell by cell.
-
utils.opt.
cartesian
(arrays, out=None, numpy=True)¶ Generate a cartesian product of input arrays.
- Parameters:
- arrays : list of array-like. 1-D arrays to form the cartesian product of.
- out : ndarray. Array to place the cartesian product in.
- Returns:
- out : ndarray. 2-D array of shape (M, len(arrays)) containing cartesian products formed of input arrays.
- See:
-
utils.opt.
cartesian_prod
(arrays, full=None, out=None, numpy=True)¶ Generates the products for all possible cartesian combinations of the arrays components.
- Args:
arrays
: list of array-like.full
(int): length of the base array. Starting value should be None.out
(ndarray): array where to put the result.numpy
(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out
(ndarray): array containing the products for all possible cartesian combinations.
-
utils.opt.
pairs_combination
(a, numpy=True)¶ Returns all the possible pair combinations from an index array.
- Args:
a
: array-like.numpy
(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out
: array containing all the possible pair combinations from an index array.
-
utils.opt.
pairs_combination_indices
(a, n_samples, numpy=True)¶ Returns all the possible pairs combinations from an index array, under their index form (upper triangle matrix).
- Args:
a
: array-like.n_samples
: size of the 2D matrix (ie max possible value of the index + 1)numpy
(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out
: array containing the indices of all possible pairs combinations.
-
utils.opt.
product_combination
(a, numpy=True)¶ Returns all the possible pair product combinations from an index array.
- Args:
a
: array-like.numpy
(bool, optional): If False, convert the arrays to numpy type. Defaults to True.
- Returns:
out
: array containing all the possible pair product combinations from an index array.
utils.output_format module¶
output_format.py. functions linked to format of the program outputs.
-
class
utils.output_format.
Tee
(*files)¶ Bases:
object
Tee object used for linking several files (used for linking stdout to log file).
-
flush
()¶
-
write
(obj)¶
-
-
utils.output_format.
clustering_to_file
(output_folder, clustering, suffix=None)¶ Writes a clustering in a file.
- Args:
output_folder
(str): Path to the output folder.clustering
(dict): clustering represented as a dictionnary mapping a cluster to its elements.suffix
(str, optional): if given, this is added as a suffix to the name of the output file. Defaults toNone
.
- Returns:
output_path
(str): Path to the file in which the clustering was written.
-
utils.output_format.
clustering_to_string
(clustering)¶ - Return a clustering as a string with entities separated by tabulations and clusters by newlines ‘
‘.
- Args:
clustering
(dict): clustering represented as a dictionnary mapping a cluster to its elements.
- Returns:
clustering_to_string
(str): string representation of the clustering.
-
utils.output_format.
init_folder
(path)¶ Create directory in
path
if not already existing.- Args:
path
(str): path of the directory to initialize.
-
utils.output_format.
load_cooc
(mat)¶ Load and format, if needed, the given similarity matrix.
- Args:
mat
(str): path to the similarity matrix (txt, npy or pickle)
-
utils.output_format.
log
(output_folder)¶ Create file for redirecting output
- Args:
output_folder
(str): path to root of directory for outputs.
-
utils.output_format.
readable_clustering
(output_folder, clustering, index_to_label, suffix=None)¶ Outputs a clustering in a more readable format.
- Args:
output_folder
(str): Path to the output folder.clustering
(dict): clustering represented as a dictionnary mapping a cluster to the lsit of its elements.index_to_label
(list): mapping from an entity index to a string label.suffix
(str, optional): if given, this is added as a suffix to the name of the output file. Defaults toNone
.
- Returns:
output_file
(str): path to the file now containing the clustering under readable format.
-
utils.output_format.
save_coocc
(output_folder, coocc, suffix=None, type='binary')¶ Saves the current co-occurence matrix.
- Args:
output_folder
(str): path to root of output folder.n
(int): step of the matrix.coocc
(ndarray): co-occurence matrix.type
(str): output file format (text
,binary
orpickle
).
- Returns:
output_file
(str): path to the file now containing the matrix.
-
utils.output_format.
save_coocc_mcl
(output_folder, coocc, index_to_label)¶ Save current co-occurence matrix (only non-zero entries) in the label format of MCL.
- Args:
output_folder
(str): path to root of output folder.coocc
(ndarray): co-occurence matrix.index_to_label
(list): mapping from an entity index to a readable label.
- Returns:
output_file
(str): path to the file now containing the matrix.
utils.parse module¶
parse.py. Functions for parsing data files and ground-truth file for a given data set. The module can also be used as a script to generate .qrel
version of the ground-truth for the information retrieval evaluation procedure.
-
utils.parse.
ground_truth_AQUA_qrel
(ground_truth_file, output_file, aqua_entities_file)¶ Builds the qrel file for the AQUAINT ground-truth to be used for the nearest neighbour evaluation.
- Args:
ground_truth_file
(str): path to the AQUAINT ground-truth file.output_file
(str): path to the qrel output file.aqua_entities_file
(str): path to the file listing all AQUAINT entities.
-
utils.parse.
ground_truth_indices
(data_type, ground_truth_file, label_to_index=None)¶ Return indices of the entities that are in the ground truth. Only used for AQUAINT as for NER, samples = ground_truth
- Args:
data_type
(str): dataset identifierground_truth_file
(str): path to the ground_truth_filelabel_to_index
(list): maps a label to the corresponding entity index. Required for Aquaint.
- Returns:
indices
(list): indices of samples in ground-truth.
-
utils.parse.
ground_truth_pairs
(data_type, ground_truth_file, n_samples)¶ Retrieves indices of samples’pairs that have the same ground-truth class. Indices are computed as if using the upper triangle of a squqre n_samples x n_samples matrix.
- Args:
data_type
(str): data set.ground_truth_file
(str): ground-truth file for the given data set.n_samples
(str): number of samples for the given data set.
- Return:
indices
(int ist): list of indices of samples pairs with same ground-truth class.
-
utils.parse.
ground_truth_qrel
(ground_truth_file, output_file, index_to_label)¶ Builds the qrel file for NER and AUDIO ground-truth to be used for the nearest neighbour evaluation.
- Args:
ground_truth_file
(str): path to the AQUAINT ground-truth file.output_file
(str): path to the qrel output file.index_to_label
(list*): maps an entity index to the corresponding string label.
-
utils.parse.
parse_AQUA_entities
(entities_file)¶ Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.
- Args:
entities_file
(str): file containing the retrieved entities and number of occurences.
- Returns:
index_to_label
(list): list assocating a sample’s index with its string representation.label_to_index
(list): reversed mapping of index_to_label.
-
utils.parse.
parse_AQUA_single_partial
(classifier_type, data_file, label_to_index, training_size, testing_size, train_acc, test_acc, test_occurences, train_included_test=False)¶ Read and parse partially the given Aquaint2 data file. Contrary to
parse_AQUA_single
, this function does not load the whole data, but directly builds the required training and testing sets.- Args:
classifier_type
(str): type of the classifier that will be used for the experiment.data_file
(str): path tothe data file.label_to_index
(str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.training_size
(int): number of docs for training from this document, or a list of docs indices + sentences to keep.testing_size
(int): number of docs for testing from this document, or a list of docs indices + sentences to keep.train_acc
(iterator): accumulator for training sentences.test_acc
(iterator): accumulator for testing sentences.test_indices
(array): accumulator for the number of occurrences of each word in the test database.train_included_test
(bool, optional): IfTrue
, retrieved training sentences will also be included in the testing set.
- Returns:
train_acc
(iterator): updated training sentences accumulator.test_acc
(iterator): updated testing sentences accumulator.
-
utils.parse.
parse_NER
(classifier_type, data_file)¶ Reads and parses the given data file for the Named Entity Recognition (NER) task.
- Args:
classifier_type
(str): type of the classifier that will be used for the experiments.data_file
(str): path to data file.
- Returns:
n_samples
(int): number of samples in the database.data
(list): structure containing the data. A word is usually represented by a tuple (index of the sample if interesting entity, word representant of the sample, additional tags, B(I)O tag).data_occurrences
(list): number of occurences of each word in each sentence.index_to_label
(list): list assocating a sample’s index with its string representation.summary
(str): additional information on the dataset.
-
utils.parse.
parse_audio
(data_folder, selected_features)¶ Parse features selection for the Audio task.
- Args:
data_folder
(str): If the features are precomputed, thendata_folder
is the path to the directory containing the features. Otherwise, it is a file containing the samples folder as its first line and all the possible HTK features markers (one per line) to consider.selected_features
(list): list of the features type to use in the experiments (given in the configuration file).
- Returns:
n_samples
(int): number of samples in the dataset.data
(list): maps a feature identifier to the corresponding HTK generated features.index_to_label
(lsit): maps an entity index to a string label.
-
utils.parse.
parse_data
(data_type, classification_params, data_file)¶ Parse data file(s) depending on the chosen options.
- Args:
data_type
(str): dataset identifier.classification_params
(str): classifier parameters.data_file
(str): path to data file.
- Returns:
n_samples
(int): number of samples in the database.data
(list): structure containing the data.data_occurrences
(list): number of entity occurrences in each sentence/docs of the data (only for AQUAINT and NER).index_to_label
(list): list associating a sample’s index with its string representation.label_to_index
(list): reversed mapping of index_to_label.
-
utils.parse.
parse_ground_truth
(data_type, ground_truth_file, label_to_index=None)¶ Reads and parses the given ground-truth file.
- Args:
data
(list): structure containing the data.ground_truth_file
(str): path to the ground truth file.label_to_index
(dict, optional): maps an AQUAINT word to its index. Required when outputting the AQUAINT groundtruth with the entity indices rather than string representation.
- Returns:
ground_truth
(list): list associating a sample with its ground-truth cluster:- for NER,
ground_truth
: cluster (str) -> entity indices (int list) - for AQUA,
ground_truth
: entity (str) -> entity indices (int list) if label_to_index, else str list
- for NER,
-
utils.parse.
parse_pattern
(classifier_type, pattern_file)¶ Reads and parses the given pattern file (Wapiti/CRF++ expected format).
- Args:
classifier_type
(str): type of the classifier that will be used for the experiment.pattern_file
(str): path to the pattern file.
- Returns:
features
(list): features organized by category.distrib
(list): probability of sampling a feature for each category.
-
utils.parse.
split_on_ground_truth_no_indices
(data_type, ground_truth_file, numb=6, keys=None)¶ Given the ground_truth data file, returns a random set number of entities for each class (usually used to visualize similarity distributions)
- Args:
data
(list): structure containing the data.ground_truth_file
(str): path to the ground truth file.numb
(int): number to plot for each entitykeys
(list, optional): if given, the algorithm returns a list of samples whose ground-truth classes form thekeys
list.
- Returns:
ground_truth
(list): list associating a sample with its ground-truth cluster.selected_entities
(list): selected entities to be plotted.
utils.parse_stat module¶
parse.py_stat. Additional functions for parsing some statistics and precomputing information on the data sets (mostly Aquaint).
-
utils.parse_stat.
count_aqua_docs
(directory, output_folder, aqua_entities_file)¶ Counts the number occurrences of each word in each document as well as for the whole data set
- Args:
directory
(str): directory containing all xml documents of the dataset.output_folder
(str): path to output directory.aqua_entities_file
(str): path to the file listing all AQUAINT entities.
-
utils.parse_stat.
count_aqua_docs_score
(directory, output_file, aqua_entities_file)¶ Computes a score for each document based on how rare are the words occuring in the document.
- Args:
directory
(str): path to the directory containing the Aquaint data files.output_file
(str): path to the file to write the output scores.aqua_entities_file
(str): path to the file containing all aquaint entities and their number of occurrences.
-
utils.parse_stat.
parse_AQUA
(classifier_type, data_folder, label_to_index)¶ Read and parse all data files from the AQUAINT folder.
- Args:
classifier_type
(str): type of the classifier that will be used for the experiment.data_folder
(str): path to folder containing all data files.label_to_index
(str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
- Returns:
data
(list): structure containing the data (file -> docs -> sentence -> words).data_occurrences
(list): number of entity occurrences in each sentence/docs of the data.summary
(str): additional information on the dataset.
-
utils.parse_stat.
parse_AQUA_entities
(entities_file)¶ Parse the file containing all entities of the AQUAINT2 dataset to build the index_to_label and label_to_index mappings.
- Args:
entities_file
(str): file containing the retrieved entities and number of occurences.
- Returns:
index_to_label
(list): list assocating a sample’s index with its string representation.label_to_index
(list): reversed mapping of index_to_label.
-
utils.parse_stat.
parse_AQUA_single
(classifier_type, data_file, label_to_index)¶ Read and parse the full given AQUAINT data file.
- Args:
classifier_type
(str): type of the classifier that will be used for the experiment.data_file
(str): path tothe data file.label_to_index
(str): maps a word to an integer index (alphabetical order). Used to maps multiple occurences of a same word to the same index.
- Returns:
data
(list): structure containing the data (file -> docs -> sentence -> words).data_occurrences
(list): number of entity occurrences in each sentence/docs of the data.
-
utils.parse_stat.
retrieve_aqua_entities
(directory, output_file)¶ Retrieves all interesting entities for the AQUAINT2 dataset (common names with strictly more than 10 occurences).
- Args:
directory
(str): directory containing all xml documents of the dataset.output_file
(str): path where to output the retrieved entities and number of occurences.
-
utils.parse_stat.
retrieve_aqua_occurrences
(directory, output_file, aqua_entities_file)¶ Retrieve all occurrences of each word in the dataset (position of their occurrences given as a tupple file -> doc -> sentence)
- Args:
directory
(str): path to the directory containing the Aquaint data files.output_file
(str): path to the directory to write the output files (1 file = 1 word).aqua_entities_file
(str): path to the file containing all aquaint entities and their number of occurrences.
-
utils.parse_stat.
retrieve_aqua_occurrences_sentences
(directory, output_file, aqua_entities_file)¶ Same as
parse_stat.retrieve_aqua_occurrences
, but outputs the sentence of the occurrences rather than its position.- Args:
directory
(str): path to the directory containing the Aquaint data files.output_file
(str): path to the file to write the output scores.aqua_entities_file
(str): path to the file containing all aquaint entities and their number of occurrences.
-
utils.parse_stat.
stat_aqua
(directory)¶ Computes some statistics about the AQUAINT dataset.
- Args:
directory
(str): directory containing all xml documents of the dataset.
utils.plot module¶
plot.py. Functions related to plotting and data visualization.
-
utils.plot.
compare_params
(true_params, em_params, ip0, output_folder)¶ Plots several visual comparisons of the parameters estimation.
- Args:
true_params
(list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.em_params
(dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.ip0
(list): p0 parameters estimates with independance assumption.output_folder
(str): path to the output folder.
-
utils.plot.
convergence_curve
(output_folder, log_file)¶ Plots the evolution of the correlation coefficients for the convergence experiments.
- Args:
output_folder
(str): path to the directory for the outputs.log_file
(str): path to the log file output during the experiments (convergence_analysis.py script)
-
utils.plot.
expected_binary
(true_params, em_params, tpi0, epi0, output_folder)¶ Plot the 2-components Poisson Binomial mixture model given its parameters.
- Args:
true_params
(list): ground-truth p0 (true_params[0]) and p1 (true_params[1]) parameters.em_params
(dict): EM estimates of p0 (em_params[0]) and p1 (em_params[1]) parameters.tpi0
(float): ground-truth pi0 estimate.epi0
(float): EM pi0 estimate.output_folder
(str): path to the output folder.
-
utils.plot.
fraction_plot
(clustering, ground_truth, output_name)¶ Plots an histogram representation of a clustering with colors representing the ground-truth clustering.
- Args:
clustering
((cluster -> values) dict): a dict representation of a clustering.ground_truth
((cluster -> values) dict): a dict representation of the ground-truth clustering.output_name
(str): prefix of the file in which to output the figure.
-
utils.plot.
get_Aquaint_graph
(start, synonyms, nodes, edges, level, maxlevel)¶ Returns an excerpt networkx graph from the Aquaint ground-truth. Recursive function.
- Args:
start
(list): list of nodes to build edge from.synonyms
(dict): Aquaint ground-truth neighbours relations.nodes
(dict): list of nodes already built, organized by depth (minimum depth relatively to one of the starting nodes).edges
(list): list of edges already built.level
(int): current depth (starting at 0).maxlevel
(int): max depth to consider.
-
utils.plot.
heatmap
(similarity_matrix, ground_truth, output_folder)¶ Plots a heat map of the similarity matrix.
- Args:
sim_matrix
(ndarray): similarity matrix.ground_truth
(dict): ground-truth clustering.output_folder
(str): path to the output folder.
-
utils.plot.
histo_cluster
(clustering, output_name)¶ Plots an histogram representation of a clustering (number of samples per cluster).
- Args:
clustering
((cluster -> values) dict): a dict representation of a clustering.output_name
(str): prefix of the file in which to output the figure.
-
utils.plot.
is_in_upper_level
(nodes, word, level)¶ Determines wether a node has already been seen as a closer neighbour (level).
- Args:
nodes
(dict): dict mapping a level to the nodes it contains.word
(str): label of the node to consider.level
(int): current level.
-
utils.plot.
mds_representation
(sim_matrix, ground_truth, index_to_label, colors, output_folder, dim=2, cores=20, mode='mds')¶ Compute euclidean distances (MDS) from the computed similarity matrix
- Args:
sim_matrix
(ndarray): similarity matrix.ground_truth
(dict): ground-truth clustering.colors
(dict): mapping from a class to a color.output_folder
(str): path to the output folder.dim
(int, optional): number of dimensions in the metric space. Defaults to 2.cores
(int, optional): number of cores to use (threaded MDS).mode
(str, optional): Projection algorithm ('mds'
or'tsne'
).
-
utils.plot.
on_the_fly_cvg
(file)¶ Plot some statistics on the on-the-fly convergence criterion.
- Args:
file
(str): path to the file containing the measurements of the on-the-fly criterion
-
utils.plot.
pie_chart
(clusters, output_folder)¶ Plots a pie-chart representation of a clustering.
- Args:
clustering
((cluster -> values) dict): a dict representation of a clustering.output_folder
(str): path to output folder.
-
utils.plot.
plot_Aquaint_graph
(words, aqua_gt, level=1)¶ Plot an excerpt graph from the Aquaint ground-truth using the networkx library.
- Args:
words
(list): nodes to consider as origin.aqua_gt
(str): path to the Aquaint ground-truth.level
(int): max depth to consider (starting at 0).
utils.probability_fit module¶
probability_fit.py. functions related to estimate probability distribution.
-
utils.probability_fit.
estimate_poisson_binomial
(N, p_values)¶ Estimate the values of a Poisson Binomial distribution given its p-parameters.
- Args:
N
(int): number of independant Bernouilli experiments.p_values
(list): Bernouilli parameter for each experiment.
- Returns:
values
(list): values taken by the distribution (k -> P(X = k)
)
-
utils.probability_fit.
sample_from_pdf
(N, values)¶ Returns N samples from a given discrete distribution
P
.- Args:
N
(int): number of samples to draw.values
(list): values taken by the discrete distribution (k -> P(X = k)
)
- Returns:
samples
(list):N
samples drawn from theP
distribution
utils.read_config module¶
read_config.py. Functions for setting the main parameters. Reads the configuration file in configuration.ini
and the command line options.
-
utils.read_config.
get_config_option
(config, section, name, arg, type='str')¶ Returns the default configuration if
arg
(command line argument) isNone
, else returnsarg
.- Args:
config
: current configuration object returned by ConfigParser.section
(str): section of the considered argument.name
(str): name of the considered argument.arg
: value passed through command line for the considered argument.type
(str, optional): type of the considered argument (str, int or float). Defaults to str.
-
utils.read_config.
parse_cmd_line
()¶ Parses the command line for options to override the default configuration parameters.
- Returns:
opt
(list): specified command line options to override the default config options:N
(int): number of iterations (-N, –iter).cores
(int): number of cores (-t, –threads).data_type
(str): chosen dataset (-d, –dataset).input
(str): chosen data file (-in, –input).groundtruth
(str): chosen data file (-g, –groundtruth).n_distrib
(str): type of annotation (-di, –distrib).training_size
(float): training percentage, strictly between 0 and 1 (-ts, –trainsize).nmin
(int): minimum number of synthetic labels (-nmin).nmax
(int): minimum number of synthetic labels (-nmax).
cfg
(str): path to default configuration file.verbose
(int): controls verbosity level (0 to 4).debug
(bool): runs in debugging mode.
-
utils.read_config.
read_config_file
(f, N=None, cores=None, data_type=None, input_file=None, ground_truth_file=None, n_distrib=None, training_size=None, nmin=None, nmax=None, classifier_type=None, similarity_type=None, task_type=None, cvg_step=None, cvg_criterion=None, output_folder=None, temp_folder=None, oar=False)¶ Reads and parses the default arguments in the configuration file.
- Args:
f
(str): path to the configuration file.opt
(list): specified command line options (see result ofread_config.parse_cmd_line
).
- Returns:
N
(int): number of iterations.cores
(int): number of cores to use.locks
(int): number of independant locks to add on the similarity matrix.data_type
(str): name of the dataset chosen for the experiments.input_file
(str): path to default input file.ground_truth_file
(str): path to default configuration file.temp_folder
(str): path to the folder for temporary files (e.g. MCL input format file).output_folder
(str): path to the folder for output files.annotation_params
(dict): parameters for the synthetic annotation.classification_params
(dict): parameters for the supervised classification algorithm. (classifier type, training percentage, similarity type, additional parameters).classifier_binary
(str): path to the binary for the classifier.task_params
(list): parameters for the post-processing task. (task type, algorithm binary, additional parameters).cvg_step
(int): Ifstep
> 2, the convergence criterion will be evaluated everystep
iteration.cvg_criterion
(float): value of the criterion on the mean entropies to stop the algorithm.config
(ConfigParser): configuration object updated with the values in command line.