utils.annotation_scripts package¶
Submodules¶
utils.annotation_scripts.annotation_CRF module¶
annotation_CRF.py. Generation of synthetic annotations for wapiti CRF classifier.
-
utils.annotation_scripts.annotation_CRF.
annotate_CRF_test
(n, test, temp_folder)¶ Returns a wapiti-formatted version of the test entities for the CRF classifier.
- Args:
n
(int): step identifier.test
(list): test entities.temp_folder
: directory for temporary files.
- Returns:
n_sentences_test
(int): number of sequences in the training database.test_entities_indices
(list): indices and identifiers of the entities of interest in the test database.test_file
(str): path to the formatted test data.
-
utils.annotation_scripts.annotation_CRF.
annotate_CRF_train
(n, train, distrib, n_labels, to_annotate, temp_folder, preclustering)¶ Returns a synthetic annotation of the data for the (wapiti) CRF classifier.
- Args:
n
(int): iteration identifier.train
(list): training entities.distrib
(str): type of the synthetic annotation.n_min
(int): minimum number of synthetic labels to use.n_max
(int): maximum number of synthetic labels to use.to_annotate
(list): in case of UNI annotation, list of indices of the entities to have their own class.with_common_label_wordform
(bool, optional): ifTrue
, each entity occurence wordform receives the same label. Defaults toFalse
.temp_folder
: path to the directory for temporary files.
- Returns:
n_unique_labels_used
(int): number of synthetic labels that were actually used.n_sentences_train
(int): number of sequences in the training database.n_entities_train
(int): number of entities in the training database.train_file
(str): path to the formatted train data.
utils.annotation_scripts.annotation_DT module¶
annotation_DT.py. Generation of synthetic annotations for weka Decision Tree J48 classifiers.
-
utils.annotation_scripts.annotation_DT.
annotate_DT_test
(n, test, features, fake_class)¶ Returns a weka-formatted version of the test entities for the DT classifier.
- Args:
n
(int): step identifier.test
(list): test entities.features
(list): pattern for the feature selection.fake_class
: fake weka class to give to all test entities for the weka format.
- Returns:
n_sentences_test
(int): number of sequences in the training database.test_entities
(list): indices and identifiers of the entities of interest in the test database.test_file
(str): path to the formatted test data.
-
utils.annotation_scripts.annotation_DT.
annotate_DT_train
(n, train, distrib, n_labels, features, to_annotate, temp_folder, preclustering)¶ Returns a synthetic annotation of the data (train + test) for the (weka) DT classifier.
- Args:
n
(int): step identifier.train
(list): training entities.distrib
(str): type of the synthetic annotation.n_min
(int): minimum number of synthetic labels to use.n_max
(int): maximum number of synthetic labels to use.features
(list): pattern for the feature selection.to_annotate
(list): in case of UNI annotation, list of indices of the entities to have their own class.with_common_label_wordform
(bool, optional): ifTrue
, each entity occurence wordform receives the same label. Defaults to False.temp_folder
: directory for temporary files.
- Returns:
N
(int): random max number of synthetic labels for this step.n_unique_labels_used
(int): number of synthetic labels that were actually used.n_sentences_train
(int): number of sequences in the training database.n_entities_train
(int): number of entities in the training database.train_file
(str): path to the formatted train data.
-
utils.annotation_scripts.annotation_DT.
weka_compatible_string
(s)¶
-
utils.annotation_scripts.annotation_DT.
weka_format
(n, wekadata, train_length, test_length, temp_folder, verbose)¶ Formats the inout weka file to be compatible with J48 tree and splits it as a train and test file
- Args:
n
(int): iteration identifier.wekadata
(str): input weka file.train_length
(int): number of attributes in train.test_length
(int): number of attributes in test.temp_folder
(str): path to temporary folder.verbose
(int): verbosity level.
utils.annotation_scripts.annotation_HTK module¶
annotation_HTK.py. Generation of synthetic annotations for HTK HMM classifiers.
-
utils.annotation_scripts.annotation_HTK.
annotate_HTK_test
(n, test)¶ Returns a HTK-formatted version of the test entities for the HTK classifier.
- Args:
n
(int): step identifier.test
(list): test entities.features
(list): pattern for the feature selection.fake_class
: fake weka class to give to all test entities for the weka format.
- Returns:
n_sentences_test
(int): number of sequences in the training database.test_entities
(list): indices and identifiers of the entities of interest in the test database.test_file
(str): path to the formatted test data.
-
utils.annotation_scripts.annotation_HTK.
annotate_HTK_train
(n, train, distrib, n_labels, temp_folder, preclustering)¶ Returns a synthetic annotation of the data (train + test) for the (HTK) HMM classifier.
- Args:
n
(int): step identifier.train
(list): training entities.distrib
(str): type of the synthetic annotation.n_labels
(int): random max number of synthetic labels for this step.
- Returns:
n_unique_labels_used
(int): number of synthetic labels that were actually used.n_entities_train
(int): number of entities in the training database.train_file
(str): path to the formatted train data.mlf
(str): HTK master label file.