utils.annotation_scripts package¶
Submodules¶
utils.annotation_scripts.annotation_CRF module¶
annotation_CRF.py. Generation of synthetic annotations for wapiti CRF classifier.
-
utils.annotation_scripts.annotation_CRF.annotate_CRF_test(n, test, temp_folder)¶ Returns a wapiti-formatted version of the test entities for the CRF classifier.
- Args:
n(int): step identifier.test(list): test entities.temp_folder: directory for temporary files.
- Returns:
n_sentences_test(int): number of sequences in the training database.test_entities_indices(list): indices and identifiers of the entities of interest in the test database.test_file(str): path to the formatted test data.
-
utils.annotation_scripts.annotation_CRF.annotate_CRF_train(n, train, distrib, n_labels, to_annotate, temp_folder, preclustering)¶ Returns a synthetic annotation of the data for the (wapiti) CRF classifier.
- Args:
n(int): iteration identifier.train(list): training entities.distrib(str): type of the synthetic annotation.n_min(int): minimum number of synthetic labels to use.n_max(int): maximum number of synthetic labels to use.to_annotate(list): in case of UNI annotation, list of indices of the entities to have their own class.with_common_label_wordform(bool, optional): ifTrue, each entity occurence wordform receives the same label. Defaults toFalse.temp_folder: path to the directory for temporary files.
- Returns:
n_unique_labels_used(int): number of synthetic labels that were actually used.n_sentences_train(int): number of sequences in the training database.n_entities_train(int): number of entities in the training database.train_file(str): path to the formatted train data.
utils.annotation_scripts.annotation_DT module¶
annotation_DT.py. Generation of synthetic annotations for weka Decision Tree J48 classifiers.
-
utils.annotation_scripts.annotation_DT.annotate_DT_test(n, test, features, fake_class)¶ Returns a weka-formatted version of the test entities for the DT classifier.
- Args:
n(int): step identifier.test(list): test entities.features(list): pattern for the feature selection.fake_class: fake weka class to give to all test entities for the weka format.
- Returns:
n_sentences_test(int): number of sequences in the training database.test_entities(list): indices and identifiers of the entities of interest in the test database.test_file(str): path to the formatted test data.
-
utils.annotation_scripts.annotation_DT.annotate_DT_train(n, train, distrib, n_labels, features, to_annotate, temp_folder, preclustering)¶ Returns a synthetic annotation of the data (train + test) for the (weka) DT classifier.
- Args:
n(int): step identifier.train(list): training entities.distrib(str): type of the synthetic annotation.n_min(int): minimum number of synthetic labels to use.n_max(int): maximum number of synthetic labels to use.features(list): pattern for the feature selection.to_annotate(list): in case of UNI annotation, list of indices of the entities to have their own class.with_common_label_wordform(bool, optional): ifTrue, each entity occurence wordform receives the same label. Defaults to False.temp_folder: directory for temporary files.
- Returns:
N(int): random max number of synthetic labels for this step.n_unique_labels_used(int): number of synthetic labels that were actually used.n_sentences_train(int): number of sequences in the training database.n_entities_train(int): number of entities in the training database.train_file(str): path to the formatted train data.
-
utils.annotation_scripts.annotation_DT.weka_compatible_string(s)¶
-
utils.annotation_scripts.annotation_DT.weka_format(n, wekadata, train_length, test_length, temp_folder, verbose)¶ Formats the inout weka file to be compatible with J48 tree and splits it as a train and test file
- Args:
n(int): iteration identifier.wekadata(str): input weka file.train_length(int): number of attributes in train.test_length(int): number of attributes in test.temp_folder(str): path to temporary folder.verbose(int): verbosity level.
utils.annotation_scripts.annotation_HTK module¶
annotation_HTK.py. Generation of synthetic annotations for HTK HMM classifiers.
-
utils.annotation_scripts.annotation_HTK.annotate_HTK_test(n, test)¶ Returns a HTK-formatted version of the test entities for the HTK classifier.
- Args:
n(int): step identifier.test(list): test entities.features(list): pattern for the feature selection.fake_class: fake weka class to give to all test entities for the weka format.
- Returns:
n_sentences_test(int): number of sequences in the training database.test_entities(list): indices and identifiers of the entities of interest in the test database.test_file(str): path to the formatted test data.
-
utils.annotation_scripts.annotation_HTK.annotate_HTK_train(n, train, distrib, n_labels, temp_folder, preclustering)¶ Returns a synthetic annotation of the data (train + test) for the (HTK) HMM classifier.
- Args:
n(int): step identifier.train(list): training entities.distrib(str): type of the synthetic annotation.n_labels(int): random max number of synthetic labels for this step.
- Returns:
n_unique_labels_used(int): number of synthetic labels that were actually used.n_entities_train(int): number of entities in the training database.train_file(str): path to the formatted train data.mlf(str): HTK master label file.