utils.annotation_scripts package

Submodules

utils.annotation_scripts.annotation_CRF module

annotation_CRF.py. Generation of synthetic annotations for wapiti CRF classifier.

utils.annotation_scripts.annotation_CRF.annotate_CRF_test(n, test, temp_folder)

Returns a wapiti-formatted version of the test entities for the CRF classifier.

Args:
  • n (int): step identifier.
  • test (list): test entities.
  • temp_folder: directory for temporary files.
Returns:
  • n_sentences_test (int): number of sequences in the training database.
  • test_entities_indices (list): indices and identifiers of the entities of interest in the test database.
  • test_file (str): path to the formatted test data.
utils.annotation_scripts.annotation_CRF.annotate_CRF_train(n, train, distrib, n_labels, to_annotate, temp_folder, preclustering)

Returns a synthetic annotation of the data for the (wapiti) CRF classifier.

Args:
  • n (int): iteration identifier.
  • train (list): training entities.
  • distrib (str): type of the synthetic annotation.
  • n_min (int): minimum number of synthetic labels to use.
  • n_max (int): maximum number of synthetic labels to use.
  • to_annotate (list): in case of UNI annotation, list of indices of the entities to have their own class.
  • with_common_label_wordform (bool, optional): if True, each entity occurence wordform receives the same label. Defaults to False.
  • temp_folder: path to the directory for temporary files.
Returns:
  • n_unique_labels_used (int): number of synthetic labels that were actually used.
  • n_sentences_train (int): number of sequences in the training database.
  • n_entities_train (int): number of entities in the training database.
  • train_file (str): path to the formatted train data.

utils.annotation_scripts.annotation_DT module

annotation_DT.py. Generation of synthetic annotations for weka Decision Tree J48 classifiers.

utils.annotation_scripts.annotation_DT.annotate_DT_test(n, test, features, fake_class)

Returns a weka-formatted version of the test entities for the DT classifier.

Args:
  • n (int): step identifier.
  • test (list): test entities.
  • features (list): pattern for the feature selection.
  • fake_class: fake weka class to give to all test entities for the weka format.
Returns:
  • n_sentences_test (int): number of sequences in the training database.
  • test_entities (list): indices and identifiers of the entities of interest in the test database.
  • test_file (str): path to the formatted test data.
utils.annotation_scripts.annotation_DT.annotate_DT_train(n, train, distrib, n_labels, features, to_annotate, temp_folder, preclustering)

Returns a synthetic annotation of the data (train + test) for the (weka) DT classifier.

Args:
  • n (int): step identifier.
  • train (list): training entities.
  • distrib (str): type of the synthetic annotation.
  • n_min (int): minimum number of synthetic labels to use.
  • n_max (int): maximum number of synthetic labels to use.
  • features (list): pattern for the feature selection.
  • to_annotate (list): in case of UNI annotation, list of indices of the entities to have their own class.
  • with_common_label_wordform (bool, optional): if True, each entity occurence wordform receives the same label. Defaults to False.
  • temp_folder: directory for temporary files.
Returns:
  • N (int): random max number of synthetic labels for this step.
  • n_unique_labels_used (int): number of synthetic labels that were actually used.
  • n_sentences_train (int): number of sequences in the training database.
  • n_entities_train (int): number of entities in the training database.
  • train_file (str): path to the formatted train data.
utils.annotation_scripts.annotation_DT.weka_compatible_string(s)
utils.annotation_scripts.annotation_DT.weka_format(n, wekadata, train_length, test_length, temp_folder, verbose)

Formats the inout weka file to be compatible with J48 tree and splits it as a train and test file

Args:
  • n (int): iteration identifier.
  • wekadata (str): input weka file.
  • train_length (int): number of attributes in train.
  • test_length (int): number of attributes in test.
  • temp_folder (str): path to temporary folder.
  • verbose (int): verbosity level.

utils.annotation_scripts.annotation_HTK module

annotation_HTK.py. Generation of synthetic annotations for HTK HMM classifiers.

utils.annotation_scripts.annotation_HTK.annotate_HTK_test(n, test)

Returns a HTK-formatted version of the test entities for the HTK classifier.

Args:
  • n (int): step identifier.
  • test (list): test entities.
  • features (list): pattern for the feature selection.
  • fake_class: fake weka class to give to all test entities for the weka format.
Returns:
  • n_sentences_test (int): number of sequences in the training database.
  • test_entities (list): indices and identifiers of the entities of interest in the test database.
  • test_file (str): path to the formatted test data.
utils.annotation_scripts.annotation_HTK.annotate_HTK_train(n, train, distrib, n_labels, temp_folder, preclustering)

Returns a synthetic annotation of the data (train + test) for the (HTK) HMM classifier.

Args:
  • n (int): step identifier.
  • train (list): training entities.
  • distrib (str): type of the synthetic annotation.
  • n_labels (int): random max number of synthetic labels for this step.
Returns:
  • n_unique_labels_used (int): number of synthetic labels that were actually used.
  • n_entities_train (int): number of entities in the training database.
  • train_file (str): path to the formatted train data.
  • mlf (str): HTK master label file.

Module contents