Skip to content

Annotation Algorithms

Annotate with Sentence Transformers

Path: flexiconc/algorithms/annotate_sentence_transformers.py

Description:

Generates embeddings for each concordance line (or part of it) using a Sentence Transformer model. Allows selection of tokens within a specified window and based on a specified token attribute.

Arguments:

Name Type Description
tokens_attribute string The positional attribute to extract tokens from (e.g., 'word').
window_start integer The lower bound of the window (inclusive). If None, uses the entire line.
window_end integer The upper bound of the window (inclusive). If None, uses the entire line.
model_name string The name of the pretrained Sentence Transformer model.
Show full JSON schema
{
  "type": "object",
  "properties": {
    "tokens_attribute": {
      "type": "string",
      "description": "The positional attribute to extract tokens from (e.g., 'word').",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "window_start": {
      "type": "integer",
      "description": "The lower bound of the window (inclusive). If None, uses the entire line.",
      "x-eval": "dict(minimum=min(conc.tokens['offset']))"
    },
    "window_end": {
      "type": "integer",
      "description": "The upper bound of the window (inclusive). If None, uses the entire line.",
      "x-eval": "dict(maximum=max(conc.tokens['offset']))"
    },
    "model_name": {
      "type": "string",
      "description": "The name of the pretrained Sentence Transformer model.",
      "default": "all-MiniLM-L6-v2"
    }
  },
  "required": []
}

Annotate with SpaCy Embeddings

Path: flexiconc/algorithms/annotate_spacy_embeddings.py

Description:

Generates averaged spaCy word embeddings for tokens within a specified window.

Arguments:

Name Type Description
spacy_model string The spaCy model to use.
tokens_attribute string The token attribute to use for creating line texts.
exclude_values_attribute string The attribute to filter out specific values.
exclude_values_list ['array'] The list of values to exclude.
window_start integer The lower bound of the token window (inclusive).
window_end integer The upper bound of the token window (inclusive).
include_node boolean Whether to include the node token (offset 0).
Show full JSON schema
{
  "type": "object",
  "properties": {
    "spacy_model": {
      "type": "string",
      "description": "The spaCy model to use.",
      "default": "en_core_web_md"
    },
    "tokens_attribute": {
      "type": "string",
      "description": "The token attribute to use for creating line texts.",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "exclude_values_attribute": {
      "type": "string",
      "description": "The attribute to filter out specific values."
    },
    "exclude_values_list": {
      "type": [
        "array"
      ],
      "items": {
        "type": "string"
      },
      "description": "The list of values to exclude."
    },
    "window_start": {
      "type": "integer",
      "description": "The lower bound of the token window (inclusive).",
      "default": -5,
      "x-eval": "dict(minimum=min(conc.tokens['offset']))"
    },
    "window_end": {
      "type": "integer",
      "description": "The upper bound of the token window (inclusive).",
      "default": 5,
      "x-eval": "dict(maximum=max(conc.tokens['offset']))"
    },
    "include_node": {
      "type": "boolean",
      "description": "Whether to include the node token (offset 0).",
      "default": true
    }
  },
  "required": [
    "spacy_model"
  ]
}

Annotate with spaCy POS tags

Path: flexiconc/algorithms/annotate_spacy_pos.py

Description:

Annotates tokens with spaCy part-of-speech tags or related tag information using a specified spaCy model. The spacy_attributes parameter is always a list, so multiple annotations can be retrieved simultaneously.

Arguments:

Name Type Description
spacy_model string The spaCy model to use for POS tagging.
tokens_attribute string The token attribute to use for POS tagging.
spacy_attributes array A list of spaCy token attributes to retrieve for annotation.
Show full JSON schema
{
  "type": "object",
  "properties": {
    "spacy_model": {
      "type": "string",
      "description": "The spaCy model to use for POS tagging.",
      "default": "en_core_web_sm"
    },
    "tokens_attribute": {
      "type": "string",
      "description": "The token attribute to use for POS tagging.",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "spacy_attributes": {
      "type": "array",
      "items": {
        "type": "string",
        "enum": [
          "pos_",
          "tag_",
          "morph",
          "dep_",
          "ent_type_"
        ]
      },
      "description": "A list of spaCy token attributes to retrieve for annotation.",
      "default": [
        "pos_"
      ]
    }
  },
  "required": [
    "spacy_model",
    "spacy_attributes"
  ]
}

Annotate with TF-IDF

Path: flexiconc/algorithms/annotate_tf_idf.py

Description:

Computes TF-IDF vectors for each line based on tokens in a specified window.

Arguments:

Name Type Description
tokens_attribute string The token attribute to use for creating line texts.
exclude_values_attribute ['string'] The attribute to filter out specific values.
exclude_values_list ['array'] The list of values to exclude.
window_start integer The lower bound of the token window (inclusive).
window_end integer The upper bound of the token window (inclusive).
include_node boolean Whether to include the node token (offset 0).
Show full JSON schema
{
  "type": "object",
  "properties": {
    "tokens_attribute": {
      "type": "string",
      "description": "The token attribute to use for creating line texts.",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "exclude_values_attribute": {
      "type": [
        "string"
      ],
      "description": "The attribute to filter out specific values."
    },
    "exclude_values_list": {
      "type": [
        "array"
      ],
      "items": {
        "type": "string"
      },
      "description": "The list of values to exclude."
    },
    "window_start": {
      "type": "integer",
      "description": "The lower bound of the token window (inclusive).",
      "default": -5,
      "x-eval": "dict(minimum=min(conc.tokens['offset']))"
    },
    "window_end": {
      "type": "integer",
      "description": "The upper bound of the token window (inclusive).",
      "default": 5,
      "x-eval": "dict(maximum=max(conc.tokens['offset']))"
    },
    "include_node": {
      "type": "boolean",
      "description": "Whether to include the node token (offset 0).",
      "default": true
    }
  },
  "required": [
    "tokens_attribute"
  ]
}