Annotation Algorithms¶
Annotate with Sentence Transformers¶
Path: flexiconc/algorithms/annotate_sentence_transformers.py
Description:
Generates embeddings for each concordance line (or part of it) using a Sentence Transformer model. Allows selection of tokens within a specified window and based on a specified token attribute.
Arguments:
Name | Type | Description |
---|---|---|
tokens_attribute | string | The positional attribute to extract tokens from (e.g., 'word'). |
window_start | integer | The lower bound of the window (inclusive). If None, uses the entire line. |
window_end | integer | The upper bound of the window (inclusive). If None, uses the entire line. |
model_name | string | The name of the pretrained Sentence Transformer model. |
Show full JSON schema
{
"type": "object",
"properties": {
"tokens_attribute": {
"type": "string",
"description": "The positional attribute to extract tokens from (e.g., 'word').",
"default": "word",
"x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
},
"window_start": {
"type": "integer",
"description": "The lower bound of the window (inclusive). If None, uses the entire line.",
"x-eval": "dict(minimum=min(conc.tokens['offset']))"
},
"window_end": {
"type": "integer",
"description": "The upper bound of the window (inclusive). If None, uses the entire line.",
"x-eval": "dict(maximum=max(conc.tokens['offset']))"
},
"model_name": {
"type": "string",
"description": "The name of the pretrained Sentence Transformer model.",
"default": "all-MiniLM-L6-v2"
}
},
"required": []
}
Annotate with SpaCy Embeddings¶
Path: flexiconc/algorithms/annotate_spacy_embeddings.py
Description:
Generates averaged spaCy word embeddings for tokens within a specified window.
Arguments:
Name | Type | Description |
---|---|---|
spacy_model | string | The spaCy model to use. |
tokens_attribute | string | The token attribute to use for creating line texts. |
exclude_values_attribute | string | The attribute to filter out specific values. |
exclude_values_list | ['array'] | The list of values to exclude. |
window_start | integer | The lower bound of the token window (inclusive). |
window_end | integer | The upper bound of the token window (inclusive). |
include_node | boolean | Whether to include the node token (offset 0). |
Show full JSON schema
{
"type": "object",
"properties": {
"spacy_model": {
"type": "string",
"description": "The spaCy model to use.",
"default": "en_core_web_md"
},
"tokens_attribute": {
"type": "string",
"description": "The token attribute to use for creating line texts.",
"default": "word",
"x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
},
"exclude_values_attribute": {
"type": "string",
"description": "The attribute to filter out specific values."
},
"exclude_values_list": {
"type": [
"array"
],
"items": {
"type": "string"
},
"description": "The list of values to exclude."
},
"window_start": {
"type": "integer",
"description": "The lower bound of the token window (inclusive).",
"default": -5,
"x-eval": "dict(minimum=min(conc.tokens['offset']))"
},
"window_end": {
"type": "integer",
"description": "The upper bound of the token window (inclusive).",
"default": 5,
"x-eval": "dict(maximum=max(conc.tokens['offset']))"
},
"include_node": {
"type": "boolean",
"description": "Whether to include the node token (offset 0).",
"default": true
}
},
"required": [
"spacy_model"
]
}
Annotate with spaCy POS tags¶
Path: flexiconc/algorithms/annotate_spacy_pos.py
Description:
Annotates tokens with spaCy part-of-speech tags or related tag information using a specified spaCy model. The spacy_attributes parameter is always a list, so multiple annotations can be retrieved simultaneously.
Arguments:
Name | Type | Description |
---|---|---|
spacy_model | string | The spaCy model to use for POS tagging. |
tokens_attribute | string | The token attribute to use for POS tagging. |
spacy_attributes | array | A list of spaCy token attributes to retrieve for annotation. |
Show full JSON schema
{
"type": "object",
"properties": {
"spacy_model": {
"type": "string",
"description": "The spaCy model to use for POS tagging.",
"default": "en_core_web_sm"
},
"tokens_attribute": {
"type": "string",
"description": "The token attribute to use for POS tagging.",
"default": "word",
"x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
},
"spacy_attributes": {
"type": "array",
"items": {
"type": "string",
"enum": [
"pos_",
"tag_",
"morph",
"dep_",
"ent_type_"
]
},
"description": "A list of spaCy token attributes to retrieve for annotation.",
"default": [
"pos_"
]
}
},
"required": [
"spacy_model",
"spacy_attributes"
]
}
Annotate with TF-IDF¶
Path: flexiconc/algorithms/annotate_tf_idf.py
Description:
Computes TF-IDF vectors for each line based on tokens in a specified window.
Arguments:
Name | Type | Description |
---|---|---|
tokens_attribute | string | The token attribute to use for creating line texts. |
exclude_values_attribute | ['string'] | The attribute to filter out specific values. |
exclude_values_list | ['array'] | The list of values to exclude. |
window_start | integer | The lower bound of the token window (inclusive). |
window_end | integer | The upper bound of the token window (inclusive). |
include_node | boolean | Whether to include the node token (offset 0). |
Show full JSON schema
{
"type": "object",
"properties": {
"tokens_attribute": {
"type": "string",
"description": "The token attribute to use for creating line texts.",
"default": "word",
"x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
},
"exclude_values_attribute": {
"type": [
"string"
],
"description": "The attribute to filter out specific values."
},
"exclude_values_list": {
"type": [
"array"
],
"items": {
"type": "string"
},
"description": "The list of values to exclude."
},
"window_start": {
"type": "integer",
"description": "The lower bound of the token window (inclusive).",
"default": -5,
"x-eval": "dict(minimum=min(conc.tokens['offset']))"
},
"window_end": {
"type": "integer",
"description": "The upper bound of the token window (inclusive).",
"default": 5,
"x-eval": "dict(maximum=max(conc.tokens['offset']))"
},
"include_node": {
"type": "boolean",
"description": "Whether to include the node token (offset 0).",
"default": true
}
},
"required": [
"tokens_attribute"
]
}