Partitioning Algorithms¶

Flat Clustering by Embeddings¶

Path: flexiconc/algorithms/partition_by_embeddings.py

Description:

Partitions lines based on embeddings stored in a concordance metadata column using clustering algorithms (Agglomerative Clustering or K-Means). Supports customizable distance metrics and linkage criteria.

Arguments:

Name	Type	Description
embeddings_column	string	The metadata column containing embeddings for each line.
n_partitions	integer	The number of partitions/clusters to create.
metric	string	The metric to compute distances between embeddings (used for Agglomerative Clustering only).
linkage	string	The linkage criterion for Agglomerative Clustering (used only when method is 'agglomerative').
method	string	The clustering method to use ('agglomerative' or 'kmeans'). Default is 'agglomerative'.

Show full JSON schema

{
  "type": "object",
  "properties": {
    "embeddings_column": {
      "type": "string",
      "description": "The metadata column containing embeddings for each line.",
      "x-eval": "dict(enum=[col for col in list(conc.metadata.columns) if (hasattr(conc.metadata[col].iloc[0], '__iter__') and not isinstance(conc.metadata[col].iloc[0], str) and all(isinstance(x, __import__('numbers').Number) for x in conc.metadata[col].iloc[0]))])"
    },
    "n_partitions": {
      "type": "integer",
      "description": "The number of partitions/clusters to create.",
      "default": 5,
      "x-eval": "dict(maximum=node.line_count)"
    },
    "metric": {
      "type": "string",
      "description": "The metric to compute distances between embeddings (used for Agglomerative Clustering only).",
      "default": "cosine"
    },
    "linkage": {
      "type": "string",
      "description": "The linkage criterion for Agglomerative Clustering (used only when method is 'agglomerative').",
      "default": "average"
    },
    "method": {
      "type": "string",
      "enum": [
        "agglomerative",
        "kmeans"
      ],
      "description": "The clustering method to use ('agglomerative' or 'kmeans'). Default is 'agglomerative'.",
      "default": "kmeans"
    }
  },
  "required": [
    "embeddings_column"
  ]
}

Partition by Metadata Attribute¶

Path: flexiconc/algorithms/partition_by_metadata_attribute.py

Description:

Partitions the concordance lines based on a specified metadata attribute and groups the data by the values of this attribute.

Arguments:

Name	Type	Description
metadata_attribute	string	The metadata attribute to partition by (e.g., 'text_id', 'speaker').
sort_by_partition_size	boolean	If True, partitions will be sorted by size in descending order.
sorted_values	['array']	If provided, partitions will be sorted by these specific values.

Show full JSON schema

{
  "type": "object",
  "properties": {
    "metadata_attribute": {
      "type": "string",
      "description": "The metadata attribute to partition by (e.g., 'text_id', 'speaker').",
      "x-eval": "dict(enum=list(set(conc.metadata.columns) - {'line_id'}))"
    },
    "sort_by_partition_size": {
      "type": "boolean",
      "description": "If True, partitions will be sorted by size in descending order.",
      "default": true
    },
    "sorted_values": {
      "type": [
        "array"
      ],
      "items": {
        "type": [
          "string",
          "number"
        ]
      },
      "description": "If provided, partitions will be sorted by these specific values."
    }
  },
  "required": [
    "metadata_attribute"
  ]
}

Partition by Ngrams¶

Path: flexiconc/algorithms/partition_ngrams.py

Description:

Extracts ngram patterns from specified positions and partitions the concordance according to their frequency in the concordance lines. See Anthony (2018) and subsequent work for more information.

Arguments:

Name	Type	Description
positions	array	The list of positions (offsets) to extract for the ngram pattern.
tokens_attribute	string	The positional attribute to search within (e.g., 'word').
case_sensitive	boolean	If True, the search is case-sensitive.

Show full JSON schema

{
  "type": "object",
  "properties": {
    "positions": {
      "type": "array",
      "items": {
        "type": "integer"
      },
      "description": "The list of positions (offsets) to extract for the ngram pattern."
    },
    "tokens_attribute": {
      "type": "string",
      "description": "The positional attribute to search within (e.g., 'word').",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "case_sensitive": {
      "type": "boolean",
      "description": "If True, the search is case-sensitive.",
      "default": false
    }
  },
  "required": [
    "positions"
  ]
}

Partition with OpenAI¶

Path: flexiconc/algorithms/partition_openai_semantic.py

Description:

Sends a list of lines to OpenAI and requests clustering into n groups with labels, using structured outputs for guaranteed JSON schema adherence.

Arguments:

Name	Type	Description
openai_api_key	string	The API key for OpenAI.
n_partitions	integer	The number of partitions/clusters to create.
token_attr	string	The token attribute to use for creating line texts.
model	string	The OpenAI model to use.
introduction_line	string	Customizable prompt for the clustering task.

Show full JSON schema

{
  "type": "object",
  "properties": {
    "openai_api_key": {
      "type": "string",
      "description": "The API key for OpenAI."
    },
    "n_partitions": {
      "type": "integer",
      "description": "The number of partitions/clusters to create.",
      "default": 5,
      "x-eval": "dict(maximum=node.line_count)"
    },
    "token_attr": {
      "type": "string",
      "description": "The token attribute to use for creating line texts.",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "model": {
      "type": "string",
      "description": "The OpenAI model to use.",
      "default": "gpt-4o-2024-11-20"
    },
    "introduction_line": {
      "type": "string",
      "description": "Customizable prompt for the clustering task.",
      "default": "You are given a list of lines of text. Cluster them into {n_partitions} clusters by the pattern in which the node word occurs. Ensure that none of the {n_partitions} clusters is empty."
    }
  },
  "required": [
    "openai_api_key"
  ]
}