Skip to content

Partitioning Algorithms

Flat Clustering by Embeddings

Path: flexiconc/algorithms/partition_by_embeddings.py

Description:

Partitions lines based on embeddings stored in a concordance metadata column using clustering algorithms (Agglomerative Clustering or K-Means). Supports customizable distance metrics and linkage criteria.

Arguments:

Name Type Description
embeddings_column string The metadata column containing embeddings for each line.
n_partitions integer The number of partitions/clusters to create.
metric string The metric to compute distances between embeddings (used for Agglomerative Clustering only).
linkage string The linkage criterion for Agglomerative Clustering (used only when method is 'agglomerative').
method string The clustering method to use ('agglomerative' or 'kmeans'). Default is 'agglomerative'.
Show full JSON schema
{
  "type": "object",
  "properties": {
    "embeddings_column": {
      "type": "string",
      "description": "The metadata column containing embeddings for each line.",
      "x-eval": "dict(enum=[col for col in list(conc.metadata.columns) if (hasattr(conc.metadata[col].iloc[0], '__iter__') and not isinstance(conc.metadata[col].iloc[0], str) and all(isinstance(x, __import__('numbers').Number) for x in conc.metadata[col].iloc[0]))])"
    },
    "n_partitions": {
      "type": "integer",
      "description": "The number of partitions/clusters to create.",
      "default": 5,
      "x-eval": "dict(maximum=node.line_count)"
    },
    "metric": {
      "type": "string",
      "description": "The metric to compute distances between embeddings (used for Agglomerative Clustering only).",
      "default": "cosine"
    },
    "linkage": {
      "type": "string",
      "description": "The linkage criterion for Agglomerative Clustering (used only when method is 'agglomerative').",
      "default": "average"
    },
    "method": {
      "type": "string",
      "enum": [
        "agglomerative",
        "kmeans"
      ],
      "description": "The clustering method to use ('agglomerative' or 'kmeans'). Default is 'agglomerative'.",
      "default": "kmeans"
    }
  },
  "required": [
    "embeddings_column"
  ]
}

Partition by Metadata Attribute

Path: flexiconc/algorithms/partition_by_metadata_attribute.py

Description:

Partitions the concordance lines based on a specified metadata attribute and groups the data by the values of this attribute.

Arguments:

Name Type Description
metadata_attribute string The metadata attribute to partition by (e.g., 'text_id', 'speaker').
sort_by_partition_size boolean If True, partitions will be sorted by size in descending order.
sorted_values ['array'] If provided, partitions will be sorted by these specific values.
Show full JSON schema
{
  "type": "object",
  "properties": {
    "metadata_attribute": {
      "type": "string",
      "description": "The metadata attribute to partition by (e.g., 'text_id', 'speaker').",
      "x-eval": "dict(enum=list(set(conc.metadata.columns) - {'line_id'}))"
    },
    "sort_by_partition_size": {
      "type": "boolean",
      "description": "If True, partitions will be sorted by size in descending order.",
      "default": true
    },
    "sorted_values": {
      "type": [
        "array"
      ],
      "items": {
        "type": [
          "string",
          "number"
        ]
      },
      "description": "If provided, partitions will be sorted by these specific values."
    }
  },
  "required": [
    "metadata_attribute"
  ]
}

Partition by Ngrams

Path: flexiconc/algorithms/partition_ngrams.py

Description:

Extracts ngram patterns from specified positions and partitions the concordance according to their frequency in the concordance lines. See Anthony (2018) and subsequent work for more information.

Arguments:

Name Type Description
positions array The list of positions (offsets) to extract for the ngram pattern.
tokens_attribute string The positional attribute to search within (e.g., 'word').
case_sensitive boolean If True, the search is case-sensitive.
Show full JSON schema
{
  "type": "object",
  "properties": {
    "positions": {
      "type": "array",
      "items": {
        "type": "integer"
      },
      "description": "The list of positions (offsets) to extract for the ngram pattern."
    },
    "tokens_attribute": {
      "type": "string",
      "description": "The positional attribute to search within (e.g., 'word').",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "case_sensitive": {
      "type": "boolean",
      "description": "If True, the search is case-sensitive.",
      "default": false
    }
  },
  "required": [
    "positions"
  ]
}

Partition with OpenAI

Path: flexiconc/algorithms/partition_openai_semantic.py

Description:

Sends a list of lines to OpenAI and requests clustering into n groups with labels, using structured outputs for guaranteed JSON schema adherence.

Arguments:

Name Type Description
openai_api_key string The API key for OpenAI.
n_partitions integer The number of partitions/clusters to create.
token_attr string The token attribute to use for creating line texts.
model string The OpenAI model to use.
introduction_line string Customizable prompt for the clustering task.
Show full JSON schema
{
  "type": "object",
  "properties": {
    "openai_api_key": {
      "type": "string",
      "description": "The API key for OpenAI."
    },
    "n_partitions": {
      "type": "integer",
      "description": "The number of partitions/clusters to create.",
      "default": 5,
      "x-eval": "dict(maximum=node.line_count)"
    },
    "token_attr": {
      "type": "string",
      "description": "The token attribute to use for creating line texts.",
      "default": "word",
      "x-eval": "dict(enum=list(set(conc.tokens.columns) - {'id_in_line', 'line_id', 'offset'}))"
    },
    "model": {
      "type": "string",
      "description": "The OpenAI model to use.",
      "default": "gpt-4o-2024-11-20"
    },
    "introduction_line": {
      "type": "string",
      "description": "Customizable prompt for the clustering task.",
      "default": "You are given a list of lines of text. Cluster them into {n_partitions} clusters by the pattern in which the node word occurs. Ensure that none of the {n_partitions} clusters is empty."
    }
  },
  "required": [
    "openai_api_key"
  ]
}