Representing Concordances¶

FlexiConc is designed to standardize the way concordance data is stored and processed. Internally, a Concordance object always contains three Pandas DataFrames—metadata, tokens, and matches—which together form its internal structure.

1. Internal Structure of the Concordance Object¶

The Concordance object maintains three DataFrames that hold different aspects of the concordance data:

A. Metadata DataFrame¶

The metadata DataFrame stores information about each concordance line. It must include a unique identifier, line_id, for each row. Additional columns may include details such as the source text identifier, chapter, paragraph, and sentence.

For example, the metadata table might be structured as follows:

line_id	text_id	chapter	paragraph	sentence
0	ED	10	35	63
1	ED	10	36	71
2	ED	14	115	262
3	ED	23	7	78
4	LD	45	85	192

B. Tokens DataFrame¶

The tokens DataFrame holds token-level details for each concordance line. It must include line_id to link tokens to their corresponding metadata entry. In addition, the tokens DataFrame typically contains:

offset: Indicates the token’s relative position to the matched (node) token(s):
- Negative values represent tokens in the left context.
- Zero marks the matched (node) token(s).
- Positive values represent tokens in the right context.
id_in_line: (Optional) The token’s sequential index within the line. If omitted, FlexiConc will reconstruct it.
word: The token text.
Other attributes such as part-of-speech, lemma, etc., may also be provided.

A tokens table might look like this:

cpos	offset	word	pos	lemma	id_in_line
445643	-20	to	IN	to	0
445644	-19	the	DT	the	1
445645	-18	toll-keeper	JJ	toll-keeper	2
445646	-17	keeper	NN	keeper	3
445647	-16	.	.	.	4
445648	-15	then	RB	then	5
445649	-14	he	PRP	he	6

Notes: - The cpos column (optional) represents the corpus position in the original corpus when available.

C. Matches DataFrame¶

The matches DataFrame specifies the match location within each of the concordance lines. In simple cases, it contains the same information as the offset column in tokens. It must include:

line_id: To associate the match with its corresponding metadata row.
match_start and match_end: Indicate the token indices that mark the beginning and end of the match.
slot: An integer that allows for multiple matches per line.

An example matches table might look like this:

line_id	match_start	match_end
0	20	20
1	61	61
2	102	102
3	143	143

The matches DataFrame is essential when there are multiple matching slots per line:

line_id	match_start	match_end	slot
0	20	20	0
0	22	22	1
1	61	61	0
1	62	62	1

The slot column allows FlexiConc to focus on different matches within the same line, enabling more complex analyses.