spaCy NLP Module¶

The spacy_module standard library module exposes spaCy's NLP pipeline as Clausal predicates. It provides model management, tokenisation, linguistic annotations, named-entity recognition, sentence segmentation, noun chunks, and vector similarity — all accessible from .clausal files via a relational interface.

Requires: pip install spacy and at least one downloaded spaCy model (e.g. python -m spacy download en_core_web_sm).

-import_from(spacy_module, [LoadModel, Process, Token, Lemma, Entity])

Nouns(DOC, TOK) <- (
    LoadModel("en_core_web_sm", "nlp"),
    Process("nlp", DOC, DOC_OBJ),
    Token(DOC_OBJ, TOK),
    Pos(TOK, "NOUN")
)

Or via the py.spacy alias:

-import_from(py.spacy, [LoadModel, Process, Entity, EntityList])

Import¶

-import_from(spacy_module, [
    LoadModel, UnloadModel, CurrentModel,
    Process,
    Token, TokenText, TokenList,
    Pos, Tag, Lemma, Dep, Head, Shape, IsAlpha, IsStop,
    Entity, EntityList,
    Sentence, SentenceList,
    Similarity,
    NounChunk
])

Layer 1 — Model management¶

Models are loaded once and kept in a module-level registry under string aliases. All registry access is thread-safe.

`LoadModel/1`¶

# skip
LoadModel(+Name)

Load a spaCy model by name; the model name is used as the alias. Idempotent — if the alias is already loaded, succeeds immediately.

# skip
LoadModel("en_core_web_sm")

`LoadModel/2`¶

# skip
LoadModel(+Name, +Alias)

Load Name under a custom Alias. Useful for loading the same model under multiple names or for shorter identifiers.

# skip
LoadModel("en_core_web_sm", "en")
LoadModel("en_core_web_lg", "en_lg")

`UnloadModel/1`¶

# skip
UnloadModel(+Alias)

Remove the model from the registry. Fails if the alias is not registered.

`CurrentModel/1`¶

# skip
CurrentModel(?Alias)

When Alias is unbound, nondeterministically enumerates all registered aliases. When ground, succeeds if that alias is currently loaded.

ListModels(A) <- CurrentModel(A)

Layer 2 — Document processing¶

`Process/3`¶

# skip
Process(+Alias, +Text, -Doc)

Run Text through the model registered as Alias and unify Doc with the resulting spaCy Doc object. The Doc object is an opaque handle passed to all downstream predicates.

setup(DOC) <- (
    LoadModel("en_core_web_sm", "nlp"),
    Process("nlp", "The quick brown fox jumps.", DOC)
)

Layer 3 — Tokens¶

A token is represented as a plain Python dict with keys:

Key	Type	Description
`text`	str	Surface form
`lemma`	str	Lemmatised form
`pos`	str	Coarse POS tag (Universal Dependencies)
`tag`	str	Fine-grained POS tag
`dep`	str	Dependency label
`head_text`	str	Surface form of the syntactic head
`head_i`	int	Index of the syntactic head token
`i`	int	Token index within the document
`is_alpha`	bool	True if the token consists of alphabetic characters
`is_stop`	bool	True if the token is a stop word
`shape`	str	Orthographic shape (e.g. `"Xxxxx"`, `"dd"`)

`Token/2`¶

# skip
Token(+Doc, -Tok)

Nondeterministic. Yields one solution per token in Doc, binding Tok to the token dict.

AllTokens(DOC, TOK) <- Token(DOC, TOK)

`Token/3`¶

# skip
Token(+Doc, ?Index, -Tok)

When Index is ground, retrieves the token at that position (fails if out of range). When Index is unbound, iterates all tokens and binds Index to each token's position.

FirstToken(DOC, TOK) <- Token(DOC, 0, TOK)
IndexedTokens(DOC, I, TOK) <- Token(DOC, I, TOK)

`TokenText/2`¶

# skip
TokenText(+Tok, ?Text)

Unify Text with the surface form of a token dict. Equivalent to T is ++Tok["text"] but more readable.

IsApple(TOK) <- TokenText(TOK, "Apple")

`TokenList/2`¶

# skip
TokenList(+Doc, -Tokens)

Unify Tokens with a list of all token dicts in the document. Deterministic.

Toks(DOC, TOKENS) <- TokenList(DOC, TOKENS)

Layer 4 — Linguistic annotations¶

All annotation predicates take a token dict as their first argument and unify the second argument with the annotation value.

`Pos/2`¶

# skip
Pos(+Tok, ?Tag)

Coarse-grained Universal Dependencies POS tag: "NOUN", "VERB", "PROPN", "ADJ", etc.

Why Pos not POS?

POS is all-uppercase, which the term transformer would interpret as a logic variable. The predicate is therefore named Pos.

`Tag/2`¶

# skip
Tag(+Tok, ?FineTag)

Fine-grained POS tag specific to the language model (e.g. "NNS", "VBZ" for English Penn Treebank).

`Lemma/2`¶

# skip
Lemma(+Tok, ?Lem)

Lemmatised form of the token (e.g. "run" for "running").

`Dep/2`¶

# skip
Dep(+Tok, ?Label)

Dependency relation to the syntactic head: "nsubj", "dobj", "ROOT", etc.

`Head/2`¶

# skip
Head(+Tok, ?HeadText)

Surface form of the syntactic head token.

`Shape/2`¶

# skip
Shape(+Tok, ?Shp)

Orthographic shape string: "Xxxxx" for "Apple", "dd" for "42", etc.

`IsAlpha/1`¶

# skip
IsAlpha(+Tok)

Succeeds if the token consists entirely of alphabetic characters. Fails otherwise.

`IsStop/1`¶

# skip
IsStop(+Tok)

Succeeds if the token is a stop word in the model's language. Fails otherwise.

Layer 5 — Named entity recognition¶

An entity is a dict with keys: text, label, start, end, start_char, end_char.

`Entity/2`¶

# skip
Entity(+Doc, -Ent)

Nondeterministic. Yields one solution per entity in the document.

Orgs(DOC, ENT) <- (Entity(DOC, ENT), T is ++ENT["label"], T == "ORG")

`Entity/3`¶

# skip
Entity(+Doc, +Label, -Ent)

Filtered iteration — only yields entities whose label matches Label.

People(DOC, ENT) <- Entity(DOC, "PERSON", ENT)
Orgs(DOC, ENT) <- Entity(DOC, "ORG", ENT)

`EntityList/2`¶

# skip
EntityList(+Doc, -Ents)

Unify Ents with a list of all entity dicts. Deterministic.

Layer 6 — Sentences¶

Sentences are plain strings (the .text of each spaCy Span).

`Sentence/2`¶

# skip
Sentence(+Doc, -Sent)

Nondeterministic. Yields one solution per sentence.

`SentenceList/2`¶

# skip
SentenceList(+Doc, -Sents)

Unify Sents with a list of all sentence strings. Deterministic.

Note

Sentence segmentation requires the senter or sentencizer component in the model pipeline. It is enabled by default in en_core_web_sm and other standard models.

Layer 7 — Similarity¶

`Similarity/4`¶

# skip
Similarity(+Alias, +Text1, +Text2, -Score)

Process both texts through the model and unify Score with their cosine similarity as a float in [0.0, 1.0].

Close(T1, T2) <- (
    Similarity("en", T1, T2, S),
    S > 0.8
)

Note

Similarity requires word vectors in the model. Use en_core_web_md or en_core_web_lg instead of _sm for meaningful scores.

Layer 8 — Noun chunks¶

A noun chunk is a dict with keys: text, root_text, root_dep, root_head_text.

`NounChunk/2`¶

# skip
NounChunk(+Doc, -Chunk)

Nondeterministic. Yields one solution per noun chunk.

Subjects(DOC, CHUNK) <- (
    NounChunk(DOC, CHUNK),
    D is ++CHUNK["root_dep"],
    D == "nsubj"
)

Working example¶

-import_from(spacy_module, [LoadModel, Process, Token, Pos, Lemma, Entity, Dep])

# Find all noun subjects in a sentence
NounSubjects(TEXT, LEMMA) <- (
    LoadModel("en_core_web_sm", "nlp"),
    Process("nlp", TEXT, DOC),
    Token(DOC, TOK),
    Pos(TOK, "NOUN"),
    Dep(TOK, "nsubj"),
    Lemma(TOK, LEMMA)
)

# Extract all organisation entities
Orgs(TEXT, ORG_TEXT) <- (
    LoadModel("en_core_web_sm", "nlp"),
    Process("nlp", TEXT, DOC),
    Entity(DOC, "ORG", ENT),
    ORG_TEXT is ++ENT["text"]
)

# Filter tokens by POS and collect as list
NounLemmas(TEXT, LEMMAS) <- (
    LoadModel("en_core_web_sm", "nlp"),
    Process("nlp", TEXT, DOC),
    FindAll(L, (Token(DOC, TOK), Pos(TOK, "NOUN"), Lemma(TOK, L)), LEMMAS)
)

Test coverage

Tests are in tests/test_spacy_module.py (50+ tests, skipped if spaCy is unavailable).

Helpers: _token_to_dict, _ent_to_dict, _chunk_to_dict key sets and values
Model registry: load/unload, idempotent load, CurrentModel enumerate/check
Process: returns Doc, error on unknown alias
Token/2,3: iteration count, first token, by index, out-of-range, iterate with index
TokenText, TokenList: extraction, list length and contents
Annotation predicates: Pos (PROPN), Lemma (look), Dep, Shape, IsAlpha, IsStop
NER: entity iteration, label filter, empty filter, EntityList
Sentences: Sentence/2 iteration, SentenceList
Similarity: identical texts (≈1.0), score is float in [0,1]
NounChunk: chunk count, dict keys
Adapter: single-arity dispatch, multi-arity dispatch, repr, unknown arity → DONE
py.spacy alias: re-exports are identical objects
Fixture integration: spacy_basic.clausal (17 Test predicates)

Implementation

Module: clausal/modules/spacy_module.py
Alias: clausal/modules/py/spacy.py
Adapter class: _SpacyPredicate — same pattern as _SQLitePredicate and _RegexPredicate
Simple → trampoline: _simple_to_trampoline() wraps deterministic generators
Nondeterministic predicates: native trampoline protocol with trail.mark()/trail.undo(mark) per solution
Lazy import: import spacy is deferred to first use via _get_spacy() so the module loads cleanly even when spaCy is not installed
Thread-safe registry: model dict protected by threading.Lock
Token representation: plain Python dicts (not opaque handles) — easy to inspect, log, and use with ++() interop

Design decisions

Doc as opaque handle — the spaCy Doc object is passed directly as a logic term. It can be unified, stored, and passed around, but its internal structure is accessed only via the provided predicates.
Token as dict — tokens are converted to plain Python dicts. This makes them easy to access with ++TOK["text"] and compatible with dict-handling builtins. Dicts are ground (no logic variables inside), so they unify structurally.
Filtered iteration — Entity/3 and similar predicates filter at iteration time rather than via a separate filter predicate, following the pattern of SQLiteQuery/4 with SQL WHERE clauses.
Model aliases — models are referenced by string aliases throughout, making predicates composable without carrying model references. The same pattern is used in the SQLite module.
Pos not POS — POS is all-uppercase and would be treated as a logic variable by the term transformer. Pos (title-case) avoids the collision.
Lazy spaCy import — import spacy is deferred to first use so that .clausal files importing this module compile correctly even when spaCy is not installed. Errors are reported at predicate call time with a clear message.

spaCy NLP Module¶

Import¶

Layer 1 — Model management¶

LoadModel/1¶

LoadModel/2¶

UnloadModel/1¶

CurrentModel/1¶

Layer 2 — Document processing¶

Process/3¶

Layer 3 — Tokens¶

Token/2¶

Token/3¶

TokenText/2¶

TokenList/2¶

Layer 4 — Linguistic annotations¶

Pos/2¶

Tag/2¶

Lemma/2¶

Dep/2¶

Head/2¶

Shape/2¶

IsAlpha/1¶

IsStop/1¶

Layer 5 — Named entity recognition¶

Entity/2¶

Entity/3¶

EntityList/2¶

Layer 6 — Sentences¶

Sentence/2¶

SentenceList/2¶

Layer 7 — Similarity¶

Similarity/4¶

Layer 8 — Noun chunks¶

NounChunk/2¶

Working example¶

`LoadModel/1`¶

`LoadModel/2`¶

`UnloadModel/1`¶

`CurrentModel/1`¶

`Process/3`¶

`Token/2`¶

`Token/3`¶

`TokenText/2`¶

`TokenList/2`¶

`Pos/2`¶

`Tag/2`¶

`Lemma/2`¶

`Dep/2`¶

`Head/2`¶

`Shape/2`¶

`IsAlpha/1`¶

`IsStop/1`¶

`Entity/2`¶

`Entity/3`¶

`EntityList/2`¶

`Sentence/2`¶

`SentenceList/2`¶

`Similarity/4`¶

`NounChunk/2`¶