Skip to main content

Generator Validation

Generator validation in pydantic-core is designed to handle lazy sequences and iterators without exhausting them prematurely. This approach ensures that memory efficiency is maintained when dealing with large or infinite streams of data, while still providing robust validation for each item yielded.

The Generator Schema

The GeneratorSchema is a TypedDict that defines how a generator or iterable should be validated and serialized. It is typically constructed using the generator_schema helper function.

from pydantic_core import core_schema

schema = core_schema.generator_schema(
items_schema=core_schema.int_schema(),
min_length=2,
max_length=5
)

The schema includes several key fields:

  • items_schema: A CoreSchema used to validate every item yielded by the generator.
  • min_length / max_length: Constraints on the number of items the generator must yield.
  • serialization: Configuration for how the generator should be serialized (e.g., using filter_seq_schema for include/exclude logic).

Lazy Validation Mechanics

Unlike list or set validation, which processes all items immediately, generator validation is lazy. When you call validate_python on a generator, pydantic-core does not iterate over the input. Instead, it returns a ValidatorIterator object.

The ValidatorIterator

The ValidatorIterator wraps the original iterable and performs validation on-the-fly as items are requested. This means that a ValidationError is not raised when validate_python is called, but rather when next() is called on the resulting iterator if an item fails validation.

from pydantic_core import SchemaValidator, core_schema, ValidationError

def my_generator():
yield 1
yield "not an int"

v = SchemaValidator(core_schema.generator_schema(items_schema=core_schema.int_schema()))
validated_gen = v.validate_python(my_generator())

# The first item is valid
assert next(validated_gen) == 1

# The second item fails validation only when accessed
try:
next(validated_gen)
except ValidationError as e:
print(e.errors())
# Output: [{'type': 'int_parsing', 'loc': (1,), ...}]

As seen in pydantic-core/tests/validators/test_generator.py, the ValidatorIterator maintains an index attribute that tracks the number of items yielded so far. This index is used in the loc (location) of any ValidationError that occurs during iteration.

Length Constraints

Length constraints (min_length and max_length) are also enforced lazily:

  1. max_length: If the generator yields more items than allowed, a too_long error is raised by the ValidatorIterator as soon as the limit is exceeded.
  2. min_length: This constraint is checked when the underlying generator is exhausted. If the total number of items yielded is less than min_length, a too_short error is raised after the final valid item.

This behavior is verified in test_generator_too_long within the test suite, where the error is raised exactly at the step where the third item is attempted to be read from a generator with max_length=2.

Serialization

Generators are serialized as arrays (JSON) or lists (Python) by default. Similar to validation, serialization is lazy when using to_python or to_json.

SerializationIterator

When serializing a generator to Python, pydantic-core returns a SerializationIterator. This iterator performs any necessary transformations (like converting objects to dicts) as you iterate over it.

from pydantic_core import SchemaSerializer, core_schema

def gen():
yield 1
yield 2

s = SchemaSerializer(core_schema.generator_schema(core_schema.int_schema()))
ser_gen = s.to_python(gen())

assert next(ser_gen) == 1
assert ser_gen.index == 1

JSON Serialization

When serializing to JSON via to_json(), the generator is fully exhausted and represented as a standard JSON array. If the generator raises an exception during this process (e.g., a ValueError inside the generator function), the serializer will propagate that error.

Design Tradeoffs

The implementation of GeneratorSchema reflects a specific set of design choices:

  • Memory vs. Eagerness: By choosing lazy validation, pydantic-core prioritizes memory efficiency. You can validate a generator yielding millions of rows from a database without loading them all into memory. The tradeoff is that you cannot know if the entire sequence is valid without consuming it.
  • Validator State: The ValidatorIterator is stateful. Once an item is consumed or a validation error is raised for a specific index, you cannot "restart" the validation from that point using the same iterator.
  • Error Context: Because errors happen during iteration, the input reported in a ValidationError for a generator is often the generator object itself (or its repr), rather than the specific failing value, which is instead identified by its loc index.