Skip to main content

Global String and Regex Behavior

In this codebase, string validation behavior is managed through a dual-layer configuration system. Global defaults are defined in CoreConfig, while specific field behaviors are defined in StringSchema. This allows for consistent string handling across an entire schema while permitting granular overrides where necessary.

Global Configuration with CoreConfig

The CoreConfig class in pydantic_core.core_schema provides several attributes that establish default validation rules for all string fields within a SchemaValidator. These settings are particularly useful for enforcing organization-wide or application-wide standards, such as maximum input lengths or automatic whitespace trimming.

Key global string attributes include:

  • str_max_length and str_min_length: Set global bounds on string length.
  • str_strip_whitespace: When True, leading and trailing whitespace is removed from all strings.
  • str_to_lower and str_to_upper: Automatically transform string casing.

As shown in pydantic-core/tests/test_config.py, these global settings are applied when the StringSchema (via cs.str_schema()) does not specify its own constraints:

from pydantic_core import CoreConfig, SchemaValidator, core_schema as cs

# Global constraint applied via CoreConfig
v = SchemaValidator(
cs.str_schema(),
config=CoreConfig(str_max_length=5)
)

assert v.isinstance_python('test') is True
assert v.isinstance_python('test long') is False

Field-Level Overrides

Individual fields defined via StringSchema can override any global setting. If a constraint is defined in both CoreConfig and StringSchema, the field-level definition takes precedence.

This hierarchy is demonstrated in the project's test suite:

# Field-level max_length (5) overrides global str_max_length (10)
v = SchemaValidator(
cs.str_schema(max_length=5),
config=CoreConfig(str_max_length=10)
)

assert v.isinstance_python('test') is True
assert v.isinstance_python('test long') is False

Regex Engine Selection

The codebase supports two different engines for validating string patterns: rust-regex and python-re. This can be configured globally in CoreConfig.regex_engine or locally in StringSchema.regex_engine.

Rust Regex Engine (rust-regex)

This is the default engine. It is implemented using the Rust regex crate, which guarantees linear-time searching and protects against ReDoS (Regular Expression Denial of Service) attacks. However, it does not support certain advanced features like look-around or backreferences.

Python Regex Engine (python-re)

The Python engine uses the standard library re module. It should be selected when complex regex features are required that the Rust engine does not support.

An example from pydantic-core/tests/validators/test_string.py shows the Python engine being used for backreferences (which rust-regex would reject):

from pydantic_core import SchemaValidator, core_schema

# Using Python regex engine for backreference support (\1)
pattern = r'r(#*)".*?"\1'
v = SchemaValidator(
core_schema.str_schema(pattern=pattern, regex_engine='python-re')
)

assert v.validate_python('r#""#') == 'r#""#'

Coercion and Strict Mode

The coerce_numbers_to_str setting determines whether numeric types (like int or float) should be automatically converted to strings during validation.

  • Non-Strict Mode: If coerce_numbers_to_str is True, an input like 123 will be validated as "123".
  • Strict Mode: If strict=True is set in either CoreConfig or StringSchema, coercion is disabled regardless of the coerce_numbers_to_str setting.

Order of Operations

When multiple transformations and validations are applied to a string, the internal pipeline follows a specific sequence:

  1. Coercion: Numbers are converted to strings (if enabled).
  2. Regex Matching: The pattern is checked against the raw (or coerced) string.
  3. Transformations: strip_whitespace, to_lower, and to_upper are applied.
  4. Length Validation: min_length and max_length are checked against the final transformed string.

This order ensures that length constraints are enforced on the "clean" version of the data that will actually be stored or used by the application.