Skip to main content

Text and Binary Sequences

Validation of text and binary sequences in pydantic-core is handled primarily through StringSchema and BytesSchema. These schemas provide fine-grained control over length constraints, pattern matching, and data transformations, while offering performance-oriented defaults like the Rust-based regex engine.

String Validation

The StringSchema (created via the str_schema() helper) defines how string data is validated and transformed. It supports both strict type checking and lax coercion from other types.

Length and Transformation Constraints

Basic validation often involves enforcing length limits and normalizing input text.

from pydantic_core import SchemaValidator, core_schema as cs

# Schema with length constraints and normalization
v = SchemaValidator(cs.str_schema(
min_length=2,
max_length=10,
strip_whitespace=True,
to_lower=True
))

assert v.validate_python(' HELLO ') == 'hello'
# v.validate_python('a') -> ValidationError: String should have at least 2 characters

Available transformation options include:

  • strip_whitespace: Removes leading and trailing whitespace.
  • to_lower: Converts the string to lowercase.
  • to_upper: Converts the string to uppercase.

Pattern Matching and Regex Engines

pydantic-core allows regex validation using the pattern field. Uniquely, it offers a choice between two regex engines via the regex_engine parameter:

  1. rust-regex (Default): Uses the Rust regex crate. It is highly performant and designed to be DDoS-resistant (non-backtracking). However, it does not support certain advanced features like look-around or backreferences.
  2. python-re: Uses Python's standard re module. This supports the full range of Python regex features but may be susceptible to catastrophic backtracking if patterns are not carefully authored.
# Using the Python regex engine for backreferences
v = SchemaValidator(
cs.str_schema(
pattern=r'r(#*)".*?"\1',
regex_engine='python-re'
)
)
assert v.validate_python('r#""#') == 'r#""#'

Coercion and Strict Mode

By default, string validation is "lax," meaning it may accept non-string types if they can be safely converted.

  • strict: If set to True, only actual str instances are accepted.
  • coerce_numbers_to_str: When True (and strict is False), numeric types (int, float) are converted to strings. Note that booleans are specifically excluded from this coercion to prevent True becoming "True".
v = SchemaValidator(cs.str_schema(coerce_numbers_to_str=True))
assert v.validate_python(123) == '123'

# Booleans are not coerced to strings
# v.validate_python(True) -> ValidationError: Input should be a valid string

Binary Sequence Validation

The BytesSchema (created via bytes_schema()) handles validation for bytes objects.

Lax vs Strict Validation

In strict mode, only bytes objects are accepted. In lax mode (the default), the validator also accepts:

  • str: Converted to bytes using UTF-8 encoding.
  • bytearray: Converted directly to bytes.
v = SchemaValidator(cs.bytes_schema())
assert v.validate_python('hello') == b'hello'
assert v.validate_python(bytearray(b'foo')) == b'foo'

JSON Integration and Encodings

Since JSON does not have a native binary type, pydantic-core provides configuration to interpret JSON strings as bytes using different encodings. This is controlled by the val_json_bytes setting in CoreConfig.

Supported modes:

  • utf8 (Default): Interprets the JSON string as a standard UTF-8 string.
  • base64: Decodes the JSON string as Base64 data.
  • hex: Decodes the JSON string as a Hexadecimal string.
from pydantic_core import CoreConfig, SchemaValidator, core_schema as cs

# Configure validator to expect base64 in JSON
v = SchemaValidator(
cs.bytes_schema(),
config=CoreConfig(val_json_bytes='base64')
)

# 'bm8tcGFkZGluZw' is base64 for 'no-padding'
assert v.validate_json('"bm8tcGFkZGluZw"') == b'no-padding'

Edge Cases and Errors

Unicode Surrogates

pydantic-core enforces valid Unicode. Strings containing unpaired surrogates (which are technically invalid UTF-8) will result in a string_unicode error.

Global Configuration

Many string and bytes constraints can be set globally via CoreConfig rather than on individual schemas. For example, setting str_max_length in CoreConfig applies a default maximum length to all string fields within that validator's scope.

v = SchemaValidator(
cs.str_schema(),
config=CoreConfig(str_max_length=5)
)
# v.validate_python('too long') -> ValidationError: String should have at most 5 characters