String and Binary Data

In pydantic-core, string and binary data validation is handled through the StringSchema and BytesSchema definitions. These schemas provide a high-performance validation layer that supports length constraints, pattern matching, and data transformations.

String Validation

The StringSchema (accessible via core_schema.str_schema()) is used to validate and optionally transform text data. It supports a variety of constraints that are enforced during the validation process.

Length and Pattern Constraints

Basic validation often involves ensuring a string meets specific length requirements or matches a regular expression.

from pydantic_core import SchemaValidator, core_schema

# Define a schema with length and pattern constraints
v = SchemaValidator(core_schema.str_schema(
    min_length=3, 
    max_length=10, 
    pattern=r'^[a-z]+$'
))

assert v.validate_python('abc') == 'abc'
# v.validate_python('ab') -> ValidationError: String should have at least 3 characters
# v.validate_python('ABC') -> ValidationError: String should match pattern '^[a-z]+$'

Regex Engines

pydantic-core allows you to choose between two regex engines for the pattern constraint:

rust-regex (Default): Uses the Rust regex crate. It is extremely fast and guaranteed to run in linear time, but it does not support features like backreferences or look-around.
python-re: Uses Python's standard re module. This is useful if your pattern requires advanced features not supported by the Rust engine.

# Using the Python regex engine for backreferences
v = SchemaValidator(core_schema.str_schema(
    pattern=r'r(#*)".*?"\1', 
    regex_engine='python-re'
))
assert v.validate_python('r#""#') == 'r#""#'

Transformations

Strings can be transformed during validation using strip_whitespace, to_lower, and to_upper. These transformations occur after the initial type check but before or during constraint validation depending on the specific configuration.

v = SchemaValidator(core_schema.str_schema(
    strip_whitespace=True, 
    to_upper=True
))

assert v.validate_python('  hello  ') == 'HELLO'

In pydantic-core, if strip_whitespace is enabled, the length constraints are checked against the stripped version of the string.

Coercion and Strictness

By default, StringSchema operates in a "lax" mode where it can accept bytes or bytearray and decode them as UTF-8. You can also enable coerce_numbers_to_str to allow numeric types to be converted to strings.

# Lax mode (default)
v = SchemaValidator(core_schema.str_schema())
assert v.validate_python(b'foobar') == 'foobar'

# Coercing numbers
v_coerce = SchemaValidator(
    core_schema.str_schema(), 
    config=core_schema.CoreConfig(coerce_numbers_to_str=True)
)
assert v_coerce.validate_python(42) == '42'

# Strict mode
v_strict = SchemaValidator(core_schema.str_schema(strict=True))
# v_strict.validate_python(b'foobar') -> ValidationError: Input should be a valid string

Binary Data Validation

The BytesSchema (accessible via core_schema.bytes_schema()) handles raw byte sequences. Like strings, it supports min_length and max_length constraints, but these refer to the number of bytes rather than the number of characters.

Input Types in Lax Mode

In lax mode, BytesSchema is flexible about its input:

bytes: Accepted as-is.
bytearray: Converted to bytes.
str: Encoded to UTF-8 bytes.

v = SchemaValidator(core_schema.bytes_schema())

assert v.validate_python(b'foo') == b'foo'
assert v.validate_python(bytearray(b'foo')) == b'foo'
assert v.validate_python('foo') == b'foo'  # Encoded to UTF-8

In strict mode (strict=True), only bytes objects are accepted.

JSON Integration

Handling binary data in JSON is a common challenge since JSON does not have a native bytes type. pydantic-core provides configuration options to handle this via CoreConfig.

Validating Bytes from JSON

You can configure how strings in JSON are interpreted as bytes using the val_json_bytes setting:

'utf8' (Default): Interprets the string as UTF-8 encoded bytes.
'base64': Decodes the string from Base64.
'hex': Decodes the string from a hex representation.

# Configuring bytes validation from JSON
v = SchemaValidator(
    core_schema.bytes_schema(),
    config=core_schema.CoreConfig(val_json_bytes='base64')
)

# Validates base64 encoded string into raw bytes
assert v.validate_json('"bm8tcGFkZGluZw"') == b'no-padding'

Serialization

Similarly, the ser_json_bytes configuration determines how bytes objects are serialized back to JSON, supporting the same 'utf8', 'base64', and 'hex' options.

Error Handling

Validation failures for strings and bytes produce specific error types:

string_too_short / bytes_too_short: Triggered when min_length is not met.
string_too_long / bytes_too_long: Triggered when max_length is exceeded.
string_pattern_mismatch: Triggered when a pattern does not match.
string_unicode: Triggered when a string contains invalid unicode (like unpaired surrogates) that cannot be processed.

These errors include context such as the limit that was violated (e.g., ctx={'min_length': 2}).

String Validation​

Length and Pattern Constraints​

Regex Engines​

Transformations​

Coercion and Strictness​

Binary Data Validation​

Input Types in Lax Mode​

JSON Integration​

Validating Bytes from JSON​

Serialization​

Error Handling​