String and Binary Data
In pydantic-core, string and binary data validation is handled through the StringSchema and BytesSchema definitions. These schemas provide a high-performance validation layer that supports length constraints, pattern matching, and data transformations.
String Validation
The StringSchema (accessible via core_schema.str_schema()) is used to validate and optionally transform text data. It supports a variety of constraints that are enforced during the validation process.
Length and Pattern Constraints
Basic validation often involves ensuring a string meets specific length requirements or matches a regular expression.
from pydantic_core import SchemaValidator, core_schema
# Define a schema with length and pattern constraints
v = SchemaValidator(core_schema.str_schema(
min_length=3,
max_length=10,
pattern=r'^[a-z]+$'
))
assert v.validate_python('abc') == 'abc'
# v.validate_python('ab') -> ValidationError: String should have at least 3 characters
# v.validate_python('ABC') -> ValidationError: String should match pattern '^[a-z]+$'
Regex Engines
pydantic-core allows you to choose between two regex engines for the pattern constraint:
rust-regex(Default): Uses the Rustregexcrate. It is extremely fast and guaranteed to run in linear time, but it does not support features like backreferences or look-around.python-re: Uses Python's standardremodule. This is useful if your pattern requires advanced features not supported by the Rust engine.
# Using the Python regex engine for backreferences
v = SchemaValidator(core_schema.str_schema(
pattern=r'r(#*)".*?"\1',
regex_engine='python-re'
))
assert v.validate_python('r#""#') == 'r#""#'
Transformations
Strings can be transformed during validation using strip_whitespace, to_lower, and to_upper. These transformations occur after the initial type check but before or during constraint validation depending on the specific configuration.
v = SchemaValidator(core_schema.str_schema(
strip_whitespace=True,
to_upper=True
))
assert v.validate_python(' hello ') == 'HELLO'
In pydantic-core, if strip_whitespace is enabled, the length constraints are checked against the stripped version of the string.
Coercion and Strictness
By default, StringSchema operates in a "lax" mode where it can accept bytes or bytearray and decode them as UTF-8. You can also enable coerce_numbers_to_str to allow numeric types to be converted to strings.
# Lax mode (default)
v = SchemaValidator(core_schema.str_schema())
assert v.validate_python(b'foobar') == 'foobar'
# Coercing numbers
v_coerce = SchemaValidator(
core_schema.str_schema(),
config=core_schema.CoreConfig(coerce_numbers_to_str=True)
)
assert v_coerce.validate_python(42) == '42'
# Strict mode
v_strict = SchemaValidator(core_schema.str_schema(strict=True))
# v_strict.validate_python(b'foobar') -> ValidationError: Input should be a valid string
Binary Data Validation
The BytesSchema (accessible via core_schema.bytes_schema()) handles raw byte sequences. Like strings, it supports min_length and max_length constraints, but these refer to the number of bytes rather than the number of characters.
Input Types in Lax Mode
In lax mode, BytesSchema is flexible about its input:
bytes: Accepted as-is.bytearray: Converted tobytes.str: Encoded to UTF-8 bytes.
v = SchemaValidator(core_schema.bytes_schema())
assert v.validate_python(b'foo') == b'foo'
assert v.validate_python(bytearray(b'foo')) == b'foo'
assert v.validate_python('foo') == b'foo' # Encoded to UTF-8
In strict mode (strict=True), only bytes objects are accepted.
JSON Integration
Handling binary data in JSON is a common challenge since JSON does not have a native bytes type. pydantic-core provides configuration options to handle this via CoreConfig.
Validating Bytes from JSON
You can configure how strings in JSON are interpreted as bytes using the val_json_bytes setting:
'utf8'(Default): Interprets the string as UTF-8 encoded bytes.'base64': Decodes the string from Base64.'hex': Decodes the string from a hex representation.
# Configuring bytes validation from JSON
v = SchemaValidator(
core_schema.bytes_schema(),
config=core_schema.CoreConfig(val_json_bytes='base64')
)
# Validates base64 encoded string into raw bytes
assert v.validate_json('"bm8tcGFkZGluZw"') == b'no-padding'
Serialization
Similarly, the ser_json_bytes configuration determines how bytes objects are serialized back to JSON, supporting the same 'utf8', 'base64', and 'hex' options.
Error Handling
Validation failures for strings and bytes produce specific error types:
string_too_short/bytes_too_short: Triggered whenmin_lengthis not met.string_too_long/bytes_too_long: Triggered whenmax_lengthis exceeded.string_pattern_mismatch: Triggered when apatterndoes not match.string_unicode: Triggered when a string contains invalid unicode (like unpaired surrogates) that cannot be processed.
These errors include context such as the limit that was violated (e.g., ctx={'min_length': 2}).