Multimodal data¶
One of the most important skills when using Draive is fully understanding how to treat multimodal data and model context.
Any data can be converted into the Multimodal alias (text, audio, image, or artifact; any
DataModel). This multimodal representation is usually the input and output type for most
Draive functions.
Multimodal data creates the model context (user inputs and model outputs), so understanding it is crucial.
Multimodal¶
The Multimodal alias merges MultimodalContent, MultimodalTag, and others. In some places input
parameters are typed as Multimodal for automated transformation.
Note
Constructors and helpers such as MultimodalContent.of(*elements: "Multimodal") use the
Multimodal alias to normalize any mix of multimodal parts into one consistent MultimodalContent.
Multimodal Content¶
MultimodalContent is the central container bundling the different MultimodalContentPart blocks.
Concrete parts carry concrete payloads: plain text (TextContent), references or embedded resources
(ResourceReference, ResourceContent), artifacts (ArtifactContent), and tags (MultimodalTag).
As a result any span of content can be examined as a set of selected parts or as an entire tag tree.
MultimodalContent serves as the input or output in multiple places in the Draive framework. It
can be used for filtering, splitting, replacing, and other operations.
A few examples of MultimodalContent usage in Draive:
@tooldecorator decorates functions that returnMultimodalContent(see Basic tools use)ModelInputandModelOutputclasses useMultimodalContentTextGeneration.generate(...)acceptsMultimodalContentas input (see Basic usage)
MultimodalContent is an easy-to-use and intelligent class that does a lot for you under the
hood:
-
It avoids extra nesting:
inner_multimodal = MultimodalContent.of("Hello world!") print(inner_multimodal) # {'type': 'content', 'parts': [{'text': 'Hello world!', 'meta': {}}]} outer_multimodal = MultimodalContent.of(inner_multimodal) print(outer_multimodal) # {'type': 'content', 'parts': [{'text': 'Hello world!', 'meta': {}}]} # (Same as the first one despite nesting) -
It merges multiple parts if the types match:
class User(DataModel): first_name: str last_name: str content = MultimodalContent.of( MultimodalTag.of( MultimodalTag.of( "Hello", name="inner", ), ArtifactContent.of( User( first_name="James", last_name="Smith", ) ), name="outer", ) ) print(content) # { # 'type': 'content', # 'parts': [ # { # 'text': '<outer><inner>Hello</inner>', # 'meta': {} # }, # { # 'category': 'User', # 'artifact': { # 'first_name': 'James', # 'last_name': 'Smith' # }, # 'hidden': False, # 'meta': {} # }, # { # 'text': '</outer>', # 'meta': {} # } # ] # }Note
MultimodalTagproduces text parts, so they merge withTextContent. -
Comes with a set of helper functions to speed up your work. Examples:
print(multimodal.texts()) # ( # {'text': '<outer><inner>Hello</inner>', 'meta': {}}, # {'text': '</outer>', 'meta': {}} # ) print(multimodal.tags()) # ( # {'name': 'outer', 'content': {...}, 'meta': {}}, # {'name': 'inner', 'content': {...}, 'meta': {}} # ) print(multimodal.artifacts()) # ( # { # 'category': 'User', # 'artifact': { # 'first_name': 'James', # 'last_name': 'Smith' # }, # 'hidden': False, # 'meta': {} # }, # )Tip
MultimodalContenthas more ready-to-use methods for filtering such asmatching_meta(),split_by_meta(),without_resources()oraudio(). This is another argument to useMultimodalContentrather than other data models
Model Input¶
ModelInput groups the user-provided blocks (ModelInputBlock). Each block is backed by
MultimodalContent, so it preserves every part type described above. The input stream may also
embed tool responses (ModelToolResponse), which return their payload as MultimodalContent,
letting you treat tool output the same way as regular text-and-media blocks.
Note
Note that ModelToolResponse has a content attribute of type MultimodalContent. This is the
reason why @tool decorated functions must return MultimodalContent.
Model Output¶
ModelOutput mirrors the same structure on the response side. Output blocks (ModelOutputBlock)
expose MultimodalContent, and both the model's reasoning trail (ModelReasoning) and its tool
requests (ModelToolRequest) rely on the same container. This means the full generation flow - from
visible content to internal thinking and tool invocations - can be analysed with one coherent set of
helpers.
Tip
There are ready-to-use methods like without_tools() to get model output without blocks related to
tool requests and responses, or reasoning() to get model reasoning blocks. That can help you
implement your features easily.