Part 8/15:
GPT-4's modalities remain speculative, but some clues point toward multi-modal capabilities:
OpenAI has developed projects like DALL·E (images) and Whisper (audio transcription).
These suggest a growing interest in multimodal models that integrate text, images, and speech.
However:
Currently, GPT models are primarily text-based.
Full multimodal integration, where models seamlessly process and generate across different media, might require architecture redesigns.
Some research hints at models that treat all data as raw bits and bytes, bypassing tokenization altogether—potentially allowing for unified handling of multiple modalities.