OpenAI’s New Model GPT-4o Is Here With Improved Voice, Vision & More

OpenAI’s Spring Update - OpenAI launched GPT-4o  (“o” for “omni”) for ChatGPT, a smarter AI buddy. It's faster and better understands text, voice, and images, ideal for teams and businesses. Now, it gets emotions and speaks more languages, making chats more fun and helpful! 


On May 13, OpenAI unveiled its GPT-4o, an AI model for ChatGPT that offers GPT-4-level intelligence. GPT-4o responds to audio inputs in as little as 232 milliseconds, averaging 320 milliseconds, mirroring human conversational response times. It matches GPT-4 Turbo's performance in English text and coding, with notable improvements in non-English languages. Additionally, it is significantly faster and 50% cheaper via API. GPT-4o excels in vision and audio understanding, outperforming existing models in these areas. 


GPT-4o represents a significant advancement over its predecessors, GPT-4 and GPT-3.5. According to Mira Murati


“GPT-4o provides GPT-4 level intelligence but is much faster”


“We think GPT-4o is shifting that paradigm into the future of collaboration, where this interaction becomes much more natural and far easier”



“GPT-4o is twice as fast as, half the price of and has higher rate limits than GPT-4 Turbo,” the company says.


1. GPT-4o ability to understand Images & Videos

GPT-4o excels in understanding images and videos, offering advanced capabilities in visual perception and audio recognition. It can analyze and interpret visual content with a higher accuracy compared to previous models, making it ideal for applications in multimedia analysis, content creation, and real-time video processing. By leveraging its enhanced vision and audio understanding, GPT-4o provides a comprehensive and sophisticated approach to multimodal AI interactions, significantly improving user experience and expanding the potential use cases for AI technology.



2. GPT-4o real-time translation into 50+ Languages

GPT-4o offers real-time translation capabilities for over 50 languages. This advanced feature allows for seamless communication across diverse linguistic barriers, providing instant and accurate translations. By integrating this functionality, GPT-4o enhances global connectivity and supports multilingual interactions in various applications, such as customer service, content creation, and international collaboration. Its ability to quickly and accurately translate spoken or written content in multiple languages significantly improves user experience and accessibility.



3. GPT-4o makes coding Easier & Much Faster

GPT-4o revolutionizes the coding experience by making it easier and significantly faster. It excels in understanding and generating code, enhancing productivity and reducing development time. Whether it's writing, debugging, or optimizing code, GPT-4o's advanced capabilities streamline the coding process, allowing developers to focus on more complex tasks. Its efficiency and accuracy in handling code-related tasks make it an invaluable tool for programmers and software engineers.



4. GPT-4o real-time Voice Variation

GPT-4o introduces real-time voice variation, enhancing the naturalness and expressiveness of AI-generated speech. This feature allows the model to modulate its tone, pitch, and inflection, adapting to various contexts and emotions. Whether for customer service, interactive applications, or multimedia content, GPT-4o's ability to dynamically vary its voice provides a more engaging and human-like experience, making interactions with AI more relatable and effective.



Improved Capabilities of GPT-4o 



  • Multimodal Capabilities: While GPT-4 focused primarily on images and text-based tasks, GPT-4o introduces multimodal capabilities. This means besides text and speech, it can also handle audio and video inputs and outputs. This allows for more natural and versatile interactions with the AI model.


  • Unified Processing: Unlike the previous model, which used separate models for transcribing audio to text and generating text-based responses, GPT-4o employs a single end-to-end model for processing all modalities. This unified approach ensures more seamless communication and reduces information loss.


  • Enhanced Voice Mode: GPT-4o significantly improves the Voice Mode experience in ChatGPT by reducing latency and enabling real-time responsiveness just like AI Assistant. Users can now interact with ChatGPT more fluidly, interrupting the AI while it's responding and receiving nuanced vocal outputs, including different emotive styles and even singing.


  • Upgraded Vision Capabilities: With GPT-4o, ChatGPT's vision capabilities receive a substantial boost. The model can quickly analyze images and desktop screens to provide relevant answers to user queries, such as identifying objects or interpreting text within images.


  • Improved Multilingual Support: GPT-4o boasts enhanced performance across approximately 50 languages, making it more accessible and effective for users worldwide. This improvement extends to both text-based interactions and vision tasks involving multilingual content.


  • Desktop App:  With GPT-4o's latest update, ChatGPT now extends its reach beyond just web browsers. The new Desktop App brings the power of AI directly to your desktop, making it more convenient than ever to access intelligent assistance.


  • Safety Measures: GPT-4o incorporates built-in safety mechanisms designed to mitigate risks across all modalities. These measures include filtering training data, refining model behavior post-training, and implementing guardrails on voice outputs to ensure responsible use of the technology.


  • Enhanced UI: With ChatGPT's revamped UI, conversations become seamless. The new interface enhances user interaction, offering a more intuitive and conversational experience. Users can navigate effortlessly, accessing features like real-time information search and advanced data analysis. The UI's user-friendly design ensures smoother communication and engagement, making interactions with ChatGPT more natural and efficient.


  • Video & Screenshots: GPT-4o transforms ChatGPT with multimedia support. Users can now share videos for instant help or upload screenshots, photos, and documents for richer discussions. This update democratizes advanced AI features that were previously limited to premium users and makes human-AI interactions more inclusive and collaborative.


Model Evaluations

GPT-4o undergoes rigorous evaluations to ensure top-tier performance. It matches GPT-4 Turbo's capabilities in English text and coding tasks while significantly improving in non-English languages. This model excels in real-time translation and voice variation, making interactions more natural and human-like. Additionally, GPT-4o demonstrates superior understanding and interpretation of images and videos, outperforming previous models in these areas.


Text Evaluation Metrics

GPT-4o showcase superior performance across several text evaluation metrics compared to other models. It excels in MMLU, GPoQA, MATH, HumanEval, MGSM, and DROP, demonstrating particularly high accuracy in HumanEval (90.2%) and MMLU (88.7%). These results underscore GPT-4o's advanced capabilities in text understanding and generation, making it a leading model in the AI landscape.



Audio ASR Metrics

GPT-4o demonstrates lower latency in most cases, particularly in audio processing and real-time applications. The throughput for GPT-4 Turbo is higher in scenarios involving large data sets and complex computations. This indicates that GPT-4o is optimized for faster, real-time interactions, while GPT-4 Turbo is better suited for high-volume, intensive processing tasks.



Audio Translation Performance

GPT-4o shows strong performance in Arithmetic and Causal Inference tasks, outperforming Claude 3 and other models in these areas. In the Commonsense and Spatial tasks, GPT-4o also performs competitively, highlighting its balanced capabilities across different types of reasoning tasks. This demonstrates GPT-4o's robust reasoning abilities and its suitability for applications requiring high-level cognitive functions.



M3 Exam Zeroshot Results

GPT-4o consistently scores higher, especially in languages like French, German, and Spanish, demonstrating its superior capability in non-English language understanding and generation. This highlights GPT-4o's robustness in handling diverse linguistic tasks, making it a valuable tool for global applications requiring multilingual support.