As AI technology evolves day by day in recent years, OpenAI, a leading company in this field, holds various patents in a wide range of fields, including natural language processing, multimodal processing, code generation and editing, image generation, and speech recognition. In this column, I would like to carefully analyze OpenAI's major registered patents and take an in-depth look at the problems and solutions covered by each patent. It covers key details of each patent, market applicability, and technical/legal implications.
1. Overview of OpenAI's AI Patents
The major AI patents applied for and registered by OpenAI are as follows, and they cover technologies such as large-scale language models (LLM), multimodal AI, automatic coding, speech recognition, image processing, and external API integration. As of February 3, 2025, OpenAI holds a total of 14 US patents.
Title
Application Number
Registration Number
IPC
Adaptive UI for Rich Output Rendering of Assistant Messages
18-606435
12164548
G06F-040/30
Systems and Methods for Interacting with a Large Language Model
18-475722
12051205
G06T-007/10
Systems and Methods for Interacting with a Multimodal Machine Learning Model
18-475588
12039431
G06N-003/0455
Schema-based Integration of External APIs with Natural Language Applications
18-474063
12124823
G06F-008/35
Systems and Methods for Image Generation with Machine Learning Models
18-458907
11983806
G06K-009/36
Systems and Methods for Generating Code Using Language Models Trained on Computer Code
18-321852
12061880
G06F-008/30
Systems and Methods for Generating Natural Language Using Language Models Trained on Computer Code
18-321921
12008341
G06F-008/30
Using Machine Learning to Train and Use a Model to Perform Automatic Interface Actions Based on Video and Input Datasets
18-303552
11887367
G06V-020/40
Multi-Task Automatic Speech Recognition System
18-302289
12079587
G06F-040/58
Systems and Methods for Hierarchical Text-Conditional Image Generation
18-193427
11922550
G06T-011/60
Schema-based Integration of External APIs with Natural Language Applications
18-186712
11922144
G06F-008/35
Systems and Methods for Language Model-based Text Editing
18-183902
11983488
G06F-040/166
Systems and Methods for Language Model-based Text Insertion
18-183898
11886826
G06F-017/00
Systems and Methods for Using Contrastive Pre-training to Generate Text and Code Embeddings
18-158166
12073299
G06N-020/00
2. Key content analysis & representative claims for each major patent
The following patents cover core AI technologies directly related to representative services (such as ChatGPT, DALL·E, Codex, Whisper, etc.) provided by OpenAI. ChatGPT and DALL/E, which have already gained high recognition in the market, also operate based on this patented technology and provide differentiated functions from other companies. Let's take a detailed look at what problems each patent solves and what technical characteristics it has.
(1) Adaptive UI for Rich Output Rendering of Assistant Messages
purposes
I want to build a UI system that can freely express visual/structured information beyond a simple text chat environment
Interactive models such as OpenAI's ChatGPT help deliver more intuitive information to users by providing rich graphs, tables, and images
Core solutions
Generative Language Model (LLM) + front-end rendering structure
LLM structures questions and answers or data analysis results (using primitives)
Front-end (UI) automatically converts them into visual elements (graphs, dashboards, etc.)
In the future, the ChatGPT interface will be beneficial for directly showing visual answers such as charts/tables beyond simple text
Examples of use
data analysis: Real-time display of charts or statistical results in the ChatGPT Enterprise version
Assistive work tools: Provides interactive reports such as customer support and sales dashboards
Representative Claim 1
Method A: suggestions, by a language model generative response engine, a first prompt in natural language to generate a response within a chat thread between a user account and the language model generative response engine presented by a front end, where in the language model generative response engine has been considered to determine when the response should be considered a first primitive from a collection of primitives to generate the response, where in the first primitive represents the language model generative response engine to output an initial response includes that structured data as defined by the first primitive; invoking, by the language model generative response engine, the first primitive; responses, by the language model generative response engine, the initial response by predicting a next word in a sequence of words based on the first prompt and the first primitive to result in the initial response made up of the sequence of words considered as the structured data defined by the first primitive; rendering the initial response as the structured data in a visual format using the front end; Outputs a completed response, the completed response including the rendered structured data in the visual format.
(2) Systems and Methods for Interventions with a Large Language Model
purposes
Create a richer user experience by incorporating multimodal inputs such as images into the large-scale language model (GPT family) provided by OpenAI
For example, after uploading an image to ChatGPT, implement a function that describes the image or requires an answer in a specific area
Core solutions
Graphical user interface (GUI) interaction
User: Text query+image provided
Model: Combined analysis of text and images → Highlight specific locations within an image
Support for advanced features such as “ChatGPT recognizes objects in an image and zooms in or explains only those parts”
Examples of use
visual tutorial: Automatically generate manuals/manuals based on product photos uploaded by users
Multimodal support ChatGPT: Build a powerful all-round interactive agent by integrating images, text, and voice input (Whisper)
Representative Claim 1
A method of studying with a multimodal machine learning model, the method studying: A graphical user interface associated with a multimodal machine learning model; Converting an image to a user in the graphical user interface; Prompt a prompt from the user; using the image and the suggested prompt; Combining an output at least in part by considering the input data to the multimodal machine learning model, the multimodal machine learning model configured using prompt engineering to Identify a location in the image on the image and the prompt, where in the output a first location; and , in the graphical user interface, an indicator at the first location in the image, where in sight the important indicator at the most important first location in the image appeared a cursor of the graphical user interface at the first location in the image.
(3) Systems and Methods for Understanding with a Multimodal Machine Learning Model
purposes
Aiming to improve visual communication between users and AI by simultaneously receiving text + images
Core technology that enhances the interaction interface of OpenAI multimodal models such as DALL·E, CLIP, and Whisper
Core solutions
Image highlighting: An image is displayed on the GUI, and the model finds a specific point (coordinate) according to a text query and highlights it with a cursor or box
Increase reliability and accessibility by directly visualizing the results recognized by the model to the user
Examples of use
Interactive design tools: “What's so awkward about this logo?” → The model highlights problem areas in the logo
Medical/manufacturing: Analyze X-Ray and CT images to show suspicious areas or highlight the location of defects on the process inspection screen
Representative Claim 1
A method of studying with a pre-trained multimodal machine learning model, the method studying: describe a graphical user interface configured to enable a user to interact with an image to generate a contested prompt that explains an area of interaction in the image; Confirmation the Conquest Prompt; discussion input data using the image and the conquest prompt; describe a response to the image by evaluate the input data to a multimodal machine learning model to configured condition the measured response to the image on the conjunct prompt; and Refrain the Refrain Response to the User, where in the unlikely response presented a prompt and the Choose a selectable control in the graphical user interface configured to enable the user to select the prompt
(4) Schema-based Integration of External APIs with Natural Language Applications (registration number: 12124823)
purposes
By simplifying the integration between interactive models (such as ChatGPT) provided by OpenAI and external APIs, it supports AI to perform various functions like plug-ins
In fact, the ChatGPT Plugin system is based on a similar concept and connects to various third-party apps to automatically call APIs and return results to users
Core solutions
Defining an API schema through a Manifest file
Provides API usage and request/response formats in a form that the model can understand
LLM analyzes users' natural language requests to determine which APIs to call and how
Like the actual ChatGPT Plugins system, when a developer provides a manifest, ChatGPT uses plugins to perform various functions
Examples of use
Scheduling Plugins: When the user says “Add a meeting to my schedule”, the calendar API is automatically called
Shop/order function: “Order pizza” → return results (menu, payment options, etc.) after calling the restaurant API
Representative Claim 1
A computer-evaluated method explains: describe a first manifest file stored in a first location, the first manifest file including first training data associated with a first web application programming interface (API)... training a model based on the first training data and the first description of the first web API; Describe a second manifest file stored in a third location, the second manifest file including second training data associated with a second web API... Inputs an input at a user interface of the model; whether the input includes a request to integrate the first web API or the second web API with the user interface; Converting one or more function calls to transmit to the first web API or the second web API based on the analysis of the received input... Re-train the model based on at least one change made to one or more of the first training data, the first description of the first web API, the second training data, or the second description of the second web API
(5) Systems and Methods for Image Generation with Machine Learning Models
purposes
Technology for freely replacing, correcting (inpainting) or enhancing (enhancing) specific parts of an existing image in the DALL/E series model, which is a text-based image creation service
By masking, you can delete part of the image, indicate new elements with text, and reconstruct the desired result
Core solutions
Masking & regeneration
Original image → Masking a specific area → Text instructions such as “Add a rainbow here” → The model creates natural pixels
Noise removal using Blind Super Resolution, etc., and resolution correction functions are also provided
Examples of use
DALL·E-based photo editing: Change the background of a portrait photo, add/delete items, automatically adjust color and brightness, etc.
Marketing/design: Fast and creative visualization of ideas when creating images for product advertisements
Representative Claim 1
A system presented: ... Involve a Masked image by Involve a Masked Region from the input image, Where in the Masked Region; Pixel Values Remembered to the Masked Region; delete a text input to an image prompt; Combining at least one of the input image, the masked region, or the text input to a machine learning model configured to generate an enhanced image... , with the machine learning model, the enhanced image based on at least one of the input image, the masked region, or the text input; Where in the generation of the enhanced image featured: Replicated pixel values from the input image or the masked image to the enhanced image; , with the machine learning model, an image segment based on at least one of the text input or the pixel values from the masked image; and Refine the image segment INTO THE ENHANCED IMAGE BY ATTRACED THE MASKED REGION.
(6) Systems and Methods for Understanding Code Using Language Models Analysing on Computer Code
purposes
Models such as Codex (or ChatGPT Code Assistant) provided by OpenAI support the natural language requirements of developers to generate appropriate code
The goal is to improve code quality and maximize development productivity by verifying test execution results
Core solutions
Code+annotation learning: Using LLM learned based on large-scale open source code and comments (dostrings)
After the model tries and tests multiple codes, it finally selects and suggests the code that works correctly
Examples of use
ChatGPT coding assistant: If you say “Create a QuickSort function in Python”, the code will be written and tested automatically
IDE integration: Automatically suggested by inserting it as a plug-in into development environments such as Visual Studio Code and JetBrains
Representative Claim 1
A computer-evaluated method, performed: a docstring natural language text a digital programming result; , using a machine learning model and based on the docstring, one or more computer code samples Configured to produce candidate results; discuss each of the one or more computer code samples to be tested in a testing environment... based on a result of the testing environment, at least one of the computer code samples which presented a particular candidate result... , using the machine learning model, natural language text Associated with the at least one identified computer code sample; VERIZE EACH OF THE ONE OR MORE COMPUTER CODE SAMPLES; and Outlined the at least one identified computer code sample...
(7) Systems and Methods for Computational Natural Language Using Language Models Manipulate on Computer Code
purposes
It is a technology that automatically generates documentation (comments/explanations) from source code, and Codex or ChatGPT “What does this function do?” Let them explain it themselves
The purpose of solving missing comments and maintenance difficulties in large projects and improving code understanding
Core solutions
Reverse documentation: Analyze function signatures and code flows and generate natural language summaries (dostrings)
After writing the code, the developer calls the model to “write a description of this function” → automatically provides an easy-to-read sentence
Examples of use
Automatic annotation generation tool: ChatGPT supports documentation every time you write new code
Education platform: When beginner developers learn open source code, they automatically receive explanations for each function to improve understanding
Representative Claim 1
A computer-evaluated method, performed: Training a machine learning model to Generate natural language docstrings from computer code; 1 or more computer code samples at the machine learning model; , via the machine learning model and based on the received one or more computer code samples, one or more candidate natural language docstrings... consider at least one of the one or more candidate natural language docstrings that provide an intent of the at least a portion of the one or more computer code samples; Outbound from the Learned Machine Learning Model The at Least One Identified Natural Language Docstring with the at least a portion of the one or more computer code samples...
(8) Using Machine Learning to Train and Use a Model to Perform Automatic Interface Actions Based on Video and Input Datasets
purposes
Train a model that performs automatic actions (click, drag, etc.) on the UI using large amounts of unlabeled video data
Link with 'visual information + automatic tasking' systems (such as automatic browser navigation, RPA) that OpenAI can implement in the future
Core solutions
Create pseudo-labels with the Inverse Dynamics Model
Look at the past and future frames of the video and infer what actions (mouse movement, keyboard input) occurred in the middle
Progressive learning: Use these inferred action labels to improve automatic interface work model performance
Examples of use
Browser automation: Through screen recording, the model learns and automatically repeats form input and file upload
software testing: Minimize manpower consumption by learning regression tests or UI verification at the video level
Representative Claim 1
A method for training a machine learning model to perform automated actions, performed: Undecided digital video data; Dangerous pseudo-labels for the untold digital video data, the most famous: Digital-grade digital video data; training a first machine learning model including an inverse dynamics model (IDM) using the intensive digital video data; and At least one pseudo-label for the untold digital video data... Adding the at least one pseudo-label to the untold digital video data to form pseudo-predicted digital video data; and Further training the first machine learning model or a second machine learning model using the pseudo-intensive digital video data...
(9) Multi-Task Automatic Speech Recognition System
purposes
Like OpenAI's Whisper model, multilingual and multi-task voice recognition is performed with a single transformer, improving the efficiency of transcribing and translating large-scale voice data
Covers multiple languages such as English, Spanish, and Korean with a single model, and processes transcription & translation tasks simultaneously
Core solutions
Multi-task training: Learn various types of label audio from a single model, such as same-language transcription and translation into different languages
Add language tokens, task tokens, etc. to the decoder input to distinguish whether “this input is a Korean translation” or an “English to French translation”
Examples of use
Worldwide service: When ChatGPT receives global voice input, it automatically transcribes or translates according to the relevant language
Automatically write meeting notes: Real-time script generation with only one model, even in multilingual meetings
Representative Claim 1
A system presented: at least one memory intensive instructions; and at least one processor configured to execute the instructions to perform operations for multi-language, multi-task speech recognition, the operations performed: Introducing a transformer model... Inputs an output transcript from an input audio segment Using the Transformers model, generation including: Converting a decoder input with a language token explaining to a first language; Involve the decoder input with a task token; and Autoregressively presents the decoder input with a first timestamp token...
(10) Systems and Methods for Hierarchical Text-Conditional Image Generation
purposes
Image generation models based on text conditions, such as DALL/E, provide a hierarchical structure that enables the gradual generation of high-resolution images
Text input → Low resolution base image → Upsampling is performed sequentially to obtain detailed results
Core solutions
Multiple submodels
Low resolution image generation (1st submodel)
High resolution upsampling (2nd submodel)
Precisely correct noise or low-quality images with Blind Super Resolution (BSR)
Examples of use
DALL·E: “Draw a picture of a cat in the style of the painter Gogh” → Create a low resolution draft → Complete a clear image during the upsampling phase
High quality advertising/marketing images: Use only text ideas to extract professional-level detailed images and use them in brand marketing
Representative Claim 1
A system presented: ... Informed at least one of the text description or the text embedding into a first sub-model configured to generate, based on at least one of the text description or the text embedding, a proposed image embedding; inadvertently at least one of the text description or the proposed image embedding... into a second sub-model configured to generate an output image; where in the second sub-model includes a first upsampler model and a second upsampler model... THE SECOND UPSAMPLER MODEL PRESENTS ON IMAGES SUCCESSFUL WITH BLIND SUPER RESOLUTION (BSR); and Making the output image accessible to a device...
(11) Schema-based Integration of External APIs with Natural Language Applications (registration number: 11922144)
purposes
Technology that easily integrates an external API using the ChatGPT Plugin method, similar to Patent No. 4
The model automatically understands and uses the API call process through the manifest to enhance functional scalability
Core solutions
Manifest customization: Third-party API providers define authentication, endpoints, and response formats as JSON schemas
The model calls the API based on the user's request → summarizes the response back in natural language
Examples of use
ChatGPT Plugin Ecosystem: Added various plug-ins such as itinerary management, travel reservations, bank transactions, and online shopping
Enterprise internal system integration: Connect to in-house ERP and CRM to interactively query specific in-house data
Representative Claim 1
A computer-interpreted method for identifying a particular external application programming interface (API) with a natural language model user interface, presented: Inputs a first input at the natural language model user interface communicably connected to a natural language model that is intended to call one or more functions based on a manifest... including the first input includes a request to integrate the particular external API... Extend the particular external API based on the received first input; describes the particular external API with the natural language model user interface, the explains the particular external API, a description of the web API, and the manifest... Considering the particular external API based on the first input or a second input... responses, based on the responses, a response message to the natural language model user interface, the response message including a result of the responses...
(12) Systems and Methods for Language Model-based Text Editing
purposes
Streamline the content creation and editing process by using OpenAI language models such as ChatGPT to automatically correct and edit document drafts
Change immediately if the user says “make this paragraph more concise” or “change it to a business style”
Core solutions
Prompt-based editing: If you specify a specific area and editing style (formal, friendly, etc.), the model replaces partial text
If you don't like the results, give the user a second command to re-edit (Iterative Editing)
Examples of use
blog post: ChatGPT compresses long drafts in a concise manner or automatically inserts SEO keywords
business email: Generate sentences at a level where models can be modified and corrected without a template and sent immediately
Representative Claim 1
A system presented: ... accept an input text prompt, the input text prompt suggest a null set; Instructions one or more user instructions; describe a set of model parameters based on the one or more user instructions; Involving a language model... , using the language model, an output text based on the input text prompt, the one or more user instructions... Editing the output text based on the language model and the one or more new user instructions by treating at least a portion of the output text; and Alleged the Language Model by Alleged the Language Model...
(13) Systems and Methods for Language Model-based Text Formatting
purposes
The model automatically inserts (inserts) sentences in the middle of an existing text document or conversation to improve the degree of completeness naturally
ChatGPT recognizes specific sections and adds necessary information or example sentences to make the post richer
Core solutions
Context analysis: Check the sentences before and after, and create a model so that the content to be inserted matches the overall flow
“Give me an example here” → Reinforce the completeness of the entire article by creating relevant examples or analogies on the fly
Examples of use
Writing a thesis/report: Inserting additional explanations or cases between chapters.
Novel/scenario: The model actively fills in necessary parts such as dialogue or setting explanations.
Representative Claim 1
A system comprising:
receiving an input text prompt comprising a prefix portion and a suffix portion; determining a set of model parameters based on the input text prompt; accessing a language model... determining a set of context parameters based on the input text prompt and the language model, the set of context parameters comprising at least one of location, person, time, or event; generating language model output text based on the set of context parameters and the language model; inserting the language model output text into the input text prompt... optimizing the accessed language model...
(14) Systems and Methods for Using Contrastive Pre-training to Generate Text and Code Embeddings
purposes
Create embeddings that efficiently calculate semantic similarity between text and code for use in search, recommendations, and document classification. For instance, the system can find code similar to a given function description or match two texts whose meanings are closely related.
Core solutions
Contrastive Learning: Trains by pulling vectors for similar samples closer and pushing vectors for dissimilar samples apart.
OpenAI leverages models that learn text and code simultaneously (e.g., Codex, CLIP) to enhance search and recommendation systems.
Examples of use
ChatGPT Plugin: When a user describes a specific piece of code, the model compares embeddings and recommends the best-matching library function.
Document Search Engine: Converts FAQ data, customer queries, etc. into meaningful vector representations for faster retrieval.
Representative Claim 1 A computer-implemented method for generating a semantic similarity based on a vector representation, the method comprising:
receiving a training data set extracted from unlabeled data, the training data set including a plurality of paired data samples corresponding to positive example pairs…
converting the paired data samples corresponding to the positive example pairs into at least one first vector of a vector representation;
accessing one or more negative example pairs…
converting the one or more negative example pairs into one or more second vectors…
training a machine learning model to generate additional vectors of the vector representation, wherein the training comprises:
initializing the machine learning model with one or more pre-trained models…
training the machine learning model using contrastive training…
receiving a query for semantic similarity…
generating, with the machine learning model and according to an embedding space, a semantic similarity result in response to the query.
These patents serve as the foundational technologies behind most of OpenAI’s services, including ChatGPT, DALL·E, Whisper, and Codex. Spanning a broad spectrum—from conversational UIs to image generation, speech recognition, code automation, and external API integration—each patent interconnects seamlessly to form what can be described as an “AI platform ecosystem.”
3. Patent Trends and Implications
Multimodal AI and UI/UX Innovation Technology capable of simultaneously processing multiple forms of data—images, text, audio, and video—is becoming widespread, and UI/UX is growing more intuitive and enriched.
Expansion of LLMs: Code Generation, Translation, and Editing Automatic code generation, document editing, and annotation can greatly automate and enhance the workload of developers and documentation teams. This leads to improved productivity, reduced labor costs, and shortened time-to-market for businesses.
Platformization via External API Integration AI models now directly call various third-party APIs to expand functionality. As conversational AI services evolve into platforms, these third-party APIs can be seamlessly integrated like plug-ins.
Enhanced Training Efficiency and Precision Contrastive learning, pseudo-labeling, and other techniques help make optimal use of unlabeled data, thereby boosting accuracy and versatility. Large-scale language and vision models are becoming more refined, enabling high performance even in zero-shot or few-shot scenarios.
4. Conclusion
OpenAI’s patent portfolio illustrates the future direction of AI, encompassing multimodal systems, large-scale language models (LLMs), external API integration, code automation, high-resolution image generation, and speech recognition. Leading companies already have extensive patent strategies to protect and expand their proprietary technologies.Pine IP Firm offers professional services in AI and software patent strategy, including patent specification drafting, filing, infringement response, and dispute resolution. As competition in AI intensifies, intellectual property rights are essential for safeguarding technological innovation. Like OpenAI’s example—where early patent acquisition strategies underpin large-scale investment and exclusive technology—your company should also establish a robust patent strategy. If you want to firmly protect your unique AI technology, we invite you to collaborate with Pine IP Firm to develop a step-by-step plan.