View OpenAI's patents

Pine IP
February 3, 2025

As AI technology evolves day by day in recent years, OpenAI, a leading company in this field, holds various patents in a wide range of fields, including natural language processing, multimodal processing, code generation and editing, image generation, and speech recognition. In this column, I would like to carefully analyze OpenAI's major registered patents and take an in-depth look at the problems and solutions covered by each patent. It covers key details of each patent, market applicability, and technical/legal implications.

1. Overview of OpenAI's AI Patents

The major AI patents applied for and registered by OpenAI are as follows, and they cover technologies such as large-scale language models (LLM), multimodal AI, automatic coding, speech recognition, image processing, and external API integration. As of February 3, 2025, OpenAI holds a total of 14 US patents.

Title Application Number Registration Number IPC
Adaptive UI for Rich Output Rendering of Assistant Messages 18-606435 12164548 G06F-040/30
Systems and Methods for Interacting with a Large Language Model 18-475722 12051205 G06T-007/10
Systems and Methods for Interacting with a Multimodal Machine Learning Model 18-475588 12039431 G06N-003/0455
Schema-based Integration of External APIs with Natural Language Applications 18-474063 12124823 G06F-008/35
Systems and Methods for Image Generation with Machine Learning Models 18-458907 11983806 G06K-009/36
Systems and Methods for Generating Code Using Language Models Trained on Computer Code 18-321852 12061880 G06F-008/30
Systems and Methods for Generating Natural Language Using Language Models Trained on Computer Code 18-321921 12008341 G06F-008/30
Using Machine Learning to Train and Use a Model to Perform Automatic Interface Actions Based on Video and Input Datasets 18-303552 11887367 G06V-020/40
Multi-Task Automatic Speech Recognition System 18-302289 12079587 G06F-040/58
Systems and Methods for Hierarchical Text-Conditional Image Generation 18-193427 11922550 G06T-011/60
Schema-based Integration of External APIs with Natural Language Applications 18-186712 11922144 G06F-008/35
Systems and Methods for Language Model-based Text Editing 18-183902 11983488 G06F-040/166
Systems and Methods for Language Model-based Text Insertion 18-183898 11886826 G06F-017/00
Systems and Methods for Using Contrastive Pre-training to Generate Text and Code Embeddings 18-158166 12073299 G06N-020/00

2. Key content analysis & representative claims for each major patent

The following patents cover core AI technologies directly related to representative services (such as ChatGPT, DALL·E, Codex, Whisper, etc.) provided by OpenAI. ChatGPT and DALL/E, which have already gained high recognition in the market, also operate based on this patented technology and provide differentiated functions from other companies. Let's take a detailed look at what problems each patent solves and what technical characteristics it has.

(1) Adaptive UI for Rich Output Rendering of Assistant Messages

purposes

  • I want to build a UI system that can freely express visual/structured information beyond a simple text chat environment
  • Interactive models such as OpenAI's ChatGPT help deliver more intuitive information to users by providing rich graphs, tables, and images

Core solutions

  • Generative Language Model (LLM) + front-end rendering structure
    • LLM structures questions and answers or data analysis results (using primitives)
    • Front-end (UI) automatically converts them into visual elements (graphs, dashboards, etc.)
  • In the future, the ChatGPT interface will be beneficial for directly showing visual answers such as charts/tables beyond simple text

Examples of use

  • data analysis: Real-time display of charts or statistical results in the ChatGPT Enterprise version
  • Assistive work tools: Provides interactive reports such as customer support and sales dashboards

Representative Claim 1

  1. Method A:
    suggestions, by a language model generative response engine, a first prompt in natural language to generate a response within a chat thread between a user account and the language model generative response engine presented by a front end,
    where in the language model generative response engine has been considered to determine when the response should be considered a first primitive from a collection of primitives to generate the response, where in the first primitive represents the language model generative response engine to output an initial response includes that structured data as defined by the first primitive;
    invoking, by the language model generative response engine, the first primitive;
    responses, by the language model generative response engine, the initial response by predicting a next word in a sequence of words based on the first prompt and the first primitive to result in the initial response made up of the sequence of words considered as the structured data defined by the first primitive;
    rendering the initial response as the structured data in a visual format using the front end;
    Outputs a completed response, the completed response including the rendered structured data in the visual format.

(2) Systems and Methods for Interventions with a Large Language Model

purposes

  • Create a richer user experience by incorporating multimodal inputs such as images into the large-scale language model (GPT family) provided by OpenAI
  • For example, after uploading an image to ChatGPT, implement a function that describes the image or requires an answer in a specific area

Core solutions

  • Graphical user interface (GUI) interaction
    • User: Text query+image provided
    • Model: Combined analysis of text and images → Highlight specific locations within an image
  • Support for advanced features such as “ChatGPT recognizes objects in an image and zooms in or explains only those parts”

Examples of use

  • visual tutorial: Automatically generate manuals/manuals based on product photos uploaded by users
  • Multimodal support ChatGPT: Build a powerful all-round interactive agent by integrating images, text, and voice input (Whisper)

Representative Claim 1

  1. A method of studying with a multimodal machine learning model, the method studying:
    A graphical user interface associated with a multimodal machine learning model;
    Converting an image to a user in the graphical user interface;
    Prompt a prompt from the user;
    using the image and the suggested prompt;
    Combining an output at least in part by considering the input data to the multimodal machine learning model, the multimodal machine learning model configured using prompt engineering to Identify a location in the image on the image and the prompt, where in the output a first location; and
    , in the graphical user interface, an indicator at the first location in the image, where in sight the important indicator at the most important first location in the image appeared a cursor of the graphical user interface at the first location in the image.

(3) Systems and Methods for Understanding with a Multimodal Machine Learning Model

purposes

  • Aiming to improve visual communication between users and AI by simultaneously receiving text + images
  • Core technology that enhances the interaction interface of OpenAI multimodal models such as DALL·E, CLIP, and Whisper

Core solutions

  • Image highlighting: An image is displayed on the GUI, and the model finds a specific point (coordinate) according to a text query and highlights it with a cursor or box
  • Increase reliability and accessibility by directly visualizing the results recognized by the model to the user

Examples of use

  • Interactive design tools: “What's so awkward about this logo?” → The model highlights problem areas in the logo
  • Medical/manufacturing: Analyze X-Ray and CT images to show suspicious areas or highlight the location of defects on the process inspection screen

Representative Claim 1

  1. A method of studying with a pre-trained multimodal machine learning model, the method studying:
    describe a graphical user interface configured to enable a user to interact with an image to generate a contested prompt that explains an area of interaction in the image;
    Confirmation the Conquest Prompt;
    discussion input data using the image and the conquest prompt;
    describe a response to the image by evaluate the input data to a multimodal machine learning model to configured condition the measured response to the image on the conjunct prompt; and
    Refrain the Refrain Response to the User, where in the unlikely response presented a prompt and the Choose a selectable control in the graphical user interface configured to enable the user to select the prompt

(4) Schema-based Integration of External APIs with Natural Language Applications (registration number: 12124823)

purposes

  • By simplifying the integration between interactive models (such as ChatGPT) provided by OpenAI and external APIs, it supports AI to perform various functions like plug-ins
  • In fact, the ChatGPT Plugin system is based on a similar concept and connects to various third-party apps to automatically call APIs and return results to users

Core solutions

  • Defining an API schema through a Manifest file
    • Provides API usage and request/response formats in a form that the model can understand
    • LLM analyzes users' natural language requests to determine which APIs to call and how
  • Like the actual ChatGPT Plugins system, when a developer provides a manifest, ChatGPT uses plugins to perform various functions

Examples of use

  • Scheduling Plugins: When the user says “Add a meeting to my schedule”, the calendar API is automatically called
  • Shop/order function: “Order pizza” → return results (menu, payment options, etc.) after calling the restaurant API

Representative Claim 1

  1. A computer-evaluated method explains:
    describe a first manifest file stored in a first location, the first manifest file including first training data associated with a first web application programming interface (API)...
    training a model based on the first training data and the first description of the first web API;
    Describe a second manifest file stored in a third location, the second manifest file including second training data associated with a second web API...
    Inputs an input at a user interface of the model;
    whether the input includes a request to integrate the first web API or the second web API with the user interface;
    Converting one or more function calls to transmit to the first web API or the second web API based on the analysis of the received input...
    Re-train the model based on at least one change made to one or more of the first training data, the first description of the first web API, the second training data, or the second description of the second web API

(5) Systems and Methods for Image Generation with Machine Learning Models

purposes

  • Technology for freely replacing, correcting (inpainting) or enhancing (enhancing) specific parts of an existing image in the DALL/E series model, which is a text-based image creation service
  • By masking, you can delete part of the image, indicate new elements with text, and reconstruct the desired result

Core solutions

  • Masking & regeneration
    • Original image → Masking a specific area → Text instructions such as “Add a rainbow here” → The model creates natural pixels
    • Noise removal using Blind Super Resolution, etc., and resolution correction functions are also provided

Examples of use

  • DALL·E-based photo editing: Change the background of a portrait photo, add/delete items, automatically adjust color and brightness, etc.
  • Marketing/design: Fast and creative visualization of ideas when creating images for product advertisements

Representative Claim 1

  1. A system presented:
    ...
    Involve a Masked image by Involve a Masked Region from the input image, Where in the Masked Region; Pixel Values Remembered to the Masked Region;
    delete a text input to an image prompt;
    Combining at least one of the input image, the masked region, or the text input to a machine learning model configured to generate an enhanced image...
    , with the machine learning model, the enhanced image based on at least one of the input image, the masked region, or the text input;
    Where in the generation of the enhanced image featured:
    Replicated pixel values from the input image or the masked image to the enhanced image;
    , with the machine learning model, an image segment based on at least one of the text input or the pixel values from the masked image; and
    Refine the image segment INTO THE ENHANCED IMAGE BY ATTRACED THE MASKED REGION.

(6) Systems and Methods for Understanding Code Using Language Models Analysing on Computer Code

purposes

  • Models such as Codex (or ChatGPT Code Assistant) provided by OpenAI support the natural language requirements of developers to generate appropriate code
  • The goal is to improve code quality and maximize development productivity by verifying test execution results

Core solutions

  • Code+annotation learning: Using LLM learned based on large-scale open source code and comments (dostrings)
  • After the model tries and tests multiple codes, it finally selects and suggests the code that works correctly

Examples of use

  • ChatGPT coding assistant: If you say “Create a QuickSort function in Python”, the code will be written and tested automatically
  • IDE integration: Automatically suggested by inserting it as a plug-in into development environments such as Visual Studio Code and JetBrains

Representative Claim 1

  1. A computer-evaluated method, performed:
    a docstring natural language text a digital programming result;
    , using a machine learning model and based on the docstring, one or more computer code samples Configured to produce candidate results;
    discuss each of the one or more computer code samples to be tested in a testing environment...
    based on a result of the testing environment, at least one of the computer code samples which presented a particular candidate result...
    , using the machine learning model, natural language text Associated with the at least one identified computer code sample;
    VERIZE EACH OF THE ONE OR MORE COMPUTER CODE SAMPLES; and
    Outlined the at least one identified computer code sample...

(7) Systems and Methods for Computational Natural Language Using Language Models Manipulate on Computer Code

purposes

  • It is a technology that automatically generates documentation (comments/explanations) from source code, and Codex or ChatGPT “What does this function do?” Let them explain it themselves
  • The purpose of solving missing comments and maintenance difficulties in large projects and improving code understanding

Core solutions

  • Reverse documentation: Analyze function signatures and code flows and generate natural language summaries (dostrings)
  • After writing the code, the developer calls the model to “write a description of this function” → automatically provides an easy-to-read sentence

Examples of use

  • Automatic annotation generation tool: ChatGPT supports documentation every time you write new code
  • Education platform: When beginner developers learn open source code, they automatically receive explanations for each function to improve understanding

Representative Claim 1

  1. A computer-evaluated method, performed:
    Training a machine learning model to Generate natural language docstrings from computer code;
    1 or more computer code samples at the machine learning model;
    , via the machine learning model and based on the received one or more computer code samples, one or more candidate natural language docstrings...
    consider at least one of the one or more candidate natural language docstrings that provide an intent of the at least a portion of the one or more computer code samples;
    Outbound from the Learned Machine Learning Model The at Least One Identified Natural Language Docstring with the at least a portion of the one or more computer code samples...

(8) Using Machine Learning to Train and Use a Model to Perform Automatic Interface Actions Based on Video and Input Datasets

purposes

  • Train a model that performs automatic actions (click, drag, etc.) on the UI using large amounts of unlabeled video data
  • Link with 'visual information + automatic tasking' systems (such as automatic browser navigation, RPA) that OpenAI can implement in the future

Core solutions

  • Create pseudo-labels with the Inverse Dynamics Model
    • Look at the past and future frames of the video and infer what actions (mouse movement, keyboard input) occurred in the middle
    • Progressive learning: Use these inferred action labels to improve automatic interface work model performance

Examples of use

  • Browser automation: Through screen recording, the model learns and automatically repeats form input and file upload
  • software testing: Minimize manpower consumption by learning regression tests or UI verification at the video level

Representative Claim 1

  1. A method for training a machine learning model to perform automated actions, performed:
    Undecided digital video data;
    Dangerous pseudo-labels for the untold digital video data, the most famous:
    Digital-grade digital video data;
    training a first machine learning model including an inverse dynamics model (IDM) using the intensive digital video data; and
    At least one pseudo-label for the untold digital video data...
    Adding the at least one pseudo-label to the untold digital video data to form pseudo-predicted digital video data; and
    Further training the first machine learning model or a second machine learning model using the pseudo-intensive digital video data...

(9) Multi-Task Automatic Speech Recognition System

purposes

  • Like OpenAI's Whisper model, multilingual and multi-task voice recognition is performed with a single transformer, improving the efficiency of transcribing and translating large-scale voice data
  • Covers multiple languages such as English, Spanish, and Korean with a single model, and processes transcription & translation tasks simultaneously

Core solutions

  • Multi-task training: Learn various types of label audio from a single model, such as same-language transcription and translation into different languages
  • Add language tokens, task tokens, etc. to the decoder input to distinguish whether “this input is a Korean translation” or an “English to French translation”

Examples of use

  • Worldwide service: When ChatGPT receives global voice input, it automatically transcribes or translates according to the relevant language
  • Automatically write meeting notes: Real-time script generation with only one model, even in multilingual meetings

Representative Claim 1

  1. A system presented:
    at least one memory intensive instructions; and
    at least one processor configured to execute the instructions to perform operations for multi-language, multi-task speech recognition, the operations performed:
    Introducing a transformer model...
    Inputs an output transcript from an input audio segment Using the Transformers model, generation including:
    Converting a decoder input with a language token explaining to a first language;
    Involve the decoder input with a task token; and
    Autoregressively presents the decoder input with a first timestamp token...

(10) Systems and Methods for Hierarchical Text-Conditional Image Generation

purposes

  • Image generation models based on text conditions, such as DALL/E, provide a hierarchical structure that enables the gradual generation of high-resolution images
  • Text input → Low resolution base image → Upsampling is performed sequentially to obtain detailed results

Core solutions

  • Multiple submodels
    • Low resolution image generation (1st submodel)
    • High resolution upsampling (2nd submodel)
  • Precisely correct noise or low-quality images with Blind Super Resolution (BSR)

Examples of use

  • DALL·E: “Draw a picture of a cat in the style of the painter Gogh” → Create a low resolution draft → Complete a clear image during the upsampling phase
  • High quality advertising/marketing images: Use only text ideas to extract professional-level detailed images and use them in brand marketing

Representative Claim 1

  1. A system presented:
    ...
    Informed at least one of the text description or the text embedding into a first sub-model configured to generate, based on at least one of the text description or the text embedding, a proposed image embedding;
    inadvertently at least one of the text description or the proposed image embedding... into a second sub-model configured to generate an output image;
    where in the second sub-model includes a first upsampler model and a second upsampler model...
    THE SECOND UPSAMPLER MODEL PRESENTS ON IMAGES SUCCESSFUL WITH BLIND SUPER RESOLUTION (BSR); and
    Making the output image accessible to a device...

(11) Schema-based Integration of External APIs with Natural Language Applications (registration number: 11922144)

purposes

  • Technology that easily integrates an external API using the ChatGPT Plugin method, similar to Patent No. 4
  • The model automatically understands and uses the API call process through the manifest to enhance functional scalability

Core solutions

  • Manifest customization: Third-party API providers define authentication, endpoints, and response formats as JSON schemas
  • The model calls the API based on the user's request → summarizes the response back in natural language

Examples of use

  • ChatGPT Plugin Ecosystem: Added various plug-ins such as itinerary management, travel reservations, bank transactions, and online shopping
  • Enterprise internal system integration: Connect to in-house ERP and CRM to interactively query specific in-house data

Representative Claim 1

  1. A computer-interpreted method for identifying a particular external application programming interface (API) with a natural language model user interface, presented:
    Inputs a first input at the natural language model user interface communicably connected to a natural language model that is intended to call one or more functions based on a manifest...
    including the first input includes a request to integrate the particular external API...
    Extend the particular external API based on the received first input;
    describes the particular external API with the natural language model user interface, the explains the particular external API, a description of the web API, and the manifest...
    Considering the particular external API based on the first input or a second input...
    responses, based on the responses, a response message to the natural language model user interface, the response message including a result of the responses...

(12) Systems and Methods for Language Model-based Text Editing

purposes

  • Streamline the content creation and editing process by using OpenAI language models such as ChatGPT to automatically correct and edit document drafts
  • Change immediately if the user says “make this paragraph more concise” or “change it to a business style”

Core solutions

  • Prompt-based editing: If you specify a specific area and editing style (formal, friendly, etc.), the model replaces partial text
  • If you don't like the results, give the user a second command to re-edit (Iterative Editing)

Examples of use

  • blog post: ChatGPT compresses long drafts in a concise manner or automatically inserts SEO keywords
  • business email: Generate sentences at a level where models can be modified and corrected without a template and sent immediately

Representative Claim 1

  1. A system presented:
    ...
    accept an input text prompt, the input text prompt suggest a null set;
    Instructions one or more user instructions;
    describe a set of model parameters based on the one or more user instructions;
    Involving a language model...
    , using the language model, an output text based on the input text prompt, the one or more user instructions...
    Editing the output text based on the language model and the one or more new user instructions by treating at least a portion of the output text; and
    Alleged the Language Model by Alleged the Language Model...

(13) Systems and Methods for Language Model-based Text Formatting

purposes

  • The model automatically inserts (inserts) sentences in the middle of an existing text document or conversation to improve the degree of completeness naturally
  • ChatGPT recognizes specific sections and adds necessary information or example sentences to make the post richer

Core solutions

  • Context analysis: Check the sentences before and after, and create a model so that the content to be inserted matches the overall flow
  • “Give me an example here” → Reinforce the completeness of the entire article by creating relevant examples or analogies on the fly

Examples of use

  • Writing a thesis/report: Inserting additional explanations or cases between chapters.
  • Novel/scenario: The model actively fills in necessary parts such as dialogue or setting explanations.

Representative Claim 1

  1. A system comprising:

    receiving an input text prompt comprising a prefix portion and a suffix portion;
    determining a set of model parameters based on the input text prompt;
    accessing a language model...
    determining a set of context parameters based on the input text prompt and the language model, the set of context parameters comprising at least one of location, person, time, or event;
    generating language model output text based on the set of context parameters and the language model;
    inserting the language model output text into the input text prompt...
    optimizing the accessed language model...

(14) Systems and Methods for Using Contrastive Pre-training to Generate Text and Code Embeddings

purposes

Create embeddings that efficiently calculate semantic similarity between text and code for use in search, recommendations, and document classification. For instance, the system can find code similar to a given function description or match two texts whose meanings are closely related.

Core solutions

  • Contrastive Learning: Trains by pulling vectors for similar samples closer and pushing vectors for dissimilar samples apart.
  • OpenAI leverages models that learn text and code simultaneously (e.g., Codex, CLIP) to enhance search and recommendation systems.

Examples of use

  • ChatGPT Plugin: When a user describes a specific piece of code, the model compares embeddings and recommends the best-matching library function.
  • Document Search Engine: Converts FAQ data, customer queries, etc. into meaningful vector representations for faster retrieval.

Representative Claim 1
A computer-implemented method for generating a semantic similarity based on a vector representation, the method comprising:

  • receiving a training data set extracted from unlabeled data, the training data set including a plurality of paired data samples corresponding to positive example pairs…
  • converting the paired data samples corresponding to the positive example pairs into at least one first vector of a vector representation;
  • accessing one or more negative example pairs…
  • converting the one or more negative example pairs into one or more second vectors…
  • training a machine learning model to generate additional vectors of the vector representation, wherein the training comprises:
    • initializing the machine learning model with one or more pre-trained models…
    • training the machine learning model using contrastive training…
  • receiving a query for semantic similarity…
  • generating, with the machine learning model and according to an embedding space, a semantic similarity result in response to the query.

These patents serve as the foundational technologies behind most of OpenAI’s services, including ChatGPT, DALL·E, Whisper, and Codex. Spanning a broad spectrum—from conversational UIs to image generation, speech recognition, code automation, and external API integration—each patent interconnects seamlessly to form what can be described as an “AI platform ecosystem.”

3. Patent Trends and Implications

  1. Multimodal AI and UI/UX Innovation
    Technology capable of simultaneously processing multiple forms of data—images, text, audio, and video—is becoming widespread, and UI/UX is growing more intuitive and enriched.
  2. Expansion of LLMs: Code Generation, Translation, and Editing
    Automatic code generation, document editing, and annotation can greatly automate and enhance the workload of developers and documentation teams. This leads to improved productivity, reduced labor costs, and shortened time-to-market for businesses.
  3. Platformization via External API Integration
    AI models now directly call various third-party APIs to expand functionality. As conversational AI services evolve into platforms, these third-party APIs can be seamlessly integrated like plug-ins.
  4. Enhanced Training Efficiency and Precision
    Contrastive learning, pseudo-labeling, and other techniques help make optimal use of unlabeled data, thereby boosting accuracy and versatility. Large-scale language and vision models are becoming more refined, enabling high performance even in zero-shot or few-shot scenarios.

4. Conclusion

OpenAI’s patent portfolio illustrates the future direction of AI, encompassing multimodal systems, large-scale language models (LLMs), external API integration, code automation, high-resolution image generation, and speech recognition. Leading companies already have extensive patent strategies to protect and expand their proprietary technologies.Pine IP Firm offers professional services in AI and software patent strategy, including patent specification drafting, filing, infringement response, and dispute resolution. As competition in AI intensifies, intellectual property rights are essential for safeguarding technological innovation. Like OpenAI’s example—where early patent acquisition strategies underpin large-scale investment and exclusive technology—your company should also establish a robust patent strategy. If you want to firmly protect your unique AI technology, we invite you to collaborate with Pine IP Firm to develop a step-by-step plan.