Creating a custom model for GitHub Copilot

Note

Custom models for GitHub Copilot Enterprise is in public preview and is subject to change. During the public preview, there is no additional cost to Copilot Enterprise customers enrolled on the public preview for creating or using a custom model.

Prerequisite

The code on which you want to train a custom model must be hosted in repositories owned by your organization on GitHub.

Limitations

For the public preview, an enterprise can deploy one custom model in a single organization.
Code completion suggestions based on the custom model are only available to managed users who get a Copilot Enterprise subscription from the organization in which the custom model is deployed. For more information, see About Enterprise Managed Users.
The custom model is not used for code suggested in responses by GitHub Copilot Chat.

About Copilot custom models

By default GitHub Copilot uses a large language model that has been trained on a large number of public code repositories, so that it can provide code completion for a wide range of programming languages in many different contexts. You can use this model as the basis for creating a custom large language model that you train specifically on your own code. This process is often known as fine-tuning.

By creating a custom model you enable GitHub Copilot to show you code completion suggestions that are:

Based on code in your own designated repositories.
Created for proprietary or less publicly represented programming languages.
Tailored according to your organization's coding style and guidelines.

This provides:

Personalization - Copilot has a detailed knowledge of your codebase, including available modules, functions, and internal libraries. A custom model may be particularly beneficial if your code is not typical of the wide range of code used to train the base model.
Efficiency and quality - Copilot is better equipped to help you write code faster and with fewer errors.
Privacy - The custom model’s training process, hosting and inferencing are secure and private to your organization. Your data always remains yours, is never used to train another customer’s model, and your custom model is never shared.

About model creation

Currently, in the public preview, only one organization in an enterprise is permitted to create a custom model.

As an owner of the organization that's permitted to create a custom model, you can choose which of your organization's repositories to use to train the model. You can train the model on one, several, or all of the repositories in the organization. The model is trained on the content of the default branches of the selected repositories. Optionally, you can specify that only code written in certain programming languages should be used for training. The custom model will be used for generating code completion suggestions in all file types, irrespective of whether that type of file was used for training.

You can also choose whether telemetry data (such as the prompts entered by users and the suggestions generated by Copilot) should be used when training the model. For more information, see Telemetry data collection and usage for custom models, later in this article.

Once initiated, custom model creation will take many hours to complete. You can check the progress of the training in your organization's settings. When model creation completes - or if it fails to complete - the person who initiated the model training will be notified by email.

If model creation fails, Copilot will continue to use the current model for generating code completion suggestions.

About model usage

As soon as the custom model is successfully created, all managed users in your enterprise who get Copilot Enterprise access from the organization in which the custom model is deployed will start to see Copilot code completion suggestions that are generated using the custom model. The custom model will always be used for any code these users edit, irrespective of where the code resides. Users cannot choose which model is used to generate the code completion suggestions they see.

When you can benefit from a custom model

The value of a custom model is most pronounced in environments with:

Proprietary or less publicly represented programming languages
Internal libraries or custom frameworks
Custom standards and company-specific coding practices

However, even in standardized environments, fine-tuning offers an opportunity to align Copilot code completion more closely with your organization’s established coding practices and standards.

Assess the effectiveness of a custom model

While some coding environments are more likely to benefit from fine-tuning, there is no guaranteed correlation between specific behaviors in a codebase and the quality of the results you get from a custom model. It is advisable to assess the use and satisfaction levels of GitHub Copilot code completion suggestions before and after the implementation of a custom model.

Use the GitHub API to assess the usage of GitHub Copilot. See REST API endpoints for GitHub Copilot usage metrics.
Survey developers to assess their level of satisfaction with GitHub Copilot code completion suggestions.

Comparing results from the API and developer survey, from before and after the implementation of a custom model, will give you an indication of the effectiveness of the custom model.

Creating a custom model

You can use your organization settings to create a custom large language model.

In the upper-right corner of GitHub, select your profile photo, then click Your organizations.
Next to the organization, click Settings.
In the left sidebar, click Copilot then click Custom model.
On the "Custom models" page, click Train a new custom model.
Under "Select repositories," choose either Selected repositories or All repositories.
If you chose Selected repositories, select the repositories you want to use for training then click Apply.
Optionally, if you want to train your model only on code written in certain programming languages, under "Specify languages," start typing the name of a language you want to include. Select the required language from the list that's displayed. Repeat the process for each language you want to include.
To improve the performance of your model, select the checkbox labeled Include data from prompts and suggestions.

Note

If the checkbox isn't available to select it indicates that the Telemetry data collection policy for custom models has been disabled in your organization's settings. For information on how to change policies for your organization, see Managing policies for Copilot in your organization.

By selecting this option you allow Copilot to collect data for prompts that user submitted and the code completion suggestions that were generated. Once sufficient data has been collected, Copilot will use this as part of the model training process, allowing it to produce a more effective model.

For more information, see Telemetry data collection and usage for custom models, later in this article.
Click Create new custom model.

Checking the progress of model creation

You can check in your organization settings for an indication of how model creation is progressing.

Go to your organization's settings for Copilot custom models. See Creating a custom model above.
The first time you train a model, the page that's displayed shows the training results.

If this is not the first training, the current and previous training attempts are listed. To see details of the current training process, click the first ellipsis button (...), then click Training details.

Reasons for training failure

Model training may fail for a variety of reasons, including:

Not enough data or non-representative data. Lack of data provided for training, or too much replication in the data, may make the fine-tuning unstable.
Non-differentiated data. If the data is not sufficiently different from the public data on which the base model was trained, training may fail or the quality of code completion suggestions from the custom model may be only marginally improved.
A data preprocessing step may encounter unexpected files types and formats which causes it to fail. A solution may be to specify only certain file types for training.

Retraining or deleting the custom model

As an organization owner, you can update or delete the custom model from your organization's settings page.

Retraining the model updates it to include any new code that has been added to the repositories you selected for training. You can retrain the model once a week.

Go to your organization's settings for Copilot custom models. See Creating a custom model above.
On the model training page, click the first ellipsis button (...), then click either Retrain model or Delete model.

If you retrain the model, Copilot will continue to use the current model to generate code completion suggestions until the new model is ready. Once the new model is ready, it will be automatically be used for code completion suggestions for all managed users who get a Copilot Enterprise subscription from the organization.

If you delete the custom model, Copilot will use the base model for generating code completion suggestions for all users who get a Copilot subscription from the organization.

Telemetry data collection and usage for custom models

When you create a custom model, you can choose to allow GitHub to collect telemetry data for the purposes of training the model. This data is used to improve the quality of the code completion suggestions the model can generate.

What telemetry data is collected?

Prompts: This includes all the information sent to the GitHub Copilot language model by the Copilot extension, including context from your open files.
Suggestions: The code completion suggestions that Copilot generates.
Code snippet: A snapshot of the code 30 seconds after a suggestion is accepted, capturing how the suggestion was integrated into the codebase. This helps determine whether the suggestion was accepted as is or modified by the user before final integration.

How is telemetry data used?

Telemetry data is primarily used to fine-tune the Copilot custom model to better understand and predict your organization’s coding patterns. Specifically, it helps:

Enhance model accuracy: By analyzing the collected telemetry, Copilot refines your custom model to increase the relevance and accuracy of future coding suggestions.
Monitor performance: Telemetry data allows GitHub to monitor how well custom models are performing compared to the base model, enabling ongoing improvements.
Feedback loops: The data helps GitHub create feedback loops where the model learns from real-world usage, adapting to your specific coding environment over time.

Data storage and retention

Data storage: All telemetry data collected is stored in the Copilot Data Store, a secure and restricted environment. The data is encrypted and isolated to prevent unauthorized access.
Retention period: Telemetry data is retained for a rolling 28-day period. After this period, the data is automatically deleted from GitHub's systems, ensuring that only recent and relevant data is used for model training and improvement.

Privacy and data security

GitHub is committed to ensuring that your organization’s data remains private and secure.

Exclusive use: The telemetry data collected from your organization is used exclusively for training your custom model and is never shared with other organizations or used to train other customers’ models.
Data leakage prevention: GitHub implements strict data isolation protocols to prevent cross-contamination between different organizations’ data. This means that your proprietary code and information are protected from exposure to other organizations or individuals.

Important considerations

Opt-in for telemetry: Participation in telemetry data collection is optional and controlled via your organization’s admin policies. Telemetry data is only collected when explicitly enabled for training custom models.
Potential risks: Although GitHub takes extensive measures to prevent data leakage, there are scenarios where sensitive data, such as internal links or names, could be included in the telemetry and subsequently used in training. We recommend reviewing and filtering the data you submit for training to minimize these risks.

For more details about our data-handling practices, see the GitHub Copilot Trust Center or review GitHub’s data protection agreement.

Who can use this feature?

In this article