Can we fine-tune sam?

Yes, SAM can be fine-tuned to adapt its performance to specific tasks or datasets, enhancing its capabilities and accuracy.

What is fine-tuning in generative AI?

Fine-tuning in generative AI involves adjusting pre-trained models on new data, refining their parameters to improve performance for specific tasks.

Why is fine-tuning a problem?

Fine-tuning can be challenging due to the need for carefully chosen hyperparameters and potential overfitting to the training data.

How do you implement segment anything model?

Implementing a Segment Anything model involves preparing data (dataset used for training SA 1B is also available), selecting the model checkpoints available in the GitHub, fine-tuning on specific tasks, and evaluating performance metrics.

What is the SAM model type?

SAM (Segment Anything Model) is a versatile deep learning model designed for image segmentation tasks. Trained on the SA-1B dataset, featuring over 1 billion masks on 11M images, SAM excels in zero-shot transfers to new tasks and distributions.

Why is segmentation better than detection?

Segmentation provides detailed object boundaries, offering precise localization compared to detection, which identifies objects but lacks boundary information, making segmentation more suitable for tasks requiring accuracy in object delineation.

How do you use the SAM model?

Utilize the SAM model by fine-tuning it on relevant data for specific segmentation tasks, ensuring proper evaluation and adjustment to achieve desired performance levels.

Back to Blogs

Contents

What is the Segment Anything Model (SAM)?
What is Model Fine-Tuning?
Why Would I Fine-Tune a Model?
How to Fine-Tune Segment Anything Model [With Code]
Fine-Tuning for Downstream Applications
Conclusion

Encord Blog

How To Fine-Tune Segment Anything

April 13, 2023

10 mins

Back to Blogs

Contents

What is the Segment Anything Model (SAM)?
What is Model Fine-Tuning?
Why Would I Fine-Tune a Model?
How to Fine-Tune Segment Anything Model [With Code]
Fine-Tuning for Downstream Applications
Conclusion

Written by

Alexandre Bonnet

View more posts

Computer vision is having its ChatGPT moment with the release of the Segment Anything Model (SAM) by Meta last week. Trained over 11 billion segmentation masks, SAM is a foundation model for predictive AI use cases rather than generative AI. While it has shown an incredible amount of flexibility in its ability to segment over wide-ranging image modalities and problem spaces, it was released without “fine-tuning” functionality.

This tutorial will outline some of the key steps to fine-tune SAM using the mask decoder, particularly describing which functions from SAM to use to pre/post-process the data so that it's in good shape for fine-tuning.

Supercharge your annotations by fine-tuning SAM for your use case

Clean & curate data smartly

Create quality labels quickly

Validate your label quality

Evaluate & monitor your models

Book a live demo

What is the Segment Anything Model (SAM)?

The Segment Anything Model (SAM) is a segmentation model developed by Meta AI. It is considered the first foundational model for Computer Vision. SAM was trained on a huge corpus of data containing millions of images and billions of masks, making it extremely powerful. As its name suggests, SAM is able to produce accurate segmentation masks for a wide variety of images. SAM’s design allows it to take human prompts into account, making it particularly powerful for Human In The Loop annotation. These prompts can be multi-modal: they can be points on the area to be segmented, a bounding box around the object to be segmented, or a text prompt about what should be segmented.

The model is structured into 3 components: an image encoder, a prompt encoder, and a mask decoder.

Image displaying the foundation model architecture for the Segment Anything (SA) model

Source

The image encoder generates an embedding for the image being segmented, whilst the prompt encoder generates an embedding for the prompts. The image encoder is a particularly large component of the model. This is in contrast to the lightweight mask decoder, which predicts segmentation masks based on the embeddings. Meta AI has made the weights and biases of the model trained on the Segment Anything 1 Billion Mask (SA-1B) dataset available as a model checkpoint.

Learn more about how Segment Anything works in our explainer blog post Segment Anything Model (SAM) Explained.

What is Model Fine-Tuning?

Publicly available state-of-the-art models have a custom architecture and are typically supplied with pre-trained model weights. If these architectures were supplied without weights then the models would need to be trained from scratch by the users, who would need to use massive datasets to obtain state-of-the-art performance.

Model fine-tuning is the process of taking a pre-trained model (architecture+weights) and showing it data for a particular use case. This will typically be data that the model hasn’t seen before, or that is underrepresented in its original training dataset.

The difference between fine-tuning the model and starting from scratch is the starting value of the weights and biases. If we were training from scratch, these would be randomly initialized according to some strategy. In such a starting configuration, the model would ‘know nothing’ of the task at hand and perform poorly. By using pre-existing weights and biases as a starting point we can ‘fine tune’ the weights and biases so that our model works better on our custom dataset. For example, the information learned to recognize cats (edge detection, counting paws) will be useful for recognizing dogs.

Why Would I Fine-Tune a Model?

The purpose of fine-tuning a model is to obtain higher performance on data that the pre-trained model has not seen before. For example, an image segmentation model trained on a broad corpus of data gathered from phone cameras will have mostly seen images from a horizontal perspective.

If we tried to use this model for satellite imagery taken from a vertical perspective, it may not perform as well. If we were trying to segment rooftops, the model may not yield the best results. The pre-training is useful because the model will have learned how to segment objects in general, so we want to take advantage of this starting point to build a model that can accurately segment rooftops. Furthermore, it is likely that our custom dataset would not have millions of examples, so we want to fine-tune instead of training the model from scratch.

Fine tuning is desirable so that we can obtain better performance on our specific use case, without having to incur the computational cost of training a model from scratch.

How to Fine-Tune Segment Anything Model [With Code]

Background & Architecture

We gave an overview of the SAM architecture in the introduction section. The image encoder has a complex architecture with many parameters. In order to fine-tune the model, it makes sense for us to focus on the mask decoder which is lightweight and therefore easier, faster, and more memory efficient to fine-tune.

In order to fine-tune SAM, we need to extract the underlying pieces of its architecture (image and prompt encoders, mask decoder). We cannot use SamPredictor.predict (link) for two reasons:

We want to fine-tune only the mask decoder
This function calls SamPredictor.predict_torch which has the @torch.no_grad() decorator (link), which prevents us from computing gradients

Thus, we need to examine the SamPredictor.predict function and call the appropriate functions with gradient calculation enabled on the part we want to fine-tune (the mask decoder). Doing this is also a good way to learn more about how SAM works.

Creating a Custom Dataset

We need three things to fine-tune our model:

Images on which to draw segmentations
Segmentation ground truth masks
Prompts to feed into the model

We chose the stamp verification dataset (link) since it has data that SAM may not have seen in its training (i.e., stamps on documents). We can verify that it performs well, but not perfectly, on this dataset by running inference with the pre-trained weights. The ground truth masks are also extremely precise, which will allow us to calculate accurate losses. Finally, this dataset contains bounding boxes around the segmentation masks, which we can use as prompts to SAM. An example image is shown below. These bounding boxes align well with the workflow that a human annotator would go through when looking to generate segmentations.

Image of stamp with bounding box

Input Data Preprocessing

We need to preprocess the scans from numpy arrays to pytorch tensors. To do this, we can follow what happens inside SamPredictor.set_image (link) and SamPredictor.set_torch_image (link) which preprocesses the image. First, we can use utils.transform.ResizeLongestSide to resize the image, as this is the transformer used inside the predictor (link). We can then convert the image to a pytorch tensor and use the SAM preprocess method (link) to finish preprocessing.

Training Setup

We download the model checkpoint for the vit_b model and load them in:

sam_model = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth')

We can set up an Adam optimizer with defaults and specify that the parameters to tune are those of the mask decoder:

optimizer = torch.optim.Adam(sam_model.mask_decoder.parameters())

At the same time, we can set up our loss function, for example Mean Squared Error

loss_fn = torch.nn.MSELoss()

Training Loop

In the main training loop, we will be iterating through our data items, generating masks, and comparing them to our ground truth masks so that we can optimize the model parameters based on the loss function.

In this example, we used a GPU for training since it is much faster than using a CPU. It is important to use .to(device) on the appropriate tensors to make sure that we don’t have certain tensors on the CPU and others on the GPU.

We want to embed images by wrapping the encoder in the torch.no_grad() context manager, since otherwise we will have memory issues, along with the fact that we are not looking to fine-tune the image encoder.

with torch.no_grad():
	image_embedding = sam_model.image_encoder(input_image)

We can also generate the prompt embeddings within the no_grad context manager. We use our bounding box coordinates, converted to pytorch tensors.

with torch.no_grad():
      sparse_embeddings, dense_embeddings = sam_model.prompt_encoder(
          points=None,
          boxes=box_torch,
          masks=None,
      )

Finally, we can generate the masks. Note that here we are in single mask generation mode (in contrast to the 3 masks that are normally output).

low_res_masks, iou_predictions = sam_model.mask_decoder(
  image_embeddings=image_embedding,
  image_pe=sam_model.prompt_encoder.get_dense_pe(),
  sparse_prompt_embeddings=sparse_embeddings,
  dense_prompt_embeddings=dense_embeddings,
  multimask_output=False,
)

The final step here is to upscale the masks back to the original image size since they are low resolution. We can use Sam.postprocess_masks to achieve this. We will also want to generate binary masks from the predicted masks so that we can compare these to our ground truths. It is important to use torch functionals in order to not break backpropagation.

upscaled_masks = sam_model.postprocess_masks(low_res_masks, input_size, original_image_size).to(device)

from torch.nn.functional import threshold, normalize

binary_mask = normalize(threshold(upscaled_masks, 0.0, 0)).to(device)

Finally, we can calculate the loss and run an optimization step:

loss = loss_fn(binary_mask, gt_binary_mask)
optimizer.zero_grad()
loss.backward()
optimizer.step()

By repeating this over a number of epochs and batches we can fine-tune the SAM decoder.

Saving Checkpoints and Starting a Model from it

Once we are done with training and satisfied with the performance uplift, we can save the state dict of the tuned model using:

torch.save(model.state_dict(), PATH)

We can then load this state dict when we want to perform inference on data that is similar to the data we used to fine-tune the model.

You can find the Colab Notebook with all the code you need to fine-tune SAM here. Keep reading if you want a fully working solution out of the box!

Fine-Tuning for Downstream Applications

While SAM does not currently offer fine-tuning out of the box, we are building a custom fine-tuner integrated with the Encord platform. As shown in this post, we fine-tune the decoder in order to achieve this. This is available as an out-of-the-box one-click procedure in the web app, where the hyperparameters are automatically set.

Image displaying training the Segment Anything Model (SAM) in the Encord platform

Original vanilla SAM mask:

Image of the original vanilla SAM mask

Mask generated by fine-tuned version of the model:

Image of the mask generated by the fine tuned version of the model

We can see that this mask is tighter than the original mask. This was the result of fine-tuning on a small subset of images from the stamp verification dataset, and then running the tuned model on a previously unseen example. With further training and more examples, we could obtain even better results.

🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥

Conclusion

That's all, folks!

You have now learned how to fine-tune the Segment Anything Model (SAM). If you're looking to fine-tune SAM out of the box, you might also be interested to learn that we have recently released the Segment Anything Model in Encord, allowing you to fine-tune the model without writing any code.

Image of a frog segmented by the Segment Anything Model (SAM) inside the Encord platform

Supercharge Your Annotations with the

Segment Anything Model

Build better ML models with Encord

Get started today

Written by

Alexandre Bonnet

View more posts

Frequently asked questions

Yes, SAM can be fine-tuned to adapt its performance to specific tasks or datasets, enhancing its capabilities and accuracy.
Fine-tuning in generative AI involves adjusting pre-trained models on new data, refining their parameters to improve performance for specific tasks.
Fine-tuning can be challenging due to the need for carefully chosen hyperparameters and potential overfitting to the training data.
Implementing a Segment Anything model involves preparing data (dataset used for training SA 1B is also available), selecting the model checkpoints available in the GitHub, fine-tuning on specific tasks, and evaluating performance metrics.
SAM (Segment Anything Model) is a versatile deep learning model designed for image segmentation tasks. Trained on the SA-1B dataset, featuring over 1 billion masks on 11M images, SAM excels in zero-shot transfers to new tasks and distributions.
Segmentation provides detailed object boundaries, offering precise localization compared to detection, which identifies objects but lacks boundary information, making segmentation more suitable for tasks requiring accuracy in object delineation.
Utilize the SAM model by fine-tuning it on relevant data for specific segmentation tasks, ensuring proper evaluation and adjustment to achieve desired performance levels.

Previous blog

SegGPT: Segmenting everything in context [Explained]

Next blog

Automating Foundation Models with Segment Anything Model (SAM) Using Encord Annotate

Jan 11 2023

5 M

sampleImage_find-and-fix-label-errors-tutorial

Tutorials

How to Find and Fix Label Errors with Encord Active

Introduction Are you trying to improve your model performance by finding label errors and correcting them? You’re probably spending countless hours manually debugging your data sets to find data and label errors with various scripts in Jupyter Notebooks. Encord Active, a new open-source active learning framework makes it easy to find and fix label errors in your computer vision datasets. With Encord Active, you can quickly and easily identify label errors in your datasets and fix them with just a few clicks. Plus, with a user-friendly UI and a range of different visualizations to slice your data, Encord Active makes it easier than ever to investigate and understand the failure modes in your computer vision models. In this guide, we will show you how to use Encord Active to find and fix label errors in the COCO validation dataset. Before we begin, let us quickly recap the three types of label errors in computer vision. Label errors in computer vision Incorrect labels in your training data can significantly impact the performance of your computer vision models. While it's possible to manually identify label errors in small datasets, it quickly becomes impractical when working with large datasets containing hundreds of thousands or millions of images. It’s basically like finding a needle in a haystack. The three types of labeling errors in computer vision are: Mislabeled objects: A sample that has a wrong object class attached to it. Missing labels: A sample that does not contain a label. Inaccurate labels: A label that is either too tight, too loose, or overlaps with other objects. Below you see examples of the three types of errors on a Bengalese tiger: Tip! If you’d like to read more about label errors, we recommend you check out Data errors in Computer Vision. How to find label errors with a pre-trained model As your computer vision activities mature, you can use a trained model to spot label errors in your data annotation pipelines. You will need to follow a simple 4-step approach: Run a pre-trained model on your newly annotated samples to obtain model predictions. Visualize your model predictions and ground truth labels on top of each other. Sort for high-confidence false positive predictions and compare them with the ground truth labels. Flag missing or wrong labels and send them for re-labeling. Tip! It is important that the computer vision model you use to get predictions has not been trained on the newly annotated samples we are investigating. How to fix label errors with Encord Active Getting started The sandbox dataset used in this example is the COCO validation dataset combined with model predictions from a pre-trained MASK R-CNN RESNET50 FPN V2 model. The sandbox dataset with labels and predictions can be downloaded directly from Encord Active. Tip! The quality of your model can greatly impact the effectiveness of using it to identify label errors. The better your model, the more accurate the predictions will be. So be sure to carefully select and use your model to get the best results. First, we install Encord Active using pip: $ pip install encord-active Hereafter, we download a sandbox dataset: $ encord-active download Loading prebuilt projects ... [?] Choose a project: [open-source][validation]-coco-2017-dataset (1145.2 mb) > [open-source][validation]-coco-2017-dataset (1145.2 mb) [open-source]-covid-19-segmentations (55.6 mb) [open-source][validation]-bdd-dataset (229.8 mb) quickstart (48.2 mb) Downloading sandbox project: 100%|################################################| 1.15G/1.15G [00:22<00:00, 50.0MB/s] Unpacking zip file. May take a bit. ╭───────────────────────────── 🌟 Success 🌟 ─────────────────────────────╮ │ │ │ Successfully downloaded sandbox dataset. To view the data, run: │ │ │ │ cd "C:/path/to/[open-source][validation]-coco-2017-dataset" │ │ encord-active visualise │ │ │ ╰─────────────────────────────────────────────────────────────────────────╯ Lastly, we visualize Encord Active: cd "[open-source][validation]-coco-2017-dataset" $ encord-active visualise In the UI, we navigate to the false positive page. A false positive prediction is when a model incorrectly identifies an object and gives it a wrong class or if the IOU is lower than the determined threshold . For example, if a model is trained to recognize tigers and mistakenly identifies a cat as a tiger, that would be a false positive prediction. Next, we select the metric “Model confidence” and filter for predictions with >75% confidence. Using the UI we can then sort for the highest confidence false positives to find images with possible label errors. In the example below, we can see that the model has predicted four missing labels on the selected image. The objects missing are a backpack, a handbag, and two people. The predictions are marked in purple with a box around them. As all four predictions are correct the label errors can automatically be sent back to the label editor to be corrected immediately. Similarly, we can use the false positive predictions to find mislabeled objects and send them for re-labeling in your label editor. The vehicle below is predicted with 99.4% confidence to be a bus but is currently mislabeled as a truck. Using Encord’s label editor, we can quickly correct the label. To find and fix any remaining incorrect labels in the dataset, we repeated this process until we were satisfied. If you're curious about identifying label errors in your own training data, you can try using Encord Active, the open-source active learning framework. Simply upload your data, labels, and model predictions to get started. Conclusion Finding and fixing label errors is a tedious manual process that can take countless hours. It is often done manually by shifting through one image at a time or writing one-off scripts in Jupyter notebooks. The three different label error types are 1) missing labels, 2) wrong labels, and 3) inaccurate labels. The easiest way to find and fix label errors and missing labels is to use a trained model to spot label errors in your training dataset by running a model. Want to test your own models? "I want to get started right away" - You can find Encord Active on Github here. "Can you show me an example first?" - Check out this Colab Notebook. "I am new, and want a step-by-step guide" - Try out the getting started tutorial. If you want to support the project you can help us out by giving a Star on GitHub :) Want to stay updated? Follow us on Twitter and Linkedin for more content on computer vision, training data, and active learning. Join our Discord channel to chat and connect.

Dec 19 2022

8 M

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.