Why do most ML projects fail?

ML projects often fail due to poor data, unclear goals, wrong model complexity, unrealistic expectations, or lack of team collaboration.

Why do machine learning models degrade in production?

Models degrade in production because real-world data changes over time (data drift), the underlying patterns they were trained on shift (concept drift), or model outputs create feedback loops that alter future data.

When your model performs well on the test set but poorly in production?

This usually indicates overfitting, meaning the model has memorized the training data too closely and doesn't generalize to new examples.

What are some causes behind the performance of an AI model?

An AI model's performance depends on the quality and quantity of its training data, its architecture, how it's trained, the features it uses, and potential biases.

Back to Blogs

Contents

Why do Models Fail in Production?
Reason #1: Data Labeling Errors
Reason #2: Poor Data Quality
Reason #3: Data Drift
Reason #4: Thinking Deployment is the Final Step (No Observability)
Key Takeaways: 4 Reasons Computer Vision Models Fail in Production

Encord Blog

4 Reasons Why Computer Vision Models Fail in Production

April 24, 2024

8 mins

Back to Blogs

Contents

Why do Models Fail in Production?
Reason #1: Data Labeling Errors
Reason #2: Poor Data Quality
Reason #3: Data Drift
Reason #4: Thinking Deployment is the Final Step (No Observability)
Key Takeaways: 4 Reasons Computer Vision Models Fail in Production

Written by

Stephen Oladele

View more posts

Here’s a scenario you’ve likely encountered: You spent months building your model, increased your F1 score above 90%, convinced all stakeholders to launch it, and... poof! As soon as your model sees real-world data, its performance drops below what you expected.

This is a common production machine learning (ML) problem for many teams—not just yours. It can also be a very frustrating experience for computer vision (CV) engineers, ML teams, and data scientists.
There are many potential factors behind these. Problems could stem from the quality of the production data, the design of the production pipelines, the model itself, or operational hurdles the system faces in production.

In this article, you will learn the four (4) reasons why computer vision models fail in production and thoroughly examine the ML lifecycle stages where they occur. These reasons show you the most common production CV and data science problems. Knowing their causes may help you prevent, mitigate, or fix them.

You’ll also see the various strategies for addressing these problems at each step. Let’s jump right into it!

Curate Data for Production Models with Encord

Clean & curate data smartly

Create quality labels quickly

Validate your label quality

Evaluate & monitor your models

Book a live demo

Why do Models Fail in Production?

The ML lifecycle governs how ML models are developed and shipped; it involves sourcing data, data exploration and preparation (data cleaning and EDA), model training, and model deployment, where users can consume the model predictions.

These processes are interdependent, as an error in one stage could affect the corresponding stages, resulting in a model that doesn’t perform well—or completely fails—in production.

Models can fail in production | Encord

Organizations develop machine learning (ML) and artificial intelligence (AI) models to add value to their businesses. When errors occur at any ML development stage, they can lead to production models failing, costing businesses capital, human resources, and opportunities to satisfy customer expectations.

Consider the implications of poorly labeling data for a CV model after data collection. Or the model has an inherent bias—it could invariably affect results in a production environment.

It is noteworthy that the problem can start when businesses do not have precise reasons or objectives for developing and deploying machine learning models, which can cripple the process before it begins.

Assuming the organization has passed all stages and deployed its model, the errors we often see that lead to models failing in production include:

Mislabeling data, which can train models on incorrect information.
ML engineers and CV teams that prioritize data quality only at later stages rather than as a foundational practice.
Ignoring the drift in data distribution over time can make models outdated or irrelevant.
Implementing minimal or no validation (quality assurance) steps risks unnoticed errors progressing to production.
Viewing model deployment as the final goal, neglecting necessary ongoing monitoring and adjustments.

Let’s look deeper at these errors and why they are the top reasons we see production models fail.

4 reasons your computer vision models fail in production.

Reason #1: Data Labeling Errors

Data labeling is the foundation for training machine learning models, particularly supervised learning, where models learn patterns directly from labeled data. This involves humans or AI systems assigning informative labels to raw data—whether it be images, videos, or DICOM—to provide context that enables models to learn.

AI algorithms also synthesize labeled data. Check out our guide on synthetic data and why it is useful.

Despite its importance, data labeling is prone to errors, primarily because it often relies on human annotators. These errors can compromise a model's accuracy by teaching it incorrect patterns.

Consider a scenario in a computer vision project to identify objects in images from data sources. Even a small percentage of mislabeled images can lead the model to associate incorrect features with an object. This could mean the model makes wrong predictions in production.

Potential Solution: Automated Labeling Error Detection

A potential solution is adopting tools and frameworks that automatically detect labeling errors. These tools analyze labeling patterns to identify outliers or inconsistent labels, helping annotators revise and refine the data. An example is Encord Active.

Encord Active is one of three products in the Encord platform (the others are Annotate and Index) that includes features to find failure modes in your data, labels, and model predictions.

A common data labeling issue is the border closeness of the annotations. Training data with many border-proximate annotations can lead to poor model generalization.

If a model is frequently exposed to partially visible objects during training, it might not perform well when presented with fully visible objects in a deployment scenario. This can affect the model's accuracy and reliability in production.

Let’s see how Encord Active can help you, for instance, identify border-proximate annotations.

Step 1: Select your Project.

Step 2: Under the “Explorer” dashboard, find the “Labels” tab.

Encord Active automatically finds patterns in the data and labels to surface potential issues with the label.

Step 3: On the right pane, click on one of the issues EA found to filter your data and labels by it. In this case, “Border Closeness”; click on it.

“Relative Area.” - Identifies annotations that are too close to image borders. Images with a Border Proximity score of 1 are flagged as too close to the border.

Step 4: Select one of the images to inspect and validate the issue. Here’s a GIF with the steps:

Encord Active automatically detects label errors.

You will notice that EA also shows you the model’s predictions alongside the annotations, so you can visually inspect the annotation issue and resulting prediction.

Step 5: Visually inspect the top images EA flags and use the Collections feature to curate them.

Encord Active automatically detects label errors issues that you can inspect and curate.

There are a few approaches you could take after creating the Collections:

Exclude the images that are border-proximate from the training data if the complete structure of the object is crucial for your application. This prevents the model from learning from incomplete data, which could lead to inaccuracies in object detection.
Send the Collection to annotators for review.

Recommended Read: 5 Ways to Improve the Quality of Labeled Data.

Reason #2: Poor Data Quality

The foundation of any ML model's success lies in the quality of the data it's trained on. High-quality data is characterized by its accuracy, completeness, timeliness, and relevance to the business problem ("fit for purpose").

Several common issues can compromise data quality:

Duplicate Images: They can artificially increase the frequency of particular features or patterns in the training data. This gives the model a false impression of these features' importance, causing overfitting.
Noise in Images: Blur, distortion, poor lighting, or irrelevant background objects can mask important image features, hindering the model's ability to learn and recognize relevant patterns.
Unrepresentative Data: When the training dataset doesn't accurately reflect the diversity of real-world scenarios, the model can develop biases. For example, a facial recognition system trained mainly on images of people with lighter skin tones may perform poorly on individuals with darker skin tones.
Limited Data Variation: A model trained on insufficiently diverse data (including duplicates and near-duplicates) will struggle to adapt to new or slightly different images in production. For example, if a self-driving car system is trained on images taken in sunny weather, it might fail in rainy or snowy conditions.

Potential Solution: Data Curation

One way to tackle poor data quality, especially after collection, is to curate good quality data. Here is how to use Encord Active to automatically detect and classify duplicates in your set.

Curate Duplicate Images

Your testing and validation sets might contain duplicate training images that inflate the performance metrics. This makes the model appear better than it is, which could lead to false confidence about its real-world capabilities.

Step 1: Navigate to the Explorer dashboard → Data tab

On the right-hand pane, you will notice Encord Active has automatically detected common data quality issues based on the metrics it computed from the data. See an overview of the issues EA can detect on this documentation page.

Step 2: Under the issues found, click on Duplicates to see the images EA flags as duplicates and near-duplicates with uniqueness scores of 0.0 to 0.00001.

There are two steps you could take to solve this issue:

Carefully remove duplicates, especially when dealing with imbalanced datasets, to avoid skewing the class distribution further.
If duplicates cannot be fully removed (e.g., to maintain the original distribution of rare cases), use data augmentation techniques to introduce variations within the set of duplicates themselves. This can help mitigate some of the overfitting effects.

Step 3: Under the Data tab, curate duplicates you want to remove or use augmentation techniques to improve by selecting them. Click Add to a Collection → Name the collection ‘Duplicates’ and add a description.

See the complete steps:

How to curate Duplicate Images with Encord Active

Once the duplicates are in the Collection, you can use the tag to filter them out of your training or validation data. If relevant, you can also create a new dataset to apply the data augmentation techniques.

See collections options in Encord Active.

Reason #3: Data Drift

Data drift occurs when the statistical properties of the real-world images a model encounters in production change over time, diverging from the samples it was trained on. Drift can happen due to various factors, including:

Concept Drift: The underlying relationships between features and the target variable change. For example, imagine a model trained to detect spam emails. The features that characterize spam (certain keywords, sender domains) can evolve over time.
Covariate Shift: The input feature distribution changes while the relationship to the target variable remains unchanged. For instance, a self-driving car vision system trained in summer might see a different distribution of images (snowy roads, different leaf colors) in winter.
Prior Probability Shift: The overall frequency of different classes changes. For example, a medical image classification model trained for a certain rare disease may encounter it more frequently as its prevalence changes in the population.

If you want to dig deeper into the causes of drifts, check out the “Data Distribution Shifts and Monitoring” article.

Potential Solution: Monitoring Data Drift

There are two steps you could take to address data drift:

Use tools that monitor the model's performance and the input data distribution. Look for shifts in metrics and statistical properties over time.
Collect new data representing current conditions and retrain the model at appropriate intervals. This can be done regularly or triggered by alerts when significant drift is detected.

You can achieve both within Encord:

Step 1: Create the Dataset on Annotate to log your input data for training or production. If your data is on a cloud platform, check out one of the data integrations to see if it works with your stack.

Step 2: Create an Ontology to define the structure of the dataset.

Step 3: Create an Annotate Project based on your dataset and the ontology. Ensure the project also includes Workflows because some features in Encord Active only support projects that include workflows.

Step 4: Import your Annotate Project to Active. This will allow you to import the data, ground truth, and any custom metrics to evaluate your data quality. See how it’s done in the video tutorial on the documentation.
Step 5: Select the Project → Import your Model Predictions.

There are two steps to inspect the issues with the input data:

Use the analytics view to get a statistical summary of the data.
Use the issues found by Encord Active to manually inspect where your model is struggling.

Step 6: On the Explorer dashboard → Data tab → Analytics View.

Step 7: Under the Metric Distribution chart, select a quality metric to assess the distribution of your input data on. In this example, “Diversity" applies algorithms to rank images from easy to hard samples to annotate. Easy samples have lower scores, while hard samples have higher scores.

Metric Distribution chart in Encord Active

Step 8: On the right-hand pane, click on Dark. Navigate back to Grid View → Click on one of the images to inspect the ground truth (if available) vs. model predictions.

Encord Active - Explorer to compare ground truth vs. model predictions.

Observe that the poor lightning could have caused the model to misidentify the toy bear as a person. (Of course, other reasons, such as class imbalance, could cause the model to misclassify the object.)

You can inspect the class balance on the Analytics View → Class Distribution chart.

Analytics view to view the class distribution chart.

Nice!

Recommended Read: How to Detect Data Drift on Datasets.

There are other ways to manage data drift, including the following approaches:

Adaptive Learning: Consider online learning techniques where the model continuously updates itself based on new data without full retraining. Note that this is still an active area of research with challenges in computer vision.
Domain Adaptation: If collecting substantial amounts of labeled data from the new environment is not feasible, use domain adaptation techniques to bridge the gap between the old and new domains.

Reason #4: Thinking Deployment is the Final Step (No Observability)

Many teams mistakenly treat deployment as the finish line, which is one reason machine learning projects fail in production. However, it's crucial to remember that this is simply one stage in a continuous cycle. Models in production often degrade over time due to factors such as data drift (changes in input data distribution) or model drift (changes in the underlying relationships the model was trained on).

Neglecting post-deployment maintenance invites model staleness and eventual failure. This is where MLOps (Machine Learning Operations) becomes essential. MLOps provides practices and technologies to monitor, maintain, and govern ML systems in production.

Potential Solution: Machine Learning Operations (MLOps)

The core principle of MLOps is ensuring your model provides continuous business value while in production. How teams operationalize ML varies, but some key practices include:

Model Monitoring: Implement monitoring tools to track performance metrics (accuracy, precision, etc.) and automatically alert you to degradation. Consider a feedback loop to trigger retraining processes where necessary, either for real-time or batch deployment.
Logging: Even if full MLOps tools aren't initially feasible, start by logging model predictions and comparing them against ground truth, like we showed above with Encord. This offers early detection of potential issues.
Management and Governance: Establish reproducible ML pipelines for continuous training (CT) and automate model deployment. From the start, consider regulatory compliance issues in your industry.

Key Takeaways: 4 Reasons Computer Vision Models Fail in Production

Remember that model deployment is not the last step. Do not waste time on a model only to have it fail a few days, weeks, or months later. ML systems differ across teams and organizations, but most failures are common. If you study your ML system, you’ll likely see that some of the reasons your model fails in production are similar to those listed in this article:

1. Data labelling errors

2. Poor data quality

3. Data drift in production

4. Thinking deployment is the final step

The goal is for you to understand these failures and learn the best practices to solve or avoid them. You’d also realize that while most failure modes are data-centric, others are technology-related and involve team practices, culture, and available resources.

Scale your annotation workflows and power your model performance with data-driven insights

Build better ML models with Encord

Get started today

Written by

Stephen Oladele

View more posts

Frequently asked questions

ML projects often fail due to poor data, unclear goals, wrong model complexity, unrealistic expectations, or lack of team collaboration.
Models degrade in production because real-world data changes over time (data drift), the underlying patterns they were trained on shift (concept drift), or model outputs create feedback loops that alter future data.
This usually indicates overfitting, meaning the model has memorized the training data too closely and doesn't generalize to new examples.
An AI model's performance depends on the quality and quantity of its training data, its architecture, how it's trained, the features it uses, and potential biases.

Previous blog

Phi-3: Microsoft’s Mini Language Model is Capable of Running on Your Phone

Next blog

Announcing the launch of Advanced Video Curation

Mar 15 2024

10 M

sampleImage_video-object-tracking-algorithms

Computer Vision

Real-time text detection is vital for text extraction and Natural Language Processing (NLP). Recent advances in deep learning have ushered in a new age for natural scene text identification. Apart from formats, natural texts show different fonts, colors, sizes, orientations, and languages (English being the most popular). It often overwhelms readers, especially those with visual impairments. Natural texts also include complex backgrounds, multiple photographic angles, and illumination intensities, creating text recognition and detection barriers. Text detection simplifies decoding videos, images, and even handwriting for a machine. In this article, you will work on a project to equip a system to perform real-time text detection from a webcam feed. But, for that, your machine must include a real-timeOCR processing feature. The same OCR powers your applications to perform real-time text detection from the input images or videos. Ready? Let’s start by understanding the problem and project scope. Problem Statement and Scope In this section, you will learn about the use case of text recognition in video streams, its challenges, and, more specifically, how to overcome them. Real-time Text Detection from Webcam Feed The requirement for real-time text detection from camera-based documents is growing rapidly due to its different applications in robotics, image retrieval, intelligent transport systems, and more. The best part is that you can install real-time text detection using the webcam on your computer. OCR-based tools like Tesseract and OpenCV are there to help you out in this regard. Displaying the Detected Text on the Screen There is no denying the fact that detecting oriented text in natural images is quite challenging, especially for low-grade digital cameras and mobile cameras. The common challenges include blurring effects, sensor noise, viewing angles, resolutions, etc. Real-world text detection isn't without hurdles. Blurring effects, sensor noise, and varying viewing angles can pose significant challenges, especially for low-grade digital cameras. Overcoming these obstacles requires advanced techniques and tools. Real-Time Text Detection Using Tesseract OCR and OpenCV Text detection methods using Tesseract is simple, quick, and effective. The Tesseract OCR helps extract text specifically from images and documents. Moreover, it generates the output in a PDF, text file, or other popular format. It's open-source Optical Character Recognition (OCR) software that supports multiple programming languages and frameworks. The Tesseract 3x is even more competent as it performs scene text detection using three methods: word finding, line finding, and character classification to produce state-of-the-art results. Firstly, the tool finds words by organizing the text lines into bubbles. These lines and regions are analyzed as proportional text or fixed pitch. Then, these lines are arranged by word spacing to make word extraction easier. The next step comprises filtering out words through a two-pass process. The first pass checks only if each word is understandable. If the words are recognizable, they will proceed with the second pass. This time, the words use an adaptive classifier where they are recognized more accurately. On the other hand, the Tesseract 4 adopts a neural network subsystem for recognizing text lines. This neural subsystem originated from OCRopus' Python-based LSTM implementation. OpenCV (Open Source Computer Vision Library) is open-source for computer vision, image processing, and machine learning. Computer vision is a branch of artificial intelligence that focuses on extracting and analyzing useful information from images. This library allows you to perform real-time scene text detection and image and video processing with the scene text detector. This library has more than 2500 in-built algorithms. The function of these algorithms is to identify objects, recognize images, text lines, and more. So, let’s learn how Tesseract OCR and OpenCV help with real-time text detection in this tutorial. Data Collection and Preprocessing The preprocessing of a video or image consists of noise removal, binarization, rescaling, and more. Thus, preprocessing is necessary for acquiring an accurate output from the OCR. The OCR software imposes several techniques to pre-process the images and videos: Binarization is a technique that converts a colorful or grayscale image into a binary or black-and-white image, enhancing the quality of character recognition. It separates text or image components from the background, making identifying and analyzing text characters easier. De-skewing is a technique that ensures proper alignment of text lines during scanning. Despeckling is used for noise reduction, reducing noise from multiple resources. Word and line detection generate a baseline for shaping characters and words. Script recognition is essential for handling multilingual scripts, as they change at the level of the words. Character segmentation or isolation is crucial for proper character isolation and reconnection of single characters due to image artifacts. Techniques for fixed-pitch font segmentation require aligning the image to a standard grid base, which includes fewer intersections in black areas. Techniques for proportional fonts are necessary to address issues like greater whitespace between letters and vertically intersecting more than one character. Two basic OCR algorithms for text recognition through computer vision techniques are matrix matching and feature extraction. Matrix matching compares an image to a glyph pixel-by-pixel, known as image correlation or pattern recognition. The output glyph is in the same scale and a similar font. The feature extraction algorithm breaks glyphs into features such as lines, line intersections, and line directions, making the recognition process more efficient and reducing the dimensionality of the texts. Again, the k-nearest neighbors algorithm compares the image features with the stored glyphs to choose the nearest match. The glyphs are symbolic characters or figures recognized as text after an OCR is conducted over an image. New to computer vision? Learn more about computer vision in this article from Encord. Capturing Video From the Webcam Using OpenCV OpenCV can detect text in different languages using your computer’s webcam. The video streaming process in OpenCV runs on a dedicated thread. It reads live frames from the webcam and caches the new videos in memory as a class attribute. The video script ingests real-time OCR effects by multi-threading. When the OCR runs in the background, the multi-threading improves the processing by enabling real-time video streaming. The OCR thread updates the detected texts and the boxes, giving them prominent visibility. 🎥 Interested in video annotation? Read our full guide to video annotation in computer vision. Set Up Tesseract OCR and Specify its Executable Path There are several reasons to install Python-tesseract to proceed with your real-time text detection. Its OCR feature easily recognizes and encodes texts from your video. Moreover, it can read many images, such as PNG, GIF, and JPEG. Thus, it can be used as an individual script. To integrate Tesseract into your Python code, you should use Tesseract’s API. It supports real concurrent execution when you use it with Python’s threading module. Tesseract releases GIL (Generic Image Library) while processing an image. First of all, install the Tesseract OCR in your environment: pip install pytesseract Then, start importing the required libraries import cv2 import pytesseract import numpy as np from PIL import ImageGrab Set up the executable path for Tesseract OCR # Set the path to the Tesseract executable for Windows OS pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' Function to Capture the Screen The `capture_screen` function captures the screen content using `ImageGrab.grab` from the Pillow library. This function captures a specific screen region defined by the `bbox` parameter. It converts the captured image from RGB to BGR format, which is suitable for OpenCV. # Function to capture the screen def capture_screen(bbox=(300, 300, 1500, 1000)): cap_scr = np.array(ImageGrab.grab(bbox)) cap_scr = cv2.cvtColor(cap_scr, cv2.COLOR_RGB2BGR) return cap_scr Webcam Initialization The code initializes the webcam (if available) by creating a VideoCapture object and setting its resolution to 640x480. # Initialize the webcam cap = cv2.VideoCapture(0) cap.set(3, 640) cap.set(4, 480) while True: # Read a frame from the webcam ret, frame = cap.read() Text Detection and Processing The output stream for real-time text detection can be a file of characters or a plain text stream. However a sophisticated OCR stores the original layout of a page. The accuracy of an OCR can be boosted when there is a lexicon constraint in the output. Lexicons are lists of words that can be presented in a document. However, it becomes problematic for an OCR to improve detection accuracy when the quantity of non-lexical words increases. It is, however, possible to assume that a few optimizations will speed up OCR in many scenarios, like data extraction from a screen. Additionally, the k-nearest-neighbor analysis (KNN) corrects the error from the words that can be used together. For example, it can differentiate between 'United States of America' and 'United States'. Now, you will learn about automated text extraction after detecting it with Tesseract OCR. Structure of text detection Applying Tesseract OCR to Perform Text Detection on Each Frame In the text detection step, the Tesseract OCR will annotate a box around the text in the videos. Then, it will show the detected text above the box. But this technique works by breaking the video frame-by-frame and applying the tesseract detection to the video frame. The caveat here is that sometimes, you may experience difficulties in text detection due to the abrupt movements of the video objects. Text detection through Tesseract OCR Main Loop - Real-Time Text Detection The following code enters a loop to capture frames from the webcam (or screen capture). It performs text detection on each frame using Tesseract OCR irrespective of the frame rate (fps). # Perform text detection on the frame using Tesseract OCR recognized_text = pytesseract.image_to_string(frame) Bounding Box Detection To draw bounding boxes around the detected text, the code utilizes Tesseract's built-in capabilities for bounding box detection. It uses `pytesseract.image_to_data` with the `pytesseract.Output.DICT` option to obtain information about individual text boxes. The code then loops through the detected boxes, and for each box with a confidence level greater than 0, it draws a green rectangle using `cv2.rectangle` # Perform bounding box detection using Tesseract's built-in capabilities d = pytesseract.image_to_data(frame, output_type=pytesseract.Output.DICT) n_boxes = len(d['text']) for i in range(n_boxes): if int(d['conf'][i]) > 0: (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i]) frame = cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2) Detected text is displayed on the frame with green and drawn with `cv2.putText` # Draw the detected text on the frame frame_with_text = frame.copy() frame_with_text = cv2.putText(frame_with_text, detected_text, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2) The Google Cloud Vision API is an example of a text extraction API. It can detect and extract text from an image. It has two annotation features to support an OCR. The annotations are: TEXT_DETECTION: It detects and extracts text from any type of image. For example, you might consider a photograph related to a signboard about traffic rules. The JSON of the API formats and stores strings and individual words from the text of that image. Also, it creates bounding boxes around the texts. DOCUMENT_TEXT_DETECTION: The vision API uses this annotation to extract text instancess from a document or dense text. The JSON formats and stores the extracted paragraph, page, word, block, and break information. Four vertices form a quadrilateral bounding box with orientation information in the text instance annotations. The vision API detects text from a local image file through the feature detection process. So, when you send REST requests to the API, it should be a Base64 encoding string for the image file's contents within the body of your request. Base64 is a group of schemes that encodes binary data into readable text for an image. It represents binary data in a 24-bit sequence. This 24-bit sequence can be further represented as four 6-bit Base64 digits. Base64 reliably carries binary data throughout channels that support text contents. Real-Time Display and Video Output Generally, the text in a video appears in multiple frames. So, you need to detect and recognize the texts present in each frame of the video. The OCR software converts the text content from the video into an editable format. The alphanumeric information in the video must be converted into its ASCII equivalent first. Then, they will be converted to readable text. This way, it detects texts from videos and other imagery formats. Modern OCR systems, such as Tesseract, are designed to automatically extract and recognize text from videos. The OCR identifies the locations of text within the video and proceeds to extract strokes from the segmented text regions, taking into account factors like text height, alignment, and spacing. Subsequently, the OCR processes these extracted strokes to generate bounding boxes, within which the recognized texts are displayed upon completion of the process. Text localization in real time text detection using Tesseract is a crucial step in optical character recognition (OCR) systems. By accurately identifying the location of text within an image or video frame, Tesseract enables the extraction and analysis of textual information. This process involves employing advanced computer vision techniques to detect and outline text regions, allowing for efficient recognition and subsequent interpretation of the detected text. Display the Detected Text The processed frame with the detected text and bounding boxes is displayed using `cv2.imshow`. # Display the frame with detected text cv2.imshow("Frame with Detected Text", frame_with_text) User Interaction The model displays real-time video with a detected text overlay until the user presses the 'q' key. Upon pressing 'q', the loop exits, the webcam is released, and the OpenCV window is closed. # Exit the loop when 'q' is pressed if cv2.waitKey(1) & 0xFF == ord("q"): break self.cap.release() self.video_output.release() cv2.destroyAllWindows() def playback(self): Moreover, you can customize your outputs by using white-listing and black-listing characters. When you choose white-listing characters, Tesseract only detects the characters white-listed by your side. It ignores the rest of the characters in the video or image. Also, you can use black-list characters when you don't want to get the output of some specific characters. Tesseract will black-list them. So, it will not produce the output for these characters. Here is the link to the full code on the GitHub repository. OCR for Mobile Apps If you need real-time text detection from a mobile scanning app, you must have an OCR as part of that scanning app. The best mobile scanning OCR app has these features – Scanning efficiency: A mobile OCR app must focus on every region of a document. Even the sensor can accurately detect the borders of the document. Also, it doesn’t take too much time to scan the document. Modes of scanning: You can get different scanning modes through this app, such as IDs, books, documents, passports, and images. Document management: It supports a file management activity by saving, organizing, printing, sharing, and exporting digitized files. Customization: You can customize your document scanning by adding a signature, text, watermark, or password protection. Accuracy: The OCR app emphasizes document digitization. Thus, it produces digitized text from a document without too much delay. For mobile scanning apps, integrating OCR is essential. An ideal OCR app should scan efficiently and offer various scanning modes, robust document management, customization options, and high accuracy in digitizing text from documents. Wrapping Up So, you have learned text detection in real-time with Tesseract OCR, OpenCV, and Python. OCR software uses text detection algorithms to implement real-time text detection. Moreover, OCR software can solve other real-world problems, such as - object detection from video and image datasets, text detection from document scanning, face recognition, and more. Just as we have covered the image-to-text concept in this article, if the concept of text-to-video amazes you, you might want to take a look at our detailed article here. Real Time Text Detection - OpenCV Tesseract: Key Takeaways Real-time text detection is crucial for applications involving text extraction and NLP applications, dealing with diverse fonts, colors, sizes, orientations, languages, and complex backgrounds. Tesseract OCR and OpenCV are open-source tools for real-time text detection. Preprocessing steps in OCR include binarization, de-skewing, despeckling, word and line detection, script recognition, and character segmentation. OCR accuracy can be enhanced with lexicon constraints and near-neighbor analysis. Video frames can be processed in real-time for text detection and recognition, converting alphanumeric information into editable text. Customization options, such as white-listing and black-listing characters, are available in OCR for tailored text detection.

Dec 19 2023

10 M

Computer Vision

Instance Segmentation in Computer Vision: A Comprehensive Guide

Accurately distinguishing and understanding individual objects in complex images is a significant challenge in computer vision. Traditional image processing methods often struggle to differentiate between multiple objects of the same class, which leads to inadequate or erroneous interpretations of visual data. This impacts practitioners working in fields like autonomous driving, healthcare professionals relying on medical imaging, and developers in surveillance and retail analytics. The inability to accurately segment and identify individual objects can lead to critical errors. For example, misidentifying pedestrians or obstacles in autonomous vehicles can result in safety hazards. In medical imaging, failing to precisely differentiate between healthy and diseased tissues can lead to incorrect diagnoses. Instance segmentation addresses these challenges by not only recognizing objects in an image but also delineating each object instance, regardless of its class. It goes beyond mere detection, providing pixel-level precision in outlining each object that enables a deeper understanding of complex visual scenes. This guide covers: Instance segmentation techniques like single-shot instance segmentation and transformer- and detection-based methods. How instance segmentation compares to other types of image segmentation techniques. Instance segmentation model architectures like U-Net and Mask R-CNN. Practical applications of instance segmentation in fields like medical imaging and autonomous vehicles. Challenges of applying instance segmentation and the corresponding solutions. Let’s get into it! Types of Image Segmentation There are three types of image segmentation: Instance segmentation Panoptic segmentation Semantic segmentation Each type serves a distinct purpose in computer vision, offering varying levels of granularity in the analysis and understanding of visual content. Instance Segmentation Instance segmentation involves precisely identifying and delineating individual objects within an image. Unlike other segmentation types, it assigns a unique label to each pixel, providing a detailed understanding of the distinct instances present in the scene. Semantic Segmentation Semantic segmentation involves classifying each pixel in an image into predefined categories. The goal is to understand the general context of the scene, assigning labels to regions based on their shared semantic meaning. Panoptic Segmentation Panoptic segmentation is a holistic approach that unifies instance and semantic segmentation. It aims to provide a comprehensive understanding of both the individual objects in the scene (instance segmentation) and the scene's overall semantic composition. Instance Segmentation Techniques Instance segmentation is a computer vision task that involves identifying and delineating individual objects within an image while assigning a unique label to each pixel. This section will explore techniques employed in instance segmentation, including: Single-shot instance segmentation. Transformer-based methods. Detection-based instance segmentation. Single-Shot Instance Segmentation Single-shot instance segmentation methods aim to efficiently detect and segment objects in a single pass through the neural network. These approaches are designed for real-time applications where speed is crucial. A notable example is YOLACT (You Only Look At Coefficients) which performs object detection and segmentation in a single network pass. Transformer-Based Methods Transformers excel at capturing long-range dependencies in data, making them suitable for tasks requiring global context understanding. Models like DETR (DEtection TRansformer) and its extensions apply the transformer architecture to this task. They use self-attention mechanisms to capture intricate relationships between pixels and improve segmentation accuracy. Detection-Based Instance Segmentation Detection-based instance segmentation methods integrate object detection and segmentation into a unified framework. These methods use the output of an object detector to identify regions of interest, and then a segmentation module to precisely delineate object boundaries. This category includes two-stage methods like Mask R-CNN, which first generate bounding boxes for objects and thn perform segmentation. Next, we'll delve into the machine learning models underlying these techniques, discussing their architecture and how they contribute to image segmentation. Understanding Segmentation Models: U-Net and Mask R-CNN Several models have become prominent in image segmentation due to their effectiveness and precision. U-Net and Mask R-CNN stand out for their unique contributions to the field. U-Net Architecture Originally designed for medical image segmentation, the U-Net architecture has become synonymous with success in various image segmentation tasks. Its architecture is unique because it has a symmetric expanding pathway that lets it get accurate location and context information from the contracting pathway. This structure allows U-Net to deliver high accuracy, even with fewer training samples, making it a preferred choice for biomedical image segmentation. U-Net, renowned for its efficacy in biomedical image segmentation, stands out due to its sophisticated architecture, which has been instrumental in advancing medical image computing and computer-assisted intervention. Developed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox, this convolutional network architecture has significantly improved image segmentation, particularly in medical imaging. U-Net Architecture Core components of U-Net architecture The U-Net architecture comprises a contracting path to capture context and a symmetric expanding path for precise localization. Here's a breakdown of its structure: Contracting path: The contracting part of the network follows the typical convolutional network architecture. It consists of repeated application of two 3x3 convolutions, each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. With each downsampling step, the number of feature channels is doubled. Bottleneck: After the contracting path, the network transitions to a bottleneck, where the process is slightly different. Here, the network applies two 3x3 convolutions, each followed by a ReLU. However, it skips the max-pooling step. This area processes the most abstract representations of the input data. Expanding Path: The expanding part of the network performs an up-convolution (transposed convolution) and concatenates with the high-resolution features from the contracting path through skip connections. This step is crucial as it allows the network to use information from the image to localize precisely. Similar to the contracting path, this section applies two 3x3 convolutions, each followed by a ReLU after each up-convolution. Final Layer: The final layer of the network is a 1x1 convolution used to map each 64-component feature vector to the desired number of classes. Unique features of U-Net Feature Concatenation: Unlike standard fully convolutional networks, U-Net employs feature concatenation (skip connections) between the downsampling and upsampling parts of the network. This technique allows the network to use the feature map from the contracting path and combine it with the output of the transposed convolution. This process helps the network to better localize and use the context. Overlap-Tile Strategy: U-Net uses an overlap-tile strategy for seamless segmentation of larger images. This strategy is necessary due to the loss of border pixels in every convolution. U-Net uses a mirroring strategy to predict the pixels in the border region of the image, allowing the network to process images larger than their input size—a common requirement in medical imaging. Weighting Loss Function: U-Net modifies the standard cross-entropy loss function with a weighting map, emphasizing the border pixels of the segmented objects. This modification helps the network learn the boundaries of the objects more effectively, leading to more precise segmentation. With its innovative use of contracting and expanding paths, U-Net's architecture has set a new standard in medical image segmentation. Its ability to train effectively on minimal data and its precise localization and context understanding make it highly suitable for biomedical applications where both the objects' context and accurate localization are critical. Mask R-CNN Architecture An extension of the Faster R-CNN, Mask R-CNN, has set new standards for instance segmentation. It builds on its predecessor by adding a branch for predicting segmentation masks on detected objects, operating in parallel with the existing branch for bounding box recognition. This dual functionality allows Mask R-CNN to detect objects and precisely segregate them within the image, making it invaluable for tasks requiring detailed object understanding. The Mask R-CNN framework has revolutionized the field of computer vision, offering improved accuracy and efficiency in tasks like instance segmentation. It builds on the successes of previous models, like Faster R-CNN, by adding a parallel branch for predicting segmentation masks. Mask RCNN Architecture Core components of Mask R-CNN Here are the core components of Mask R-CNN: Backbone: The backbone is the initial feature extraction stage. In Mask R-CNN, this is typically a deep ResNet architecture. The backbone is responsible for processing the input image and generating a rich feature map representing the underlying visual content. Region Proposal Network (RPN): The RPN generates potential object regions (proposals) within the feature map. It does this efficiently by scanning the feature map with a set of reference boxes (anchors) and using a lightweight neural network to score each anchor's likelihood of containing an object. RoI Align: One of the key innovations in Mask R-CNN is the RoI Align layer, which fixes the misalignment issue caused by the RoI Pooling process used in previous models. It does this by preserving the exact spatial locations of the features, leading to more accurate mask predictions. Classification and Bounding Box Regression: Similar to its predecessors, Mask R-CNN uses the features within each proposed region to classify the object and refine its bounding box. It uses a fully connected network to output a class label and bounding box coordinates. Mask Prediction: This sets Mask R-CNN apart. In addition to the classification and bounding box outputs, there's a parallel branch for mask prediction. This branch is a small Fully Convolutional Network (FCN) that outputs a binary mask for each RoI. Unique characteristics and advancements Parallel Predictions: Mask R-CNN makes mask predictions parallel with the classification and bounding box regressions, allowing it to be relatively fast and efficient despite the additional output. Improved Accuracy: The introduction of RoI Align significantly improves the accuracy of the segmentation masks by eliminating the harsh quantization of RoI Pooling, leading to finer-grained alignments. Versatility: Mask R-CNN is versatile and can be used for various tasks, including object detection, instance segmentation, and human pose estimation. It's particularly powerful in scenarios requiring precise segmentation and localization of objects. Training and Inference: Mask R-CNN maintains a balance between performance and speed, making it suitable for research and production environments. The model can be trained end-to-end with a multi-task loss. The Mask R-CNN architecture has been instrumental in pushing the boundaries of what's possible in image-based tasks, particularly in instance segmentation. Its design reflects a deeper understanding of the challenges of these tasks, introducing key innovations that have since become standard in the field. Practical Applications of Instance Segmentation Instance segmentation, a nuanced approach within the computer vision domain, has revolutionized several industries by enabling more precise and detailed image analysis. Below, we delve into how this technology is making significant strides in medical imaging and autonomous vehicle systems. Medical Imaging and Healthcare In medical imaging, instance segmentation is pivotal in enhancing diagnostic precision. Creating clear boundaries at a granular level for the detailed study of medical images is crucial in identifying and diagnosing various health conditions. Medical Imaging within Encord Annotate’s DICOM Editor Precision in Diagnosis: Instance segmentation facilitates the detailed separation of structures in medical images, which is crucial for accurate diagnoses. For instance, segmenting individual structures can help radiologists precisely locate tumors, fractures, or other anomalies. This precision is vital, especially in complex fields such as oncology, neurology, and various surgical specializations. Case Studies: One notable application is in tumor detection and analysis. By employing instance segmentation, medical professionals can identify the presence of a tumor and understand its shape, size, and texture, which are critical factors in deciding the course of treatment. Similarly, in histopathology, instance segmentation helps in the detailed analysis of tissue samples, enabling pathologists to identify abnormal cell structures indicative of conditions such as cancer. Autonomous Vehicles and Advanced Driving Assistance Systems The advent of autonomous vehicles has underscored the need for advanced computer vision technologies, with instance segmentation being exceptionally crucial due to its ability to process complex visual environments in real-time. Real-time Processing Requirements: For autonomous vehicles, navigating through traffic and varying environmental conditions requires a system capable of real-time analysis. Instance segmentation contributes to this by enabling the vehicle's system to distinguish and identify individual objects on the road, such as other vehicles, pedestrians, and traffic signs. This detailed understanding is crucial for real-time decision-making and manoeuvring. Safety Enhancements Through Computer Vision: By providing detailed and precise image analysis, instance segmentation helps increase the safety features of autonomous driving systems. For example, suppose a pedestrian suddenly crosses the road. In that case, the system can accurately segment and identify the pedestrian as a separate entity, triggering an immediate response such as braking or swerving to avoid a collision. This precision in identifying and reacting to various road elements significantly contributes to the safety and efficiency of autonomous transportation systems. Instance Segmentation in ADAS Challenges and Solutions in Instance Segmentation Instance segmentation, while a powerful tool in computer vision, has its challenges. These obstacles often arise from the intricate nature of the task, which requires high precision in distinguishing and segmenting individual objects within an image, particularly when these objects overlap or are closely intertwined. Below, we explore some of these challenges and the innovative solutions being developed to overcome them. Handling Overlapping Instances One of the primary challenges in instance segmentation is managing scenes where objects overlap, making it difficult to discern boundaries. This complexity is compounded when dealing with objects of the same class, as the model must detect each object and provide a unique segmentation mask for each instance. The Role of Intersection over Union (IoU): IoU is a critical metric that provides a quantitative measure of the overlap between the predicted segmentation and the ground truth. By optimizing towards a higher IoU, models can improve their accuracy in distinguishing between separate objects, even when closely packed or overlapping. Techniques for Accurate Boundary Detection: Several strategies are employed to enhance boundary detection. One approach involves using edge detection algorithms as an auxiliary task to help the model better understand where one object ends and another begins. Additionally, employing more sophisticated loss functions that penalize inaccuracies in boundary prediction can drive the model to be more precise in its segmentation. Addressing Sparse and Crowded Scenes The instance segmentation models' quality heavily relies on the training data, which must be meticulously annotated to distinguish between different objects clearly. The Importance of Ground Truth in Training Models: For a model to understand the complex task of instance segmentation, it requires a solid foundation of 'ground truth' data. These images have been accurately annotated to indicate the exact boundaries of objects. The model uses this data during training, comparing its predictions against these ground truths to learn and improve. Time and Resource Constraints for Dataset Curation: Creating such datasets requires significant time and resources. Solutions to this challenge include using semi-automated annotation tools that leverage AI to speed up the process of employing data augmentation techniques to expand the dataset artificially. Furthermore, there's a growing trend towards collaborative annotation projects and sharing datasets within the research community to alleviate this burden. The field of instance segmentation will continue to grow by tackling these problems head-on and coming up with new ways to build models and process data. This will make the technology more useful in real-world applications. Instance Segmentation: Key Takeaways As we conclude the complete guide to instance segmentation, it's crucial to synthesize the fundamental insights that characterize this intricate niche within the broader landscape of computer vision and deep learning. Recap of Core Concepts: At its core, instance segmentation is an advanced technique within image segmentation. It meticulously identifies, segments, and distinguishes between individual objects in an input image, even those within the same class label. Instance segmentation across industries: Instance segmentation is a key part of medical imaging. It helps practitioners make accurate diagnoses and plan effective treatments by making it easier to make decisions in real-time through better image analysis. Integrating instance segmentation into various industries underscores its versatility, from navigating self-driving cars through complex environments to optimizing retail operations through advanced computer vision tasks.

Nov 26 2023

7 M

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.

Why do Models Fail in Production?

Reason #1: Data Labeling Errors

Reason #2: Poor Data Quality

Reason #3: Data Drift

Reason #4: Thinking Deployment is the Final Step (No Observability)

Key Takeaways: 4 Reasons Computer Vision Models Fail in Production

Encord Blog

4 Reasons Why Computer Vision Models Fail in Production

Why do Models Fail in Production?

Reason #1: Data Labeling Errors

Reason #2: Poor Data Quality

Reason #3: Data Drift

Reason #4: Thinking Deployment is the Final Step (No Observability)

Key Takeaways: 4 Reasons Computer Vision Models Fail in Production

Written by

Why do Models Fail in Production?

Reason #1: Data Labeling Errors

Potential Solution: Automated Labeling Error Detection

Reason #2: Poor Data Quality

Potential Solution: Data Curation

Curate Duplicate Images

Reason #3: Data Drift

Potential Solution: Monitoring Data Drift

Reason #4: Thinking Deployment is the Final Step (No Observability)

Potential Solution: Machine Learning Operations (MLOps)

Key Takeaways: 4 Reasons Computer Vision Models Fail in Production

Build better ML models with Encord

Written by

Phi-3: Microsoft’s Mini Language Model is Capable of Running on Your Phone

Announcing the launch of Advanced Video Curation

Related blogs

5 Questions to Ask When Evaluating a Video Annotation Tool

Finding a reliable ecosystem to scale model development

Fine-tuning Models: Hyperparameter Optimization

Intelligent Character Recognition: Process, Tools and Applications

Exploring Vision-based Robotic Arm Control with 6 Degrees of Freedom

How Have Foundation Models Redefined Computer Vision Using AI?

Grok-1.5 Vision: First Multimodal Model from Elon Musk’s xAI

Panoptic Segmentation Tools: Top 9 Tools to Explore in 2024

Top 10 Open Source Computer Vision Repositories

15 Interesting Github Repositories for Image Segmentation

Top 10 Video Object Tracking Algorithms in 2024

5 Questions to Ask When Evaluating a Video Annotation Tool

Claude 3 | AI Model Suite: Introducing Opus, Sonnet, and Haiku

Stable Diffusion 3: Multimodal Diffusion Transformer Model Explained

Apple Vision PRO - Extending Reality to Radiology

Few Shot Learning in Computer Vision: Approaches & Uses

GPT-4 Vision Alternatives

Top 15 DICOM Viewers for Medical Imaging

Top 8 Use Cases of Computer Vision in Manufacturing

Top 8 Applications of Computer Vision in Robotics

What is RLAIF - Reinforcement Learning from AI Feedback?

How to Detect Data Quality Issues in Torchvision Dataset using Encord Active

Top Tools for RLHF

How to Use OpenCV With Tesseract for Real-Time Text Detection

Instance Segmentation in Computer Vision: A Comprehensive Guide

Software To Help You Turn Your Data Into AI