https://arxiv.org/pdf/2203.03884.pdf
Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels
The key to Semi-Supervised Semantic Segmentation is to assign sufficient pseudo-labels to the pixels of an unlabeled image. A common practice is to select high-confidence predictions as pseudo-truths, but this leads to the problem that most pixels may go unused due to unreliability. We argue that every pixel is important for model training, even if its prediction is ambiguous. Intuitively, unreliable predictions may be confounded in the top categories (i.e., the ones with the highest probability), but it should have confidence in pixels that do not belong to the rest of the categories. Thus, such a pixel can be convincingly considered as a negative sample for those most unlikely categories. Based on this realization, we develop an efficient pipeline to leverage unlabeled data. Specifically, we separate reliable and unreliable pixels by predictive entropy, push each unreliable pixel into a category queue consisting of negative samples, and seek to train the model with all candidate pixels. Given the training evolution, the prediction becomes more and more accurate, and we adaptively adjust the threshold of the reliable-unreliable division. Experimental results in various benchmarks and training environments show that our method outperforms state-of-the-art alternatives.
1. Introduction
Semantic segmentation is a fundamental task in the field of computer vision, which has been greatly developed with the rise of deep neural networks [5,29,35,46]. Existing supervised methods rely on large-scale data with annotations, which are too expensive to obtain in practice. To alleviate this problem, many attempts [1,4,9,15,21,33,43,48] have been made to implement semi-supervised semantic segmentation, i.e., learning models with only a few labeled samples and a large number of unlabeled samples. In this case, it becomes crucial to fully utilize the unlabeled data.
A typical solution is to assign pseudo-labels to unannotated pixels. Specifically, given an unlabeled image, prior art [27,41] augments supervised models by borrowing predictions from models trained on labeled data and using the pixel predictions as "ground truth". To mitigate the problem of confirmation bias [2], where the model may be affected by erroneous pseudo-labeling, existing approaches suggest filtering predictions using confidence scores [42,43,50,51]. In other words, only high-confidence predictions are used for pseudo-labeling, while ambiguous predictions are discarded.
However, a potential problem resulting from using only reliable predictions is that certain pixels may never be learned throughout the training process. For example, if the model does not satisfactorily predict a particular class (e.g., the chair in Figure 1), it is difficult to assign accurate pseudo-labels to pixels about that class, which can lead to under-training and absolute imbalance. From this perspective, we believe that in order to make full use of unlabeled data, each pixel should be appropriately utilized.
As mentioned above, directly using unreliable predictions as pseudo-labels will lead to performance degradation [2]. In this paper, we propose an alternative approach to using unreliable pseudo-labeling. We call our framework U2PL.First, we observe that an unreliable prediction is usually confounded only in a few classes, not in all classes. As an example, in Figure 2, the pixel with the white cross receives similar probabilities in the classes motorcycle and person, but the model is pretty sure that the pixel does not belong to the classes car and train. Based on this observation, we reconsider the confusing pixels as negative samples for those less likely classes. Specifically, after obtaining predictions from unlabeled images, we used per-pixel entropy as a metric (see Fig. 2a) to classify all pixels into two groups, reliable and unreliable. All reliable predictions are used to derive positive pseudo-labels, while pixels with unreliable predictions are pushed into a memory bank filled with negative samples. To avoid that all negative pseudo-labels come only from a subset of the classes, we use a queue for each class. This design ensures that the number of negative samples for each class is balanced. Also, considering that the quality of pseudo-labels becomes more and more accurate as the model becomes more and more accurate, we propose a strategy to adaptively adjust the threshold to classify reliable and unreliable pixels.
We evaluated the proposed U2PL on PASCAL VOC 2012 [14] and Cityscapes [10] in a wide range of training environments, and our method outperforms the state-of-the-art competitors. Moreover, by visualizing the segmentation results, we find that our method achieves better performance on those ambiguous regions (e.g., boundaries between different objects) due to the full use of unreliable pseudo-labels.
2. RELATED WORK
There are two typical examples of semi-supervised learning:
Consistency regularization [3,15,33,36,42] and entropy minimization [4,16]. More recently, a more intuitive but effective framework: self-training [27] has become mainstream. Several methods [15,43,44] utilize powerful data augmentation, such as CutOut [13], CutMix [45], and ClassMix [31], which are based on self-training. However, these methods do not focus much on the features of semantic segmentation, whereas our method mainly focuses on those unreliable pixels which will be filtered out by most self-training based methods [34,43,44].
Pseudolabels are used to prevent overfitting to correct pseudolabels when generating predictions of the input image from a teacher network [2,27].FixMatch [37] utilizes confidence thresholds to select reliable pseudolabels.UPS [34] is a method based on FixMatch [37], which takes into account model uncertainty and data uncertainty. However, in semi-supervised semantic segmentation, our experiments show that including unreliable pixels in training improves performance.
Model uncertainty in computer vision is mainly measured by Bayesian deep learning methods [12,23,30]. In our setting, we are not concerned with how to measure uncertainty. We simply use the entropy of the pixel probability distribution as a metric.
Contrastive learning has been applied to self-supervised learning by many successful research results [7,8,17]. In semantic slicing, contrast learning has become a promising new paradigm [1,28,40,47,49]. However, these methods ignore false negative samples that are common in semi-supervised segmentation, and unreliable pixels may be incorrectly pushed away in contrast loss. Distinguishing between unlikely classes of unreliable pixels solves this problem.
Negative learning aims to reduce the risk of misinformation by reducing the probability of negative samples [24,25,34,39], but these negative samples are chosen with a high degree of confidence. In other words, these methods still use reliable predictions. In contrast, we recommend making full use of these unreliable predictions for learning, rather than filtering them out.
3. METHODOLOGY
In this section, we mathematically build up our problem and outline our proposed methodology in Section 2.3.1 First. Our strategy for filtering reliable pseudo-labels is presented in Section 2.3. 3.2. Finally, we describe how to use unreliable pseudo-labels in Sec. 3.3.
3.1. Overview
Given a labeled set Dl=(x l i, y l i)Nl i=1 and a larger unlabeled set Du={x u i}Nu i=1, we aim to train a semantic segmentation model using a large unlabeled data and a smaller labeled dataset.
Figure 3 gives an overview of U2PL, which follows a typical self-training framework with two models of the same architecture, named teacher and student. These two models differ only when the weights are updated. The weights θs of the student model are updated in line with conventional practice, while the weights θt of the teacher model are an exponential moving average (EMA) updated by the weights of the student model. Each model consists of a CNN-based encoder h, a decoder with a segmentation header f , and a representation header g . In each training step, we sample equally the B labeled image Bl and the B unlabeled image Bu. For each labeled image, we aim to minimize the standard cross-entropy loss in Equation (2). For each unlabeled image, we first put it into the teacher model and make predictions. Then, based on pixel-level entropy, we ignore unreliable pixel-level pseudo-labeling when computing the unsupervised loss in Equation (3). This part will be introduced in Sect. 3.2 in detail. Finally, we use the contrast loss to fully utilize the unreliable pixels excluded from the unsupervised loss, which will be presented in Sect. 3.3.
Our optimization objective is to minimize the overall loss as follows:
where Ls and Lu denote the supervised and unsupervised losses on the labeled and unlabeled images, respectively, and Lc denotes the full utilization of the unreliable pseudo contrast loss of the label. λu and λc are the weights of the unsupervised and contrast losses, respectively. both Ls and Lu are cross entropy (CE) losses:
where y l i denotes the hand-annotated mask label of the ith labeled image, and y?u i is the pseudo-label of the ith unlabeled image. f?h is the composite function of h and f, which means that the image is first fed to h, and then fed to f to get the segmentation result. lc is the pixel-level information [32] loss defined as:
where M is the total number of anchor pixels and zci denotes the representation of the ith anchor of class c. The pixel-level information [32] loss is defined as follows. Each anchor pixel is followed by one positive sample and N negative samples, which are denoted as z+ci and z? cij respectively. note that z=g? h(x) is the output of the representation header. h-, -i is the cosine similarity between features from two different pixels, which is restricted in the range from ?1 to 1, and thus requires the temperature τ. After [28], we set M=50, N=256 and τ=0.5.
3.2. pseudo-labeling
In order to avoid overfitting the wrong pseudo-labels, we utilize the entropy of each pixel's probability distribution to filter high-quality pseudo-labels for further supervision. Specifically, we denote pij ∈ R C as the softmax probability generated by the segmentation head of the teacher model at pixel j for the ith unlabeled image, where C is the number of classes. Its entropy is calculated by the following equation:
where pij (c) is the pij value in the cth dimension.
Then, we define a pixel with entropy on αt as an unreliable pseudo-label, which is not suitable for supervision. Therefore, we define the pseudo-label of the ith unlabeled image at pixel j as:
where γt denotes the entropy threshold for the tth training step. We set γt as the quantile corresponding to αt, i.e., γt=np. percentile (H.flatte(), 100*(1-αt)), where H is the per-pixel entropy map. To improve the performance, we use the following tuning strategy during the pseudo-labeling process.
Dynamic partition adjustment.
During the training process, the pseudo-labeling gradually tends to be reliable. Based on this intuition, we adjust the proportion αt of unreliable pixels at each calendar element with a linear strategy:
Where α0 is the initial proportion, set to 20%, and t is the current training time.
Adaptive weight adjustment.
After obtaining reliable pseudolabels, we incorporate them into the unsupervised loss in Equation (3). The weight λu of this loss is defined as the inverse of the percentage of pixels in the current small batch with entropy less than the threshold γt multiplied by the base weight η:
where 1 (-) is the indicator function and η is set to 1.
3.3. Use of Unreliable Pseudo-Labels
Discarding unreliable pseudo-labels or decreasing their weights has been widely used in semi-supervised learning tasks to prevent model performance from degradation [37,41,43,50]. We follow this intuition by filtering out unreliable pseudo-labels according to Equation (6).
However, this disdain for unreliable pseudo-labeling can lead to information loss. Obviously, unreliable pseudo-labels can provide better identification information. For example, the white cross in Figure 2 is a typical unreliable pixel. Its distribution shows the uncertainty of the model to distinguish the class of people from the class of motorcycles. However, this distribution also demonstrates the deterministic nature of the model, i.e., it does not distinguish the pixel as a regular car, a regular train, a regular bike, etc. This property provides the main basis for our proposal of semi-supervised semantic segmentation using unreliable pseudo-labels.
The goal of U2PL is to utilize unreliable pseudo-label information for better distinctions, which is consistent with the recently popular contrast learning paradigm for distinguishing representations. However, due to the lack of labeled images in semi-supervised semantic segmentation tasks, our U2PL is based on a more sophisticated strategy.U2PL has three components named anchor pixel, positive candidate and negative candidate. These components are obtained from certain sets in a sampling manner to reduce the huge computational cost. Next, we describe how to select (a) anchor pixels (queries), (b) positive samples for each anchor, and (c) negative samples for each anchor. Anchoring pixels. During training, we sample anchor pixels (queries) for each class that appears in the current small batch. We denote the feature set of all labeled candidate anchor pixels of class c as Al c,
where yij is the underlying truth value of the jth pixel of labeled image i, and δp denotes the class-specific positive threshold and is set to 0.3 after [28]. zij means
the representation of the jth pixel of labeled image i. For unlabeled data, the corresponding Au c can be computed as:
It is similar to Al c. The only difference is that we use pseudo-labels y?ij based on Equation (6) instead of hand-labeled labels, which implies that the qualified anchor pixels are reliable, i.e., H(pij) ≤ γt. Thus, for class c, all qualified anchors are
positive samples.
The positive sample is the same for all anchors of the same class. It is the center of all possible anchors:
Negative sample.
We define a binary variable nij (c) to recognize if the jth pixel of image i qualifies as a negative sample of class c.
Where nlij (c) and nuij (c) are indicators of whether the jth pixel of labeled and unlabeled image i qualifies as a class c negative sample, respectively.
For the ith labeled image, a class c qualified negative sample should (a) not belong to class c, and (b) be difficult to distinguish between class c and its underlying factual category. Therefore, we introduce the pixel-level category order Oij = argsort(pij). Clearly, we have Oij(arg max pij) = 0 and Oij(arg min pij) = C? 1.
Where rl is the low-rank threshold, set to 3. These two metrics reflect features (a) and (b), respectively.
For samples of category c: (i) unreliable for samples of category a; (b) probably not belonging to category c; (c) not belonging to the least likely category. Similarly, we use Oij to define nuij (c): where rh is the higher-order threshold, set to 20. Finally, the set of negative samples for class c is
the classification memory bank.
Due to the long-tail phenomenon of the dataset, negative candidates for some specific categories are very limited in small batches. In order to maintain a stable number of negative samples
we use the categorical memory bank Qc (FIFO queue) to store negative samples for class c.
Finally, Algorithm 1 shows the whole process using unreliable pseudo-labeling. All features of the anchor are related to the gradient and thus come from the student, while the features of the positive and negative samples come from the teacher.
4. Experiments
4.1. Installation procedure
Dataset.
The PASCAL VOC 2012 [14] dataset is a standard semantic segmentation benchmark containing 20 object semantic classes and 1 background semantic class. The training and validation sets include 1464 and 1449 images, respectively. After [9,21,43], we added 9118 additional training images using SBD [18] as an augmentation set. Since the SBD [18] dataset is roughly annotated, PseudoSeg [50] uses only the standard 1464 images as the entire labeling set, whereas other methods [9,21] use all 10582 images as candidate labeling data. Therefore, we evaluated our approach on both the classical set (1464 candidate labeled images) and the hybrid set (10582 candidate labeled images.) Cityscapes [10] is a dataset designed for urban scene understanding, consisting of 2975 training images with finely annotated masks and 500 validation images. For each dataset, we compare U2PL with other methods under the 1/2, 1/4, 1/8, and 1/16 partitioning protocols.
Network structure.
We use ResNet-101 [19], pre-trained on ImageNet [11], as the backbone and DeepLabv3+ [6] as the decoder. Both the segmentation header and the representation header consist of two Conv BN ReLU blocks, where both blocks maintain the feature map resolution and the first block halves the number of channels. The segmentation header can be viewed as a pixel-level classifier that maps the 512-dimensional features output by the ASPP module into the C class. The representation header maps the same features to a 256-dimensional representation space.
Evaluation
Following the previous methods [15,21,33,48], the images are centrally cropped to a fixed resolution for PASCAL VOC 2012. For cityscapes, the previous method was applied to the slide window evaluation, as we did. We then evaluated these cropped images using the mean intersection over union (mIoU) as a metric. All results were measured on the val set on Cityscape [10] and PASCAL VOC 2012 [14]. Ablation studies were performed on the agitator PASCAL VOC 2012 [14] val under the 1/4 and 1/8 partitioning schemes.
Implementation details.
For training on the blender and classical PASCAL VOC 2012 datasets, we used a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.001, a weight decay of 0.0001, a crop size of 513 × 513, a batch size of 16, and a training time of 80. For training on the cityscape dataset, we also used a stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01, a weight decay of 0.0005, a crop size of 769 × 769, a batch size of 16, and a training time of 200. In all experiments, the decoder learns ten times faster than the backbone. During training, we use poly scheduling to attenuate the learning rate: lr=lrbase-
4.2. Comparison with Existing Alternatives
We compare our approach with the following state-of-the-art semi-supervised semantic cuts: mean teacher (MT) [38], CCT [33], GCT [22], and pseudo-SEG [50], CutMix [15], CPS [9], PC2Seg [48], AEL [21]. We reimplemented MT [38], CutMix [45] for fair comparison. For cityscape [10], we also replicated CPS [9] and AEL [21]. All results are equipped with the same network architecture (DeepLabv3+ as decoder and ResNet-101 as encoder). Note that the classical PASCAL VOC 2012 dataset and the blender PASCAL VOC 2012 dataset differ only in the training set. Their validation set is the same, *** with 1449 images.
Results for the classic PASCAL VOC 2012 dataset.
Tags. Our classical VOC-2012 dataset was compared to other state-of-the-art methods PASCAL-1. U2PL outperforms the supervised baseline by +22.21%, +14.23%, +7.78%, and +4.47% for the 1/16, 1/8, 1/4, and 1/2 partitioning protocols, respectively. For a fair comparison, we only list the methods tested on classic PASCAL VOC 2012. Our method U2PL outperforms PC2Seg by +10.98%, +2.87%, +3.88%, and +3.11% under 1/16, 1/8, 1/4, and 1/2 partitioning protocols for all partitioning protocols, respectively. Even under full supervision, our method outperforms PC2Seg by +5.34%.
Results for the blender PASCAL VOC 2012 dataset.
Label. 2 shows the comparison results on the blender PASCAL VOC 2012 dataset. Under most partitioning protocols, our method U2PL outperforms all other methods. Compared to the baseline model (trained using only supervised data), U2PL achieves all improvements of +9.34%, +7.46%, +3.50%, and +3.37% under 1/16, 1/8, 1/4, and 1/2 partitioning protocols, respectively. Compared to the existing state-of-the-art methods, U2PL outperforms them under all partitioning protocols. In particular, U2PL outperforms AEL under the 1/8 and 1/4 protocols by +1.44% and +1.24%, respectively.
Results on the Cityscape dataset.
Labeling.CitySVal 3 demonstrates the comparison results on the Capes collection. Under the 1/16, 1/8, 1/4, and 1/2 partitioning protocols, the performance of U2PL improves by +9.16%, +3.95%, +4.08%, and +1.29% over the supervised-only baseline.U2PL significantly outperforms existing state-of-the-art methods. In particular, U2PL outperforms AEL by +0.45%, +0.93%, +1.03%, and +0.11% for the 1/16, 1/8, 1/4, and 1/2 partitioning protocols, respectively.
Note that when the labeled data is very limited, e.g., when we have only 92 labeled data, our U2PL has a significant advantage over previous approaches (for the classical PASCAL VOC 2012, under 1/16 partitioning (+10.98%), proving the efficiency of using unreliable pseudo-labels.
4.3. ablation study
The efficiency of using unreliable pseudo-labels.
To prove our central point that using unreliable pseudo-labels promotes semi-supervised semantic segmentation, we conducted experiments on the selection of negation candidates with varying reliability (Section 3.3). Labeling.4 demonstrates the mIoU results on the PASCAL VOC 2012 val collection. "Unreliable" outperforms the other options, demonstrating that using unreliable pseudo-labeling does help. Appendix B shows the effectiveness of using unreliable pseudo-labels on cityscapes.
Validity of probability level thresholds.
sec. 3.3 suggests the use of probability rank thresholds to balance the amount of information and confusion caused by unreliable pixels. Labeling.5 verifies that this balance promotes performance. rl=3 and rh=20 greatly outperform the other options. When rl=1, false negative candidates are not filtered out, resulting in pixels whose intraclass features are incorrectly distinguished by Lc. When rl=10, negative candidates tend to be semantically unrelated to the corresponding anchor pixel, which makes this distinction less informative. Appendix D.2 investigates the validity of the PRT and α0.
components for urban landscapes.
We performed our experiments in Tab.6. progressively ablating each component of U2PL. For a fair comparison, all ablations are under the 1/4 partitioning protocol on the blender PASCAL VOC 2012 dataset. Most importantly, we achieved 73.02% mIoU (MT in Table 2) using the no-Lc training model as a baseline. Simply adding Lc without the DPA strategy improves the baseline by +4.06%. The categorized repository Qc, with PRT and high entropy filtering, improves +5.47% over the baseline. Dynamic Partition Adjustment (DPA) combined with high entropy filtering improves +6.05% over the baseline. Note that DPA is a linear adjustment without tuning (refer to Equation (7)), which is simple and effective. For the probabilistic rank threshold (PRT) component, we set the corresponding parameters according to the tabs.5. In the absence of high entropy filtering, the improvement is significantly lower at +4.55%. Finally, when summing up all the contributions, our method reaches the most recent result with a mIoU of 79.30% under the 1/4 partitioning protocol. Based on this result, we apply these components and the corresponding parameters in all experiments on Tab.2 and Tab.1.
Superablation parameter study.
We ablate the following important parameters of U2PL. Tab. 7 investigates the effect of different initial reliable unreliable partitions. The parameter α0 has an effect on the performance. We found that α0 = 20% achieves the best performance. A small α0 introduces incorrect pseudo-labeling for supervision, and a large α0 underutilizes information from some high-confidence samples. Appendix D.1 examines other hyperparameters.
4.4. Qualitative Results
Figure 4 shows the results of the different methods on the PASCAL VOC 2012 val collection. Thanks to the use of unreliable pseudo-labeling, U2PL outperforms the other methods. Note that using contrast learning without filtering those unreliable pixels can sometimes be detrimental to the model (see rows 2 and 4 in Figure 4), leading to worse results than when the model is trained using only labeled data.
In addition, by visualizing the segmentation results, we find that our method obtains better segmentation results
performance on these ambiguous regions (e.g., boundaries between different objects). This visual difference proves that our method ultimately enhances the reliability of unreliable prediction labels.
5. CONCLUSIONS
We have presented a semi-supervised semantic cutoff framework, U2PL, that incorporates unreliable pseudo-labels into training and outperforms many existing state-of-the-art methods, suggesting that our framework provides a new and promising paradigm for semi-supervised learning research. Our ablation experiments demonstrate that the insights from this work are highly reliable. Qualitative results provide intuitive proof of its effectiveness, especially performing better on boundaries between semantic objects or other ambiguous regions. The training of our method is very time-consuming compared to fully supervised methods [5,6,29,35,46], which is a common drawback of semi-supervised learning tasks [9,20,21,33,43,48]. Due to the extreme lack of labeling, semi-supervised learning frameworks usually have to pay the price for higher accuracy in time. Their training optimization can be explored in more depth in the future.