Image Co-Segmentation via Examples Guidance

Given a collection of images which contains objects from the same category, the co-segmentation methods aim at simultaneously segmenting such common objects in each image. Most of existing co-segmentation approaches rely on computing similarities inter-regions representing foregrounds in these images. However, region similarity measurement is challenging due to the large appearance variations among objects in the same category. In addition, for real-world images which have cluttered backgrounds, the existing co-segmentation approaches miss sufficient robustness to extract the common object from the background. In this paper, we propose a new co-segmentation method which takes advantage of the reliable segmentation of few selected images, in order to guide the segmentation of the remaining images in the collection. A random sample of images is first selected from the image collection. Then, the selected images are segmented using an interactive segmentation method. These segmentation results are used to construct positive/negative samples of the targeted common object and background regions respectively. Finally, these samples are propagated to the remaining images in the collection through computing both local and global consistency. The experiments on the iCoseg and MSRC datasets demonstrate the performance and robustness of the proposed method. Keywords—Co-segmentation; image segmentation; segmentation propagation; MRF based segmentation


I. INTRODUCTION
Foreground segmentation is defined as the task of generating pixel level foreground masks for all the objects in a given image or video.Accurate foreground segmentation is very important and basic problem in computer vision field since it has several potential applications like content-based image retrieval [1], image editing [2] and action recognition [3].
In order to highlight the foreground region to be extracted, image segmentation approaches exploit different metrics at the pixel or region level such as saliency, color, texture or shape.However, when dealing with images that have cluttered backgrounds, or images where the foreground has similar attributes as the background, the question of "what to segment out" become more problematic.Considering the limitations of individual image segmentation, in recent years, jointly segmenting multiple images containing a common object has become very popular in a way that the common patterns that exist in a set of similar images can serve as a mean of compensating for the lack of information about visual object foreground.This task of segmenting simultaneously multiple images which contain common or similar objects is known as image co-segmentation.

A. Motivation
Numerous co-segmentation approaches, with various formulations, have been proposed and have proven to be very effective in extracting common objects from a collection of related images.The main idea of all these approaches is to exploit the repeated pattern in the image collection to obtain a form of a prior information about the common object to be extracted.On one hand, this weak supervision is an attractive leverage which is not available in the case of single image segmentation, on the other hand the existing co-segmentation models also involve new challenges: 1) Even for images that contain a common object, similarity measurement is challenging due to the large appearance variations among objects in the same category.Also, for images with cluttered background, it could be quite difficult to distinguish the object from the background, and moreover, the image similarity calculation may be useless.2) Even with a prior information obtained from the related images, the resulting fully automatic segmentation may be imperfect, and in some situations, segmenting images individually performs better, as demonstrated in [4], [5].Furthermore, in realistic applications, images generally contain similar backgrounds (i.e.similar scenes) such as frames sampled from a video.For these images the co-segmentation process may provide random and insufficiently accurate results.3) The existing co-segmentation problem is usually formulated using complex models which require a number of parameters to be regulated, especially when dealing with large datasets.

B. Contributions
To deal with the above challenges, the idea of this paper is to use the segmentation of small sample of images to guide the segmentation process in the remaining images.All object/background segments in the sampled set are used as positive/negative samples to be exploited as reliable prior information about the common object in the image collection.Then, the segmentation of a given image is mainly based on similarity between candidate object regions extracted from this image and the positive/negative samples.Particularly, the aim is to transfer the training samples to the unsegmented images by considering simultaneously global and local consistency.The main contributions of this paper are: • Given the foreground segmentation of only a subset of images, selected randomly, a simple local and global consistency propagation method is proposed to guide the segmentation process of the remaining unsegmented images.
• The proposed method is not limited to segmenting predefined object categories provided in the learning process, which the case of fully supervised methods.As our method is partially interactive, it can segment any object based on the randomly sampled images.• Instead of propagating only object segmentation samples to the unsegmented images, the proposed method considers both object and background samples in the propagation process.Indeed, this prior information about both the targeted object and the background can better discern the common object in the images, particularly in the case where the image background shares the same features with the foreground.The rest of this paper is organized as follows: An overview of the related works is presented in Section II.The proposed method is explained in Section III.Experimental results and discussion are given in Section IV, followed by the concluding remarks in Section V.

II. RELATED WORK
The co-segmentation problem is a newly explored field of image segmentation.It is defined as the task of jointly segmenting the common region/object from multiple related images.This idea was first introduced in [6].Since then, numerous formulation of the co-segmentation problem have been proposed ranging from binary-class co-segmentation models(single common foreground object) to multi-class cosegmentation and multi-group co-segmentation.In this study, we are interested in the binary-class co-segmentation model.
The first family comprises co-segmentation methods based on the Markov Random Field model (MRF).The main idea behind these approaches is to extend the single image segmentation model by adding foreground similarity constraint into the traditional MRF segmentation model.Usually, a new global term is added to the energy function which accounts for foreground similarity.Several foreground similarity measurements are designed, such as L1-norm [6] and L2-norm [11].In the work of Hochman et al [9], a rewarding similarity measurement is proposed instead of penalizing the foreground difference.This similarity measurement led to a sub-modular energy function which can be easily optimized with graph cuts.Vicente et al [13] compared the three aforementioned MRF-based models and derived a new effective model that was optimized using the dual decomposition method.Later, many works contributed to improve the foreground similarity measure by bringing scale invariance [12,20].In the same way, Batra et al [7] have extended the traditional interactive segmentation method by developing an interactive image co-segmentation approach which segmented common objects from the image collection through human interaction.Dong et al. [21] proposed a new interactive co-segmentation method formulated by an unified energy function which encodes the global scribbled energy, inter-image energy and local smooth energy.More recently, [22] introduced the use of higher-order energy to formulate the interactive image co-segmentation problem, where the higher-order term encodes the consistency between the labeled regions and all over-segmentation regions in the image.Instead of relying on the user interaction, other methods used co-saliency, a closely related work to image co-segmentation, to estimate possible foreground locations, then these co-saliency values were exploited to construct the MRF data term.However, adding foreground similarity constraint into the MRF model resulted in non-submodular energy function which is not easy to optimize.So, the focus of all MRF based co-segmentation methods has been on improving approximating solutions which led in most cases to coarse segmentation of the common object.
Other works formulated the co-segmentation problem as a clustering task.In [14], authors handled the segmentation problem in a discriminative framework that combines bottomup image segmentation with kernel method to assign foreground/background labels jointly to all images.To deal with foreground appearance variations, they used multiple invariant features in the similarity measurement.The discriminative clustering based co-segmentation method [14] was extended in [16] to segment multiple common regions.This method involved a spectral-clustering term and a discriminative term into a new energy function which can be optimized efficiently by using EM algorithm.A large-scale based co-segmentation method was proposed by Kim et al [15], where the joint segmentation task was molded by temperature maximization with finite K heat sources on a linear anisotropic diffusion system.This can be represented as a K-way segmentation that maximizes the segmentation confidence of every pixel in an image.In theory, this temperature function is a sub-modular function, and thus at least a constant approximation of the optimal solution is guaranteed by a greedy algorithm.
MRF based methods and clustering based approaches usually can only provide coarse pixel-level segmentation, thus, large object variations and complicated image backgrounds decrease these methods performance.To this end, methods based on object proposal have been attracting a growing attention [5,17]- [19,23].The main idea behind these methods is to select a subset of the object proposals by evaluating their consistency using region similarity.
These proposals were generated beforehand, and the selected were considered as common targets.In [5], a constraint that the common target has to be an object was added to the cosegmentation framework and an off-line learning method was introduced to retrieve visually similar object proposals among different images.These new aspects contributed to a notable improvement of object co-segmentation performance.In [18], multiple object proposals of all images were represented with a directed graph where similarity between adjacent object proposals were represented by weighted edges.Finally, the common foreground selection was achieved using shortest path algorithm.In the work of [23], additional information such as depth was used to improve proposal based co-segmentation results.These approaches were easily affected by the quality of those generated proposals, and they failed to work well when there were no good proposals in the generated candidates.
All the existing co-segmentation approaches exploited the weak prior information i.e. the same object category contained in collection of images.These co-segmentation approaches constrained correspondence relationship between common objects to better highlight them.For instance, they used additional prior as objectiveness measure [24], or saliency prior or co-saliency measure [8].By introducing these constraints to object co-segmentation formulation, the common objects could be better segmented even in high appearance variations conditions.Even though, these models still could not obtain robust performance in real-world image collection, where target objects were not salient or shared similar features with the image background.
In this paper, we propose to use the segmentation of few images to guide the segmentation of the remaining images in the collection.In contrast of fully supervised methods which require a large amount of training data from a predefined set of object categories, we demonstrate in this work that the propagation of few images from the image collection can improve considerably the segmentation performance.In such conditions, providing some guidance while segmenting a common object from a complex image collection can improve the segmentation results.Hence, we propose to use the segmentation of few images to guide the segmentation of the remaining images in the collection.

III. THE PROPOSED METHOD
Given a collection of images all belonging to the same object category, the goal is to extract the common object from all these images.The basic idea of this work is to exploit the segmentation results of randomly selected image samples and use these results to guide the segmentation task of the entire image collection.The work-flow of the proposed method is shown in Fig. 1.First an image sample is randomly selected from the image collection (Fig. 1a), then from each selected image, foreground and background regions are extracted to form a set of positive and negative segments (Fig. 1b) using an interactive segmentation method, in such a way that positive segments are the targeted object instances which we aim to segment out, and the negative segments are representing background regions.Finally, the main step in our proposed approach is to transfer this available information (i.e.positive/negative segments) to the remaining images in the collection (Fig. 1c).To do so, from each remaining image, multiple region candidates are generated.Afterwards, positive/negative segments are transducted to each region candidate by considering both global and local region consistency.The algorithm for the different steps of the proposed co-segmentation method has been detailed in Algorithm 1.

A. Random Image Sample Selection
Consider I = {I 1 , I 2 , ..., I N } a large collection of N images all of which contain instances of the same object category.From the image collection I, an image subset T = {I 1 , ..., I M } of M images is randomly selected.In the next step, these selected images will be used to extract the positive and negative samples.From the image collection I = {I 1 , I 2 , ..., I N } select a random subset T = {I 1 , ..., I M } of M images.

3:
Obtain the segmentation result for each image I k in T = {I 1 , ..., I M } using grab-cut algorithm.and construct the positive/negative samples set using these segmentation.Compute the global consistency: for each region C ij do 9: retrieve n s most similar samples in N i using equation (5).for each image r i j do 17: retrieve its most similar local regions in I k using equation ( 8).obtain the final segmentation using grab-cut algorithm.

B. Positive/Negative Segments Extraction
In this step, we aim to generate positive and negative segments from the selected image subset T = {I 1 , ..., I M } .For that, we use grabcut method [25] witch is an interactive based segmentation method.Given an image I i ∈ T, the goal is to estimate a label matrix L i , where L i (p) = y i (p) denotes the binary label for the pixel p, and y i (p) ∈ {0, 1}.The label 0 denotes the background and 1 denotes the foreground.The standard grabcut framework [25] involves three steps: initial labeling, learning the appearance model using Gaussian Mixture Model(GMM) and energy minimization.
• Initial labeling: Initially the user provide a bonding box specifying foreground and background regions.Label 1 is assigned to pixels within the foreground region and label 0 for pixels within the background region.• Learning the appearance model using GMM: In this step, pixels inside and outside this bounding box are used to learn two Gaussian Mixture Model(GMM) for the foreground and the background in RGB space.Let G i f and G i b denote those two mixture models.Then, the negative log-likelihood value of a pixel p is computed as follows: Where z i (p) denotes the RGB color for pixel p in image I i .This term reflects the cost of assigning a pixel as foreground (or background) according to the GMM models.
• Energy minimization: The object extraction is performed by minimizing the following Gibbs energy function: Where U (L i ) is the data term encoding the probability that a pixel p belongs to object or background: with L(p) is the label of pixel p that equal to 1 if p belongs to the object and it is equal to 0 if it belongs to the background.V (L i ) is the smoothness term that penalizes assigning different labels to neighboring pixels with similar color features.It is defined as follows: where β is a scaling parameter.The equation 2 is efficiently minimized using grabCut that apply five rounds of iterative refinement, alternating between learning the likelihood values using GMM and obtaining the label estimates.
After obtaining the segmentation results of all images in T. As shown in Fig. 1c, we extract positive and negative samples; such that all object segmentation results i.e. foreground regions are considered as positive samples and similarly background regions are filed as negative samples.In the next step, all extracted positive and negative regions will be propagated to each remaining image in the collection in order to guide its segmentation.

C. Segmentation via Examples Guidance
In the grabcut based segmentation method, the unary term U (L i ) describes the foreground model which is learned from the user scribbles on the image .In this step, we aim to substitute the user interaction using the pre-segmented image sample.It means that the previously extracted positive/negative samples are used to guide the segmentation of the remaining images.Hence, for each unsegmented image, we aim to define the unary term based on the proposed segmentation propagation process (discussed next), and then perform the grabcut segmentation to extract the object foreground.
1) Object candidate generation: In order to transfer the available positive/negative samples to the unsegmented images, we first extract a number of region candidates which represent object and background regions.To ensure that the common object will be segmented as a local region, the proposed method in [18] is adopted.Namely each image which comprise three subsets: C1 is comprised of superpixels generated using the over-segmentation method [26], C2 contains the segmentation results obtained by saliency detection method [27] and C3 includes the segmentation of detected objects in I i using object detection method [24].Note that the extracted object regions will form strong match with positive samples, in the same way, particularly for images from similar scenes, background regions are more likely to match the negative samples.
2) Segmentation propagation: After extracting region candidates from each unsegmented image, we propagate the previously constructed positive/negative samples to each region candidate based on region similarities.Furthermore, to deal with object variance among images, we need to propagate the available segmentation samples to the most similar unsegmented images in I \ T .Hence, for each image I i ∈ I \ T, we first retrieve a set N i of most similar images in T and estimate the common object in I i guided by those images only.In order to account for the foreground region in the similarity measurement, the weighted Gist descriptor [28] is used to represent each image.Basically, given the saliency map S i of image I i , a coarse initial foreground/background estimation is computed by thresholding S i using Otsu method [29], i.e. S f i , S b i = Otsu(S i ) and then these pixel estimates are used as a weight of Gist descriptor.We define the segmentation propagation task using two components, namely, the global consistency and local consistency, so that the global consistency propagates the overall information by considering the whole segment in the similarity measurement.As for the local consistency, and in order to deal with object appearance variations, the local information represented by local patches is propagated to the extracted region candidates.
a) Global consistency: In the global consistency the whole segment information of positive/negative samples is propagated to each unsegmented image.Given an image I i and the set N i of its most similar images from the randomly selected images T .For each object candidate C ij in I i we first retrieve n most similar samples in N i , one in each presegmented image I k : Where S kl is a positive or negative sample and D(C ij , S kl ) is the chi-square distance between C ij and S kl features .Then the common object estimates of object candidate C ij is given by the following equation: Where M (S l(k) ) is the object likelihood of the region sample S l(k) .Clearly, if regions S l(k) retrieved by equation 5 are positive samples, then their object likelihood M (S l(k) ) are assigned to 1 as a result, the common object estimates of region C ij is higher and therefore this regions is more likely to belong to the common object.Otherwise, if these regions are negative samples (their object likelihood are assigned to 0), the common object estimates M co (C ij ) is lower.Note that object candidate C ij extracted from I i may be overlapping.So a pixel I i (p, q) with a location (p, q) may belong to multiple object candidates and will be assigned multiple common object estimates M co (p, q).In this case, the largest one is selected as the common object estimate of the pixel.
b) Local consistency: In real-world conditions the global common object appearance is often inconsistent and difficult to capture due to the large variance of viewpoints, scales and object poses.As a result, considering only the global consistency with the pre-segmented images may not be sufficient to properly estimate the common object in a given image.
To handle this problem we look also at local consistency by transferring local regions of positive/negative samples to the unsegmented image.To do so, a set of local patches ,represented by windows, are extracted from both I k and I i .Then these local regions are ranked to select n r = 10 relevant patches {r k 1 , ...r k nr } to represent local information in the pre-segmented image I k and {r i 1 , ...r i nr } for image I i .We also get the local object likelihood m(r k l ) of a region r k l by directly using the object likelihood value of its corresponding positive/negative sample in I k .For a local region r i j we search for its most similar local regions in I k based on the distance between feature histograms h ij and h kl of windows regions r i j and r k l respectively Similar to the global consistency computation, we obtain the common object estimates M L i (p, q) as follows: with (p, q) is the pixel location.m(r k l * ) the object likelihood of positive/negative local region samples (that is equal to belong to the object and 0 otherwise).As in global consistency computation, a pixel (p, q) may be assigned several local based common object estimates because of the overlapping of detected local regions.In our case the largest one is chosen as the common object estimate.

c) Common object extraction:
To obtain the final common object estimates, we combine the global and local consistency maps as follow: Where α is a scaling coefficient.From the common object estimates in the image I i , the object extraction is performed using GrabCut algorithm described in Section III-B.Here the initial label assignment of a pixel p is determined by thresholding M T i using the common Otsu's method [29].
With τ is the global threshold value.Then the final segmentation of image I i is obtained iteratively through alternating between learning the foreground/background GMM and obtaining the label assignments (equation 1, 2, 4).

A. Experimental Setting
To demonstrate the efficiency of the proposed method, the experiments are conducted on two publicly available datasets, namely iCoseg [7] and MSRC [30] datasets which have been frequently used in previous co-segmentation studies; MSRC dataset contains 14 categories with 418 images in total.iCoseg dataset contains 38 categories with 643 images in total.Regarding the parameter setting, we set the number of randomly selected images M = 6 and the number of nearest neighbors n s = 3.In (10) coefficient α and 1−α regulate the importance of the global and local consistency term.We set α = 0, 6 for all datasets.The color histogram is used for segmentation propagation in ICoseg dataset.For MSRC dataset that exhibits more intra-group variation, color feature for matching the segments will be unreliable.As a result, we used the dense SIFT feature for matching.
Following the literature, two objective measures, Jaccard Similarity (J), and Precision (P ) are used for the quantitative results.Denote A f p ,A b p , A f g and A b p as proposed foreground pixels set, proposed background pixels set, ground-truth foreground pixels set and ground-truth background pixels set, respectively.Here, Jaccard Similarity is defined as the size of intersection divided by the size of union of the proposed and ground truth foreground pixels sets: And Precision [31] is defined as the percentage of pixels that have same labels in both the proposed and ground truth masks: The quantitative comparison results between the state-ofthe-art algorithms and ours are given in the following subsections.

B. Comparison with the State-of-the-Art
The proposed method is compared with different stateof-the-art object co-segmentation algorithms, including Unsupervised joint object discovery and segmentation in Internet images [4] (named ObjectDiscovery13), Group saliency propagation for large scale and quick image co-segmentation (GSP) [32] and automatic image co-segmentation using geometric mean saliency (GMS) [28].Image co-segmentation via saliency co-fusion (Kotes16) [33] and a semi-supervised method for image co-segmentation (Es-salhi17) [34].
We note that to compare with the work of Rubinstein et al. [4], the results are reproduced using their publicly available implementation.Moreover, results of [28] and [32] are regenerated by running the codes provided kindly by the authors.For co-segmentation via saliency co-fusion [33], the results reported on the paper are considered.
For iCoseg dataset the precision values obtained by each method on different image groups are depicted in Fig. 2. The precision averages of all groups are shown in the first column.Clearly the proposed method achieves the best result (92.71% accuracy average).Specifically, compared with the work in [34], which transfers the object segmentation of randomly selected images to the unsegmented images, the new proposed method performs better.This demonstrates that transferring both the object segmentation and the background regions to the unsegmented images can accurately extract the common object from interfered or complex background.This is particularly observable for image groups: bear (the average accuracy recorded 90,07 %) and brown bear (97,85%) where the images share similar foreground and background, this is also the case for stonehenge (97,30%), panda1 (91,03%) , panda2 (84,29%), kendo (97,58%), kendo2 (98,93%) and taj mahal (93,45%) where the proposed method improves considerably Fig. 2: Comparison between the proposed method and the state-of-the-art methods ObjectDiscovery13 [4], GMS14 [28], GSP [32] and Es-salhi17 [34] on iCoseg dataset.the segmentation accuracy.Moreover, the proposed method outperforms the other methods on most groups.
We next objectively evaluate the proposed method by the Jaccard similarity metric (J).The results are summarized in Table 2. Obviously, our proposed method outperforms the existing methods on most image groups of the challenging iCoseg dataset.Particularly, the method gives considerably better results than [28] and [4], even if they used dense correspondences to compute consistency between images in the group.This is expected since in a group of related images, the object instances usually appear on similar backgrounds, and consequently computing correspondences between these images can not highlight accurately the common object.However, this is not a crucial issue in our approach, where prior information transferred from positive/negative samples can accurately guide the segmentation of the common object.
Besides, it should be noted that some image groups in iCoseg data set contain small number of images (less than ten images), which is not appropriate for the random image selection step.Thus for these groups, only the segmentation of one randomly selected image is propagated to guide the segmentation of the remaining images.Under these conditions, our method gives appealing results, especially for brown bear and taj mahal groups.
For overall comparison, Table 1 shows the numeric precision and Jaccard similarity averages on iCoseg dataset compared with the existing methods.Fig. 4 further illustrates visual results of the proposed method on 10 sample groups from iCoseg dataset.The odd columns represent the original images and the even columns display their segmentation results.We can clearly see that the proposed method achieves a smooth segmentation results even when the common object appears in cluttered or similar backgrounds.Besides, we evaluate our approach on MSRC dataset, the quantitative results are presented in Fig. 3 where the class- wise comparison of our method with those of state-of-the art is shown.In this comparison 12 groups are used.It can be seen that our results are very competitive to the best methods [28] and [32].Particularly our method outperforms other existing methods namely on "cow", "sheep", "plane" and "bird" groups.
Furthermore, it is interesting to notice that the proposed method reports good results compared with [34] in almost all image groups.This is expected since this method propagated only the positive segments (regions that contain the targeted object) to other images, while the proposed method is based on both positive and negative segmentation transfer.That allows to have better a performance even when images share similar background or when the common object is depicted in very cluttered image backgrounds.Fig. 5 shows sample segmentation results from MSRC dataset, we display images from 4 groups to show the performance of our method.First column of each group represents original images and the second column displays the segmented images.By comparing these qualitative results, we can see that the proposed method can distinctly improve the segmentation accuracy.Fig. 3: Comparison between the proposed method and the state-of-the-art methods GMS14 [28], GSP [32] and Es-salhi17 [34] on MSRC data set.Table 3 lists out the detailed Jaccard similarity results reported for MSRC dataset.The proposed method achieves comparable results with other methods, notably on the follow-ing image groups: cow (0,812) , face (0,625), plane (0,596) and sheep (0,820).

4 :
for each remaining image I i do 5: generate a set of candidate regions {C ij } R j=1 6: retrieve a set N i of most similar images I k in T 7:

Fig. 4 :
Fig. 4: Sample segmentation results on iCoseg dataset.There are eight groups of images.In each group, the first column represents the original images, and the second column represents the segmentation results.

TABLE 1 :
Precision average P and Jaccard similarity average J on Icoseg dataset.

TABLE 2 :
Evaluation results comparison between the proposed method and other co-segmentation methods in terms of Jaccard Similarity values.image groups of iCoseg dataset are considered.

TABLE 3 :
Evaluation results comparison between the proposed method and the other co-segmentation methods in terms of Jaccard Similarity values.Classes in MSRC dataset are considered.