close

Вход

Забыли?

вход по аккаунту

?

TNNLS.2017.2757497

код для вставкиСкачать
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
1
Object Categorization Using Class-Specific Representations
Chunjie Zhang , Jian Cheng, Liang Li, Changsheng Li , and Qi Tian
Abstract— Object categorization refers to the task of automatically
classifying objects based on the visual content. Existing approaches simply
represent each image with the visual features without considering the
specific characters of images within the same class. However, objects
of the same class may exhibit unique characters, which should be
represented accordingly. In this brief, we propose a novel class-specific
representation strategy for object categorization. For each class, we first
model the characters of images within the same class using Gaussian
mixture model (GMM). We then represent each image by calculating the
Euclidean distance and relative Euclidean distance between the image
and the GMM model for each class. We concatenate the representations
of each class for joint representation. In this way, we can represent an
image by not only considering the visual contents but also combining
the class-specific characters. Experiments on several public available
data sets validate the superiority of the proposed class-specific representation method over well-established algorithms for object category
predictions.
Index Terms— Class-specific representation, image classification, object categorization, visual representation.
I. I NTRODUCTION
Automatically categorizing images based on the visual contents
is hard due to the semantic discrepancy between visual features
and human perceptions. Researchers have proposed various algorithms [1]–[3] to solve this problem of which the bag-of-visualwords (BoW) model [1] is the most widely used both for its effectiveness and simplicity. The BoW model tries to represent images
by quantizing local features into visual words. Sparse coding [2] is
also used to reduce quantization loss. Recently, directly harvesting
information from image pixels by deep convolutional neural networks (CNNs) [3] becomes popular whose performances are much
better than the BoW model.
Manuscript received July 29, 2016; revised December 29, 2016;
accepted September 24, 2017. This work was supported in part
by the National Natural Science Foundation of China under
Grant 61303154 and Grant 61332016 and in part by the Scientific Research
Key Program of Beijing Municipal Commission of Education under
Grant KZ201610005012. The work of Q. Tian was supported in part by ARO
under Grant W911NF-15-1-0290, in part by the Faculty Research Gift Awards
by NEC Laboratories of America and Blippar, and in part by the National
Science Foundation of China under Grant 61429201. (Corresponding author:
Jian Cheng.)
C. Zhang is with the Research Center for Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences,
Beijing 100190, China, and also with the University of Chinese Academy of
Sciences, Beijing 100049, China (e-mail: [email protected]).
J. Cheng is with the National Laboratory of Pattern Recognition, Institute
of Automation, Chinese Academy of Sciences, Beijing 100049, China, also
with the University of Chinese Academy of Sciences, Beijing 100049, China,
and also with the Center for Excellence in Brain Science and Intelligence
Technology, Chinese Academy of Sciences, Beijing 100049, China (e-mail:
[email protected]).
L. Li is with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences,
Beijing 100049, China (e-mail: [email protected]).
C. Li is with the School of Computer Science and Engineering, University
of Electronic Science and Technology of China , Chengdu 611731, China
(e-mail: [email protected]).
Q. Tian is with the Department of Computer Sciences, The University of Texas at San Antonio, San Antonio, TX 78249 USA (e-mail:
[email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2017.2757497
Although effective, there are two problems with these methods.
First, each image is represented using the visual information of the
corresponding image without considering images of the same class.
This is suboptimal as classifiers are trained to separate one class
of images from others. Due to the variations of images and noisy
information, the initial representations may be scattered and cannot
be well separated. Besides, there are often multiple objects whose
representations are mixed together. Researchers try to solve this
problem by modeling the contextual and co-occurrence information
of different objects [4]–[9], or leveraging the information of other
sources [10]–[13]. However, these methods are often conducted on
the image level without fully exploring the correlations of objects
within the same class. The joint modeling of images of the same class
can help to boost the discriminative power of image representations
for better categorization.
Moreover, one-vs-all strategy is often used for categorizing objects.
One class of objects is separated from other objects of different
classes by training discriminative classifiers. The images containing
the objects to be classified are viewed as positive samples, while other
images are regarded as negative samples. The positive samples may
exhibit similar and correlated characters as they all belong to the
same semantics. However, the negative samples do not have such
correlations as they may belong to various classes with different
visual appearances. Traditional methods [14]–[16] often treat them
jointly without considering the diversity of negative samples. The
class information of images should be combined with the visual
representation of each image to improve the discrimination of the
final representations.
To solve the above-mentioned problems, in this brief, we propose
a novel class-specific representation method for object categorization.
Images are first represented with visual features. We then model the
specific characters of each class by fitting the image representations
with Gaussian mixture models (GMMs). We measure the similarities
between each image and each class using the GMM models to
represent the corresponding image. In this way, we can combine the
visual information of images with the class-specific discriminations
of images. We conduct object categorization experiments on several
public image data sets with the experimental results, which well
demonstrate the superiority of the proposed method over other
baseline algorithms.
The main contributions of this brief lie in three aspects.
1) First, instead of using GMM over all the images, we use GMM
to mine the specific characters of images within the same
class. The GMM model of each class encodes the class-level
correlations of images for representation.
2) Second, we generate the class-specific representations by measuring the similarities of images with each class using the
GMM models. The class-specific representations are more
specific and discriminative than traditional BoW-based representations. The visual information and the class-specific
representation are jointly combined for discriminative object
categorization.
3) Third, the proposed class-specific representation method can
be combined with more discriminative and efficient image
representation algorithms (e.g., CNN and Fisher vector [17]
to further improve the performances of categorization.
2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
The rest of this brief is organized as follows. We give the
related work in Section II. The details of the proposed class-specific
representation method for categorization are given in Section III.
In Section IV, we give the experimental results and analysis. Finally,
we conclude in Section V.
II. R ELATED W ORK
Many algorithms [1]–[3] had been proposed to automatically
categorize images based on the contents. The BoW model was
proposed by Sivic and Zisserman [1] using the local features by
quantization. Yang et al. [2] tried to reduce the quantization loss with
sparse coding and improved the performance. Krizhevsky et al. [3]
used deep CNNs to extract image representations from pixels directly.
To make use of the contextual and co-occurrence information of
objects, many works [4]–[9] had been done. Wang et al. [4]
proposed to use the local information for sparse coding, while
Zhang et al. [5] used low-rank decomposition to model the
co-occurrence information of images within the same class.
Chen et al. [6] used the detection results as the context for classification. Armanfard et al. [7] proposed to select local features to get rid
of the noise information. Li et al. [8] explored the locality constraint
for dictionary learning, while Zhang et al. [9] tried to contextually
represent images using graphlet path.
The usage of information from other sources [10]–[13] were also
proposed. Zhang et al. [10] transferred the codebooks of other data
sets, while Xiao and Guo [11] used kernel matching for semisupervised domain adaptation. Li et al. [12] tried to harvest images from
the Internet for semantic representation of images. Fu et al. [13]
proposed to use the testing images by transductive multiview learning.
Researchers also tried to reorganize the training samples for better
representation [14]–[16]. Zhang et al. [14] represented images in
subsematic spaces, while Rasiwasia and Vasconcelos [15] modeled
the contextual information of different semantics for recognition.
Ristin et al. [16] incrementally learned random forests for large-scale
classification.
Instead of only using visual features, representing images with
attributes [18]–[20] and semantics [21]–[23] were also widely studied. Farhadi et al. [18] used attributes for object description, while
Lampert et al. [19] learned to detect objects by transferring attributes.
Parikh and Grauman [20] compared the relativeness of different
attributes to measure the degree of semantics. Zheng et al. [21]
proposed to use the deep learning approach for topic modeling.
Duygulu et al. [22] tried to annotate images as machine translation,
while Guillaumin et al. [23] treated image annotation by discriminative metric learning.
Several image data sets [25]–[27] were collected by
researchers with various image representation and classification
methods [17]– [51]. Sanchez et al. [17] used the Fisher vector to
encode high-order information, while Sande et al. [28] explored
different color channels. Wu and Rehg [29] used histogram
intersection kernel for similarity measurement, while Hou et al. [30]
combined different features. Lian et al. [31] tried to learn codebook
with max-margin constraint. Boureau et al. [32] went beyond visual
features by learning midlevel representations. Zhang et al. [33]
learned a number of codebooks for feature encoding, while
Gao et al. [34] targeted the fine-grained problem. Xie et al. [35]
used bin-ratio similarity, while Yuan and Yan [36] used sparse
representation for classification. Qi et al. [37] used concurrent
multiple instance learning, while Li et al. [38] made use of the
contextual information of features. Fu et al. [39] tried to refine the
tagging accuracy using view-dependent representations and then
extended to weakly supervised settings.
III. C LASS -S PECIFIC R EPRESENTATION FOR
O BJECT C ATEGORIZATION
In this section, we give the details of the proposed class-specific
image representation (CSR) method for object categorization.
A. Class-Specific Modeling
Suppose we have N training images of C classes as (x n , yn ),
n = 1, . . . , N, x n ∈ Rd×1 is the feature vector for the nth image,
yn ∈ RC×1 is the corresponding label vector with yn,i = 1 if the
image has the ith class of objects, otherwise yn,i = 0. We try to
generate class-specific information for representation. This is because
images of the same class are correlated. One image may be contaminated with noisy information, which makes the corresponding visual
representation less discriminative. However, if we jointly consider
images of the same class, we are probably able to get more consistent
and discriminative representations.
In this brief, we use the GMM to incorporate the class-level
information of images. For the ith class, we try to use GMM with K i
clusters to model the images. Formally, let Si be the set of images
for the ith class as
Si = {x n |(x n , yn ), yn,i = 1}.
(1)
We use Si to generate the GMM for the ith class as
yn,i =
Ki
πk,i N(x n |μk,i , σ k,i ), ∀x n ∈ Si
(2)
k=1
where K i is the mixture number for the ith class, N(x n |μk,i , σ k,i )
is the Gaussian distribution with the parameters μk,i and σ k,i as
the mean and standard deviation, and πk,i is the mixing parameter. This can be learned with the expectation–maximization (EM)
algorithm [24]. The learned GMM model captures the intrinsic
characters of images with the same objects. Similarly, we can learn
the GMM models for each class with the corresponding parameters
as {πk,i , μk,i , σ k,i }, k = 1, . . . , K i , i = 1, . . . , C.
To model the class-specific character, for each image x n , we use
a 2 × K i dimensional vector hi as the class-level representation of
image x n , which is defined as
hi,n = [d(x n , {π1,i , μ1,i , σ 1,i }); . . . ; d(x n , {π K i ,i , μ K i ,i , σ K i ,i })]
(3)
where d(x n , {π1,i , μ1,i , σ 1,i }) is the distances between image x n
and the particular Gaussian model {π1,i , μ1,i , σ 1,i }. We use the
Euclidean distance and define d(x, {π, μ, σ }) as
x − μ
.
(4)
d(x, {π, μ, σ }) = π x − μ;
μ
The first and second dimensions measure the Euclidean distance
and the relative Euclidean distance of initial image representation
with the corresponding Gaussian model. Note that other types of
distance measurement strategies can also be used. In this way, we are
able to model the specific characters of each class of images for
representation. Similar images of different classes may generate
similar GMM models. However, we generate the GMM model for
each class separately. The class information is used during the training
process of GMM. Besides, the learned GMM models are not exactly
the same for visually similar classes. This helps to separate them
apart with the final image representations. Since the GMM models
are learned for each class independently, the proposed method can
be paralleled to speed up computation.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
B. Image Representation and Object Categorization
After obtaining the class-specific representations, we can combine
them together for the final image representation. For the nth image,
we concatenate the class-specific representation hi,n for each class
together as
C
hn = [h1,n ; . . . ; hC,n ]
(5)
which is of i=1 2 × K i dimensions. We can use various values of
K i to treat different classes separately, and in this brief, we simply
set them to the same number. Note that if we set K i to zero, the proposed method would degenerate to traditional image representation
methods. Besides, the proposed class-specific representation method
can also cope with images of multiple objects.
To utilize the discriminative power of both the class-specific
representations and the initial visual representations, we concatenate
them together for the final image representation as h̄n = [hn ; x n ].
To predict the categories of objects, we train one-vs-all linear
classifier for the ith class as
yn = w iT h̄n
w i = argmin
wi
2
+ λw i 2
max 0, 1 − yn w iT h̄n
representation strategies (e.g., Fisher vector [17]). For the sparse
coding-based image representation, we densely extract local features
of multiple scales with the minimum scale is of 16 × 16 pixels. SIFT
feature is used as the local feature. For the flower images, color
SIFT features [28] are also used. The codebook size is set to 1024,
and we concatenate the max-pooled representations for each type of
feature together for joint representation. For the CNN-based image
representation, we use the seven-layer networks proposed in [3]. Each
image is represented with a vector of 4096 dimensions. For the local
feature-based BoW scheme, the SIFT features are extracted from
local features and then encoded with sparse coding. We use classspecific image representation with sparse coding (CSR-Sc) to denote
this method. For the CNN-based image representation, we combine
it with the proposed CSR method for classification. This method is
denoted by the CSR-CNN. We also give the performances of CNN
with the softmax strategy for classifier training (CSR-CNN-softmax).
We use the same experimental setup and compare with the reported
results of other methods directly for fair comparison. We use the
classification accuracy for quantitative comparison.
(6)
by minimizing the quadric hinge loss with L 2 constraint on the
parameter w i as
N
3
(7)
n=1
with λ is the weighting parameter. After the classifiers are learned,
we can predict the categories of images using (6) by representing
each testing image with the class-specific feature. Algorithm 1 gives
the procedures of the proposed object categorization with the classspecific representation method. Note that the softmax strategy can
also be used for classifier training and image class prediction.
Algorithm 1 Procedures of Object Categorization With ClassSpecific Representation Method
Input:
The number of Gaussian mixtures K i , i = 1, . . . , C; the regularization parameter λ; training images (x n , yn ), n = 1, . . . , N,
the testing images.
Output:
The predicted categories of testing images;
1: Learn the GMM models for each class using training images as
Eq. 2;
2: Calculate the class-specific features using Eq. 3;
3: Concatenate the class-specific features using Eq. 5 and train the
classifiers accordingly with Eq. 7;
4: Represent each testing images with the learned GMM models
using Eq. 3-5 and classify them accordingly using Eq. 6;
5: return The predicted categories of testing images.
B. Scene-15 Data Set
This data set has 15 classes of images: store, office, tallbuilding,
street, opencountry, mountain, insidecity, highway, forest, coast, livingroom, kitchen, industrial, suburb, and bedroom. We follow the
experimental setup as [25] did and randomly select 100 training
images per class for training and test on the other images. This
random selection is repeated for ten times. We give the performance
comparisons with other methods in Table I. We also give the
performance of the proposed CSR method when only using the sparse
coding strategy (CSR-Sc). The boxplot of the performance is given
in Fig. 2.
The proposed method is able to improve over many baseline
methods. This shows that the proposed method can make use of the
class information of images for more discriminative representations.
Specially, CSR-Sc outperforms sparse coding, spatial pyramid matching (ScSPM) when only local features with sparse coding are used by
about 3.5%. Besides, we also are able to improve over object bank,
which uses information from the Internet. Moreover, the proposed
method also outperforms [30], which combines features with the k
nearest neighbor framework for better classification.
As to the per-class performance, we can see from Fig. 2 that
the indoor classes are harder to classify than the outdoor classes
with more cluttered background. The larger intervariance of indoor
classes also makes it relatively harder for efficient modeling. The
proposed method can model this class variance to some extent and
help to improve over other baseline methods. Specially, the relative
improvement of the proposed method over ScSPM [2] is larger on
indoor classes than outdoor classes. This proves the effectiveness and
usefulness of the proposed method.
C. Flower-17 Data Set
IV. E XPERIMENTS
To evaluate the effectiveness of the proposed class-specific representation method for object categorization, we conduct experiments
on three public available image data sets: the Scene-15 data set [25],
the Flower-17 data set [26], and the UIUC-Sports data set [27].
Fig. 1 gives some sample images of the three data sets.
A. Experimental Setup
To obtain the initial image representations, we make use of the
sparse coding [2] and CNN-based [3] strategies. Note that the
proposed method can also be combined with other types of image
There are 17 types of flowers (Buttercup, Colts foot, Daffodil,
Daisy, Dandelion, Fritillary, Iris, Pansy, Sunflower, Windflower,
Snowdrop, Lily valley, Bluebell, Crocus, Tigerlily, Tulip, and Cowslip)
in this data set. Each class has 80 images. The train/validata/test
splits provided in [26] with 40/20/20 images, respectively, are used
for comparison. Table II gives the performance comparisons. We also
give the boxplot of the per-class performance in Fig. 3.
We can see from Table II that the proposed method outperforms
many baseline methods. Compared with image-only based representation strategies, the proposed method can also make use of the class
information. Since images of the Flower-17 data set are visually
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Fig. 1.
Sample images of (a) Scene-15 data set, (b) Flower-17 data set, and (c) UIUC-Sports data set.
TABLE I
P ERFORMANCE C OMPARISON OF THE P ROPOSED C LASS -S PECIFIC
R EPRESENTATION M ETHOD FOR O BJECT C ATEGORIZATION
W ITH O THER M ETHODS ON THE S CENE -15 D ATA S ET.
N UMERICAL VALUES S TAND FOR M EAN
AND S TANDARD D ERIVATION
Fig. 2. Boxplot of the per-class performance on the Scene-15 data set (%).
The numbers from 1 to 15 in the horizontal row indicate store, office,
tallbuilding, street, opencountry, mountain, insidecity, highway, forest, coast,
livingroom, kitchen, industrial, suburb, and bedroom, respectively.
very similar, we can improve the categorization performance by
also taking the intrinsic characters of each class into consideration.
Besides, the proposed method is also able to improve over ICT [10],
which transfers the information of other flower data sets by 1.5%.
Moreover, the learning of classifiers makes the proposed method more
discriminative and robust than the reconstruction-based method [36].
The experimental results on the Flower-17 data set again prove the
effectiveness of the proposed CSR for object categorization.
TABLE II
P ERFORMANCE C OMPARISON OF THE P ROPOSED C LASS -S PECIFIC
R EPRESENTATION M ETHOD FOR O BJECT C ATEGORIZATION
W ITH O THER M ETHODS ON THE F LOWER -17 D ATA S ET
D. UIUC-Sports Data Set
The UIUC-Sports data set has eight classes of images with different
sport types: badminton, bocce, snow boarding, croquet, climbing,
polo, rowing, and sailing. There are 1792 images with each class has
137–250 images. We randomly select 70 images per class for training
and use the rest for testing, as [27] did. We give the performance
comparisons on the UIUC-Sports data set in Table III and the boxplot
in Fig. 4. We can have similar conclusions as on the other two
data sets. By exploring the class-level information along with the
image-level representation, we can incorporate more discriminative
information for better categorization. Because images of this data set
often have people with varied appearances, the visual representation
of each image is often contaminated with noisy information. The
proposed method is able to alleviate this problem and improve the
performance.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
5
Fig. 3.
Boxplot of the per-class performance on the Flower-17 data
set (%). The numbers from 1 to 17 in the horizontal row indicate Buttercup,
Colts foot, Daffodil, Daisy, Dandelion, Fritillary, Iris, Pansy, Sunflower,
Windflower, Snowdrop, Lily valley, Bluebell, Crocus, Tigerlily, Tulip, and
Cowslip, respectively.
Fig. 5. Performance changes with the number of Gaussian mixtures on the
Scene-15 data set, the Flower-17 data set, and the UIUC-Sports data set.
TABLE III
P ERFORMANCE C OMPARISON OF THE P ROPOSED C LASS -S PECIFIC
R EPRESENTATION M ETHOD FOR O BJECT C ATEGORIZATION
W ITH O THER M ETHODS ON THE UIUC-S PORTS D ATA S ET
P ERFORMANCE C OMPARISON OF THE P ROPOSED C LASS -S PECIFIC
R EPRESENTATION M ETHOD FOR O BJECT C ATEGORIZATION
W ITH O THER M ETHODS ON THE CIFAR-100 D ATA S ET
TABLE IV
is too large, it would represent images too fine to introduce some
noise. Besides, visually similar images of different classes require
larger mixture number in order to separate. Hence, we use ten for the
Scene-15 data set and the UIUC-Sports data set. As to the Flower-17
data set, 20 Gaussian mixtures per class are used.
F. CIFAR-100 Data Set
Fig. 4.
Boxplot of the per-class performance on the UIUC-Sports data
set (%). The numbers from 1 to 8 in the horizontal row indicate badminton, bocce, snow boarding, croquet, climbing, polo, rowing, and sailing,
respectively.
E. Influences of Gaussian Mixture Number
The Gaussian mixture number is an important parameter, which
influences the performances. We give the performance changes with
the number of Gaussian mixtures on the three data sets in Fig. 5.
If the mixture number is too small, the resulting representation would
be too cluttered to separate images. However, if the mixture number
We also evaluate the proposed method on the CIFAR-100 data
set [41]. This data set has 100 classes of images with 600 images
for each class. The image size is 32 × 32 pixels. The 100 classes
are divided into 20 superclasses. We follow the same experimental
setup as in [41]. For each class, 500 images are used for training,
and the other 100 images are used for testing. Since images of this
data set are relatively smaller compared with the other three data
sets, we only give the performances of the proposed method with the
CNN scheme.
The classification performances of the proposed method and baseline methods are given in Table IV. Compared with the CNN, the proposed method can improve over [43] by representing images with
class-specific representations. Besides, the softmax strategy can help
to improve the classification accuracy over the one-vs-all strategy.
Moreover, CSR-CNN-based methods can also achieve comparable
performances as in [45] and [46], which improve over [43] with more
efficient representations. The proposed method cannot perform as
well as the all-CNN [47], which improves over CNN [43]. However,
the proposed class-specific representation method can be combined
with these schemes for better classification.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
The softmax strategy can improve over one-vs-all strategy on the
Scene-15 data set, the Flower-17 data set, and the UIUC-Sports data
set. Since the CSR-CNN-based strategy performs very well, it is
very hard to improve the performances when the softmax strategy
is used instead of one-vs-all strategy. We have also compared the
performances of the two strategies on the CIFAR-100 data set.
The performance of CSR-CNN-softmax is relatively larger than
CSR-CNN on this data set than the other three data sets because
of larger number of image classes.
V. C ONCLUSION
In this brief, we proposed a novel class-specific image representation method for object categorization. We first represented
images with traditional visually based strategies and then learned
the specific characters for each class using the GMM. Images were
then mapped to each Gaussian model using the Euclidean distance
and the normalized Euclidean distance for each class. We combined
the discriminative power of each class by concatenating the classspecific vectors for joint representations along with the visual information. The proposed method can be combined with discriminative
image representation methods to improve the classification accuracy.
We conducted object categorization experiments on the Scene-15 data
set, the Flower-17 data set, the UIUC-Sports data set, and the
CIFAR-100 data set. The experimental results proved the usefulness
of the proposed method.
R EFERENCES
[1] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
object matching in videos,” in Proc. Int. Conf. Comput. Vis., Oct. 2003,
pp. 1470–1477.
[2] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching
using sparse coding for image classification,” in Proc. Comput. Vis.
Pattern Recognit., Miami, FL, USA, Jun. 2009, pp. 1794–1801.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS, 2012,
pp. 1097–1105.
[4] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. Comput.
Vis. Pattern Recognit., 2010, pp. 3360–3367.
[5] C. Zhang, J. Liu, Q. Tian, C. Xu, H. Lu, and S. Ma, “Image classification
by non-negative sparse coding, low-rank and sparse decomposition,” in
Proc. Comput. Vis. Pattern Recognit., 2011, pp. 1673–1680.
[6] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and S. Yan, “Contextualizing object detection and classification,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 37, no. 1, pp. 13–27, Jan. 2015.
[7] N. Armanfard, J. P. Reilly, and M. Komeili, “Local feature selection
for data classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38,
no. 6, pp. 1217–1227, Jun. 2016, doi: 10.1109/TPAMI.2015.2478471.
[8] Z. Li, Z. Lai, Y. Xu, J. Yang, and D. Zhang, “A locality-constrained and
label embedding dictionary learning algorithm for image classification,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 2, pp. 278–293,
Feb. 2017, doi: 10.1109/TNNLS.2015.2508025.
[9] L. Zhang, R. Hong, Y. Gao, R. Ji, Q. Dai, and X. Li, “Image categorization by learning a propagated graphlet path,” IEEE Trans. Neural Netw.
Learn. Syst., vol. 27, no. 3, pp. 674–685, Mar. 2016.
[10] C. Zhang, J. Cheng, J. Liu, J. Pang, Q. Huang, and Q. Tian, “Beyond
explicit codebook generation: Visual representation using implicitly
transferred codebooks,” IEEE Trans. Image Process., vol. 24, no. 12,
pp. 5777–5788, Dec. 2015.
[11] M. Xiao and Y. Guo, “Feature space independent semi-supervised
domain adaptation via kernel matching,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 37, no. 1, pp. 54–66, Jan. 2014.
[12] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei, “Object bank: A highlevel image representation for scene classification & semantic feature
sparsification,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC,
Canada, 2010, pp. 1378–1386.
[13] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong, “Transductive multiview zero-shot learning,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 37, no. 11, pp. 2332–2345, Nov. 2015.
[14] C. Zhang et al., “Object categorization in sub-semantic space,” Neurocomputing, vol. 142, pp. 248–255, Oct. 2014.
[15] N. Rasiwasia and N. Vasconcelos, “Holistic context models for visual
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 5,
pp. 902–917, May 2012.
[16] M. Ristin, M. Guillaumin, J. Gall, and L. van Gool, “Incremental learning of random forests for large-scale image classification,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 38, no. 3, pp. 490–503, Mar. 2016.
[17] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the Fisher vector: Theory and practice,” Int. J. Comput. Vis.,
vol. 105, no. 3, pp. 222–245, 2013.
[18] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects
by their attributes,” in Proc. Comput. Vis. Pattern Recognit., Miami, FL,
USA, 2009, pp. 1778–1785.
[19] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect
unseen object classes by between-class attribute transfer,” in Proc.
Comput. Vis. Pattern Recognit., Miami, FL, USA, 2009, pp. 951–958.
[20] D. Parikh and K. Grauman, “Relative attributes,” in Proc. Int. Conf.
Comput. Vis., Barcelona, Spain, 2011, pp. 503–510.
[21] Y. Zheng, Y.-J. Zhang, and H. Larochelle, “A deep and autoregressive approach for topic modeling of multimodal data,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 38, no. 6, pp. 1056–1069, Jun. 2016,
doi: 10.1109/TPAMI.2015.2476802.
[22] P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth, “Object
recognition as machine translation: Learning a lexicon for a fixed image
vocabulary,” in Proc. Eur. Conf. Comput. Vis., 2002, pp. 97–112.
[23] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schimid, “TagProp:
Discriminative metric learning in nearest neighbor models for image
auto-annotation,” in Proc. IEEE Int. Conf. Comput. Vis., Sep./Oct. 2009,
pp. 309–316.
[24] C. Bishop, Pattern Recognition and Machine Learning. New York, NY,
USA: Springer, 2007.
[25] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories,” in Proc.
Comput. Vis. Pattern Recognit., 2006, pp. 2169–2178.
[26] M.-E. Nilsback and A. Zisserman, “A visual vocabulary for flower
classification,” in Proc. Comput. Vis. Pattern Recognit., 2006,
pp. 1447–1454.
[27] L.-J. Li and L. Fei-Fei, “What, where and who? Classifying events
by scene and object recognition,” in Proc. Int. Conf. Comput. Vis.,
2007, pp. 1–8.
[28] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek, “Evaluating
color descriptors for object and scene recognition,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 32, no. 9, pp. 1582–1596, Sep. 2010.
[29] J. Wu and J. M. Rehg, “Beyond the Euclidean distance: Creating
effective visual codebooks using the histogram intersection kernel,” in
Proc. Int. Conf. Comput. Vis., 2009, pp. 630–637.
[30] J. Hou, H. Gao, Q. Xia, and N. Qi, “Feature combination and
the kNN framework in object classification,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 27, no. 6, pp. 1368–1378, Jun. 2016,
doi: 10.1109/TNNLS.2015.2461552.
[31] X.-C. Lian, Z. Li, B.-L. Lu, and L. Zhang, “Max-margin dictionary
learning for multiclass image categorization,” in Proc. Eur. Conf. Comput. Vis., Barcelona, Spain, Sep. 2010, pp. 157–170.
[32] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning midlevel features for recognition,” in Proc. Comput. Vis. Pattern Recognit.,
San Francisco, CA, USA, Jun. 2010, pp. 2559–2566.
[33] C. Zhang, C. Liang, L. Li, J. Liu, Q. Huang, and Q. Tian, “Fine-grained
image classification via low-rank sparse coding with general and classspecific codebooks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28,
no. 7, pp. 1550–1559, Jul. 2017, doi: 10.1109/TNNLS.2016.2545112.
[34] S. Gao, I. W.-H. Tsang, and Y. Ma, “Learning category-specific dictionary and shared dictionary for fine-grained image categorization,” IEEE
Trans. Image Process., vol. 23, no. 2, pp. 623–634, Feb. 2014.
[35] N. Xie, H. Ling, W. Hu, and X. Zhang, “Use bin-ratio information
for category and scene classification,” in Proc. Comput. Vis. Pattern
Recognit., 2010, pp. 2313–2319.
[36] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint
sparse representation,” in Proc. Comput. Vis. Pattern Recognit., 2010,
pp. 3493–3500.
[37] G.-J. Qi, X.-S. Hua, Y. Rui, T. Mei, J. Tang, and H.-J. Zhang, “Concurrent multiple instance learning for image categorization,” in Proc.
Comput. Vis. Pattern Recognit., 2007, pp. 1–8.
[38] T. Li, T. Mei, I.-S. Kweon, and X.-S. Hua, “Contextual bag-of-words for
visual categorization,” IEEE Trans. Circuits Syst. Video Technol., vol. 21,
no. 4, pp. 381–392, Apr. 2011.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
[39] J. Fu, J. Wang, Y. Rui, X. J. Wang, T. Mei, and H. Lu, “Image tag
refinement with view-dependent concept representations,” IEEE Trans.
Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1409–1422, Aug. 2015.
[40] J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, and Y. Rui, “Relaxing from
vocabulary: Robust weakly-supervised deep learning for vocabulary-free
image tagging,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 1985–1993.
[41] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
Univ. Toronto, Toronto, ON, Canada, Tech. Rep., 2009. [Online]. Available: https://doi.org/10.1.1.222.9220
[42] Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptive
field learning for pooled image features,” in Proc. Comput. Vis. Pattern
Recognit., 2012, pp. 3370–3377.
[43] N. Srivastava and R. R. Salakhutdinov, “Discriminative transfer learning
with tree-based priors,” in Proc. Adv. Neural Inf. Process. Syst., 2013,
pp. 2094–2102.
[44] M. D. Zeiler and R. Fergus. (2013). “Stochastic pooling for regularization of deep convolutional neural networks.” [Online]. Available:
https://arxiv.org/abs/1301.3557
[45] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
Y. Bengio, “Maxout networks,” in Proc. 30th Int. Conf. Mach. Learn.,
2013, pp. 1319–1327.
7
[46] T. Lin and H. Kung, “Stable and efficient representation learning with
nonnegativity constraints,” in Proc. 31st Int. Conf. Mach. Learn., 2014,
pp. 1323–1331.
[47] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. (2014).
“Striving for simplicity: The all convolutional net.” [Online]. Available:
https://arxiv.org/abs/1412.6806
[48] C. Zhang, J. Cheng, and Q. Tian, “Incremental codebook adaptation for
visual representation and categorization,” IEEE Trans. Cybern., to be
published, doi: 10.1109/TCYB.2017.2726079.
[49] C. Zhang, J. Cheng, and Q. Tian, “Multi-view label
sharing
for
visual
representations
and
classifications,”
IEEE Trans. Multimedia, to be published, doi: 10.1109/
TMM.2017.2759500.
[50] C. Zhang, J. Cheng, and Q. Tian, “Structured weak
semantic space construction for visual categorization,” IEEE
Trans. Neural Netw. Learn. Syst., to be published, doi:
10.1109/TNNLS.2017.2728060.
[51] C. Zhang, J. Sang, G. Zhu, and Q. Tian, “Bundled
local features for image representation,” IEEE Trans. Circuits Syst. Video Technol., to be published, doi: 10.1109/
TCSVT.2017.2694060.
Документ
Категория
Без категории
Просмотров
3
Размер файла
1 198 Кб
Теги
2017, 2757497, tnnls
1/--страниц
Пожаловаться на содержимое документа