Semantic Analysis for Automatic Event Recognition and Segmentation of ...

TAG:  wedding video 
Published Time: -
Filetype: pdf
Filesize: 856684
1 Semantic Analysis for Automatic Event Recognition and Segmentation of Wedding Ceremony Videos Wen-Huang Cheng, Student Member, IEEE, Yung-Yu Chuang, Member, IEEE, Yin-Tzu Lin, Chi-Chang Hsieh, Shao-Yen Fang, Bing-Yu Chen, Member, IEEE, and Ja-Ling Wu, Fellow, IEEE Abstract— Wedding is one of the most important ceremonies in our lives. It symbolizes the birth and creation of a new family.
In this paper, we present a system for automatically segmenting a
wedding ceremony video into a sequence of recognizable wedding
events, e.g. the couple’s wedding kiss. Our goal is to develop an
automatic tool that helps users to ef?ciently organize, search,
and retrieve his/her treasured wedding memories. Furthermore,
the obtained event descriptions could bene?t and complement
the current research in semantic video understanding. Based
on the knowledge of wedding customs, a set of audiovisual
features, relating to the wedding contexts of speech/music types,
applause activities, picture-taking activities, and leading roles,
are exploited to build statistical models for each wedding event.
Thirteen wedding events are then recognized by a hidden Markov
model, which takes into account both the ?tness of observed
features and the temporal rationality of event ordering to
improve the segmentation accuracy. We conducted experiments
on a collection of wedding videos and the promising results
demonstrate the effectiveness of our approach. Comparisons with
conditional random ?elds show that the proposed approach is
more effective in this application domain. Index Terms— Home videos, wedding ceremonies, semantic content analysis, event detection, video segmentation. I. I NTRODUCTION A wedding ceremony is an occasion that a couple’s families and friends gather together to celebrate, witness, and usher the
beginning of their marriage. It is a public announcement of the
couple’s transition from two separate lives to a new family
unit. Often, the couples invite some videographers, whether
professional or amateur, to document the wedding as their
treasured memento of the ceremony. In this paper, wedding
videos refer to the raw, unedited footage recorded for wedding.
Since a wedding video usually spans hours, the development
of automatic tools for ef?cient content classi?cation, indexing,
searching, and retrieval becomes crucial. In this paper, we focus on the recognition of a wedding’s group actions, namely wedding events, whereby a wedding
is interpreted as a series of meaningful interactions among This work was partially published in the ACM Workshop on Multimedia Information Retrieval (MIR), 2007 [41]. This work was partially supported
by the National Science Council of R.O.C. under grants NSC 95-2622-E-002-
018, NSC 95-2752-E-002-006-PAE, and NSC 95-2221-E-002-332. It was also
supported by National Taiwan University under grant 95R0062-AE00-02. Wen-Huang Cheng, Yung-Yu Chuang, Bing-Yu Chen, and Ja-Ling Wu are with the Graduate Institute of Networking and Multimedia, National
Taiwan University, Taipei 10617, Taiwan, R.O.C. (e-mail: {wisley, cyy, robin, wjl }@cmlab.csie.ntu.edu.tw). Yin-Tzu Lin, Chi-Chang Hsieh, and Shao-Yen Fang are with the Depart- ment of Computer Science and Information Engineering, National Taiwan
University, Taipei 10617, Taiwan, R.O.C. (e-mail: {known, nonrat, strawin- sky }@cmlab.csie.ntu.edu.tw) participants. Based on the knowledge of wedding customs [1],
[2], we de?ne thirteen wedding events, such as the couple’s
wedding vows, ring exchange, and so forth. Our goal is to
automatically segment a wedding video into a sequence of
recognizable wedding events. Without loss of generality, we
focus on one of the most popular wedding styles, the western
wedding, that follows the basic western tradition [1], [2] and
takes place in a church-style venue. Based on our observations,
a wedding video typically consists of four parts: preparation,
guest seating, main ceremony, and reception. For simplicity,
we deal with the third part alone because of its relative
signi?cance. In the rest of this paper the term wedding refers
to the main ceremony. In the literature, the study of wedding video analysis has long been ignored. The wedding video is simply to be treated
as one of various content sources in research on home videos
[3], [4], [5]. Although the wedding ceremony video shares
some common properties with other kinds of home videos,
such as frequent poor-quality contents and unintentional cam-
era operations [3], [4], several characteristics make it much
more challenging to be processed and analyzed: • Restricted spatial information: Since most of the wed-
ding events occur in a single place (e.g. the front of a
church altar) and participants basically stay motionless
during the ceremony, the conventional techniques based
on scene, color, and motion information [3], [4], [6] are
not applicable to pre-partition a wedding video or to
group “similar” shots into basic units for further event
recognition. Likewise, most of the other content-generic
visual features such as texture and edge are not reliable
to be utilized. • Temporally continuous capture: The extraction of broken
time stamps is a widely used technique for generating
shot candidates or event units of home videos [7], [8].
However, to avoid missing anything important, videog-
raphers usually capture a wedding, especially the main
ceremony, in a temporally continuous manner without any
interruption. As a result, the temporal logs are not useful
for wedding segmentation. • Implicit event boundary: Although a wedding ceremony
proceeds following a de?nite schedule, the boundaries
between wedding events are often implicit and unclear.
For example, a groom’s entering to the venue is some-
times overlapped with the start of the bride’s entering.
It is not easy to determine an accurate change point to
separate two events. This phenomenon not only increases 2 the dif?culty of accurate video segmentation but also adds
uncertainties to annotate the event ground truth. To recognize the thirteen wedding events, we adopt a set of audiovisual features, relating to the wedding contexts of
speech/music types, applause activities, picture-taking activ-
ities, and leading roles, as the basic event features to build
our wedding video segmentation framework. Each wedding
event is represented by a set of statistical models in terms of
the extracted features. Since these features are selected based
on the understanding of wedding customs [1], [2], they are
more discriminative in distinguishing wedding events than the
aforecited features, such as motion and textures. To effectively
segment a wedding video, we develop a hidden Markov model
(HMM) [9], in which every hidden state is associated with a
wedding event and a state transition is governed by how likely
two corresponding wedding events take place in succession.
The event sequence is, therefore, automatically determined
by ?nding the most probable path. In summary, our event
recognition framework not only uses the model similarity
of extracted features, but simultaneously takes the temporal
rationality of event ordering into account. The main contributions of our work are twofold. First, an automatic system is proposed and realized for event-based
wedding segmentation. To the best of our knowledge, this
work is the ?rst one to analyze and structure wedding videos
at the semantic-event level. Actually, for any type of home
videos, our work might also be the ?rst one to achieve the
semantic event analysis. The proposed methodology could
be extensively applied to the other kinds of home videos
that possess similar characteristics as wedding, such as the
birthday party and school ceremonies. Second, a taxonomy
is developed to categorize the wedding events, whereby we
adopted a set of carefully selected audiovisual features for
robust event modeling and recognition. The true power of these
features is that they are effective in discriminating various
wedding events but their extractions from videos are as easy
as the conventional ones. Furthermore, the obtained high-level
descriptions could bene?t and complement the current research
in semantic video understanding. The rest of this paper is organized as follows. After a discussion of related work, Section III presents the taxonomy
of wedding events. The extraction of event features and the
modeling and segmentation of wedding videos are described in
Section IV and Section V, respectively. Section VI depicts the
experimental results, and Section VII presents our concluding
remarks and the directions of future work. II. R ELATED W ORK In this section, we review previous studies on home video analysis. According to their applications, they are classi?ed
into four major categories: scene-based segmentation, capture-
intent detection, photo-assisted summarization, and highlight
extraction. Meanwhile, their pros and cons as compared with
our approach will be brie?y discussed as well. Scene-based Segmentation. A basic segmentation process is to cluster relevant shots into groups called scenes. A scene is
de?ned as a subdivision of a video in which either the physical setting is ?xed, or when it presents a continuous action in
one place [4], [6]. Since the home video content tends to be
close in time, the clustering can be simply con?ned to adjacent
shots. Gatica-Perez et al. [3] proposed a greedy algorithm
that initially treats each shot as a cluster and successively
merges adjacent ones until a Bayesian criterion is violated.
The merging order is determined by both the visual and the
temporal similarities, such as color, edge, and shot duration.
Zhai et al. [4] located scene boundaries using the optimization
technique – Markov chain Monte Carlo (MCMC). A color-
based similarity matrix is constructed for video shots, from
which the clusters with high intra- and low inter-similarities
are detected as the desired scenes. Capture-intent Detection. A capture-intent refers to an idea, a feeling, theme, or message that makes us to capture certain
video segments [5], [10], e.g. a sentimental sunset or baby
laughing. Since the user’s capture-intent is often expressed
through the use of cinematic principles, some researchers ex-
ploit the theory of computational media aesthetics for captur-
ing such intents [11]. Achanta et al. [5] proposed a framework
for modeling the capture-intents of four basic emotions, i.e.
cheer, serenity, gloom, and excitement. An emotion delivery
system is also developed for helping users to enhance the
original or to convey a new emotion to a given home video.
Mei et al. [10] further integrated the knowledge of psychology
to classify the capture-intents into seven categories, such as
close-up view, beautiful scenery, just record, etc. A learning-
based mechanism for classifying the capture-intents is then
presented using two kinds of feature sets: attention-speci?c
and content-generic features. Photo-assisted Summarization. Personal photo albums can be viewed as an excellent abstract of the corresponding home
videos. Both capture most of important moments but photo
albums are relatively concise in presenting the contents. Since
a still image can be applied to search videos, the summariza-
tion task can be casted as the problem of template matching
between these two media. Aner-Wolf et al. [12] targeted on
wedding videos. They represented each shot with one or
several mosaics that are used to be aligned with the wedding
photos. All shots with successful alignments are collected
to generate a summarized wedding video. Similar ideas are
adopted by Takeuchi et al. [13], but they instead estimated
the user’s general preferences on the summarization. On the
other hand, Pan et al. [14] analyzed home videos in a ?ner
unit called a snippet that corresponds to a meaningful camera
motion pattern, such as a long static followed by a fast zoom. Highlight Extraction. Highlights are the video segments with relatively higher semantic or perceptual attractions to
users. Since it is still not possible to understand video se-
mantics with the current computing technologies, detection
of human attention provides an alternative way for detecting
perceptual highlights [15], [16]. Hua et al. [17] proposed a
home video editing system, in which attention-based highlight
segments are selected to be aligned with a given piece of
incidental music to generate an edited highlight video. Mean-
while, a set of professional editing rules is utilized to optimize
the editing quality, e.g. motion activity should match with
music tempo. Abowd et al. [18] presented a semi-automatic 3 TABLE I T AXONOMY OF WEDDING EVENTS Code Event De?nition ME Main Group Entering † Members of the main group walking down the aisle. GE Groom Entering Groom (with the best man) walking down the aisle. BE Bride Entering Bride (with her father) walking down the aisle. CS Choir Singing Choir (with participants) singing hymns. OP Of?ciant Presenting Of?ciants giving presentations, e.g. invocation, benediction, and homily. WV Wedding Vows Couple exchanging wedding vows. RE Ring Exchange Couple exchanging wedding rings. BU Bridal Unveiling Groom unveiling his bride’s veil. MS Marriage License Signing Couple (with of?ciants) signing the marriage license. WK Wedding Kiss Groom kissing his bride. AP Appreciation Couple thanking to certain people, e.g. their parents or all participants. ED Ending Couple (followed by the main group) walking back down the aisle. OT Others Any events not belonging to the above, e.g. lighting a unity candle. † The main group indicates all persons, except the ones in GE and BE, who are invited to walk down the aisle, e.g. ?ower girls, ring bearers, groomsmen, bridesmaids, honorary attendants, of?ciants, etc. Fig. 1. Sample key-frames of the thirteen wedding events. approach for highlight browsing. Home videos need to be
manually annotated with a prede?ned tag hierarchy that helps
to group together the highlight segments with similar semantic
meanings, e.g. all clips of the child’s birthday wishing. Some observations are made from the above discussions. First, the so-called event is a more semantic unit for video
segmentation as compared with the conventional ones such as
frames, subshots, shots, and scenes [19], [20]. It represents a
single human activity during a period of time. However, stud-
ies on semantic event analysis of home media are extremely
rare as compared with the other kinds of content sources
such as sports [19]. Second, the analysis of home media are
mostly from the perspective of a viewer or a videographer but
not the media owner or event participants. Helping them to
explicitly identify what had happened in a video often seems
more crucial than simply indicating where would be more
signi?cant. These observations motivate our development of
a comprehensive scheme for event-based video analysis and
segmentation. III. W EDDING E VENT T AXONOMY According to the western tradition [1], [2], a wedding cer- emony, whether religious or secular, begins when an assigned
attendant (such as an of?ciant or bride’s mother) is entering
down the aisle and ends while the couple is walking out
of the wedding venue. The mid-process may vary depending on countries, religions, local customs, and the wishes of the
couple, but the basic elements that constitute the western
weddings are almost the same [1], [2]. Therefore, we de?ne
thirteen wedding events as listed in Table I. They are carefully
speci?ed to be mutually exclusive and collectively exhaustive
[21]. The corresponding sample key-frames for these events
are illustrated in Figure 1. In addition to the traditions, the common perception of the relative event importance is also taken into account in the
development of our taxonomy for further applications such
as highlight extraction or video summarization. For example,
the three entering events (ME, GE, BE) are traditionally to be
viewed as a unity called a processional [1], [2], but they should
be explicitly separated because the couple’s arriving is gener-
ally much more exciting than others. By contrast, we classify
all of the of?ciants’ formal presentations like invocation and
benediction into a single wedding event (OP), because they
are often invariable in form and the verbal expressions are
basically predictable, often not beyond the scope of invoking
God’s blessing upon the marriage or inspiring the attendants’
religious spirits. It is evident that they are not as important as
compared to other events. Furthermore, as shown in Table I, the taxonomy roughly follows the procession of a wedding ceremony, i.e. from the
ME event to the ED event. However, it should be noted that the
actual event ordering is based on each couple’s own wedding 4 TABLE II T HE TENDENCY OF WEDDING EVENTS IN THEIR BEHAVIOR OF SPEECH / MUSIC TYPES , APPLAUSE ACTIVITIES , PICTURE - TAKING ACTIVITIES , AND LEADING ROLES ( FROM THE SECOND TO THE FIFTH COLUMNS , RESPECTIVELY ). ? S/M a App. b Pic. c Leading Roles d ME – N L + main group GE – N – groom, (best man) BE M – H + bride, (bride’s father) CS M – L ? choir, (wedding participants) OP S N – of?ciants WV S N H ? bride, groom, of?ciants RE S N H ? bride, groom, of?ciants BU S – H ? bride, groom MS – N – bride, groom, (of?ciants) WK – Y H + bride, groom AP – Y – bride, groom, (wedding participants) ED M Y H ? bride, groom, (main group) OT – – – – ? “–” in the blanks means no obvious tendency. a S: speech events, M: music events. b Y: applause events, N: non-applause events. c L ? , L + , H ? , H + : events with the activity of picture-taking from low to high. d People in parentheses are optional. program and certain events could be repeated or removed in
the ceremony. For example, the OP and the CS events are
often interweaved with other ones. In addition, a simpli?ed
ceremony could only contain four events of WV, RE, MS, and
WK. IV. E VENT F EATURES D EVELOPMENT AND E XTRACTION Effective event modeling is built on top of reliable event features. The understanding of wedding customs [1], [2] gives
valuable insights to the process of feature exploration. Several
key observations, which are found to be useful in discriminat-
ing the wedding events, are ?rst presented in Section IV-A.
In Section IV-B, guided by these ?ndings, we develop cor-
responding audiovisual features, including four audio features
and two visual features. They are collected together as event
features for later event modeling. A. Key Observations According to the western traditions [1], [2], wedding events are observed to behave differently in four main aspects:
speech/music types, applause activities, picture-taking activ-
ities, and leading roles. In the following, we explain in detail
for each of the key observations and then give corresponding
guidance on the development of relevant event features. 1) Speech/Music Types: Traditionally, some wedding events contain purely speech and others are accompanied with
music [2]. For example, in the OP and the WV events, all
participants keep quiet to listen to an of?ciant or the couple
speaking. In the CS and the BE events, a choir is singing
with piano accompaniment or the selected background music
(e.g. Mozart’s Wedding March) is played during the event. The
tendency of wedding events in speech/music types is shown
in Table II. Obviously, the discrimination between speech TABLE III E XAMPLES OF FLASH DISTRIBUTIONS OF FOUR SUCCESSIVE WEDDING EVENTS IN A CEREMONY . ? 1. OP 2. WV 3. RE 4. WK 674 (sec) 234 (sec) 142 (sec) 12 (sec) 19 (times) 55 (times) 8 (times) 73 (times) 0.0282 (Hz) 0.2350 (Hz) 0.0563 (Hz) 6.0833 (Hz) ? The third to the ?fth rows are the durations, ?ash numbers (manually counted), and ?ash densities of the corresponding wedding events,
respectively. and music types from recorded audio plays a key role in
wedding event recognition. However, because the quality of
the recorded audio is generally poor and often interfered with
environmental sound and background noise, the selected audio
features related to the speech/music discrimination have to be
robust enough to survive such a low-SNR audio input. 2) Applause Activities: Applause is usually expected from wedding attendants as the expression of approval or admiration
at certain moments during the ceremony. For example, in the
WK and the ED events, the couple routinely receives a burst
of applause at the moments when they are kissing or walking
back down the aisle. By contrast, in the OP and the WV
events, wedding attendants rarely applaud in order to keep
the solemnity and avoid interfering with the ongoing wedding
speech. Thus, effective applause detection is bene?cial to the
recognition of wedding events, cf. Table II. Note that, for
our applications, the applause especially refers to the ones
created by a group of people rather than by an individual.
Speci?cally, the applause is generated by the group act of
hands clapping and naturally the group members tend to clap
at slightly different rates. This phenomenon makes the sound
of applause dif?cult to be analyzed without the use of prior
knowledge [22], [23]. Therefore, a common technique is to
exploit the physical properties of applause [23], [24] to identify
its appearance in the audio track of wedding videos. 3) Picture-taking Activities: Wedding attendants, especially the couple’s family members and close friends, often take
pictures during the ceremony, and the number of pictures taken
roughly represents the relative importance of a wedding event.
Table II illustrates a relative comparison for the generally
observed frequency of taking pictures during various wedding
events. Since the occurrence of camera ?ashes correlates
closely with the activity of picture-taking [25], the estimation
of ?ash density could be an effective visual cue for wedding
event discrimination. Table III shows an example of ?ash
distributions for four successive wedding events in a ceremony.
We observed high variations in ?ash distributions among
events. For example, the WK event is merely 12 seconds long,
but there are 73 ?ashes. Its density reaches six times per
second, on average. By contrast, the OP event is of relatively
less importance to the audiences, as described in Section III,
and it contains a small number of ?ashes even if it lasts for a
much longer duration. 5 (Hz) (Hz) ( ) ( ) (a) The spectrogram. (sec) (sec) (b) The line map. Fig. 2. Example of a music signal with (a) its spectrogram using short-time Fourier transform and (b) its corresponding line map. 4) Leading Roles: As shown in Table II, the leading roles involved in various wedding events are different. For example,
groom and the best man are the main characters in the GE
event; the groom, his bride, and of?ciants are the main focuses
in the RE event. The main characters’ occurrence pattern
gives a visual hint for the event category. A na¨?ve solution
would be to recognize all roles in videos. This is, however,
not a trivial task with today’s technology. Fortunately, there
are some simple tricks to detect the bride, inarguably the
most important focus of a wedding. According to the western
tradition [1], [2], the bride invariably wears a white gown and
veil as a symbol of purity but the other female roles have
?exibility in their dress color. Therefore, it is more reliable to
represent the bride’s appearance assuming she wears white. B. Selected Features for Event Modeling Based on the observations of Section IV-A, four kinds of audiovisual features, related to the scopes of speech/music
discrimination, applause detection, ?ash detection, and bride
indication, are developed as basic features for event modeling.
In the following, we detail the development for each adopted
event feature and give their de?nitions in mathematical forms. 1) Event Features Related to Speech/Music Discrimination: As mentioned in Section IV-A.1, the audio recordings of wed-
dings are often with poor quality. Thus, the selected audio fea-
tures have to be discriminative enough between speech/music
types for the given low-SNR inputs. However, in the literature,
most studies address the speech/music discrimination problem
only for clean data or with the assumption of known noise
types [22], [26]. To identify the audio features that are resistant
to noises, we ?rst collect a comprehensive set of candidate
features from the previous work [22], [26], [27] and determine
the more reliable ones using feature selection algorithms [28],
[29]. Initially, tens of audio features are collected to form a candidate set, including the short-time energy, energy cross-
ing, band energy ratio, root mean square (RMS), normalized
RMS variance, zero crossing (ZC), joint RMS/ZC, bandwidth,
silent interval frequency, mel-frequency cepstral coef?cients
(MFCCs), frequency centroid, maximal mean frequency, har-
monic degree, music component ratio, and so forth [22], [26],
[27]. Each of the collected audio features is assessed by
information theoretical measures [28], [29], so as to estimate its discriminability between the speech and the music types. At
the end, three of them are chosen for their stable performances
under various noise types. They are the one-third energy
crossing (OEC), the silent interval frequency (SIF), and the
music component ratio (MCR), as detailed below. Note that,
for extracting the audio features, the audio track of a wedding
video is converted to 44,100-Hz mono-channel format ?rst.
For simplicity, let x(n) be a discrete-time audio signal with time index n and N denotes the total number of samples in the interval from which features are extracted. • One-third Energy Crossing (OEC). One of the char-
acteristics of a speech signal is that the corresponding
amplitude has more obvious variations than that of the
music. Given a ?xed threshold ?, the number of audio energy waveform’s crossings over ? is often higher in a speech than that in a music. For each audio track,
we empirically set ? to one-third of the whole range of its average amplitude. Therefore, OEC is de?ned as a
measurement of the audio’s energy-spectral content as
follows: OEC 1
2 · N n=2 |sign ? (x 2 (n)) ? sign ? (x 2 (n ? 1))| (1) where sign ? (a) = ? ?
? 1, a > ? 0, a = ?. ?1, a < ? (2) As suggested by previous work [27], [30], the audio
track is uniformly segmented into non-overlapping 1-
second audio frames. For each audio frame, one feature
value is computed in every 20-ms interval and these 50
short-time feature values are averaged to generate the
representative OEC feature for that 1-second frame. The
same mechanism is used in SIF extraction, as described
next. • Silent Interval Frequency (SIF). Since a speech signal is
a concatenation of a series of syllables, it contains more
pronouncing pauses than a music signal does. Therefore,
SIF is de?ned to measure the silent intervals of an audio
signal as follows [27]: SIF I((ZC = 0) or (E < ? l ) or (E < 0.1E max and E < ? h )) (3) where I(·) is the indicator function, E is RMS of the signal amplitude, and E max is the maximum RMS value of the whole audio track. To be precise, E = N n=1 x 2 (n) (4) and ZC 1
2 · N n=2 |sign 0 (x(n)) ? sign 0 (x(n ? 1))|. (5) In addition, the two thresholds ? l and ? h are empirically set to 0.5 and 2, respectively. As described in OEC
extraction, we compute a representative SIF feature for 6 100 100 Speech Music (%) (%) 80 60 80 60 40 20 40 20 0 20 0 Precision Recall Precision Recall (a) (b) (c) Fig. 3. Classi?cation results of the audio types of speech (the left subplot) and music (the right subplot) on three audio datasets of (a) Internet radio,
(b) Internet radio with added white noises (5 dB), and (c) audio tracks from
home videos, using a multi-class SVM classi?er built upon the three audio
features proposed in Section IV-B.1. each 1-second audio frame by taking average of 50 short-
time SIF values. • Music Component Ratio (MCR). Harmonicity is the
most prominent characteristic of a music signal. A music
signal often contains spectral peaks at certain frequency
levels and the peaks last for a period of time. This can be
observed from the “horizontal lines” in the spectrogram
of a music signal, as shown in Figure 2. MCR is then
de?ned as the average horizontal line number of an audio
spectrogram within a second, and the line extraction
algorithm is as follows: 1) Segment the given audio track into 40-ms audio frames with a 10-ms overlap between two succes-
sive frames. 2) Compute the spectrogram (Figure 2(a)) of the audio frames using short-time Fourier transform. 3) Convert the spectrogram to a corresponding gray- level image by taking the absolute values of the
Fourier coef?cients. 4) Construct a line map (Figure 2(b)) from the image using the Sobel operation [31], and a 7-order median
?lter is applied to remove outliers along each row
of the map. 5) Identify all horizontal lines in the line map using the Hough transform [31]. 6) For each 1-second frame, calculate the line number from every 4-pixel-wide windows with 2-pixel ad-
vance in the line map, and take the average of the
line numbers as the ?nal MCR value. As a result, we use OEC, SIF, and MCR to practically realize a multi-class SVM classi?er for speech/music dis-
crimination [32]. The classi?er has been evaluated on three
small audio datasets, each containing approximately three-hour
sources. The ?rst dataset is collected from Internet radio and
the second is obtained by adding 5 dB white noises to the
?rst one. In addition, we constitute the third one from audio
tracks of two kinds of home videos, i.e. the wedding and
the birthday party. Here, sound of birthday party is included
because its audio contents have higher variations and contain 1 0.8 (dB) 20 0.6 0 4 0 -20 0.4 0.2 0 -40 60 0 -60 (kHz) 0 2 4 6 8 10 (kHz) (a) (b) 0 2 4 6 8 10 Fig. 4. Examples of (a) two power spectrums of a wedding audio from consecutive time instances, one with applause (the top solid curve) and another
without applause (the bottom dotted curve), and (b) a sigmoidal ?lter function. more diversi?ed sound effects. For example, some of the
birthday parties are taken place at a quiet home, and others
are in a very noisy environment, such as the restaurants with
crowd laughing, talking, and cheering. Then, a ?vefold cross-
validation experiment [9] is conducted for the classi?er on
each of the datasets and the results measured by average
precisions and recalls are illustrated in Figure 3. The clas-
si?cation performance shows that the proposed audio features
discriminate music/speech quite well even for the audio with
a substantial amount of noises. 2) Event Features Related to Applause Detection: The same feature selection mechanisms, as described in the pre-
vious section, are applied to identify the noise-resistant audio
features for detecting the presence of applause in low-SNR
audio recordings. However, based on our experiments, the
audio features in the previous section generally do not perform
very well. Instead, a speci?c audio feature is developed for
applause detection. This feature exploits the physical proper-
ties of applause, indicated in Section IV-A.2: when applause
is coming up in the audio signal, a signi?cant increase in
magnitude can be observed over the whole power spectrum
[23], [24]. An example is illustrated in Figure 4(a). For
comparison, two power spectrums taken from consecutive time
instances of a wedding audio are depicted in the same ?gure.
The spectrum with applause (the top solid curve) is around
20 dB larger in magnitude than the one without applause (the
bottom dotted curve) for almost all frequencies. To capture the
global variations of audio magnitudes, an audio feature of the
weighted short-time energy (WSE) is employed. • Weighted Short-time Energy (WSE). The feature value
of weighted short-time energy is de?ned as the weighted
sum over the spectrum power (in decibels) of an audio
signal at a given time as follows: WSE 1 WSE max ? s 0 W (?) · 10 log(|SF (?)| 2 + 1)d? (6) where SF (?) is the short-time Fourier transform coef- ?cient of the frequency component ?, and W (?) is the corresponding weighting function. In addition, ? s denotes the sampling frequency and WSE max is the maximum WSE in the audio track as a normalization factor. The
calculation of WSE is special in that the spectrum power
is in a logarithmic unit of decibels. Summation in the
decibel domain is the same as multiplication in the energy 7 1 0.9 0 8 0.8 0.7 0.6 s
ion 0.5 0.4 0.3 Preci s 0.2 0.1 0 T max T mean 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Fig. 5. Precision-recall curves of the applause detection results using two different thresholds. (See Section IV-B.2 for details.) domain. The logarithmic nature leads that a large WSE
value comes from a global trend of high power over the
whole spectrum but not few dominant frequencies. Fur-
thermore, since human speech is commonly observed in a
wedding and the speech signals are bandlimited to around
3.2 kHz [26], W (?) is chosen to be a sigmoidal function (cf. Figure 4(b)) in order to suppress the contributions
from low frequencies. Speci?cally, W (?) = 1 1 + e ?? 1 (??? 2 ) , (7) where ? 1 and ? 2 are control parameters and are respec- tively set to 2.5 (kHz) and 5.0 (kHz). As mentioned in
Section IV-B.1, the input audio track is ?rst segmented
into non-overlapping 1-second audio frames. For each
audio frame, one feature value is computed for every
50-ms interval with a 10-ms overlap. A median ?lter
is then applied to diminish possible noises. Instead of
aggregation, based on our experiments, the maximum of
these 25 feature values is selected as the representative
WSE feature for that 1-second frame. To verify the capability of WSE, a simple trial is conducted to detect the applause presented in audio recordings using two
different thresholds: T max and T mean . That is, given a series of WSE values, we compute two thresholds by individually
multiplying the maximum value and their mean to a numerical
factor between [0,1]. Then applause can be located at the
positions with higher WSE values than the chosen threshold.
Figure 5 illustrates the precision-recall curves of the average
detection results on 15 audio tracks from a set of collected
home videos, including wedding and birthday parties. The
inclusion of birthday parties is for the same reason as described
in Section IV-B.1. Overall, the performance is well acceptable
and it shows that WSE can capture applause effectively even
for noisy home video recordings. 3) Event Features Related to Flash Detection: Flashes of picture-taking can be detected from abrupt and short increases
of the global intensity in a video frame. A visual feature of
the ?ash density, as suggested in Section IV-A.3, can then be
de?ned in the following. • Flash Density (FLD). In home videos, the durations
of observed ?ashes are seldom longer than two video (a) Video frame f t . (b) The thresholded image. t y y t s 1 y t c t y t s h(x) h(y) x t s x t s 1 2 3 x (c) The bridal white map with projection histograms. x t c Fig. 6. Examples of (a) a video frame with (b) the thresholded image and (c) the bridal white map with projection histograms. 1 0.9 0.8 0.7 0.6 i
on 0.5 0.4 0.3 Precis i 0.2 0.1 0
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Fig. 7. Precision-recall curves of the bride indication results. (See Section IV-
B.4 for details.) frames. In every 1-second interval, we compute a feature
value of the ?ash density as follows: FLD M?1 t=2 I(( ˆ f I t ? ˆ f I t?1 ? ) and ( ˆ f I t ? ˆ f I t+1 ? )) (8) where M , ˆ f I t are respectively the total number of video frames and the value of average intensity of the frame f t , and the threshold = 5 was suggested by previous work [25] for ?ash detection. To get more insight into the feature of FLD, we apply the ?ash detection algorithm to one wedding video used in
later experiments, i.e. the Clip-A in Table V. In terms of ?ash
numbers, 457 ?ashes are correctly detected among the 482 true
ones, and there are 17 false positives. The detecting precision
and recall are 94.81% and 96.41%, respectively. The detecting
performance shows that ?ashes can be robustly captured with
our feature. 4) Event Features Related to Bride Indication: As men- tioned in Section IV-A.4, the bride is an important leading
role in wedding events and her appearance can be detected by
the color of “bridal white”. However, due to various lighting
conditions, the determination of real bridal white is extremely
dif?cult and often needs a laborious training process similar 8 OEC OEC SIF SIF MCR WSE MCR WSE FLD WSE FLD WSE BWR BWR (a) The RE feature models. (b) The WK feature models. Fig. 8. Examples of wedding event models of (a) the RE event and (b) the WK event. to that of the skin color detection [33]. Instead, our current
implementation approximates bridal white map for each video
frame, whereby a corresponding visual feature, bridal white
ratio (BWR), can then be de?ned. The bridal white map is
generated using the following procedure: 1) Convert a video frame f t to the HSI color space [31], in which the values are within the range of [0,255]. 2) Set empirically two thresholds ? I
t and ? S t for the inten- sity and the saturation respectively for the bridal white: ? I
t = min (240, ˆ f I t + 80) and ? S t = 75. (9) 3) Construct a thresholded image ¯ ? t from the video frame using the above two thresholds, cf. Figure 6(b). The
thresholded image is de?ned as ¯? t (p) = 1, if f I t (p) ? ? I
t and f S t (p) < ? S t 0, otherwise (10) where p is a pixel, and f I t (p) and f S t (p) denote p’s intensity and saturation values, respectively. 4) Obtain a bridal white map ? t (cf. Figure 6(c)) by removing outliers of ¯ ? t using a morphological closing (i.e., erosion followed by dilation) [31]. That is ? t = ¯? t ? Se (11) where Se is a disk structuring element whose radius is 5-pixel wide and ? denotes the closing operation. After constructing the bridal white map, the feature, bridal white ratio, is then de?ned as follows: • Bridal White Ratio (BWR). To obtain BWR, the tech-
nique of histogram projection [34] is applied to improve
the reliability of ? t . Speci?cally, based on the observation that the bride roughly appears in the shape of a white
vertical bar (cf. Figure 6(a)), we add a spatial constraint
that the white distribution in the vertical direction should
be wider than that in the horizontal one. Therefore, we project the bridal white map along the x and the y direc- tions to construct two 1-D histograms (cf. Figure 6(c)),
from which the isolated component with the maximum
white ratio is individually selected. For example, in
Figure 6(c), there are three isolated components in the
horizontal histogram but only one in the vertical one.
We compute standard deviations, s x
t and s y
t , of the white distributions for the maximum components along both
axes. In every 1-second interval, a feature value of BWR
is de?ned as BWR 1 M M t=1 ?(? t ) · I(s x
t < s y
t ) (12) where ?(? t ) returns the white ratio of ? t in terms of white pixel number with respect to the map size. Note
that we use the average white percentage to avoid making
the hard-decision on whether the bride is present in video
frames or not. For understanding its performance, a simple trial is carried out for the bride indication by making binary decisions (i.e.
presence or absence) on the basis of the obtained BWR
values. Given a prede?ned threshold, a higher BWR value
corresponds to the bride’s presence, otherwise her absence.
Figure 7 illustrates precision-recall curves of the detecting
results for a wedding video, i.e. the Clip-A in Table V. The
“hard-decision” performance is promising and we believe that
the resulted “soft-decision” BWR is helpful for our modeling
task. V. W EDDING M ODELING The objective of wedding modeling is to estimate the event sequencing of a wedding video. At each time instance,
extracted event features are exploited to recognize the wedding
events. In addition, a wedding video is a kind of sequential
data. The occurrence of a wedding event highly depends on 9 TABLE IV A N EVEN TRANSITION MODEL OF THE WEDDING EVENTS . ME GE BE CS OP WV RE BU MS WK AP ED OT ME 0.80 0.11 0.09 GE 0.12 0.80 0.08 BE 0.80 0.04 0.16 CS 0.80 0.16 0.01 0.03 OP 0.07 0.80 0.03 0.01 0.01 0.02 0.02 0.04 WV 0.80 0.13 0.03 0.03 RE 0.03 0.80 0.13 0.03 BU 0.80 0.20 MS 0.07 0.80 0.07 0.07 WK 0.03 0.11 0.03 0.80 0.03 AP 0.12 0.04 0.80 0.04 ED 1.00 OT 0.05 0.14 0.02 0.80 the category of its preceding neighbors. Thus, in wedding
modeling, it needs not only to consider how likely the acquired
features match an event candidate but also the temporal
rationality whether the candidate is appropriate to follow the
existing sequence immediately. Therefore, we use an effective
learning tool, the hidden Markov model (HMM), to describe
the spatio-temporal relations of events within a wedding video
[9]. In Sections V-A and V-B, we ?rst build statistical models
for feature similarity and temporal ordering for each of the
wedding events. Section V-C then devises an integrated HMM
framework for both the event-based analysis and the wedding
segmentation. Before proceeding, note that we uniformly divide the wed- ding video into a sequence of 1-second units. The main
reason for this uniform pre-segmentation is that we can not
use conventional video units, such as shots, as the basic
analysis units. This is because shots of a wedding video
can’t be reliably obtained using conventional techniques as
mentioned in Section I. In addition, uniform segmentation
makes online processing possible. For convenience, let E denotes an index set [35] of the wedding events, where
the indexing consists of a bijective mapping from the event
set E S = {ME, GE, . . . , OT } to a set of natural num- bers, i.e. E = {1, 2, . . . , |E S |}. Similarly, F is an index set corresponding to the collection of event features F S = {OEC, SIF, MCR, WSE, FLD, BWR}. For the t-th video unit,
let e t ? E be the corresponding state variable that indicates the occurrence of a speci?c wedding event, and let x t = (x 1
t , . . . , x |F | t ) be the feature vector associated with the speci?c event features x j
t , j ? F . A. Wedding Event Modeling For each of the wedding events, a statistical feature model is constructed for each of the adopted event features. Speci?cally,
a feature model is a probability distribution describing the
likelihood of feature values. The use of statistical histograms
[31] is a na¨?ve approach, but their discrete nature often
causes unwanted discontinuity in results, especially when a
feature value locates near the boundaries of histogram bins.
Instead, we accumulate the probability by regarding each
feature sample as a Gaussian centered at the sample. Assume
that, for the i-th event, we have N samples for the j-th feature {x j
1 , . . . , x j
N } extracted from the training clips. The distribution p i,j of the j-th feature for the i-th event can then be obtained as p i,j (x) = 1 N N n=1 1 ? j ?2?e ?(x?x j
n ) 2 /2(? j ) 2 , ?i ? E, ?j ? F, (13) where ? x=?? p i,j (x)dx = 1 and ? j is a con?dence parameter specifying how we trust the extracted values of the j-th feature. That is, if the extracted feature samples are more accurate and
reliable, we can set ? j to a smaller value. Since the feature models are used for discriminating the wedding events, the divergence among feature models of
different wedding events should be as large as possible.
Quantitatively, the divergence of two probability distributions
p and q can be de?ned by the symmetric Kullback-Leibler
(SKL) distance [28]: D SKL (p, q) = 12 y p(y) log p(y) q(y) + q(y) log q(y)
p(y) dy (14) For the j-th feature, the con?dence parameter ? j is chosen to maximize the sum of divergences among the same kind of
feature models. That is, ? j = arg max ? i,k?E, i<k D SKL (p i,j , p k,j ) (15) To ?nd the optimal ? j , we use exhausted search and empir- ically set a search range (e.g. [0 , 1]) with a desired precision (e.g. 0.05). The optimal con?dence parameters we found are
? OEC = 0.005, ? SIF = 0.015, ? MCR = 0.5, ? W SE = 0.0025, and ? BW R = 0.01. It is worthy to notice that FLD is an exception because its values are discrete. As a result,
we manually set ? F LD = 0 and apply a 9-point normalized ?lter to the sample sequences of FLD feature values as an
alternative to the Gaussian-based smoothing. Therefore, given a video unit (e.g. the t-th one), we can compute the probability that we observe x t given that this video unit belongs to the i-th wedding event: p(x t |e t = i) = |F | j=1 p i,j (x j
t ) (16) Note that, in practice, we compute the log-likelihood by taking
logarithm of the expression, and thus obtain a contributive
weight ? j to the j-th feature model, where j ? j = 1. In our experiments, we used a ?xed set of weights, i.e. ? OEC = 0.25, ? SIF = 0.2, ? MCR = 0.1, ? W SE = 0.1, ? F LD = 0.1, and ? BW R = 0.25. They are automatically speci?ed by optimizing the recognition accuracy of wedding events through
a cross-validation process (cf. Section VI) that is iteratively
repeated among training clips. An interesting phenomenon is
that the audio-based event features take as high as two-thirds
of the weights. This implies that audio information seems more
crucial for the wedding analysis. Overall, the proposed event modeling has the following advantages. First, it has good tolerance to inaccuracy and
uncertainty of the extracted event features. The Gaussian
component helps to reduce and diversify the in?uence of an 10 inaccurate feature value. Second, it avoids the artifacts due
to quantization errors in the constructed feature models. The
distribution of feature values can be faithfully represented
without approximation. Figure 8 gives examples of feature
statistical models for two wedding events, RE and WK. B. Event Transition Modeling The event transition model (ETM) is constructed to describe the probability that a wedding event is immediately followed
by another in a wedding ceremony. In other words, it evaluates
whether a temporal transition is to be allowed between each
pair of the wedding events. Therefore, ETM can be de?ned
by an |E| × |E| matrix A as follows: A i,k = P r(e t = k|e t?1 = i), ?i, k ? E (17) where A i,k is the entry of the i-th row and the k-th column of A, and t ? 1, t are two successive time instances in units of seconds. Since all possible transitions are enumerated in A, the marginal probability along each row is unity, that is |E| k=1 A i,k = 1, ?i ? E. (18) In fact, given a training set of wedding videos with the event ground truth, we can tabulate an approximation of ETM,
namely ˜ A. However, the obtained probability distributions are often extremely biased. That is, most of the probabilities are
prone to centralize on the diagonal entries, i.e. ˜ A i,i . This phenomenon is due to the fact that transitions are counted in
seconds. For example, assuming that we have two successive
events which are both 100 seconds long, only one event
transition will be accounted during this 200-second period.
Therefore, for each row of ˜ A (e.g. the i-th one), we exploit a regularization to balance the probabilities as follows: A i,k = ? i ˜ A i,k , i = k (1 ? ? i ˜ A i,i )/(1 ? ˜ A i,i ) · ˜ A i,k , i = k , ?k ? E (19) where ? i is the regularization factor in the range of [0 , 1]. To be precise, we shift some of the diagonal probabilities to
the off-diagonal ones but keep their relative ratios unchanged.
Empirically, all of the diagonal entries are regularized to take
approximately 80% probabilities along each row, i.e. A i,i ? 0.8, after regularization. Table IV shows the ETM we learnt from training videos, in which the blank entries represent zero probabilities. Sparsity
of the ETM shows that few types of event transitions are
allowed. It also demonstrates the occurrence of wedding events
has a strong temporal correlation. This fact helps to reduce
the computation cost and to increase the reliability of the
determined event sequencing. C. Wedding Segmentation Using HMM HMM is a speci?c instance of state space models, in which the concept of hidden states is introduced to recognize the
temporal pattern of a Markov process [9]. Since the sequence
of wedding events can be viewed as a ?rst-order Markov data,
as shown in Section V-B, we exploit an HMM framework 1 e 1 e 1 , 1 A 1 e 1 , 1 A … 2 e 2 e 2 e … 3 e 3 e 3 e 3 , 3 A 3 , 3 A … t = 1 2 3 t = 1 2 3 … Fig. 9. A simpli?ed example of the HMM for wedding segmentation. (See Subsection V-C for details.) for segmenting wedding videos, in which the wedding event
statistical models (Section V-A) and the event transition model
(Section V-B) are integrated together. Speci?cally, given an input wedding video V , it is ?rst partitioned into N 1-second video units, V = {v 1 , . . . , v N }. For each video unit v t , t ? {1, . . . , N }, we have a set of |F | event features associated with it, i.e. x t = (x 1
t , . . . , x |F | t ). Collecting all the observations X = {x 1 , . . . , x N }, our goal is to ?nd the most probable event sequencing S for V , where S = {e 1 , . . . , e N }. Therefore, we develop a left- to-right HMM with |E| states {e i |i ? E}, in which each state corresponds to one of the adopted event categories. The
HMM is governed by a set of parameters, ? = {?, A, ?}, where ?, A, and ? are the initial state probabilities, the state transition probabilities, and the emission probabilities,
respectively [9]. Figure 9 illustrates a trellis representation of
a simpli?ed HMM with only three states. Clearly, ? and A have been explicitly described by the wedding event models
and the event transition model, respectively. Without loss of
generality, ? is presumed to be a uniform distribution, i.e. p(e 1 = i|?) = 1/|E|, ?i ? E. Accordingly, our goal for ?nding the optimal sequencing S can be formulated as S = arg max s P r(X, S|?) = arg max s p(e 1 |?) N t=2 p(e t |e t?1 , A) N t=2 p(x t |e t , ?) = arg max s p(e 1 |?) N t=2 A e t?1 ,e t N t=2 |F | j=1 p e t ,j (x j
t ) (20) where the second and the third terms are derived from
Eqns. (16) and (17), respectively. Because the HMM trellis
is equivalent to a directed tree (as shown in Figure 9), the
solution of S can be ef?ciently obtained using the Viterbi algorithm [9]. After labeling each 1-second unit of the input video, the temporal extent of a detected wedding event, or called an
event segment, is de?ned by collecting successive video units
with the same event labeling. Finally, a smoothing scheme is
applied to reduce possible labeling errors. Since, in general,
a wedding event lasts for at least tens of seconds, we remove
the short ones (less than 10 seconds in duration) by merging it
into its neighbors. If its proceeding and succeeding neighbors
belong to different event categories, it is merged into the left
one; otherwise, all the three events are merged into one event. 11 TABLE V T HE COLLECTION OF SIX WEDDING VIDEOS USED IN OUR EXPERIMENTS . Clip A B C D E F Duration 2215 (sec) 410 (sec) 4122 (sec) 3790 (sec) 1062 (sec) 1350 (sec) Event # 17 8 35 23 15 14 TABLE VI T HE STATISTICS OF MEANS ? AND VARIANCES ? 2 OF EVENT DURATION FOR EACH OF THE EVENT CATEGORIES IN OUR VIDEO COLLECTION ( UNIT : SECONDS ). Event ME GE BE CS OP WV RE BU MS WK AP ED OT (a) from all event samples ? i 92.00 42.33 114.00 139.90 130.91 163.33 135.50 47.33 166.00 11.60 68.33 75.20 149.08 ? i 38.11 36.25 67.73 104.62 182.28 61.71 13.20 6.66 62.60 1.14 6.66 13.48 67.13 (b) from half of the event samples with shorter durations ˜? i 45.33 19.00 37.00 56.64 54.24 88.50 111.67 38.67 132.50 10.00 61.33 51.33 97.63 ˜? i 15.95 5.57 1.41 32.08 32.16 26.16 23.63 8.39 33.23 1.00 5.51 24.01 40.17 TABLE VII T HE RECOGNITION RESULTS OF ALL WEDDING EVENTS ( UNIT : SECONDS ). Events ME GE BE CS OP WV RE BU MS WK AP ED OT RR(%) ME 547 0 32 0 0 0 0 0 0 0 0 0 0 94.47 GE 25 99 18 0 0 0 0 0 0 0 0 0 0 69.72 BE 80 0 350 0 0 0 0 0 0 0 0 0 0 81.40 CS 0 0 0 2320 93 0 0 0 0 42 64 0 154 86.79 OP 1 0 5 212 3622 145 459 4 0 2 28 8 156 78.03 WV 0 0 0 43 77 602 73 0 0 0 0 0 0 75.72 RE 0 0 0 0 55 152 442 6 0 0 0 0 0 67.48 BU 0 0 0 0 0 0 0 183 0 2 0 0 0 98.92 MS 0 0 0 9 113 0 0 0 143 0 0 0 0 53.96 WK 0 0 0 0 0 0 0 0 0 87 0 0 0 100.00 AP 30 0 0 23 2 0 0 0 0 0 164 0 2 74.21 ED 0 0 0 0 3 0 0 0 0 0 0 427 0 99.30 OT 0 0 0 586 509 130 96 17 0 0 48 0 436 23.93 RP(%) 80.09 100.00 86.42 72.66 80.96 58.50 41.31 87.14 100.00 65.41 53.95 98.16 58.29 VI. E XPERIMENTAL R ESULTS This section ?rst presents experimental results for the evalu- ation of the proposed framework in wedding event recognition
(Section VI-A) and wedding ceremony video segmentation
(Section VI-B). This, we show comparisons with another
well-known algorithm, linear-chain conditional random ?elds
(LCRF), and an extension of our system to a practical scenario
in Sections VI-C and VI-D, respectively. In our experiments, we used a total of six wedding video clips. Each of them contains a complete recording of a
wedding ceremony. Three observers (none of the clip owners)
collaboratively annotated the event ground truth. Table V
summarizes the statistics of the videos used in the experiments
and also reports durations and numbers of the annotated events
for all six videos. Our experiments were performed using a
leave-one-out cross-validation strategy, in which models were
trained from ?ve clips and tested on the remaining one, and
the whole training-testing procedure was iterated six times. In
addition, our current system is programmed using Matlab 7.2 without code optimization, and running on a machine with
Intel P4 3.0 GHz CPU, 1.0 GB memory, and MS Windows
XP Professional x32 Edition. Based on the experiments below
in Section VI-A, the average testing time for a clip is about 15
times longer than its original video length, and the extraction
of audiovisual features accounts for around 96% of the time. A. Event Recognition Analysis Table VII summarizes the event recognition results in unit of seconds, presented in the form of a confusion matrix
[30], where the leftmost column represents the actual event
categories while the top-most row indicates the resultant ones
recognized by the HMM framework. The confusion matrix is
accumulated from results of all clips in the collection. The
recognition precision (RP) and the recognition recall (RR) for
each of the event categories are reported in Table VII. As
described in Section I, since the actual boundaries between
wedding events are not always precise, the recognition result
of a video unit is claimed to be correct if it hits the ground truth 12 TABLE VIII T HE RECOGNITION RESULTS SOLELY BASED ON THE FEATURE SIMILARITY OF WEDDING EVENTS WITHOUT EXPLOITING THE EVENT TRANSITION MODELING . Events ME GE BE CS OP WV RE BU MS WK AP ED OT (a) using audio features only RP(%) 34.54 30.14 42.57 78.39 69.81 0 0 71.55 0 12.09 0 34.80 69.32 RR(%) 87.39 59.86 87.21 64.46 88.49 0 0 44.86 0 96.55 0 80.93 13.89 (b) using visual features only RP(%) 20.30 12.58 30.64 47.10 62.68 36.61 0 20.40 0 4.32 0 16.59 45.49 RR(%) 87.74 66.20 29.07 45.23 14.81 8.43 0 44.32 0 87.36
Google Search
Google
Popular Articles