Skip to content

How to refine raw videos

Fabrizio Pedersoli (National Institute of Advanced Industrial Science and Technology (AIST))


The original raw video recordings of the AIST dance dataset contain a pre-roll and a post-roll of the actual dance performance (main dancing part). Therefore the videos must be precisely cut in order to remove the unwanted parts. In addition, in the raw recordings the audio is picked up by the cameras microphone. These audio is low quality and must be replaced with the original corresponding audio sample.

Due to the large volume of videos, an automatic algorithm for cutting the videos has to be developed. This algorithm must reliably detect with high precision (10ms) the staring time instant of the dance performance. Since the duration of the music piece is known in advance, the ending time of the dance performance is derived by the starting time. Moreover, each estimated staring time must be validated by a double-checking procedure such that possible errors can be automatically detected as well.


The main idea of the proposed algorithm is to detect the sequence of eight beat clicks that comes before the music in the raw recordings. The beat-click sequence is characterized by two types of beat clicks, which we refer to as A beat (higher pitch) and B beat (lower pitch). The beat-click sequence is arranged as the pattern ABBBABBB. The objective of the proposed algorithm is to find the position of the first A beat, t_{A_0}.

Given t_{A_0}, the music start time, t_s, is obtained by adding the duration of the beat-click sequence t_s = t_{A_0} + T_b. The duration T_b is known in advance because each music peace has a particular beat duration T that corresponds to a predetermined tempo.

In order to reliably identify t_{A_0} with high precision, a template matching approach is used. For each dance style s, music tempo m and camera c, a random video is selected and, within its audio the beat-click sequence is manually cut. This audio signal \tau_{s,m,c} [t] serves as beat-click template for the corresponding category of videos s, m, c.

Given a raw audio x[t] of a particular dance video, the correspondent template is selected and, the first occurrences of A and B are automatically extracted and we denote their beat-click templates (temporal shapes) as b_A[t] and b_B[t], respectively. b_A[t] is used for estimating t_{A_0}, while b_B[t] is used for double-checking under the assumption that the tempo is known. An example of beat-click template and associated components is given in the figure below.

The templates b_A[t] and b_B[t] are matched against x[t] by computing the following error function:

err[t] = \sum_{i} \left| x[t+i] - b[i] \right|^2

The figure below shows an example of two error functions.

It is reasonable to suppose that the minima of err_A[t] are located at the positions of the A shapes. Similarly, the minima of err_B[t] are located at the positions of the B shapes. A candidate for the t_{A_0} is the minimum of err_A[t], however there is no guarantee that this value is found in correspondence of A_0 as it may also be found at the position of A_4 (see figure above). Nonetheless, we can leverage on the fact that these minima should have a particular spacing because the beat duration T is known. Thus, if we compute the function:

err_0[t] = err_A[t] + err_A[t + 4\cdot T]

we can assume that the location of the minimum of err_0[t] is the candidate for t_{A_0}. Since the strongest minima of err_A should be exactly spaced by 4T and, because we add together a function and a shifted version of itself, the minimum of this pointwise summation should correspond to the position of A_0. By doing this procedure we are able to clearly distinguish A_0 from A_4.

The estimated t_{A_0} is then validated by the double-checking procedure. Since the beat-click sequence has a particular structure we need to check that the remaining pattern occurs after t_{A_0}. For this reason another function is computed based on err_B[t] for the purpose of detecting t_{B_1}, that is:

err_1[t] = err_B[t] + \sum_{i\in \{1,2,4,5,6\}} err_B[t+i\cdot T]

With the same consideration as before, the minimum of err_1[t] should be located at the position of B_1.

Finally, the double-checking procedure verifies that:

0.95 \leq \frac{t_{B_1} - t_{A_0}}{T} \leq 1.05

if this condition holds true, the estimated t_{A_0} is considered valid and the music start time t_s is computed by adding the overall duration of the sequence of eight beat clicks. Otherwise, if the double-checking condition is not valid, an error is flagged and the t_s is manually identified. Finally, on the basis of t_s, the pre-roll and post-roll of a raw video are automatically trimmed away and a noisy raw audio is automatically replaced with its noiseless original musical piece.