How to refine raw videos¶

Fabrizio Pedersoli (National Institute of Advanced Industrial Science and Technology (AIST))

Overview¶

The original raw video recordings of the AIST dance dataset contain a pre-roll and a post-roll of the actual dance performance (main dancing part). Therefore the videos must be precisely cut in order to remove the unwanted parts. In addition, in the raw recordings the audio is picked up by the cameras microphone. These audio is low quality and must be replaced with the original corresponding audio sample.

Due to the large volume of videos, an automatic algorithm for cutting the videos has to be developed. This algorithm must reliably detect with high precision (10ms) the staring time instant of the dance performance. Since the duration of the music piece is known in advance, the ending time of the dance performance is derived by the starting time. Moreover, each estimated staring time must be validated by a double-checking procedure such that possible errors can be automatically detected as well.

Algorithm¶

The main idea of the proposed algorithm is to detect the sequence of eight beat clicks that comes before the music in the raw recordings. The beat-click sequence is characterized by two types of beat clicks, which we refer to as $A$ beat (higher pitch) and $B$ beat (lower pitch). The beat-click sequence is arranged as the pattern $ABBBABBB$ . The objective of the proposed algorithm is to find the position of the first $A$ beat, $t_{A_0}$ .

Given $t_{A_0}$ , the music start time, $t_s$ , is obtained by adding the duration of the beat-click sequence $t_s = t_{A_0} + T_b$ . The duration $T_b$ is known in advance because each music peace has a particular beat duration $T$ that corresponds to a predetermined tempo.

In order to reliably identify $t_{A_0}$ with high precision, a template matching approach is used. For each dance style $s$ , music tempo $m$ and camera $c$ , a random video is selected and, within its audio the beat-click sequence is manually cut. This audio signal $\tau_{s,m,c} [t]$ serves as beat-click template for the corresponding category of videos $s$ , $m$ , $c$ .

Given a raw audio $x[t]$ of a particular dance video, the correspondent template is selected and, the first occurrences of $A$ and $B$ are automatically extracted and we denote their beat-click templates (temporal shapes) as $b_A[t]$ and $b_B[t]$ , respectively. $b_A[t]$ is used for estimating $t_{A_0}$ , while $b_B[t]$ is used for double-checking under the assumption that the tempo is known. An example of beat-click template and associated components is given in the figure below.

The templates $b_A[t]$ and $b_B[t]$ are matched against $x[t]$ by computing the following error function:

$err[t] = \sum_{i} \left| x[t+i] - b[i] \right|^2$

The figure below shows an example of two error functions.

It is reasonable to suppose that the minima of $err_A[t]$ are located at the positions of the $A$ shapes. Similarly, the minima of $err_B[t]$ are located at the positions of the $B$ shapes. A candidate for the $t_{A_0}$ is the minimum of $err_A[t]$ , however there is no guarantee that this value is found in correspondence of $A_0$ as it may also be found at the position of $A_4$ (see figure above). Nonetheless, we can leverage on the fact that these minima should have a particular spacing because the beat duration $T$ is known. Thus, if we compute the function:

$err_0[t] = err_A[t] + err_A[t + 4\cdot T]$

we can assume that the location of the minimum of $err_0[t]$ is the candidate for $t_{A_0}$ . Since the strongest minima of $err_A$ should be exactly spaced by $4T$ and, because we add together a function and a shifted version of itself, the minimum of this pointwise summation should correspond to the position of $A_0$ . By doing this procedure we are able to clearly distinguish $A_0$ from $A_4$ .

The estimated $t_{A_0}$ is then validated by the double-checking procedure. Since the beat-click sequence has a particular structure we need to check that the remaining pattern occurs after $t_{A_0}$ . For this reason another function is computed based on $err_B[t]$ for the purpose of detecting $t_{B_1}$ , that is:

$err_1[t] = err_B[t] + \sum_{i\in \{1,2,4,5,6\}} err_B[t+i\cdot T]$

With the same consideration as before, the minimum of $err_1[t]$ should be located at the position of $B_1$ .

Finally, the double-checking procedure verifies that:

$0.95 \leq \frac{t_{B_1} - t_{A_0}}{T} \leq 1.05$

if this condition holds true, the estimated $t_{A_0}$ is considered valid and the music start time $t_s$ is computed by adding the overall duration of the sequence of eight beat clicks. Otherwise, if the double-checking condition is not valid, an error is flagged and the $t_s$ is manually identified. Finally, on the basis of $t_s$ , the pre-roll and post-roll of a raw video are automatically trimmed away and a noisy raw audio is automatically replaced with its noiseless original musical piece.