Learn it Up - Logs

12/05/2026:

13/05/2026:

14/05/2026:

15/05/2026:

IDEA: Consider training LSTM models using recurrent batch normalization and residual connections in LSTMs To consider this, must search on existing applications of these to LSTMs in other papers to consider results.

16/05/2026:

18/05/2026:

QUESTION: DanceDanceConv and DanceDanceConvLSTM both use an architecture simular to the encoder-decoder. Could doing something like reversing the input possibly improve the model like it does with seq2seq?

QUESTION: Could adding this attention encoder-decoder mechanism to DanceDanceConvolution help increase performance without adding the computational costs of ConvLSTM?

19/05/2026:

QUESTION: Honestly, the original approach is much simplier, but I see the appeal of the transformer. I wonder if you would get great performance in RNN attention based networks by adding residuals and/or normalization. Food for thought.

I’ve noticed a detail that I have glossed over in DanceDanceConvLSTM, that is the fact that it does utilize music information when doing step placement, which feels right with me. I shall also inspect the transformer-based generation paper Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts (Yi2023) to determine if they do that.

20/05/2026:

NOTES

As a somewhat experienced (average S16) player of Pump it Pump and having some experience writing my own charts for the game, I could instantly note a few issues with the architecture of the original paper DanceDanceConvolution paper, the main ones being: 1. The step placement process doesn’t place the steps aligned with the BPM of the game (in real charts, the steps are always placed within fractions of a full beat measure), instead placing the steps in arbitrary positions, possibly misaligned with the BPM of the song. 2. The step selection process does not take music information into consideration, it only receives the step placements and tries to spit out the time outputs. 3. The chart generation models are trained on all available chart styles (technical, run, goofy). T his is like trying train a single model to translate text from english, using datasets of various text in different languages.

The first two¹ were fortunately addressed in DanceDanceConvLSTM from 2025, but the third point is still untackled.

Additionally, the BPM detection often fails to detect the right offset for songs with variable or changing BPM. Considering that the algorithm used by DDCLSTM is the same used by the software Arrow Vortex which I use for writing charts, I know that this works pretty bad for BPM-chaning charts, requiring manual intervention and BPM change placement. This is even worse for charts with continuously changing BPMs. In the training for Pump it up charts, I should remove these BPM-changing songs.

Another thing to point out is that pump it up charts have lots of gimmicks and details that should be filtered out when doing modelling.

¹: When I initially read DanceDanceConvLSTM, I didn’t take into consideration that the model solves both problems 1 and 2, I thought it only solved number 2.

Next steps:

21/05/2026

Looking at the references for Bengio2012, I’ve read Chapters 2 and 4 of Algorithms for Classifying Recorded Music by Genre (Bergstra2006b), which surveys several basic important concepts in audio processing in its chapter 2, specially the Mel, Sone and Phon scales, sound pressure and intensity.

Additionally, it explains and tests various audio feature extraction methods, introducing Mel-scale Phon Coefficients (MPC) and Mel-scale Sone Coefficients, which make quite some sense.

22/05/2026

I’ve read Temporal Pooling And Multiscale Learning For Automatic Annotation And Ranking Of Music Audio (Hamel2011), since they are a reference for Bengio2012’s usage of PCA in features but honestly I didn’t really understand the whole PCA feature thing.

23/05/2026

I’ve read Bengio2012. I didn’t really understand their PCA-based approach for musical feature in their models, but this technique is not used in later work such as DDC or DDCL. Instead, what is used is the approach to use multiple overlapping timescales to process information in the onset detection.

I’ve read DDC once again. Their model is now very clear to me, both in terms of audio information retrieval, step placement and step selection. I have realized something interesting: You could technically make a program that receives some live audio input from a microphone and generate the step placements on the fly, depending on the LSTM unroll parameter for the encoder. For the parameters presented in DDC, this would be 2 seconds of delay.

I’ve read Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting (Shi2015) which introduces and explains ConvLSTM.

I’ve read the step placement model of DDCL once again, and I now understand how it works.

NOTES

24/05/2026

I’ve read Yi2023 once again.

NOTES

25/05/2026

I’ve started studying the SM file format and gimmicks in Pump it Up files, trying to understand warp effect.s

26/05/2026

I’ve finished understanding the different category of gimmick charts and how they’re used in Stepmania 5 files.

BPM changes may be tricky, but fortunately, the most common way of applying gimmicks is through scrolls, and speed sections, which do not affect note timing. It should be possible to filter warp, and fake blocks

29/05/2026

I’ve started writing the StepMania file parser and filter for the upcoming models.

31/05/2026

I’ve implemented absolute time audio extraction and stepfile simulation (for manually comparing the absolute times with the note times of real charts).

01/05/2026

I’ve studied the fourier transform through characteristic functions to have a better understanding of fourier analysis and synthesis.

As a result of my studies, have this over-engineered fizzbuzz without branching or modular arithmetic:

import math

freqs  = [-1.00, -0.80,-0.67, -0.60, -0.40, -0.33, -0.20,
         -0.00, 0.20 , 0.33 , 0.40 , 0.60 , 0.67 , 0.80 ]
norms  = [11.00, 6.00 ,5.00 , 6.00 , 6.00 , 5.00 , 6.00 ,
         11.00, 6.00 , 5.00 , 6.00 , 6.00 , 5.00 , 6.00 ]

def fizzbuzz(x):
    result = 0
    for freq, norm in zip(freqs, norms):
        result += norm * math.cos(2 * math.pi * freq * x)
    i = result / 30
    return [x, 'Fizz', 'Buzz', 'FizzBuzz'][round(i)]

for i in range(1, 101):
    print(fizzbuzz(i))

02/05/2026

I’ve started implementing feature extraction using the essentia library and the same parameters as DDC.