slurm hangs on "Sent DbdInit msg"

by Sebastien Mirolo on Tue, 27 Nov 2012

We are building a webapp that interacts with the mysql database slurmdbd relies on. One of the features involves creating entries into a cluster assoc_table. One INSERT and we fire a sbatch command. The command complains:

sbatch: error: Batch job submission failed: Invalid account 
    or account/partition combination specified

I guessed it might be a cache issue so I decided to restart slurmdbd in case. That's when it went south. Restart slurmdbd. Restart slurmctld. slurmctld hangs. Again, this time increasing the debugging level.

$ slurmdbd -D -vvvvv
$ slurmctld -D -vvvv

As it turns out, slurmctld hangs waiting for a message from slurmdbd while slurmdbd waits on a SQL procedure that runs in an infinite loop. Its definition is here:

slurm-2.4.2/src/plugins/accounting_storage/mysql/accounting_storage_mysql.c:
    "create procedure get_parent_limits("

The reason for the infinite loop are the entries in assoc_table. They are not structured as a tree.

$ SELECT user, acct, parent_acct FROM cluster_assoc_table;
  user: 1
  acct: project
  parent_acct: demo

We fix the database by adding an entry for demo with no parent and no user.

$ INSERT INTO cluster_assoc_table (acct) VALUES ('demo');
$ SELECT user, acct, parent_acct FROM cluster_assoc_table;
  user: 1
  acct: project
  parent_acct: demo

  user:
  acct: demo
  parent_acct:

We are back in business, slurmdbd and slurmctld initializations complete.

by Sebastien Mirolo on Tue, 27 Nov 2012