Jobs can stay in the Q state for extended period of time because of license restrictions or other applicationS specific requirements. Follow this guide if you think all conditions are met but your job is still stuck in the Q state.
1 - Verify if the capacity is being provisioned
Verify if the capacity associated to the job is being provisioned by running the following command:
Copy qstat -f <job_id> | grep select
If compute_node
value is set to tbd
: Jobs is not eligible to run for the reasons mentioned above.
If compute_node
value is set to compute_node=idea-<CLUSTER>-compute-ondemand-<JOB_ID>
: In this case IDEA has triggered CloudFormation and the capacity is being provisioned
You can login to AWS Console and navigate to the CloudFormation console to verify the CloudFormation stack associated to your job is in CREATE_COMPLETE
state. If not, verify any potential errors via the Events
tab.
2 - Verify the bootstrap logs for the compute node(s) being provisioned
If the capacity is being provisioned, the next thing to check is if there is no errors during the bootstrap sequence on the compute node(s) provisioned to run your job.
To verify that, review logs located under /apps/<CLUSTER>/scheduler/jobs/<JOB_ID>/
You will find the boostrap & compute_node logs for all EC2 instances being provisioned for your job:
Copy Bootstrap:
/apps/<CLUSTER>/scheduler/jobs/<JOB_ID>/bootstrap/<COMPUTE_NODE_JOB_ID>
Compute Node Startup Logs:
/apps/<CLUSTER>/scheduler/jobs/<JOB_ID>/logs/<INSTANCE_HOSTNAME>
3 - Check if the compute node(s) is/are being registered on the scheduler
Verify if the compute node(s) are being registered correctly to the scheduler.
Run pbsnodes -a
and find the section specific to your job id (see example below for job 103)
Copy p-10-110-6-141
Mom = ip-10-110-6-141.us-east-2.compute.internal
Port = 15002
pbs_version = unavailable
ntype = PBS
state = state-unknown,down
resources_available.anonymous_metrics = True
resources_available.auto_scaling_group = idea-demo-compute-ondemand-103-AutoScalingComputeGroup-YXWJFA4XLGKQ
resources_available.availability_zone = us-east-2c
resources_available.base_os = rhel7
resources_available.cluster_name = idea-demo
resources_available.compute_node = idea-demo-compute-ondemand-103
resources_available.efa_support = True
resources_available.force_ri = False
resources_available.fsx_lustre = False
resources_available.host = ip-10-110-6-141
resources_available.ht_support = False
resources_available.instance_ami = ami-0c1c3220d0b1716d2
resources_available.instance_id = i-02447d465fbe84723
resources_available.instance_profile = AIPA6ERFY3V55DSVSI2OZ
resources_available.instance_type = c5n.9xlarge
resources_available.job_group = gf4ea4e11
resources_available.job_id = 103
resources_available.keep_ebs = False
resources_available.keep_forever = False
resources_available.launch_time = 1671017843
resources_available.placement_group = False
resources_available.provisioning_time = 1671017862089
resources_available.queue_type = compute
resources_available.root_size = 10gb
resources_available.scaling_mode = single-job
resources_available.scratch_iops = 0
resources_available.scratch_size = 0gb
resources_available.stack_id = arn:aws:cloudformation:us-east-2: < REDACTE D > :stack/idea-demo-compute-ondemand-103/9caa5e10-7ba3-11ed-837d-060b0d4719a4
resources_available.subnet_id = subnet-064c905368b057b68
resources_available.system_metrics = False
resources_available.tenancy = default
resources_available.terminate_when_idle = 0
resources_available.vnode = ip-10-110-6-141
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = normal
resv_enable = True
sharing = default_shared
In this example, the host is still being configured as state is state = state-unknown,down.
If that's the case, wait a little longer. Your host will be ready to accept job when state = free
(see below)
Copy ip-10-110-6-141
Mom = ip-10-110-6-141.us-east-2.compute.internal
ntype = PBS
state = free
pcpus = 18
resources_available.anonymous_metrics = True
resources_available.arch = linux
resources_available.auto_scaling_group = idea-demo-compute-ondemand-103-AutoScalingComputeGroup-YXWJFA4XLGKQ
resources_available.availability_zone = us-east-2c
resources_available.base_os = rhel7
resources_available.cluster_name = idea-demo
resources_available.compute_node = idea-demo-compute-ondemand-103
resources_available.efa_support = True
resources_available.force_ri = False
resources_available.fsx_lustre = False
resources_available.host = ip-10-110-6-141
resources_available.ht_support = False
resources_available.instance_ami = ami-0c1c3220d0b1716d2
resources_available.instance_id = i-02447d465fbe84723
resources_available.instance_profile = AIPA6ERFY3V55DSVSI2OZ
resources_available.instance_type = c5n.9xlarge
resources_available.job_group = gf4ea4e11
resources_available.job_id = 103
resources_available.keep_ebs = False
resources_available.keep_forever = False
resources_available.launch_time = 1671017843
resources_available.mem = 96640680kb
resources_available.ncpus = 18
resources_available.placement_group = False
resources_available.provisioning_time = 1671017862089
resources_available.queue_type = compute
resources_available.root_size = 10gb
resources_available.scaling_mode = single-job
resources_available.scratch_iops = 0
resources_available.scratch_size = 0gb
resources_available.stack_id = arn:aws:cloudformation:us-east-2:<REDACTED>:stack/idea-demo-compute-ondemand-103/9caa5e10-7ba3-11ed-837d-060b0d4719a4
resources_available.subnet_id = subnet-064c905368b057b68
resources_available.system_metrics = False
resources_available.tenancy = default
resources_available.terminate_when_idle = 0
resources_available.vnode = ip-10-110-6-141
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = normal
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Dec 14 11:49:10 2022
4 - Restart the Scheduler
If needed, SSH to the scheduler machine and restart both OpenPBS and idea-scheduler
module.
To restart OpenPBS, run systemctl restart pbs
A valid output looks like this ( see Active: active (running)
)
Copy [clusteradmin@ip-10-110-2-204 ~]$ systemctl status pbs
● pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: active (running) since Wed 2022-12-14 11:22:30 UTC; 2min 13s ago
Docs: man:pbs(8)
Process: 17146 ExecStop=/opt/pbs/libexec/pbs_init.d stop (code=exited, status=0/SUCCESS)
Process: 20115 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)
Tasks: 9
Memory: 17.0M
CGroup: /system.slice/pbs.service
├─20191 /opt/pbs/sbin/pbs_comm
├─20206 /opt/pbs/sbin/pbs_sched
├─20249 /opt/pbs/sbin/pbs_ds_monitor monitor
├─20273 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
├─20283 postgres: logger process
├─20285 postgres: checkpointer process
├─20286 postgres: writer process
├─20287 postgres: wal writer process
├─20288 postgres: autovacuum launcher process
├─20289 postgres: stats collector process
├─20318 postgres: postgres pbs_datastore 10.110.2.204(60770) idle
└─20321 /opt/pbs/sbin/pbs_server.bin
If the service is not starting up correctly, verify the logs under:
/var/spool/pbs/sched_logs
/var/spool/pbs/server_logs
Finally, try to restart idea-scheduler
by running /opt/idea/python/latest/bin/supervisorctl restart scheduler
You can confirm if idea-scheduler
has started correctly by checking the application logs located under /opt/idea/app/logs