Page 1 of 1

Possible race condition between Cluster definition and job submission?

Posted: Mon Jun 07, 2021 7:25 pm
by bsync
I have a piece of code that creates a JobCluster instance with a fixed set of 'nodes' (all ip address strings at the moment) and when I attempt to submit a job (to a specific node at least) I get None back instead of a DispyJob. If I put a small sleep in between the JobCluster definition and the job=cluster.submit_node(...) call as follows, the returned job is legit.

c = JobCluster("myscript.py", depends=[...], nodes=['192.168.1.1', '192.168.1.2'])
time.sleep(1) #Removing this sleep cause the following submittal to fail with 'invalid node' message
j = c.submit_node('192.168.1.1'), 'myargs')

Looking in the code it appears it could be related to this line:

https://github.com/pgiri/dispy/blob/857 ... _.py#L2876

where a Task is created and _job.job is supposed to be the DispyJob returned...perhaps the Task needs a little time to spin up and populate the _job.job attribute with the DispyJob reference... maybe a small loop there to wait for the _job.job to become non None?

Anyhow...not sure if any of this is even reproducible but just in case anyone runs into it?

Re: Possible race condition between Cluster definition and job submission?

Posted: Tue Jun 08, 2021 2:28 am
by Giri
'submit_node' is supposed to be used with node that has been discovered and initialized. Right after a cluster is created (i.e., 'JobCluster' returned), the scheduler most likely wouldn't have found the node used in 'submit_node', causing the exception. To use 'submit_node', either 'cluster_status' or 'job_status' functions must be used. See 'job_scheduler.py', 'node_update.py' and 'longrun.py' where 'submit_node' is used with 'cluster_status'.