I have a piece of code that creates a JobCluster instance with a fixed set of 'nodes' (all ip address strings at the moment) and when I attempt to submit a job (to a specific node at least) I get None back instead of a DispyJob. If I put a small sleep in between the JobCluster definition and the job=cluster.submit_node(...) call as follows, the returned job is legit.
c = JobCluster("myscript.py", depends=[...], nodes=['192.168.1.1', '192.168.1.2'])
time.sleep(1) #Removing this sleep cause the following submittal to fail with 'invalid node' message
j = c.submit_node('192.168.1.1'), 'myargs')
Looking in the code it appears it could be related to this line:
https://github.com/pgiri/dispy/blob/857 ... _.py#L2876
where a Task is created and _job.job is supposed to be the DispyJob returned...perhaps the Task needs a little time to spin up and populate the _job.job attribute with the DispyJob reference... maybe a small loop there to wait for the _job.job to become non None?
Anyhow...not sure if any of this is even reproducible but just in case anyone runs into it?
Possible race condition between Cluster definition and job submission?
Moderator: admin
-
- Posts: 5
- Joined: Tue May 18, 2021 8:28 pm
-
- Site Admin
- Posts: 58
- Joined: Sun Dec 27, 2020 5:35 pm
Re: Possible race condition between Cluster definition and job submission?
'submit_node' is supposed to be used with node that has been discovered and initialized. Right after a cluster is created (i.e., 'JobCluster' returned), the scheduler most likely wouldn't have found the node used in 'submit_node', causing the exception. To use 'submit_node', either 'cluster_status' or 'job_status' functions must be used. See 'job_scheduler.py', 'node_update.py' and 'longrun.py' where 'submit_node' is used with 'cluster_status'.