Can't setup node in some cases

Questions, issues regarding dispy / pycos

Moderator: admin

Post Reply
dementiy
Posts: 7
Joined: Tue Apr 13, 2021 5:26 pm

Can't setup node in some cases

Post by dementiy »

Hello.

I faced with a couple of errors. In dispyscheduler.py:

Code: Select all

cluster = self._clusters[compute.id]
if node.ip_addr in cluster._dispy_nodes:
    continue
dispy_node = cluster._dispy_nodes.get(node.ip_addr, None)
if not dispy_node:
    dispy_node = DispyNode(node.ip_addr, node.name, node.cpus)
    cluster._dispy_nodes[node.ip_addr] = dispy_node
    dispy_node.tx = node.tx
    dispy_node.rx = node.rx
 # ...
I think that first if-condition is redundant (now "get" will always return "None").

And the second case: say dispynode was killed, at the same time client was stopped and then started again, and then dispynode was started. The node now has a dead jobs for old computation than not in _clusters:

Code: Select all

def reschedule_jobs(self, dead_jobs):
    if not dead_jobs:
        return
    for _job in dead_jobs:
        cluster = self._clusters[_job.compute_id]
        del self._sched_jobs[_job.uid]
I think it should be:

Code: Select all

def reschedule_jobs(self, dead_jobs):
    if not dead_jobs:
        return
    for _job in dead_jobs:
        cluster = self._clusters.get(_job.compute_id, None)
        if cluster is None:
           continue
        # ...
What do you think? Should I create a PR?
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Can't setup node in some cases

Post by Giri »

Thanks for reporting and finding fixes!

First case is clear and I agree with the change you suggest. I think your suggestion regarding second case also makes sense, but not sure I understand the problem you mention (i.e., that you can't setup nodes in some cases). Can you give bit more detail, e.g., how to reproduce it? Have you tested proposed change to make sure it fixes the problem?

Please create PR in github.

Thanks.
dementiy
Posts: 7
Joined: Tue Apr 13, 2021 5:26 pm

Re: Can't setup node in some cases

Post by dementiy »

I created PR only for first case (https://github.com/pgiri/dispy/pull/219).

Here is simplified example for the second case.

I run dispyshceduler.py like this:

Code: Select all

dispyscheduler.py -i 10.57.46.38 -d --daemon --clean --pulse_interval 6 --ping_interval=30 --zombie_interval=30
Then dispynode.py:

Code: Select all

dispynode.py -i 10.57.46.38 -d --daemon --clean --force_cleanup
And SharedJobClient:

Code: Select all

def compute(n):
    import time
    time.sleep(n)


if __name__ == "__main__":
    import dispy
    import threading
    
    def close_cluster_after_n_secs(secs):
        import time
        time.sleep(secs)
        cluster.cancel(job)
        cluster.close(timeout=5, terminate=True)
    
    cluster = dispy.SharedJobCluster(compute, nodes=["*"], client_port=0, host="10.57.46.38", scheduler_host="10.57.46.38", logleve=dispy.logger.DEBUG)
    job = cluster.submit(60)

    # close cluster after 30 secs
    t = threading.Thread(target=close_cluster_after_n_secs, args=(30,))
    t.start()
Before job cluster is closed I kill dispynode.py (kill -s 9 pid). Then I wait for the cluster to be closed by thread. Run job cluster again and run node. Here is output from dispyscheduler:

Code: Select all

2021-04-15 18:17:49 pycos - uncaught exception in tcp_req/140382734612432:
Traceback (most recent call last):
  File "/home/dementiy/.virtualenvs/dispy-env/bin/dispyscheduler.py", line 425, in tcp_req
    self.add_node(info)
  File "/home/dementiy/.virtualenvs/dispy-env/bin/dispyscheduler.py", line 1449, in add_node
    self.reschedule_jobs(dead_jobs)
  File "/home/dementiy/.virtualenvs/dispy-env/bin/dispyscheduler.py", line 1676, in reschedule_jobs
    cluster = self._clusters[_job.compute_id]
KeyError: 140382729516560
No new jobs on dispynode.

Yes, maybe it's a little bit hard to reproduce error because timings.

For me this solution works fine:

Code: Select all

def reschedule_jobs(self, dead_jobs):
        if not dead_jobs:
            return
        for _job in dead_jobs:
            cluster = self._clusters.get(_job.compute_id, None)
            del self._sched_jobs[_job.uid]
            if cluster is None:
                logger.debug("Cluster for computation %s is gone away", _job.compute_id)
                continue
            # ...
 
But maybe some resources should be freed up additionally.
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Can't setup node in some cases

Post by Giri »

I have committed fix for this as well. Let me know if this works.
Post Reply