Can't setup node in some cases

dementiy · Post by **dementiy** » Wed Apr 14, 2021 10:16 am

Hello.

I faced with a couple of errors. In dispyscheduler.py:

cluster = self._clusters[compute.id]
if node.ip_addr in cluster._dispy_nodes:
    continue
dispy_node = cluster._dispy_nodes.get(node.ip_addr, None)
if not dispy_node:
    dispy_node = DispyNode(node.ip_addr, node.name, node.cpus)
    cluster._dispy_nodes[node.ip_addr] = dispy_node
    dispy_node.tx = node.tx
    dispy_node.rx = node.rx
 # ...

I think that first if-condition is redundant (now "get" will always return "None").

And the second case: say dispynode was killed, at the same time client was stopped and then started again, and then dispynode was started. The node now has a dead jobs for old computation than not in _clusters:

Code: Select all

def reschedule_jobs(self, dead_jobs):
    if not dead_jobs:
        return
    for _job in dead_jobs:
        cluster = self._clusters[_job.compute_id]
        del self._sched_jobs[_job.uid]

I think it should be:

Code: Select all

def reschedule_jobs(self, dead_jobs):
    if not dead_jobs:
        return
    for _job in dead_jobs:
        cluster = self._clusters.get(_job.compute_id, None)
        if cluster is None:
           continue
        # ...

What do you think? Should I create a PR?

Post by **Giri** » Thu Apr 15, 2021 12:09 am

Thanks for reporting and finding fixes!

First case is clear and I agree with the change you suggest. I think your suggestion regarding second case also makes sense, but not sure I understand the problem you mention (i.e., that you can't setup nodes in some cases). Can you give bit more detail, e.g., how to reproduce it? Have you tested proposed change to make sure it fixes the problem?

Please create PR in github.

Thanks.

dementiy · Post by **dementiy** » Thu Apr 15, 2021 3:33 pm

I created PR only for first case (https://github.com/pgiri/dispy/pull/219).

Here is simplified example for the second case.

I run dispyshceduler.py like this:

Code: Select all

dispyscheduler.py -i 10.57.46.38 -d --daemon --clean --pulse_interval 6 --ping_interval=30 --zombie_interval=30

Then dispynode.py:

Code: Select all

dispynode.py -i 10.57.46.38 -d --daemon --clean --force_cleanup

And SharedJobClient:

Code: Select all

def compute(n):
    import time
    time.sleep(n)


if __name__ == "__main__":
    import dispy
    import threading
    
    def close_cluster_after_n_secs(secs):
        import time
        time.sleep(secs)
        cluster.cancel(job)
        cluster.close(timeout=5, terminate=True)
    
    cluster = dispy.SharedJobCluster(compute, nodes=["*"], client_port=0, host="10.57.46.38", scheduler_host="10.57.46.38", logleve=dispy.logger.DEBUG)
    job = cluster.submit(60)

    # close cluster after 30 secs
    t = threading.Thread(target=close_cluster_after_n_secs, args=(30,))
    t.start()

Before job cluster is closed I kill dispynode.py (kill -s 9 pid). Then I wait for the cluster to be closed by thread. Run job cluster again and run node. Here is output from dispyscheduler:

Code: Select all

2021-04-15 18:17:49 pycos - uncaught exception in tcp_req/140382734612432:
Traceback (most recent call last):
  File "/home/dementiy/.virtualenvs/dispy-env/bin/dispyscheduler.py", line 425, in tcp_req
    self.add_node(info)
  File "/home/dementiy/.virtualenvs/dispy-env/bin/dispyscheduler.py", line 1449, in add_node
    self.reschedule_jobs(dead_jobs)
  File "/home/dementiy/.virtualenvs/dispy-env/bin/dispyscheduler.py", line 1676, in reschedule_jobs
    cluster = self._clusters[_job.compute_id]
KeyError: 140382729516560

No new jobs on dispynode.

Yes, maybe it's a little bit hard to reproduce error because timings.

For me this solution works fine:

Code: Select all

def reschedule_jobs(self, dead_jobs):
        if not dead_jobs:
            return
        for _job in dead_jobs:
            cluster = self._clusters.get(_job.compute_id, None)
            del self._sched_jobs[_job.uid]
            if cluster is None:
                logger.debug("Cluster for computation %s is gone away", _job.compute_id)
                continue
            # ...

But maybe some resources should be freed up additionally.

Post by **Giri** » Sun Apr 25, 2021 11:31 pm

I have committed fix for this as well. Let me know if this works.

Can't setup node in some cases

Can't setup node in some cases

Re: Can't setup node in some cases

Re: Can't setup node in some cases

Re: Can't setup node in some cases