Debugging dispynode error

Anon · Post by **Anon** » Mon Oct 11, 2021 4:29 am

Hi guys, I have been using dispy to distribute computation between different servers in the cloud, but I'm getting an error after a few jobs submissions, and I'm not quite sure why it is happening or what is causing it.

Code: Select all

2021-10-11 00:19:55 dispy - Failed to run 2513 on X.X.X.X: bytearray(b'Invalid job (Traceback (most recent call last):\n  File "/usr/local/bin/dispynode.py", line 1132, in job_request\n    client = self.clients[_job.compute_id]\nKeyError: 1633899704562\n)')
2021-10-11 00:19:55 dispy - Failed to run job 139797300820752 on X.X.X.X for computation compute

When this happens, the cluster gets stuck, and any subsequent attempts to send a job fail for any node. The line where it fails belongs to this code segment.

Code: Select all

    def tcp_req(self, conn, addr, task=None):

        def job_request(msg):
            error = None
            try:
                _job = deserialize(msg)
                client = self.clients[_job.compute_id]

So I'm guessing there could be an issue with the job I'm trying to create that is causing a bug in the deserialization. Has anyone else faced a similar issue ? how can I debug this to see if the job I'm sending is causing a deserialization bug ?

Post by **Giri** » Tue Oct 12, 2021 10:24 pm

Is this with latest version?

It seems node discarded client, likely due to zombie detection. I think you can disable zombie checking at node with '--zombie_interval=0'.

BTW, I have added support for on-demand hybrid cloud computing to dispy. If you are interested, email me.

Anon · Post by **Anon** » Thu Oct 14, 2021 8:31 pm

Unfortunately, it didn't solve the issue. After a few hours and a few jobs, the nodes start crashing. I'm using

dispynode - version: 4.15.0 (Python 3.6.8)

.

It does seem to be like the connection is interrupted, or for some reason it fails, and then it's unable to resume the job.

Code: Select all

pycos - invalid task 140648621602776 to resume

EDIT: I also found this error in one of the nodes

Code: Select all

2021-10-15 01:37:05 pycos - uncaught exception in tcp_req/14$668394794152:
Traceback (most recent call last):
  File "/usr/local/bin/dispynode.py", line 1696, in tcp_req
    msg = yield conn.recv_msg()
  File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 3686, in _schedule
    retval = task._generator.throw(*exc)
  File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 906, in _async_recv_msg
    data = yield self.recvall(n)
  File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 485, in _recvall
    recvd = self._rsock.recv_into(view, len(view), *args)
ConnectionResetError: [Errno 104] Connection reset by peer

2021-10-15 01:37:10 pycos - uncaught exception in tcp_req/140668485396616:
Traceback (most recent call last):
  File "/usr/local/bin/dispynode.py", line 1696, in tcp_req
    msg = yield conn.recv_msg()
  File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 3688, in _schedule
    retval = task._generator.send(task._value)
  File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 906, in _async_recv_msg
    data = yield self.recvall(n)
  File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 511, in _async_recvall
    self._read_result = bytearray(bufsize)
MemoryError

The funny thing is that I tried this time to run it on five remote nodes and one local node (127.0.0.1). The local node lasted more than the other ones but still crashed after hours. What data is stored on the node after a job execution ? Is there a way to do a cleanup (remove everything from memory/disk except dependency files transferred) ?

BTW yeah, ill pm you about the on-demand support.

Post by **Giri** » Sun Oct 17, 2021 6:48 pm

I have looked into issue with intermittent loss of network connection and simulated it. The client currently simply discards nodes that are presumed dead. I am working on recovering from this. I will email once I am done with it for you to try.

Post by **Giri** » Tue Oct 19, 2021 3:49 am

Can you post or email me your client program without 'compute'. I would like to test with the way you create cluster and submit jobs.

Anon · Post by **Anon** » Sun Oct 24, 2021 5:38 am

Sorry for the delay. I also found that my compute had a major bug that was causing it to yield a bunch of errors. I've been working on it and ill do a major test today, but here is the code

Code: Select all

def compute(args): # function sent to remote nodes for execution
    from external_lib import X
    result = X.start(args)
    return result


def job_status(job):
    global pending_jobs, jobs_cond, args
    if (job.status == dispy.DispyJob.Finished or job.status in (dispy.DispyJob.Terminated, dispy.DispyJob.Cancelled,dispy.DispyJob.Abandoned)):
        jobs_cond.acquire()
        if job.id: 
            if job.exception == None: 
                if job() != None: # computation found something
                    saveResult(job()))
            pending_jobs.pop(job.id)
            job.result = None
            if len(pending_jobs) <= len(nodes):
                jobs_cond.notify()
        jobs_cond.release()


if __name__ == '__main__':
    import dispy, sys, threading, os, time 

    #external lib has multiple files
    dependencies = []
    for root,subdir,files in os.walk('external_lib'):
        for file in files:
            dependencies.append(os.path.join(root,file))

    nodes = []
    cluster = dispy.JobCluster(compute, depends=dependencies,nodes=nodes, host='0.0.0.0', secret=SECRET, job_status=job_status)
    pending_jobs = {}
    jobs_cond = threading.Condition()

    #load computations and their arguments
    compute_list = ...
    #
    while len(compute_list) != 0:
        job = cluster.submit(compute_task,compute_args)
        jobs_cond.acquire()
        if job.status == dispy.DispyJob.Created or job.status == dispy.DispyJob.Running:
            pending_jobs[job.id] = job
            if len(pending_jobs) >= len(nodes):
                jobs_cond.wait()
                cluster.print_status()
            jobs_cond.release()
        compute_task = compute_list.pop()
    cluster.wait()
    cluster.close()

Anon · Post by **Anon** » Mon Nov 22, 2021 7:09 pm

Quick update: I have been able to run it for a bit more than 1 week without issues. I made major changes to the compute to handle a lot of exceptions and some bugs. Also added firewall rules on the nodes and cluster to block external connections as it's not expected to interact with other hosts. Will continue to test and hopefully it will run smoothly this month. Thanks for the support

Anon · Post by **Anon** » Fri Jan 07, 2022 11:23 am

Bumping this topic again. After several weeks of running a computation, I face the same issues again. It does seem to be network issues or something corrupting the serialized job because it simply gets stuck.

The cluster gets errors like

Code: Select all

2022-01-07 11:13:44 dispy - Ignoring invalid reply for job 140544685245808 from X.X.X.X

and the node gets thousands of errors like

Code: Select all

2022-01-07 11:13:44 dispynode - Sending result for job 140544685245808

I don't know how to debug or inspect these issues to figure out if its a problem with the type of data I'm sending/receiving or what else I can do to handle this kind of connectivity issue.

Post by **Giri** » Sun Jan 09, 2022 2:59 pm

I did some work on this issue when you reported first, but left it after your last comment that you got it working. I am now preparing next release (hopefully in couple of days). I can get back to your issue after the release. Email me so I can send you files to test.

jhzheng_fzu · Post by **jhzheng_fzu** » Sat Oct 15, 2022 3:01 pm

hello!! I am a graduate student in Fuzhou University in China. I am learning the dispy to create a cluster but my majority is civil engineering. Would you like to tell me your email address so that we can communicate together?

Junhao Zheng
Fuzhou University

Debugging dispynode error

Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error

Re: Debugging dispynode error