Page 1 of 2
Debugging dispynode error
Posted: Mon Oct 11, 2021 4:29 am
by Anon
Hi guys, I have been using dispy to distribute computation between different servers in the cloud, but I'm getting an error after a few jobs submissions, and I'm not quite sure why it is happening or what is causing it.
Code: Select all
2021-10-11 00:19:55 dispy - Failed to run 2513 on X.X.X.X: bytearray(b'Invalid job (Traceback (most recent call last):\n File "/usr/local/bin/dispynode.py", line 1132, in job_request\n client = self.clients[_job.compute_id]\nKeyError: 1633899704562\n)')
2021-10-11 00:19:55 dispy - Failed to run job 139797300820752 on X.X.X.X for computation compute
When this happens, the cluster gets stuck, and any subsequent attempts to send a job fail for any node. The line where it fails belongs to this code segment.
Code: Select all
def tcp_req(self, conn, addr, task=None):
def job_request(msg):
error = None
try:
_job = deserialize(msg)
client = self.clients[_job.compute_id]
So I'm guessing there could be an issue with the job I'm trying to create that is causing a bug in the deserialization. Has anyone else faced a similar issue ? how can I debug this to see if the job I'm sending is causing a deserialization bug ?
Re: Debugging dispynode error
Posted: Tue Oct 12, 2021 10:24 pm
by Giri
Is this with latest version?
It seems node discarded client, likely due to zombie detection. I think you can disable zombie checking at node with '--zombie_interval=0'.
BTW, I have added support for on-demand hybrid cloud computing to dispy. If you are interested, email me.
Re: Debugging dispynode error
Posted: Thu Oct 14, 2021 8:31 pm
by Anon
Unfortunately, it didn't solve the issue. After a few hours and a few jobs, the nodes start crashing. I'm using
dispynode - version: 4.15.0 (Python 3.6.8)
.
It does seem to be like the connection is interrupted, or for some reason it fails, and then it's unable to resume the job.
Code: Select all
pycos - invalid task 140648621602776 to resume
EDIT: I also found this error in one of the nodes
Code: Select all
2021-10-15 01:37:05 pycos - uncaught exception in tcp_req/14$668394794152:
Traceback (most recent call last):
File "/usr/local/bin/dispynode.py", line 1696, in tcp_req
msg = yield conn.recv_msg()
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 3686, in _schedule
retval = task._generator.throw(*exc)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 906, in _async_recv_msg
data = yield self.recvall(n)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 485, in _recvall
recvd = self._rsock.recv_into(view, len(view), *args)
ConnectionResetError: [Errno 104] Connection reset by peer
2021-10-15 01:37:10 pycos - uncaught exception in tcp_req/140668485396616:
Traceback (most recent call last):
File "/usr/local/bin/dispynode.py", line 1696, in tcp_req
msg = yield conn.recv_msg()
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 3688, in _schedule
retval = task._generator.send(task._value)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 906, in _async_recv_msg
data = yield self.recvall(n)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 511, in _async_recvall
self._read_result = bytearray(bufsize)
MemoryError
The funny thing is that I tried this time to run it on five remote nodes and one local node (127.0.0.1). The local node lasted more than the other ones but still crashed after hours. What data is stored on the node after a job execution ? Is there a way to do a cleanup (remove everything from memory/disk except dependency files transferred) ?
BTW yeah, ill pm you about the on-demand support.
Re: Debugging dispynode error
Posted: Sun Oct 17, 2021 6:48 pm
by Giri
I have looked into issue with intermittent loss of network connection and simulated it. The client currently simply discards nodes that are presumed dead. I am working on recovering from this. I will email once I am done with it for you to try.
Re: Debugging dispynode error
Posted: Tue Oct 19, 2021 3:49 am
by Giri
Can you post or email me your client program without 'compute'. I would like to test with the way you create cluster and submit jobs.
Re: Debugging dispynode error
Posted: Sun Oct 24, 2021 5:38 am
by Anon
Sorry for the delay. I also found that my compute had a major bug that was causing it to yield a bunch of errors. I've been working on it and ill do a major test today, but here is the code
Code: Select all
def compute(args): # function sent to remote nodes for execution
from external_lib import X
result = X.start(args)
return result
def job_status(job):
global pending_jobs, jobs_cond, args
if (job.status == dispy.DispyJob.Finished or job.status in (dispy.DispyJob.Terminated, dispy.DispyJob.Cancelled,dispy.DispyJob.Abandoned)):
jobs_cond.acquire()
if job.id:
if job.exception == None:
if job() != None: # computation found something
saveResult(job()))
pending_jobs.pop(job.id)
job.result = None
if len(pending_jobs) <= len(nodes):
jobs_cond.notify()
jobs_cond.release()
if __name__ == '__main__':
import dispy, sys, threading, os, time
#external lib has multiple files
dependencies = []
for root,subdir,files in os.walk('external_lib'):
for file in files:
dependencies.append(os.path.join(root,file))
nodes = []
cluster = dispy.JobCluster(compute, depends=dependencies,nodes=nodes, host='0.0.0.0', secret=SECRET, job_status=job_status)
pending_jobs = {}
jobs_cond = threading.Condition()
#load computations and their arguments
compute_list = ...
#
while len(compute_list) != 0:
job = cluster.submit(compute_task,compute_args)
jobs_cond.acquire()
if job.status == dispy.DispyJob.Created or job.status == dispy.DispyJob.Running:
pending_jobs[job.id] = job
if len(pending_jobs) >= len(nodes):
jobs_cond.wait()
cluster.print_status()
jobs_cond.release()
compute_task = compute_list.pop()
cluster.wait()
cluster.close()
Re: Debugging dispynode error
Posted: Mon Nov 22, 2021 7:09 pm
by Anon
Quick update: I have been able to run it for a bit more than 1 week without issues. I made major changes to the compute to handle a lot of exceptions and some bugs. Also added firewall rules on the nodes and cluster to block external connections as it's not expected to interact with other hosts. Will continue to test and hopefully it will run smoothly this month. Thanks for the support
Re: Debugging dispynode error
Posted: Fri Jan 07, 2022 11:23 am
by Anon
Bumping this topic again. After several weeks of running a computation, I face the same issues again. It does seem to be network issues or something corrupting the serialized job because it simply gets stuck.
The cluster gets errors like
Code: Select all
2022-01-07 11:13:44 dispy - Ignoring invalid reply for job 140544685245808 from X.X.X.X
and the node gets thousands of errors like
Code: Select all
2022-01-07 11:13:44 dispynode - Sending result for job 140544685245808
I don't know how to debug or inspect these issues to figure out if its a problem with the type of data I'm sending/receiving or what else I can do to handle this kind of connectivity issue.
Re: Debugging dispynode error
Posted: Sun Jan 09, 2022 2:59 pm
by Giri
I did some work on this issue when you reported first, but left it after your last comment that you got it working. I am now preparing next release (hopefully in couple of days). I can get back to your issue after the release. Email me so I can send you files to test.
Re: Debugging dispynode error
Posted: Sat Oct 15, 2022 3:01 pm
by jhzheng_fzu
hello!! I am a graduate student in Fuzhou University in China. I am learning the dispy to create a cluster but my majority is civil engineering. Would you like to tell me your email address so that we can communicate together?
Junhao Zheng
Fuzhou University