Unfortunately, it didn't solve the issue. After a few hours and a few jobs, the nodes start crashing. I'm using
dispynode - version: 4.15.0 (Python 3.6.8)
.
It does seem to be like the connection is interrupted, or for some reason it fails, and then it's unable to resume the job.
Code: Select all
pycos - invalid task 140648621602776 to resume
EDIT: I also found this error in one of the nodes
Code: Select all
2021-10-15 01:37:05 pycos - uncaught exception in tcp_req/14$668394794152:
Traceback (most recent call last):
File "/usr/local/bin/dispynode.py", line 1696, in tcp_req
msg = yield conn.recv_msg()
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 3686, in _schedule
retval = task._generator.throw(*exc)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 906, in _async_recv_msg
data = yield self.recvall(n)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 485, in _recvall
recvd = self._rsock.recv_into(view, len(view), *args)
ConnectionResetError: [Errno 104] Connection reset by peer
2021-10-15 01:37:10 pycos - uncaught exception in tcp_req/140668485396616:
Traceback (most recent call last):
File "/usr/local/bin/dispynode.py", line 1696, in tcp_req
msg = yield conn.recv_msg()
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 3688, in _schedule
retval = task._generator.send(task._value)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 906, in _async_recv_msg
data = yield self.recvall(n)
File "/usr/local/lib/python3.6/site-packages/pycos/__init__.py", line 511, in _async_recvall
self._read_result = bytearray(bufsize)
MemoryError
The funny thing is that I tried this time to run it on five remote nodes and one local node (127.0.0.1). The local node lasted more than the other ones but still crashed after hours. What data is stored on the node after a job execution ? Is there a way to do a cleanup (remove everything from memory/disk except dependency files transferred) ?
BTW yeah, ill pm you about the on-demand support.