Page 1 of 2
Node fails when job is terminated
Posted: Fri Apr 16, 2021 1:56 pm
by dementiy
On
Windows dispynode process was killed while canceling job. This is because os.kill() with CTRL_BREAK_EVENT argument does not work as expected on Windows and kill the process group. From dispynode.py:
Code: Select all
try:
if os.name == 'nt':
signum = signal.CTRL_BREAK_EVENT
else:
signum = signal.SIGINT
os.kill(pid, signum)
except Exception:
dispynode_logger.debug(traceback.format_exc())
Example to reproduce the problem without dispy (only for Windows):
Code: Select all
import multiprocessing
import os
import signal
import time
def do_nothing():
for _ in range:
time.sleep(.2)
if __name__ == "__main__":
p = multiprocessing.Process(target=do_nothing)
p.start()
print(f"Is alive: {p.is_alive()}")
os.kill(p.pid, signal.CTRL_BREAK_EVENT)
time.sleep(1)
print(f"Is alive: {p.is_alive()}")
p.terminate()
# Main thread should still work and print numbers from 0 to 999
for _ in range(1000):
print(i)
time.sleep(.2)
# Output:
# Is alive: True
# Is alive: True
# 0
# Process finished with exit code -1073741510 (0xC000013A interrupted by Ctrl+C)
For more details see the last comment in this thread:
https://bugs.python.org/issue26350.
If we replace CTRL_BREAK_EVENT to SIGTERM then everything works fine, except that we need to manually kill childs:
Code: Select all
if os.name == 'nt' and psutil:
proc = psutil.Process(pid)
dispynode_logger.info("Terminating childs for process with pid: %s", pid)
for child in proc.children(recursive=True):
child.kill()
try:
if os.name == 'nt':
signum = signal.SIGTERM
else:
signum = signal.SIGINT
os.kill(pid, signum)
except Exception:
dispynode_logger.debug(traceback.format_exc())
Re: Node fails when job is terminated
Posted: Sat Apr 17, 2021 12:20 pm
by Giri
I occassionally test with Windows, so likely things are broken now. I also notice that I use CTRL_C_EVENT in dispycos, so likely that is better (instead of CTRL_BREAK_EVENT)? As you may know, latest releases of dispy and pycos are supposed to first send KeyboardEvent to computations before terminating process, so using SIGTERM to kill the process breaks this feature. I will look into this soon and post.
Re: Node fails when job is terminated
Posted: Sat Apr 17, 2021 9:21 pm
by Giri
I did a simple test with replacing CTRL_BREAK_EVENT with CTRL_C_EVENT and it seems to work. I will commit this soon so you can try.
Re: Node fails when job is terminated
Posted: Sat Apr 17, 2021 10:10 pm
by dementiy
Hmm... But it doesn't matter what signal is sent (CTRL_C_EVENT or CTRL_BREAK_EVENT) dispynode will be terminated/stopped by this signal (I checked it on my Windows machine). From os.kill implementation for Windows (
https://github.com/python/cpython/blob/ ... 7897-L7905):
Code: Select all
/* Console processes which share a common console can be sent CTRL+C or
CTRL+BREAK events, provided they handle said events. */
if (sig == CTRL_C_EVENT || sig == CTRL_BREAK_EVENT) {
if (GenerateConsoleCtrlEvent(sig, (DWORD)pid) == 0) {
err = GetLastError();
PyErr_SetFromWindowsErr(err);
}
else
Py_RETURN_NONE;
Re: Node fails when job is terminated
Posted: Sat Apr 17, 2021 10:46 pm
by Giri
Aiyee. I tested with CTRL_C_EVENT and it seems to work as expected. I use ConEmu64 shell in Windows 10 to run dispynode. Not sure if that matters. I am going to commit the changes as I tested. Can you test and post (even if my fix is not likely valid)?
Re: Node fails when job is terminated
Posted: Sun Apr 18, 2021 3:24 am
by Giri
I remember reading document you cited above about issues with signals in Windows. I believe that is why I used CTRL_BREAK_EVENT (as that is the best that can be done). My latest commit to use CTRL_C_EVENT is not better! Anyway, as I said, in my tests it seems to work as I mentioned. Can you tell me how you run dispynode? From "cmd" shell? If possible, attach a test program with dispy that kills dispynode when a job is terimnated.
Re: Node fails when job is terminated
Posted: Sun Apr 18, 2021 9:48 pm
by dementiy
Hello, I tested your last commit with CTRL_C_EVENT and dispynode node stopped when job is terminated:
Code: Select all
Terminating job ...
dispynode will quit when current computation ... closes.
Sending result ...
Computation ... from .... done
Sending TERMINATE to ...
Sending TERMINATE to ...
Computation .... already closed?
Process finished with exit code 0
And this is expected behavior because CTRL_C_EVENT is sent to the process group.
I create dispynode as an instance of class _DispyNode, for example:
Code: Select all
_dispy_node = _DispyNode(cpus=4, hosts=[...], scheduler_host=..., clean=True, force_cleanup=True, deamon=True)
while True:
try:
time.sleep(3600)
except Exception as e:
if os.path.isfile(...):
_dispy_node.shutdown("exit")
else:
break
This will work fine only if I register sighandler to ignore SIGINT/SIGBREAK:
Code: Select all
def sighandler(signum, frame):
dispynode_logger.debug("dispynode received signal %s. Ignore it", signum)
signal.signal(signal.SIGINT, sighandler)
signal.signal(signal.SIGBREAK, sighandler)
Re: Node fails when job is terminated
Posted: Mon Apr 19, 2021 5:33 am
by Giri
I think signals are not appropriate for terminating processes with Windows. I will address this in couple of days. As I asked earlier, can you describe how you use dispynode (e.g., from "cmd" shell?) and a test program using dispy that terminates a job which kills dispynode? In my testing termination works as expected so I can't test this case.
Re: Node fails when job is terminated
Posted: Mon Apr 19, 2021 9:00 am
by dementiy
Yep, agree about signals.
Here is a simple test program.
I create an instance of DispyNode (test_dispynode.py):
Code: Select all
import os
import time
from dispy.dispynode import _DispyNode, dispynode_logger
if __name__ == "__main__":
dispynode_logger.setLevel(dispynode_logger.DEBUG)
_dispy_node = _DispyNode(
cpus=4, hosts=["10.57.46.38"],
scheduler_host="10.57.46.38",
clean=True,
force_cleanup=True,
deamon=True
)
while True:
try:
time.sleep(3600)
except (Exception, KeyboardInterrupt) as e:
if os.path.isfile(os.path.join(_dispy_node.dest_path_prefix, "config.pkl")):
_dispy_node.shutdown("exit")
else:
break
Cluster (test_cluster.py):
Code: Select all
import dispy
def compute(n):
import time
time.sleep(n)
if __name__ == "__main__":
import time
cluster = dispy.SharedJobCluster(
compute,
nodes=["*"],
client_port=0,
host="10.57.46.38",
scheduler_host="10.57.46.38",
loglevel=dispy.logger.DEBUG
)
job = cluster.submit(10)
time.sleep(3)
cluster.cancel(job)
while job.status == dispy.DispyJob.Created or job.status == dispy.DispyJob.Running:
time.sleep(1)
And actual output:
Re: Node fails when job is terminated
Posted: Wed Apr 21, 2021 2:50 am
by Giri
I just committed a fix for it. Let me know if this works for you.