Node fails when job is terminated

Questions, issues regarding dispy / pycos

Moderator: admin

dementiy
Posts: 7
Joined: Tue Apr 13, 2021 5:26 pm

Node fails when job is terminated

Post by dementiy »

On Windows dispynode process was killed while canceling job. This is because os.kill() with CTRL_BREAK_EVENT argument does not work as expected on Windows and kill the process group. From dispynode.py:

Code: Select all

try:
    if os.name == 'nt':
        signum = signal.CTRL_BREAK_EVENT
    else:
        signum = signal.SIGINT
    os.kill(pid, signum)
except Exception:
    dispynode_logger.debug(traceback.format_exc())
Example to reproduce the problem without dispy (only for Windows):

Code: Select all

import multiprocessing
import os
import signal
import time

def do_nothing():
    for _ in range:
        time.sleep(.2)

if __name__ == "__main__":
    p = multiprocessing.Process(target=do_nothing)
    p.start()
    print(f"Is alive: {p.is_alive()}")
    os.kill(p.pid, signal.CTRL_BREAK_EVENT)

    time.sleep(1)
    print(f"Is alive: {p.is_alive()}")
    p.terminate()
    
    # Main thread should still work and print numbers from 0 to 999
    for _ in range(1000):
        print(i)
        time.sleep(.2)
 
 # Output:
 # Is alive: True
 # Is alive: True
 # 0
 # Process finished with exit code -1073741510 (0xC000013A interrupted by Ctrl+C)
 
For more details see the last comment in this thread: https://bugs.python.org/issue26350.

If we replace CTRL_BREAK_EVENT to SIGTERM then everything works fine, except that we need to manually kill childs:

Code: Select all

if os.name == 'nt' and psutil:
    proc = psutil.Process(pid)
    dispynode_logger.info("Terminating childs for process with pid: %s", pid)
    for child in proc.children(recursive=True):
        child.kill()

try:
    if os.name == 'nt':
        signum = signal.SIGTERM
    else:
        signum = signal.SIGINT
    os.kill(pid, signum)
except Exception:
    dispynode_logger.debug(traceback.format_exc())
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Node fails when job is terminated

Post by Giri »

I occassionally test with Windows, so likely things are broken now. I also notice that I use CTRL_C_EVENT in dispycos, so likely that is better (instead of CTRL_BREAK_EVENT)? As you may know, latest releases of dispy and pycos are supposed to first send KeyboardEvent to computations before terminating process, so using SIGTERM to kill the process breaks this feature. I will look into this soon and post.
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Node fails when job is terminated

Post by Giri »

I did a simple test with replacing CTRL_BREAK_EVENT with CTRL_C_EVENT and it seems to work. I will commit this soon so you can try.
dementiy
Posts: 7
Joined: Tue Apr 13, 2021 5:26 pm

Re: Node fails when job is terminated

Post by dementiy »

Hmm... But it doesn't matter what signal is sent (CTRL_C_EVENT or CTRL_BREAK_EVENT) dispynode will be terminated/stopped by this signal (I checked it on my Windows machine). From os.kill implementation for Windows (https://github.com/python/cpython/blob/ ... 7897-L7905):

Code: Select all

    /* Console processes which share a common console can be sent CTRL+C or
       CTRL+BREAK events, provided they handle said events. */
    if (sig == CTRL_C_EVENT || sig == CTRL_BREAK_EVENT) {
        if (GenerateConsoleCtrlEvent(sig, (DWORD)pid) == 0) {
            err = GetLastError();
            PyErr_SetFromWindowsErr(err);
        }
        else
            Py_RETURN_NONE;
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Node fails when job is terminated

Post by Giri »

Aiyee. I tested with CTRL_C_EVENT and it seems to work as expected. I use ConEmu64 shell in Windows 10 to run dispynode. Not sure if that matters. I am going to commit the changes as I tested. Can you test and post (even if my fix is not likely valid)?
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Node fails when job is terminated

Post by Giri »

I remember reading document you cited above about issues with signals in Windows. I believe that is why I used CTRL_BREAK_EVENT (as that is the best that can be done). My latest commit to use CTRL_C_EVENT is not better! Anyway, as I said, in my tests it seems to work as I mentioned. Can you tell me how you run dispynode? From "cmd" shell? If possible, attach a test program with dispy that kills dispynode when a job is terimnated.
dementiy
Posts: 7
Joined: Tue Apr 13, 2021 5:26 pm

Re: Node fails when job is terminated

Post by dementiy »

Hello, I tested your last commit with CTRL_C_EVENT and dispynode node stopped when job is terminated:

Code: Select all

Terminating job ...
dispynode will quit when current computation ... closes.
Sending result ...
Computation ... from ....  done
Sending TERMINATE to ...
Sending TERMINATE to ...
Computation .... already closed?

Process finished with exit code 0
And this is expected behavior because CTRL_C_EVENT is sent to the process group.

I create dispynode as an instance of class _DispyNode, for example:

Code: Select all

_dispy_node = _DispyNode(cpus=4, hosts=[...], scheduler_host=..., clean=True, force_cleanup=True, deamon=True)
while True:
    try:
        time.sleep(3600)
    except Exception as e:
        if os.path.isfile(...):
            _dispy_node.shutdown("exit")
        else:
            break
This will work fine only if I register sighandler to ignore SIGINT/SIGBREAK:

Code: Select all

def sighandler(signum, frame):
    dispynode_logger.debug("dispynode received signal %s. Ignore it", signum)

signal.signal(signal.SIGINT, sighandler)
signal.signal(signal.SIGBREAK, sighandler)
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Node fails when job is terminated

Post by Giri »

I think signals are not appropriate for terminating processes with Windows. I will address this in couple of days. As I asked earlier, can you describe how you use dispynode (e.g., from "cmd" shell?) and a test program using dispy that terminates a job which kills dispynode? In my testing termination works as expected so I can't test this case.
dementiy
Posts: 7
Joined: Tue Apr 13, 2021 5:26 pm

Re: Node fails when job is terminated

Post by dementiy »

Yep, agree about signals.

Here is a simple test program.

I create an instance of DispyNode (test_dispynode.py):

Code: Select all

import os
import time
from dispy.dispynode import _DispyNode, dispynode_logger

if __name__ == "__main__":
    dispynode_logger.setLevel(dispynode_logger.DEBUG)
    _dispy_node = _DispyNode(
        cpus=4, hosts=["10.57.46.38"],
        scheduler_host="10.57.46.38",
        clean=True,
        force_cleanup=True,
        deamon=True
    )
    while True:
        try:
            time.sleep(3600)
        except (Exception, KeyboardInterrupt) as e:
            if os.path.isfile(os.path.join(_dispy_node.dest_path_prefix, "config.pkl")):
                _dispy_node.shutdown("exit")
            else:
                break
Cluster (test_cluster.py):

Code: Select all

import dispy

def compute(n):
    import time
    time.sleep(n)

if __name__ == "__main__":
    import time
    cluster = dispy.SharedJobCluster(
        compute,
        nodes=["*"],
        client_port=0,
        host="10.57.46.38",
        scheduler_host="10.57.46.38",
        loglevel=dispy.logger.DEBUG
    )
    job = cluster.submit(10)
    time.sleep(3)
    cluster.cancel(job)
    while job.status == dispy.DispyJob.Created or job.status == dispy.DispyJob.Running:
        time.sleep(1)
 
And actual output:
Image
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: Node fails when job is terminated

Post by Giri »

I just committed a fix for it. Let me know if this works for you.
Post Reply