Page 1 of 1

How to debug hanging processes?

Posted: Fri Mar 19, 2021 5:38 pm
by drchazewaz
Hello, I'm submitting a batch of jobs:

Code: Select all

        for s in file_list:
            job = cluster.submit(s ,**kwargs)

        while cluster._pending_jobs > 0:
            print 'Pending Jobs' ,cluster._pending_jobs
            time.sleep(2)
...and the pending jobs are not completing. I'm trying 10 as a test, and my loop keeps saying there are 10 jobs left. I've got the HTML monitor going too, and it always says Jobs Submitted : 0, Jobs Done : 0, Jobs Pending : 0.

I've verified that the function that cluster() is running works locally on its own without dispy. How do I diagnose why the cluster is hanging?

Re: How to debug hanging processes?

Posted: Sat Mar 20, 2021 2:49 am
by Giri
It looks like dispy job scheduler didn't find any nodes. There could be multiple issues if so:
  1. Make sure IP address where dispynode runs is correct network address that scheduler can connect to; e.g., if the address annouced by dispynode (or scheduler/client) is '127.0.0.1' (if IPv4), then they can't find / connect. Typically, addresses should be of the form '192.168.x.y' (but it depends on your network configuration). You can force dispy to use speficific IP address (e.g., with 'host' parameter)
  2. Make sure either firewall is disabled on nodes and client, or allow ports through firewall (default ports are 61590-61593). dispynode and scheduler announce IP address and port when started.
  3. For testing, you can specify host name / IP address of a node in 'nodes' parameter of JobCluster (e.g., with 'cluster = dispy.JobCluster(compute, nodes=["192.168.2.10"])'. When nodes are listed this way, TCP protocol is used to connect to nodes; otherwise, UDP broadcast is used to discover nodes, which can sometimes fail (e.g., if there is lot of traffic, routers may drop UDP packets, or if WiFi is used).
  4. Use 'debug' parameter to see more details; e.g., scheduler should show it discovered nodes (e.g., with 'Discovered 192.168.x.y with n cpus').
  5. If using httpd, it should list nodes found. If no nodes are listed in 'Cluster', then scheduler can't submti jobs.
  6. You may know this, but in case someone else finds this later: It is not advisable to use any variable that starts with underscore, even for debugging; they are for implementation only (like 'private' variables in other OO languages).

Re: How to debug hanging processes?

Posted: Mon Mar 22, 2021 7:26 pm
by drchazewaz
Thank you very much, a firewall was indeed the problem.

What is the correct way to accomplish my little "how many jobs are left" print statement without using the "_" variables?

Sometimes I get a "None" result for my jobs, and I have to use trial and error until I get a valid result. E.g. this happens when I miss a dependency (in general I'm not sure I'm using the depends= keyword correctly). How do I see into what's happening on the nodes for debugging?

EDIT: Just recently I was getting "None" returned on jobs because input data files didn't exist on the nodes (but did exist on the local machine where I was testing code). It took me a while to figure this out - how could I have figured this out through the dispy implementation of the code?

Re: How to debug hanging processes?

Posted: Wed Mar 24, 2021 12:06 am
by Giri
It is easy to pass parameter 'loglevel=dispy.logger.DEBUG' to JobCluster or SharedJobCluster to debug issues / see progress (e.g., how many jobs submitted, pending).

You can also call 'cluster.status()' method at anytime (e.g., periodically?) to get current status. Although it is easier to view status with httpd.

When a job is finished, it is a good idea to check the status of the job before accessing result. For example, see 'node_status.py' in examples which checks if 'job.status == dispy.DispyJob.Finished' and access result if so. Otherwise, you can get the exception trace, errors etc in 'job.exception', 'job.stderr' etc.