How to debug hanging processes?

Questions, issues regarding dispy / pycos

Moderator: admin

Post Reply
drchazewaz
Posts: 2
Joined: Fri Mar 19, 2021 5:15 pm

How to debug hanging processes?

Post by drchazewaz »

Hello, I'm submitting a batch of jobs:

Code: Select all

        for s in file_list:
            job = cluster.submit(s ,**kwargs)

        while cluster._pending_jobs > 0:
            print 'Pending Jobs' ,cluster._pending_jobs
            time.sleep(2)
...and the pending jobs are not completing. I'm trying 10 as a test, and my loop keeps saying there are 10 jobs left. I've got the HTML monitor going too, and it always says Jobs Submitted : 0, Jobs Done : 0, Jobs Pending : 0.

I've verified that the function that cluster() is running works locally on its own without dispy. How do I diagnose why the cluster is hanging?
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: How to debug hanging processes?

Post by Giri »

It looks like dispy job scheduler didn't find any nodes. There could be multiple issues if so:
  1. Make sure IP address where dispynode runs is correct network address that scheduler can connect to; e.g., if the address annouced by dispynode (or scheduler/client) is '127.0.0.1' (if IPv4), then they can't find / connect. Typically, addresses should be of the form '192.168.x.y' (but it depends on your network configuration). You can force dispy to use speficific IP address (e.g., with 'host' parameter)
  2. Make sure either firewall is disabled on nodes and client, or allow ports through firewall (default ports are 61590-61593). dispynode and scheduler announce IP address and port when started.
  3. For testing, you can specify host name / IP address of a node in 'nodes' parameter of JobCluster (e.g., with 'cluster = dispy.JobCluster(compute, nodes=["192.168.2.10"])'. When nodes are listed this way, TCP protocol is used to connect to nodes; otherwise, UDP broadcast is used to discover nodes, which can sometimes fail (e.g., if there is lot of traffic, routers may drop UDP packets, or if WiFi is used).
  4. Use 'debug' parameter to see more details; e.g., scheduler should show it discovered nodes (e.g., with 'Discovered 192.168.x.y with n cpus').
  5. If using httpd, it should list nodes found. If no nodes are listed in 'Cluster', then scheduler can't submti jobs.
  6. You may know this, but in case someone else finds this later: It is not advisable to use any variable that starts with underscore, even for debugging; they are for implementation only (like 'private' variables in other OO languages).
drchazewaz
Posts: 2
Joined: Fri Mar 19, 2021 5:15 pm

Re: How to debug hanging processes?

Post by drchazewaz »

Thank you very much, a firewall was indeed the problem.

What is the correct way to accomplish my little "how many jobs are left" print statement without using the "_" variables?

Sometimes I get a "None" result for my jobs, and I have to use trial and error until I get a valid result. E.g. this happens when I miss a dependency (in general I'm not sure I'm using the depends= keyword correctly). How do I see into what's happening on the nodes for debugging?

EDIT: Just recently I was getting "None" returned on jobs because input data files didn't exist on the nodes (but did exist on the local machine where I was testing code). It took me a while to figure this out - how could I have figured this out through the dispy implementation of the code?
Giri
Site Admin
Posts: 58
Joined: Sun Dec 27, 2020 5:35 pm

Re: How to debug hanging processes?

Post by Giri »

It is easy to pass parameter 'loglevel=dispy.logger.DEBUG' to JobCluster or SharedJobCluster to debug issues / see progress (e.g., how many jobs submitted, pending).

You can also call 'cluster.status()' method at anytime (e.g., periodically?) to get current status. Although it is easier to view status with httpd.

When a job is finished, it is a good idea to check the status of the job before accessing result. For example, see 'node_status.py' in examples which checks if 'job.status == dispy.DispyJob.Finished' and access result if so. Otherwise, you can get the exception trace, errors etc in 'job.exception', 'job.stderr' etc.
Post Reply