Scheduler Setup - Every first run fails, every second works

odo2063 · Post by **odo2063** » Wed Nov 24, 2021 10:00 am

Hi

I have setup a seperate schduler. For now every first(, third, fith, ...) run fails and every second(, fourth, sixth, ...) run works. Dispy version ist pips.

For testing i use a modified example from the documentation:

Code: Select all

#!/usr/bin/python3

def compute(n):
    import time, socket
    time.sleep(n)
    host = socket.gethostname()
    return (host, n)

if __name__ == '__main__':
    import dispy, random
    import time
    cluster = dispy.SharedJobCluster(compute, host='a.b.c.1', client_port=0, scheduler_host='a.b.c.1')
    time.sleep(5)
    jobs = []
    for i in range(2000):
        # schedule execution of 'compute' on a node (running 'dispynode')
        # with a parameter (random number in this case)
        job = cluster.submit(1) #random.randint(5,20))
        # job.id = i # optionally associate an ID to job (if needed later)
        jobs.append(job)
    cluster.wait() # waits for all scheduled jobs to finish
    for job in jobs:
        host, n = job() # waits for job to finish and returns results
        print('%s executed job %s at %s with %s' % (host, job.id, job.start_time, n))
        # other fields of 'job' that may be useful:
        # print(job.stdout, job.stderr, job.exception, job.ip_addr, job.start_time, job.end_time)
    cluster.print_status()
    cluster.close()

It fails with the following error message:

Code: Select all

2021-11-24 09:40:46 dispy - Creating job for "(5,)", "{}" failed with "Traceback (most recent call last):                                                                                           
  File "/usr/local/lib/python3.9/dist-packages/dispy/__init__.py", line 3330, in submit_job_id_node                                                                                                 
    _job.uid = deserialize(msg)                                                                                                                                                                     
  File "/usr/local/lib/python3.9/dist-packages/pycos/__init__.py", line 85, in deserialize                                                                                                          
    return pickle.loads(pkl)                                                                                                                                                                        
EOFError: Ran out of input                                                                                                                                                                          
"                                                                                                                                                                                                   
Traceback (most recent call last):
  File "/home/beekeeper/data/./test3.py", line 24, in <module>
    host, n = job() # waits for job to finish and returns results
TypeError: 'NoneType' object is not callable

Scheduler and client are running on the same machine.

the scheduler setup:

Code: Select all

dispyscheduler.py -d --httpd --host=a.b.c.1 --node_secret 123456 --fair_cluster_scheduler

the node setup:

Code: Select all

dispynode.py --clean --scheduler_host=a.b.c.1 --secret=123456 --admin_secret 654321

odo2063 · Post by **odo2063** » Wed Nov 24, 2021 10:03 am

btw. is help needed for keeping the documentation up to date?

odo2063 · Post by **odo2063** » Wed Nov 24, 2021 1:00 pm

What confuses me is, when it runs, the scheduler/client shows up in node list:

Code: Select all

            Node |  CPUs |    Jobs |  Sec/Job | Node Time Sec |    Sent |    Rcvd
---------------------------------------------------------------------------------
       a.b.c.1   |     0 |       0 |      0.0 |           0.0 |     0 B |     0 B
      WorkerBee2 |    16 |      25 |      5.0 |         125.2 | 434.0 K |   6.1 K
      WorkerBee6 |    16 |      25 |      5.0 |         125.2 | 434.9 K |   6.1 K
      WorkerBee5 |    16 |      25 |      5.0 |         125.2 | 436.0 K |   6.1 K
      WorkerBee0 |    16 |      25 |      5.0 |         125.2 | 472.5 K |   6.1 K
      WorkerBee4 |    16 |      25 |      5.0 |         125.2 | 433.2 K |   6.1 K
      WorkerBee3 |    16 |      25 |      5.0 |         125.2 | 438.4 K |   6.1 K
      WorkerBee7 |    16 |      25 |      5.0 |         125.2 | 434.1 K |   6.1 K
      WorkerBee1 |    16 |      25 |      5.0 |         125.2 | 434.1 K |   6.1 K

Post by **Giri** » Thu Nov 25, 2021 3:25 pm

I tested your program and setup as described above. Apparently using 'httpd' with dispyscheduler doesn't work. I will fix the issue and send you patch in the next couple of days. In the meantime, you can try without 'httpd' option to dispyscheduler and see if that works (for now).

odo2063 · Post by **odo2063** » Thu Nov 25, 2021 4:05 pm

Well, that seems to work (for now)...I thought the --httpd was just for information...

Post by **Giri** » Thu Nov 25, 2021 4:21 pm

'httpd' option to dispyscheduler can be used to manage cluster (add, remove, adjust nodes) used by scheduler as well as jobs of all users, so a sys admin has control over all nodes and jobs. Users (programs) can also use httpd to manage their own clusters / jobs.

odo2063 · Post by **odo2063** » Fri Nov 26, 2021 1:43 pm

There seems to be something else wrong with the scheduler (or my understanding on how the scheduler works

)(or I am the only one trying to use dispyscheduler and now all the bugs come out

).

I tried a long running program in parallel with a short running program, but the short running one never got executed. But I could run the short running after long running was finished on the same console, while the first execution of short running was still unexecuted. This happend with all three scheduler paradigms.

Long running programm:

Code: Select all

def compute3(n):
    import time, socket
    time.sleep(n)
    host = socket.gethostname()
    n=2*n
    return (host, n)

if __name__ == '__main__':
    import dispy, random
    import time
    cluster = dispy.SharedJobCluster(compute3, host='a.b.c.1', client_port=0, scheduler_host='a.b.c.1')
    cluster.print_status()
    jobs = []
    for i in range(2000):
        job = cluster.submit(5) #random.randint(5,20))
        job.id = i # optionally associate an ID to job (if needed later)
        jobs.append(job)
    cluster.wait() # waits for all scheduled jobs to finish
    for job in jobs:
        host, n = job() # waits for job to finish and returns results
        print('%s executed job %s at %s with %s' % (host, job.id, job.start_time, n))
    cluster.print_status()
    cluster.close()

short running program:

Code: Select all

def compute2(n):
    import time, socket
    time.sleep(n)
    host = socket.gethostname()
    n=2*n
    return (host, n)

if __name__ == '__main__':
    import dispy, random
    import time
    cluster = dispy.SharedJobCluster(compute2, host='a.b.c.1', client_port=0, scheduler_host='a.b.c.1')
    cluster.print_status()
    jobs = []
    for i in range(20):
        job = cluster.submit(5) #random.randint(5,20))
        job.id = i # optionally associate an ID to job (if needed later)
        jobs.append(job)
    cluster.wait() # waits for all scheduled jobs to finish
    for job in jobs:
        host, n = job() # waits for job to finish and returns results
        print('%s executed job %s at %s with %s' % (host, job.id, job.start_time, n))
    cluster.print_status()
    cluster.close()

last messages from scheduler:

Code: Select all

scheduler_1  | 2021-11-26 13:31:50 dispyscheduler - Received reply for job 140104616614144 from a.b.c.13                                                                                          
scheduler_1  | 2021-11-26 13:31:50 dispyscheduler - Received reply for job 140104616613696 from a.b.c.10                                                                                          
scheduler_1  | 2021-11-26 13:31:50 dispyscheduler - Received reply for job 140104616613360 from a.b.c.13                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.16 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.17 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.14 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.15 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.13 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.10 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.11 for compute3 / 140104615836160                                                                                          
scheduler_1  | 2021-11-26 13:31:51 dispyscheduler - Closing node a.b.c.12 for compute3 / 140104615836160

Post by **Giri** » Sat Nov 27, 2021 3:54 pm

Committed the fix for 'httpd' option to dispyscheduler in github. Follow installation instructions there to build / install package from sources.

dispyscheduler, by default, executes computations as they were submitted (in that order). If client C1 submits all jobs and then client C2 submits jobs, then scheduler finishes jobs jobs of C1 before running jobs of C2. It has been sometime I looked into job schedulers but I think fair scheduler may run jobs of different clients simultaneously. Let me know if that doesn't work.

odo2063 · Post by **odo2063** » Mon Nov 29, 2021 11:03 am

"httpd" seems to work. Thank you very much!

It also shows both "cluster jobs" and it shows the nodes in the long running process(cluster job) but no(never) nodes for the second process, but the cluster job itself is shown.

Post by **Giri** » Tue Nov 30, 2021 3:05 am

'Cluster' shows nodes for selected cluster (at the top of page). I think if you select second cluster, it should show nodes for that cluster. Alternately, you can select cluster '*', in which case it will show all nodes for all clusters.

Scheduler Setup - Every first run fails, every second works

Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works

Re: Scheduler Setup - Every first run fails, every second works