LaVision DaVIS PIV software?

nucci · August 10, 2023, 7:52pm

Is anyone familair with running LaVision’s DaVIS PIV software on am HPC cluster? I have a user at my site who is trying to do so, but we are getting a series of strange errors where the worker processes are all hanging. We are trying to use OpenMPI but it looks like everything just hangs, so we are missing a step somewhere.

the log file contains:

// DaVis Worker 10.2.1 (10.2.1.90329L-64) CL log file date: 10.08.23 time: 12:51:24

and that is as far as the worker goes before falling asleep. It appears it is waiting on a communication even that’ll never occur. The main process was started with srun (it is in a SLURM environment).

sohil.shrestha · August 15, 2023, 8:12pm

This is likely due to communication issues between the worker processes, causing the hang. Here are some steps you can take to diagnose and resolve the issue:

Check MPI Configuration and Environment:

Ensure that OpenMPI is properly installed and configured on your cluster nodes.
Verify that the necessary environment variables for OpenMPI are set correctly. Common variables include LD_LIBRARY_PATH and PATH.

SLURM Configuration:

Ensure that your SLURM job script is correctly configured to run your DaVis application. Make sure you have specified the correct number of tasks (--ntasks), nodes (--nodes), and tasks per node (--ntasks-per-node).
Check if the SLURM environment itself is properly set up. Any misconfigurations here can affect how the job runs.

Networking and Firewalls:

Ensure that all necessary ports for MPI communication are open and accessible between the cluster nodes.

Debugging Options:

OpenMPI provides various debugging options to help identify issues. For example, you can use the --debug-devel flag when launching your job to get more detailed debug information.
Consider using tools like ompi_info to inspect your OpenMPI installation and gather information about available communication methods.

Without insights on error logs it is hard to debug the issues. You can look at the slurm job output and error logs. If all else fail, contacting the cluster admin will be the best bet.

nucci · August 16, 2023, 3:56pm

I’m going to answer my question here so that if this comes up in future searches people can see a solution.

It turns out that the DaVis worker was not properly set up on the cluster.

There is a directory in $HOME/.config that contained a log file indicating that DaVis could not read the dongle. This should not be the case for cluster back-end processing. However, rather than shutting down the code with an error message, it just sat there, occupying time and memory in sleep state.

To correct the issue with looking for a license dongle, we re-ran the setup script that came with the Linux worker code. Now DaVis would run as expected, with the workers not locking up. Too bad we have another error, but that has more to do with the problem setup than a configuration error.