Rules and Guidelines
There is one rule: Respect your fellow users!
The computing servers are primarily meant to run large, long (non-interactive) jobs. There is only a limited number of compute servers, each with limited resources, and they are shared between everybody in the department. You have to share the available resources fairly and make sure you don't unnecessarily hinder other users. To help protect the active jobs and resources, new logins to a server are automatically disabled when a server becomes overloaded. This means that you will sometimes have to wait for other jobs to finish and at other times may have to kill a job to create space for other users.
The following guidelines will help you follow this rule, and provide some practical hints:
- Connect only directly from the bastion server to the login servers.
Connecting from one server to another creates unwanted load on the server in the middle and it's network connection. And if one of these fails you loose the connection to your job.
- Always choose the login server with the lowest use (most importantly system load and memory usage).
See the current resource usage page or the
serverscommand for information.
- Only use the storage best suited to your files.
See the file storage page for more information.
- Follow the specific instructions for that server.
Each server displays a message at login. Make sure you understand it before proceeding. This message includes the current load of the server, so look at it at every login.
- Run only one computing or memory intensive job per login server.
Leave enough resources for other users. When the number of running threads of all programs combined exceed the number of cores in the server, or the combined virtual memory used exceeds the server's memory, the efficiency of the server will be (severely) reduced.
Most multi-threaded applications (such as Java and Matlab) will automatically use all cpu cores of a server, and thus take away processing power from other jobs. If you can specify the number of threads, set it to at most 25% (¼) of the cores in that server (for a server with 8 cores, use at most 2; this leaves enough processing capacity for other users).
- Do not run more than 4 jobs in total spread over all login servers.
Even when all login servers are idle now, other users might want to start a job in 5 minutes, so leave some space for others.
- Actively monitor the status of your jobs and the loads of the servers.
Make sure your job runs normally and is not hindering other jobs. Check the following at the start of a job and thereafter at least twice a day:
- If your job is not working correctly (or halted) because of a programming error, terminate it immediately; debug and fix the problem instead of just trying again (the result will almost certainly be exactly the same).
- If your
screen's Kerberos ticket has expired, renew it so your job can successfully save it's results.
- Use the
topprogram to monitor the cpu (%CPU) and memory (%MEM) usage of your job. If either is too high, kill your job so it doesn't cause problems for other jobs.
toprunning unless your are continuously watching it; press
- Watch the current resource usage or the
serverscommand, and if the server is running close to it's limits (higher than 90% server load or memory, swap or disk usage), consider moving your job to a less busy server. If more than half of the servers are at their limits, consider killing one or more jobs to make some space for others.
screenprogram will keep your job running when you logout, or loose your connection. Simply run
screen program. For help, execute
man screenon the server.
- Automate your job.
Prepare a script that runs all necessary steps automatically, so you don't have unnecessary delays and can rerun the job if necessary. Do the interactive pre- and post-processing, including creating and debugging the script, on your own computer as much as possible.
- Save the results frequently.
Your job can crash, the server can become overloaded, or the network shares can become unavailable. Write your code in a modular way, so that you can continue the job from the point where it last saved.
- (Automatically) terminate your jobs when they are done.
Release the used resources so other users can use them. Have the script save the final results to file and exit. Don't forget to exit screen as well.