Posts
PACE quarterly maintenance – January ’15
Hi everybody,
Our January maintenance window is upon us. We’ll have PACE clusters down Tuesday and Wednesday next week, January 13 and 14. We’ll get started at 6:00am on Tuesday, and have things back to you as soon as possible.
Major items this time around include:
- routine patching on our DDN system that servers project directories for many users.
- routine patching on the file server that provides the storage for /usr/local and our virtual machine infrastructure (including most head nodes)
- firmware updates on some NFS project directory servers to address stability issues
Additionally, the Joe and Atlas cluster users have graciously offered to test out an upgraded version of the Moab/Torque scheduler software. Presuming we have success with these two clusters, we will look to roll out the upgrades to the rest of the PACE universe during our April maintenance period. If you use clusters other than Atlas and Joe, this the rest of this announcement will not affect you next week. Users of Atlas and Joe can expect the following:
- The current version uses a different database, so we will not be able to migrate submitted jobs. The scheduler will start with an empty queue, and you will need to resubmit your jobs after the maintenance day (sorry for this inconvenience).
- We will start using “node packing” which allocates as many jobs on a node as possible before jumping on the next one. With the current version, users can submit many single-core jobs, each landing on a separate node, making it more difficult for the scheduler to start jobs that require entire nodes.
- You will be able to use msub for interactive jobs (which is currently broken due to a bug), although the recommendation from the vendor company is to use “qsub” for everything (we confirmed that it’s much faster than msub).
- There will no longer be a discrepancy between job IDs generated by msub (Moab.###) and qsub (####). You will always see a single job ID (in plain number format) regardless of your msub/qsub preference.
Other improvements included in the scheduler upgrade:
- Speed – new versions of Moab and Torque are now multithreaded, making it possible for some query commands (e.g. showq) to return instantly regardless of the load on the scheduler. Currently, when a user submits a large job array, these commands usually timeout.
- Introduction of cpusets. When a user is given X cores, he/she will not be able to use more than that. Currently, users can easily violate the requested limits by spawning more processes and threads and Torque cannot do much to stop that. This will significantly reduce the job interference and allows us to finally use ‘node packing’ as explained above.
- Several other benefits from bug fixes and improvements including but not limited to, zombie processes, lost output files, missing array jobs, long job allocation times, etc.
We hope these improvements will provide you with a more efficient and productive computing environment. Please let us know (pace-support@oit.gatech.edu) if you have any concerns or questions regarding this maintenance period.
COMSOL 5.0 and Application Builder Workshop in Atlanta, GA (1/29)
You’re invited to our free workshop focusing on
COMSOL Multiphysics® version 5.0 and its new features
and additional capabilities. This event will take place on
Thursday, January 29th at Georgia Tech in Atlanta. All attendees
will receive a free two-week trial of the software.
During the workshop you will:
– Learn the fundamental modeling steps in COMSOL Multiphysics
– Set up and solve a simulation through a hands-on exercise
– Convert existing COMSOL models into Apps using the
COMSOL Application Builder
AM Session
————–
Program:
8:45am – 9:00am Registration
9:00am – 10:30am Simulations in COMSOL Multiphysics 5.0
10:30am – 10:45am Coffee Break
10:45am – 12:00pm Hands-on Tutorial
PM Session
————–
12:45pm – 1:00pm Registration
1:00pm – 2:30pm Simulations in COMSOL Multiphysics 5.0
2:30pm – 2:45pm Coffee Break
2:45pm – 4:00pm Hands-on Tutorial
Event details and registration: http://comsol.com/c/1izr
Seating is limited, so advance registration is recommended.
Feel free to contact me with any questions.
Contact information:
Miraj Desai
COMSOL, Inc.
1 New England Executive Park
Burlington, MA 01803
781-273-3322
miraj.desai@comsol.com
Free nVidia qwicklab tokens for GT Researchers!
In collaboration with NVIDIA, we are happy to announce the availability of *free* tokens for qwiklab classes for PACE users.
You can find a full list of available labs including CUDA (basics & expert), OpenACC and using GPUs in Matlab here:
https://nvidia.qwiklab.com/lab_catalogue
One nice thing about these labs is that they utilize Amazon Web Services (AWS) to provide each person with a GPU node, so you can try all examples and play with the codes on-the-fly requiring nothing but browser.
We currently have 40 tokens in total (NVIDIA promised more if there is demand) so please make sure you make good use of all the tokens that you requested and return them if you end up not taking the class. We allow for 4 tokens at a time, but you can definitely request more after you use all of them.
Here are the steps you need to follow:
1. Register at https://nvidia.qwiklab.com using your GT email
2. Email pace-support@oit.gatech.edu and specify:
– the email you used for your registration
– your PACE username
– number of tokens (4 max)
3. Repeat step #2 as you need more tokens
4. Provide feedback (each class should have a survey and NVIDIA folks repeatedly stated their interest in hearing from GT researchers)
Current tokens expire on June 3rd 2015.
Happy new year!
Georgia Tech mention in HPCWire Intel IPCC article
From the article:
Georgia Tech is conducting research that seeks to modernize quantum chemistry codes used in materials science. By designing a parallel code called GTFock, scientists can closely predict properties of materials using fundamental physical principles. This allows scalability to previously unattainable numbers of computing nodes. The team at Georgia Tech ran large batches of code on the Tianhe-2, one of the world’s most powerful computers, along with two Xeon Phi coprocessors. The experiment produced computations using more than 1.6 million cores, all working in parallel.
The code GTFock, is developed by Xing Liu, Aftab Patel, and Associate Professor Edmond Chow, of the School of Computational Science and Engineering , with assistance from Professor David Sherrill of the School of Chemistry and Biochemistry.
Original article found here:
http://www.hpcwire.com/off-the-wire/intel-piece-reveals-details-ipccs-penn-state-university-oregon-georgia-tech/?utm_source=rss&utm_medium=rss&utm_campaign=intel-piece-reveals-details-ipccs-penn-state-university-oregon-georgia-tech
Free Linux 101 Course
We at PACE are offering a beginning course on Linux. The target audience are those who have little or no Linux experience and need to start use PACE cluster for their research.
Date: 11/14/2014
Time: 10:00 am to 12:00pm
Location: Clough Undergraduate Learning Commons 262
Topics:
What is Linux?
Why use Linux?
Access to Linux
Common Commands on Linux
Editors
How to use man page
Linux Usage Tips
Module usage on PACE
Please register the course at the following link:
http://trains.gatech.edu/courses/index#view-12863
-Pace Team
Free Supercomputing in Plain English Workshop, Spring 2015
Free Supercomputing in Plain English (SiPE)
Available live in person and live via videoconferencing
These workshops focus on fundamental issues of High Performance
Computing (HPC) as they relate to Computational and Data-enabled
Science & Engineering (CDS&E), including:
* overview of HPC;
* the storage hierarchy;
* instruction-level parallelism;
* high performance compilers;
* shared memory parallelism (e.g., OpenMP);
* distributed parallelism (e.g., MPI);
* HPC application types and parallel paradigms;
* multicore optimization;
* high throughput computing;
* accelerator computing (e.g., GPUs);
* scientific and I/O libraries;
* scientific visualization.
Tuesdays starting Jan 20 2015, 1:30pm Central Time
(3:30pm Atlantic, 2:30pm Eastern, 12:30pm Mountain, 11:30am Pacific)
Live in person: Stephenson Research & Technology Center boardroom,
University of Oklahoma Norman campus
Live via videoconferencing: details to be announced
Registration coming soon!
http://www.oscer.ou.edu/education/
So far, the SiPE workshops have reached over 1500 people at
248 institutions, agencies, companies and organizations in
47 US states and territories and 10 other countries:
* 178 academic institutions;
* 29 government agencies;
* 26 private companies;
* 15 not-for-profit organizations.
SiPE is targeted at an audience of not only computer scientists
but especially scientists and engineers, including a mixture of
undergraduates, graduate students, faculty and staff.
The key philosophy of the SiPE workshops is that an HPC-based code
should be maintainable, extensible and, most especially, portable
across platforms, and should be sufficiently flexible that it can
adapt to, and adopt, emerging HPC paradigms.
Prerequisite:
1 semester of programming experience and/or coursework in any of
Fortan, C, C++ or Java, recently
Major Storage Issue (why were the head nodes unavailable?)
Yesterday (10/26), early evening (4:50pm), it appears one of our primary storage units decided to have a serious crash (page fault in the kernel, if you wanted more detail), and that proceeded to offline a good share of the storage allocated to supporting our VM infrastructure. Since most of the head nodes we run are in fact VMs, this of course meant that the head nodes themselves started having problems handling new job requests and allowing logins.
Please note, any submitted jobs were not affected, only jobs that were in the process of submission around 4:50pm yesterday until 8:30am this morning.
We have restored functionality to this array and will be submitting tickets with the vendor shortly to evaluate what has occurred on the machine, and any remediations we can apply. We may need to reboot the head nodes affected by this to get them to their proper state as well, but we are evaluating where we are before making that call.
UPDATE 1:
Unfortunately, upon review, we will have to restart the head node VMs, and that process will start immediately so that folks can submit jobs as soon as possible.
UPDATE 2:
With the engagement of the vendor, we have identified the likely cause of this problem which will ultimately be addressed during our January Maintenance, due to its requirement for a reboot (which would be service interrupting right now). Thankfully, a work-around for the bug that we could apply without requiring a reboot is available and should keep the system stable until then. At this time, we have enacted that work-around.
PACE clusters ready for research
Greetings!
Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.
In general, all tasks were successfully completed. However, we do have some compute nodes that have not successfully applied the kernel update. We will keep those offline for the moment and continue to work through those tomorrow.
As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!
quarterly maintenance underway
Scheduled maintenance has begun. Please see our previous post here for details.