Posts

Free Linux 101 Course

We at PACE are offering a beginning course on Linux. The target audience are those who have little or no Linux experience and need to start use PACE cluster for their research.

Date: 11/14/2014
Time: 10:00 am to 12:00pm
Location: Clough Undergraduate Learning Commons 262

Topics:
What is Linux?
Why use Linux?
Access to Linux
Common Commands on Linux
Editors
How to use man page
Linux Usage Tips
Module usage on PACE

Please register the course at the following link:

http://trains.gatech.edu/courses/index#view-12863

-Pace Team

Free Supercomputing in Plain English Workshop, Spring 2015

Free Supercomputing in Plain English (SiPE)
Available live in person and live via videoconferencing

These workshops focus on fundamental issues of High Performance
Computing (HPC) as they relate to Computational and Data-enabled
Science & Engineering (CDS&E), including:

* overview of HPC;
* the storage hierarchy;
* instruction-level parallelism;
* high performance compilers;
* shared memory parallelism (e.g., OpenMP);
* distributed parallelism (e.g., MPI);
* HPC application types and parallel paradigms;
* multicore optimization;
* high throughput computing;
* accelerator computing (e.g., GPUs);
* scientific and I/O libraries;
* scientific visualization.

Tuesdays starting Jan 20 2015, 1:30pm Central Time
(3:30pm Atlantic, 2:30pm Eastern, 12:30pm Mountain, 11:30am Pacific)

Live in person: Stephenson Research & Technology Center boardroom,
University of Oklahoma Norman campus

Live via videoconferencing: details to be announced

Registration coming soon!

http://www.oscer.ou.edu/education/

So far, the SiPE workshops have reached over 1500 people at
248 institutions, agencies, companies and organizations in
47 US states and territories and 10 other countries:

* 178 academic institutions;
* 29 government agencies;
* 26 private companies;
* 15 not-for-profit organizations.

SiPE is targeted at an audience of not only computer scientists
but especially scientists and engineers, including a mixture of
undergraduates, graduate students, faculty and staff.

The key philosophy of the SiPE workshops is that an HPC-based code
should be maintainable, extensible and, most especially, portable
across platforms, and should be sufficiently flexible that it can
adapt to, and adopt, emerging HPC paradigms.

Prerequisite:
1 semester of programming experience and/or coursework in any of
Fortan, C, C++ or Java, recently

Major Storage Issue (why were the head nodes unavailable?)

Yesterday (10/26), early evening (4:50pm), it appears one of our primary storage units decided to have a serious crash (page fault in the kernel, if you wanted more detail), and that proceeded to offline a good share of the storage allocated to supporting our VM infrastructure. Since most of the head nodes we run are in fact VMs, this of course meant that the head nodes themselves started having problems handling new job requests and allowing logins.

Please note, any submitted jobs were not affected, only jobs that were in the process of submission around 4:50pm yesterday until 8:30am this morning.

We have restored functionality to this array and will be submitting tickets with the vendor shortly to evaluate what has occurred on the machine, and any remediations we can apply. We may need to reboot the head nodes affected by this to get them to their proper state as well, but we are evaluating where we are before making that call.

UPDATE 1:
Unfortunately, upon review, we will have to restart the head node VMs, and that process will start immediately so that folks can submit jobs as soon as possible.

UPDATE 2:
With the engagement of the vendor, we have identified the likely cause of this problem which will ultimately be addressed during our January Maintenance, due to its requirement for a reboot (which would be service interrupting right now). Thankfully, a work-around for the bug that we could apply without requiring a reboot is available and should keep the system stable until then. At this time, we have enacted that work-around.

PACE clusters ready for research

Greetings!

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

In general, all tasks were successfully completed.  However, we do have some compute nodes that have not successfully applied the kernel update.  We will keep those offline for the moment and continue to work through those tomorrow.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!

PACE quarterly maintenance – October ’14

Hi everybody,

Our October maintenance window is rapidly approaching.  We’ll be back to the normal two day even this time around – Tuesday, October 21 and Wednesday, October 22.

Major items this time around include the continued expansion of our DDN storage system.  This will complete the build out of the infrastructure portions of this storage system, allowing for the addition of another 600 disk drives as capacity is purchased by the faculty.

Also, we have identified a performance regression in the kernel deployed with RedHat 6.5.  With some assistance from one of our larger clusters, we have been testing an updated kernel that does not exhibit these performance problems, and will be rolling out the fix everywhere.  If you’ve noticed your codes taking longer to run since the maintenance in July, this is very likely the cause.

We will also be migrating components of our server infrastructure to RedHat 6.5.  This should not be a user visible event, but worth mentioning just in case.

Over the last few months, we’ve identified a few bad links in our network.  Fortunately, we have redundancy in place that allowed us to simply disable those links.  We will be taking corrective actions on these to bring those links back to full redundancy.

PACE clusters ready for research

Greetings!

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

In general, all tasks were successfully completed.  However, we do have some compute nodes that are still having various issues.  We will continue to work through those tomorrow.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!

 

PACE quarterly maintenance – July ’14

2014-07-15 at 6am: Maintenance has begun

Hi folks,

It is time again for our quarterly maintenance. We have a bit of a situation this time around and will need to extend our activities into a third day – starting at 6:00am Tuesday, July 15 and ending at or before 11:59pm Thursday, July 17. This is a one-time event, and I do not expect to move to three-day maintenance as a norm. Continue reading below for more details.

Over the years, we’ve grown quite a bit and filled up one side of our big Infiniband switch. This is a good thing! The good news is that there is plenty of expansion room on the other side of the switch. The bad news is that we didn’t leave a hole in the raised floor to get the cables to the other side of the switch. In order to rectify this, and install all of the equipment that was ordered in June, we need to move the rack that contains the switch as well as some HVAC units on either side. In order to do this, we need to unplug a couple hundred Infiniband connections and some ethernet fiber. Facilities will be on hand to handle the HVAC. After all the racks are moved, we’ll swap in some new raised-floor tiles and put everything back together. This is a fair bit of work, and is the impetus for the extra day.

In addition, we will be upgrading all of the RedHat 6 compute nodes and login nodes from RHEL6.3 to RHEL6.5 – this represents nearly all of the clusters that PACE manages. This image has been running on the TestFlight cluster for some time now – if you haven’t taken the opportunity to test your codes there, please do so. This important update contains some critical security fixes to go along with the usual assortment of bug fixes.

We are also deploying updates to the scheduler prologue and epilogue scripts to more effectively combat “leftover” processes from jobs that don’t completely clean up after themselves. This should help reduce situations where jobs aren’t started because compute nodes incorrectly appear busy to the scheduler.

We will also be relocating some storage servers to prepare for incoming equipment. There should be no noticeable impact to this move, but just in case, the following filesystems are involved:

  • /nv/pase1
  • /nv/pb2
  • /nv/pc6
  • /nv/pcoc1
  • /nv/pface1
  • /nv/pmart1
  • /nv/pmeg1
  • /nv/pmicro1
  • /nv/py2

Among other things, these moves will pave the way for another capacity expansion of our DDN project storage, as well as a new scratch filesystem. Stay tuned for more details on the new scratch, but we are planning a significant capacity and performance increase. Projected timeframe is to go into limited production during our October maintenance window, and ramp up from there.

We will also be implementing some performance tuning changes for the ethernet networks that should primarily benefit the non-GPFS project storage.

The /nv/pase1 filesystem will be moved back to its old storage server, which is now repaired and tested.

The tardis-6 head node will have some additional memory allocated.

And finally, some other minor changes – Firmware updates to our DDN/GPFS storage, as recommended by DDN, as well as installation of additional disks for increased capacity.

The OIT Network Backbone team will be upgrading the appliances that provide DNS & DHCP services for PACE. This should be negligible impact to us, as they have already rolled out new appliances for most of campus already.

Replacement of a fuse for the in-rack power distribution in rack H33.

— Neil Bright

Update on widespread drive failures

After some consultation with members of the GT IT community (thank you specifically, Didier Contis for raising the awareness of the issue), as well as our vendor, we have identified the cause of the high rate of disk failures plaguing storage units purchased a little bit more than a year ago.

An update to the firmwares running on the internal backplanes of the storage arrays was necessary, and performance and availability were greatly improved immediately after these were applied on the arrays. These firmwares are normally manufacturer maintained materials only, and aren’t readily available to the public like controller firmwares are, which led to some additional delays before repairs.

That said, we have retained the firmwares and the software used to apply them for any future use should other units have issues.

PACE Lecture Series: New Courses Scheduled

PACE is starting the lecture and training course offering once again.
Over the next two months, we are offering 4 courses:

  1. Introduction to Parallel Programming with MPI and OpenMP (July 22)
  2. Introduction to Parallel Application Debugging and Profiling (June 17)
  3. A Quick Introduction To Python (June 24)
  4. Python For Scientific Computing (July 8)

For details about where and when, or to register your attendance (each class is limited to 30 seats), visit our PACE Training page.