Posts

upcoming maintenance day, 7/17 – please test your codes

It’s that time of the quarter again, and all PACE-manager clusters will be taken offline for maintenance on July 17 (Tuesday). All jobs that will not complete by then will be held by the scheduler. They will be released by the scheduler once the clusters are up and running again, requiring no further action on your end. If you find that your jobs does not start running, then you might like to check its walltime to make sure it does not exceed this date.

With this maintenance, we are upgrading our RedHat 6 clusters to RedHat 6.2, which includes many bugfixes and performance improvements. This version is known to provide better software and hardware integration with our systems, particularly with the 64-core nodes we have been adding over the last year.

We are doing our best to test existing codes with the new RedHat 6.2 stack. In our experience, codes currently running on our RedHat 6 systems continue to run without problems. However we strongly recommend you test your critical codes on the new stack. For this purpose, we renovated the “testflight” cluster to include RedHat 6.2 nodes, so all you need for testing is to submit your RedHat 6 jobs to the “testflight” queue. If you would like to recompile your code, please login to the testflight-6.pace.gatech.edu head node. Please try to keep the problem sizes small since this cluster only includes ~14 nodes with cores varying from 16 to 48, plus a single 64 core node. We have limited this queue to two jobs at a time from a given user. We hope the testflight cluster will be sufficient to test drive your codes, but if you have any concerns, or notice any problems with the new stack, please let us know at pace-support@oit.gatech.edu.

We will also upgrade the software on the scratch storage Panasas. We have observed many ‘failover’ events resulting in brief interruptions of service under high loads, potentially incurring performance penalties on running codes. This version is supposed to help address these issues.

We have new storage systems for Athena (/nv/pz2) and Optimus (/nv/pb2). During maintenance day, we will move these filesystems off of temporary storage, and onto their new servers.

More details will be forthcoming on other maintenance day activities, so please keep an eye on our blog at http://blog.pace.gatech.edu/

Thank you for your cooperation!

-PACE Team

Scheduler Problems

The job scheduler is currently under heavy load (heavier than any we have seen so far).

Any commands you run to query the scheduler (showq, qstat, msub, etc.) will probably fail because the scheduler can’t respond in time.

We are working feverishly to correct the problem.

scratch space improvements

While looking into some reports of less-than-desired performance from the scratch space, we have found and addressed some issues.  We were able to enlist the help of a support engineer from Panasas, who helped us identify a few places to improve configurations.  These were applied last week, and we expect to see improvements in read/write speed.

If you notice differences in the scratch space performance (positive or negative!) please let us know by sending a note to pace-support@oit.gatech.edu.

reminder – electrical work in the data center

Just a quick reminder that Facilities will be doing some electrical work in the data center, unrelated to PACE, tomorrow.  We’re not expecting any issues, but there is a remote possibility that this work could interrupt electrical power to various PACE servers, storage and network equipment.

Upcoming Quarterly Maintenance on 4/17

The first quarter of the year had passed already, and it’s time for the quarterly maintenance once again!

Our team will offline all the clusters for regular maintenance and improvements on 04/17, for the entire day. We have a scheduler reservation in place to hold jobs that would not complete until the maintenance day, so hopefully no jobs will need to be killed. The jobs with such long wallclock times will still be queued, but they will not be released until the maintenance is over.

Please direct your concerns/questions to PACE support at pace-support@oit.gatech.edu.

Thanks!

FYI – upcoming datacenter electrical work

In addition to our previously scheduled maintenance day activities next tuesday, the datacenter folks are scheduling another round of electrical work during the morning of Saturday 4/21.  Like the last time, this should not affect any PACE managed equipment, but just in case….

New rhel6 shared/hybrid queues are ready!

We are happy to announce the availability of shared/hybrid queues for all sharing rhel6 clusters. Please run “/opt/pace/bin/pace-whoami” to see which of these queues you have access to. We did our best to test and validate these queues, but there could still be some issues left overlooked. Please contact us at pace-support@oit.gatech.edu if you notice any problems.

Here’s a list of these queues:

  • mathforce-6
  • critcelforce-6
  • apurimacforce-6
  • prometforce-6 (prometheusforce-6 was too long for the scheduler)
  • eceforce-6
  • cygnusforce-6
  • iw-shared-6

Happy computing!

 

Webinar: Parallel Computing with MATLAB on Multicore Desktops and GPUs

Mathworks is offering us a very interesting webinar:

“Parallel Computing with MATLAB on Multicore Desktops and GPUs ”

Friday, March 30, 2012

2:00 PM EDT

REGISTER NOW

In this webinar we introduce how using Parallel Computing Toolbox you can fully leverage the computing power available on your desktop through multicore processors and GPUs.

Through demonstrations you will learn how with minimal changes to your code you can speed up your MATLAB based data analysis, design and simulation work.

The webinar will last approximately 60 minutes. A Q&A session will follow the presentation and demos.

Register here

Mathworks contact:

Jamie Winter

508-647-7463

jamie.winter@mathworks.com

Enabling Discovery with Dell HPC Solutions!

Dell, along with partners Intel, Mellanox, APC/Schneider Electric and Scientific Computing, would like to invite you to a 1-day workshop to see how HPC Solutions from Dell and Partners can enable cutting edge results in your research labs.

 

Thursday, April 26, 2012

Emory Conference Center 
1615 Clifton Road
Atlanta, GA  30329

Register Here: https://www.etouches.com/Emory

Agenda:

8:30 a.m. – 8:45 a.m.                Registration

8:45a.m. — 9:00 a.m.                Dell Welcome

9:00 a.m. – 9:45 a.m.              “Dell HPC Solutions”  (Dr. Glen Otero, Dell HPC Computer Scientist)

9:45 a.m. – 10:25 a.m.              Suresh Menon, Georgia Institute of Technology

10:25 a.m. – 10:35 a.m.            Break 

10:35 a.m. – 11:15 p.m.            Center for Disease Control Presentation

11:15 p.m. – 11:55  a.m.           Phil Moore, Savannah River National Laboratory

12:00 p.m. – 12:45 p.m.            Networking Lunch

12:45 p.m. – 1:25 p.m.              Boyd Wilson and Randy Martin, Clemson University

1:25 p.m.  —  2:05 p.m.              Neil Bright, PACE, Georgia Institute of Technology

2:05 p.m. –   2:35 p.m.              “The Core to Faster Simulation and Greater Discovery” (Jim Barlow, Enterprise Technologist, Intel)

2:35 p.m. —  2:45 p.m                Break

2:45 p.m. –  3:00 p.m.               Mellanox presentation

3:00 p.m. –  3:15 p.m.              APC presentation

3: 15 p.m. – 4:00 p.m.             “HPC Panel of Experts”  (Ask your HPC questions of this team of HPC Experts from across the industry!)

Regarding the job scheduler problems over the weekend

We experienced a major problem with one of our file servers over the weekend, which caused some of your jobs to fail. We would like to apologize for this inconvenience and provide you with more details on the issue.

In a nutshell, the management blade of our file server we use for scratch space (iw-scratch) crashed for a reason that we are still investigating. This system has a failover mechanism, which allows another blade to take over for continuation of operations. Therefore, you were still able to see your files and could use the software stack that is on this fileserver.

Our node that runs the moab server (job scheduler), on the other hand, mounts this fileserver using another mechanism that uses a static IP. After the new blade took over the operations, our Moab node continued to try mounting the iw-scratch using the IP of the failed blade, needless to say, unsuccessfully.

As a result, some jobs failed with messages similar to “file not found”. This problem also rendered the moab server unresponsive, until we rebooted it Saturday night. Even after the reboot, some problems persisted until we fixed the server this morning. We will keep you updated as we find more about the nature of the problem. We are also in contact with the vendor company to prevent this from happening again.

Thank you once again for your understanding and patience. Please contact us at pace-support@oit.gatech.edu for any questions and concerns.