Upcoming quarterly maintenance – 1/18/2012

This is a reminder that all PACE-managed clusters will be shutdown on January 18 (Wednesday next week) for regular maintenance.

All currently running jobs will complete before the shutdown.  Any jobs submitted to the scheduler between now and maintenance day will either complete before the shutdown or wait until after maintenance to start.

Major items on the list this time around are:
– Improving scratch filesystem performance
– Increasing scratch filesystem size
– Complete the migration of our server infrastructure to VMs
– Further redundancy improvements to core of the HPC network
– Adjustments to VM hypervisors which hopefully improve login node performance
– Install a new binary for the scheduler to remove a limit to the number of queues (this shouldn’t change anything else, but mentioning it just in case..)
– integration of some new clusters into the Infiniband fabric (again, doing this on maintenance day just in case something bad happens)

For updates about maintenance, please check the PACE blog at http://blog.pace.gatech.edu/

If you have questions or concerns, please send a note to pace-support@oit.gatech.edu.

TACC-Intel Highly Parallel Computing Symposium

Tue Apr 10 – Wed Apr 11 2012

Texas Advanced Computing Center, Austin, Texas

http://www.tacc.utexas.edu/ti-hpcs12

Submissions due Wed Feb 15 2012

 

 

The TACC-Intel Highly Parallel Computing Symposium will take place on Tuesday April 10th – Wednesday April 11th 2012 at the Texas Advanced Computing Center (TACC) in Austin, TX.

In the past year, the Intel MIC program has advanced forward towards the first commercial many-core co-processor, code named Knights Corner.

Accordingly, this symposium will expand to have two major focus areas: the Many-core Applications Research Community (MARC) for the Single-Chip Cloud Computer (SCC) experimental architecture, and the emerging community around the forthcoming Intel Many Integrated Core (MIC) architecture family of products for productivity solutions.

In April, researchers from different fields will present their current and future work.

For SCC the focus will be advanced hardware architecture concepts implementation and how to use the SCC to explore tools and software that take advantage of the finer granularity data flow.

For Intel MIC the focus is on programming productivity for highly parallel applications.

The host site, TACC, will be the site of the first large scale supercomputer system based on Intel MIC in January 2013.

Interested researchers are invited to submit unpublished reports, both on work in process or new results regarding software for novel many-core hardware architectures.

While the Intel Single-Chip Cloud Computer (SCC) has served as common research platform for most MARC members, recent availability of the development kits for the Intel MIC family of products has expanded the community for many-core applications research.

Some of the concepts of the SCC will be realized in production form when the Intel MIC product line becomes available. Other interesting research on next generation many-core platforms is also relevant for this event.

Topics of interest include, but are not limited to:

  • Operating system support for novel many-core architectures
  • Dealing with legacy software on novel many-core architectures
  • Traditional and new programming models for novel many-core hardware
  • Experiences porting, running, or developing applications
  • New approaches for leveraging on-die messaging facilities

All authors are invited to submit original and unpublished work as either regular papers (maximum 6 pages) for oral presentation or short papers (maximum 4 pages) for poster presentation. Papers describing work-in-progress are also welcome.

Paper submission is possible through EasyChair at http://www.easychair.org/conferences/?conf=tihpcs11

Submissions are due Wednesday February 15th, 2012.

For additional information please check the event website: http://www.tacc.utexas.edu/ti-hpcs12

 

Dan Stanzione, PhD
Deputy Director
Texas Advanced Computing Center, The University of Texas at Austin
dan@tacc.utexas.edu, 512-475-9411

PACE coverage during winter break

Greetings all, and happy holidays!

As I’m sure you are all acutely aware, campus will be closed next week! This includes us as well.

If you have troubles, please submit tickets as usual – preferably using our pace-support.sh script!  (See http://www.pace.gatech.edu/support.)  We will get to these as soon as things get back to normal in January.

If there is an immediate problem, please call OIT operations at (404) 894-4669 and leave a message.  One of the operators will be checking in occasionally and they have contact information for the PACE team.

Have a good break!

Department of Energy Computational Science Graduate Fellowship

Department of Energy Computational Science Graduate Fellowship

Applications due Jan 10 2012

We are pleased to inform you that the application is now open for the Department of Energy Computational Science Graduate Fellowship (DOE CSGF) at https://www.krellinst.org/doecsgf/application/.

This is an exciting opportunity for doctoral students to earn up to four years of financial support along with outstanding benefits and

opportunities while pursuing degrees in fields of study that utilize high performance computing technology to solve complex problems in science and engineering.

Benefits of the Fellowship:

  • $36,000 yearly stipend
  • Payment of all tuition and fees
  • $5,000 academic allowance in first year
  • $1,000 academic allowance each renewed year
  • 12-week research practicum at a DOE Laboratory
  • Yearly conferences
  • Career, professional and leadership development
  • Renewable up to four years

Applications for the next class of fellows are due Jan 10 2012.

For more information regarding the fellowship and to access the online application, visit: http://www.krellinst.org/csgf/

Updated: Network troubles, redux (FIXED)

We’ve got the switch back.  The outage looks to have caused our virtual machine farm to reboot, so connections to head nodes will have been dropped.

This also affected the network path between compute nodes and the file servers.  With a little luck, the NFS traffic should resume, but you may want to check on any running jobs to make sure.

Word from the network team is that they were following published instructions from the switch vendor to integrate the two switches when the failure occurred.  We’ll be looking into pretty intensely, as this these switches are seeing a lot of deployments in other OIT functions.

Network troubles, redux – 11/10 3:00pm

Hi folks,

In an attempt to restore network redundancy from the switch failure on 10/31, the Campus Network team has experienced some troubles connecting the new switch.  At this point, the core of our HPC network is non-functional.  Senior experts from the network team are working on restoring connectivity as soon as possible.

Full filesystems this morning

This morning, we found the hp8, hp10, hp12, hp14, hp16, hp18, hp20, hp22, hp24, and hp26 filesystems full.  All of these filesystems reside on the same fileserver and share capacity.  The root cause was a an oversight on our part – a lack of quota enforcement on a particular users home directory.  The proper 5GB home directory quotas have been reinstated and we are working with this user to move their data to their project directory.  We’ve managed to free up a little space at the moment, but it will take a little time to move a couple TB of data.  We’re also doing an audit to ensure that all appropriate storage quotas are in place.

 

This would have affected users on the following clusters:

  • Athena
  • BioCluster
  • Aryabhata
  • Atlantis
  • FoRCE
  • Optimius (not production yet)
  • ECE (not production yet)
  • Prometheus (not production yet)
  • Math (not production yet)
  • CEE (not production yet)

PACE staffing, week of November 14

Greetings all,

As I’m sure some of you are aware, next week is the annual Supercomputing ’11 conference in Seattle.  Many of the PACE staff will be attending, but Brian MacLeod and Andre McNeill have graciously agreed to hold the fort here.  The rest of us will be focused on conference activities but will have connectivity and can assist with urgent matters should it be required.

Updated: network troubles this morning (FIXED)

All head nodes and critical servers are back online (some required an emergency reboot).  The network link to PACE equipment in TSRB is restored as well.

We do not believe any jobs were lost.

All Inchworm clusters should be back to normal.

Please let us know via pace-support@oit.gatech.edu if you notice anything out of the ordinary at this point.

 

network troubles this morning – 0908

Looks like we have a problem with a network switch this morning.  Fortunately, our resiliency improvements have mitigated some of this, but not all as we haven’t yet extended those improvements down to the individual server level.  We’re working with the OIT network team to get things back functional as soon as possible.