Information Technology and Research Computing

Modern research relies heavily on information technology. CSHL offers robust access to both standard technology resources and advanced computational services to meet the needs of the innovative research programs. At CSHL, we provide reliable, high-speed connectivity across campus as well as high performance computing and data storage, housed within a state-of-the-art 3,000 square foot data center on campus.

CSHL offers a High Performance Compute Cluster for rapid processing of huge quantities of data. A cluster can be described as a number of computers (referred to as nodes) connected to each other through a fast data network and operated as a single, large computer. Researchers submit calculations to the cluster and scheduler software distributes the calculations across a variable number of nodes, optimizing use of the cluster and speed of the calculation. Modern computer processors (CPUs) have multiple computational units referred to as “cores”. Cluster nodes typically have more than one CPU, each with several cores. If your calculation takes a week on a single core, it could take a day if you were able to distribute the task across seven cores.

The main institutionally shared compute cluster is based on the Intel Sandy-Bridge (SB) microarchitecture. The cluster is a 1,728-core IBM x solution based on the M4 server line with Intel Xeon E5 (SB-EP) processors. The cluster was designed from 104 servers, each with 16 cores and 128 GB RAM, configured as development, compute, and management nodes. High-bandwidth, 10 Gbps, Ethernet networking interconnects the cluster nodes and connects the cluster to our data storage systems.

In addition to the standard compute nodes, the cluster includes two high memory nodes, each with 32 cores and 1.5 TB RAM. The cluster was designed to accommodate standard batch processing along with calculations implemented in the Hadoop framework, in order to support the full spectrum of institutional research computing needs. Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. It is suitable for processing large data sets and has proven particularly useful in our Genomics research.

CSHL data storage consists of enterprise-grade equipment, primarily from IBM and Isilon (EMC). Scientific data storage, currently well in excess of 5 Petabytes (PB), is expected to grow significantly over the coming years: our platforms that are scalable to many PB.

The Lab also offers an IBM Scale Out Network Attached Storage (SONAS) platform which provides general-purpose data storage, with the added benefit of also supporting the extreme performance requirements of an HPC environment. The SONAS platform is fully backed up, with automated replication of data between the primary and secondary data centers. As a modular, flexible system, SONAS can easily be enhanced to provide even higher performance and expanded to achieve vast storage capacity.