Information Technology and Research Computing

Modern research relies heavily on information technology. CSHL offers robust access to both standard technology resources and advanced computational services to meet the needs of the innovative research programs. At CSHL, we provide reliable, high-speed connectivity across campus as well as high performance computing and data storage, housed within a state-of-the-art 3,000 square foot data center on campus.

CSHL offers a High Performance Computing (HPC) cluster for rapid processing of large quantities of data. A cluster can be described as a number of computers (referred to as nodes) connected to each other through a fast data network and operated as a single, large system. Researchers submit computational jobs to the cluster and scheduler software distributes them across a variable number of nodes, optimizing the use of the cluster and speed of the job. Modern computer processors (CPUs) have multiple computational units referred to as “cores”. Cluster nodes typically have more than one CPU, each with many cores. If a computation takes a week on a single core on a personal desktop or laptop computer, it could take a day if you were able to distribute the task across many cores on multiple nodes of a cluster.

The institutionally shared HPC cluster is based on the Intel Xeon Gold/Platinum (Cascade Lake-SP) architecture. The cluster consists of 50 nodes: two head/management nodes, two development/login nodes, and 46 compute nodes. The nodes are connected via dual 25 Gbps Ethernet (GigE) networks. A DDN GridScaler storage system is accessed over Ethernet via a high performance parallel filesystem (GPFS).

The compute nodes that run computational jobs consist of 42 nodes with 768GB of memory and 4 nodes with 3TB of memory. A subset of 14 of the compute nodes also have GPU co-processors. GPUs are specialized units providing faster processing than CPUs for certain applications such as image processing, machine learning, and sequencer base-calling.

CSHL data storage consists of enterprise-grade equipment, primarily from DDN. Scientific data storage, currently well in excess of 10 Petabytes (PB), is expected to grow significantly over the coming years: our platforms that are scalable to many PB.

The Lab also offers an IBM Spectrum Scale based platform which provides general-purpose data storage, with the added benefit of also supporting the extreme performance requirements of an HPC environment. The platform is fully backed up, with automated replication of data between the primary and secondary data centers. As a modular, flexible system, it can easily be enhanced to provide even higher performance and expanded to achieve vast storage capacity.