Earlier this year, CRS was in the process of solving a big problem the Institute had. A really big problem. The total available storage space for data in the Institute was hovering around 120 terabytes (abbreviated “TB” if you’ve multiplied your bytes by 1000 or “TiB” if by 1024). We only had about a fifth of that free. There were existing future requests for another 10-15TB, and the total data storage need tends to increase by 2-5TB every year. Already, the servers were warning us that we’d be full soon.
I want to step aside quickly here and talk about how unimaginably vast 120TB of space is. It could contain 5725 copies of the entirety of English Wikipedia as of April 2, 2022 (20.69GB), a half hour of HD video for each person in the 2020 US Census, or 5.7 billion copies of the CRS budget spreadsheet.
With that out of the way, it’s also worth noting that the 120TB we were working with is what we’d call “usable” space. Behind that usable space is 175TB worth of “raw” hard drives. Hard drives are sensitive pieces of equipment, and some percentage of each batch of hard drives are destined to fail much sooner than their rated 5-10 years. In the three years I’ve been at IBS, we’ve had to replace six or seven drives, but these systems are all built such that any given one or two or three drives can fail all at the same time, and the system keeps chugging along without any data loss or disruption. We can even slot in a new hard drive while the system is running, which is a useful feature when a small system can have 70 or 80 drives and serve 50-200 people at once! These systems accomplish that feat by reserving space on multiple hard drives to contain copies of the same data, or enough information to regenerate it, so we lose between one third and one half of the raw space in order to provide redundancy for the data stored in that usable space.
Fortunately, with the help of Dell, CU’s Procurement office, and the Office of Information Technology datacenter operations group, we were able to quickly purchase and install 375TB of raw storage, representing an increase of 200TB of usable space. Most of this is slower storage, meant for archiving large datasets that are accessed in their entirety only occasionally, but it also includes more than doubling the amount of really fast (SSD or “Solid-State Drive,” which is more like a USB flash drive than physically spinning platters of magnetic media) storage that is directly attached to our Citrix environment. The system that manages this storage is smart enough to place more-often-used data onto faster storage, and less-frequently-used data onto slow storage in a way that’s mostly invisible to us, so if you notice that the exact same operation in Stata done twice in a row can result in different amounts of time to run, that’s a big part of the explanation.
In total now, we have 320TB of usable storage, which is only about enough to store about 1.5% of the digital data that the Library of Congress ingests annually, but it’s enough for a lot of cool datasets.