Search
Close this search box.

How much is 120 terabytes?

A close up of an array of 21 server hard drives labelled "SAS 500GB 10K" with identical arrays above and below.

Earlier this year, CRS was in the process of solving a big problem the Institute had. A really big problem. The total available storage space for data in the Institute was hovering around 120 terabytes (abbreviated “TB” if you’ve multiplied your bytes by 1000 or “TiB” if by 1024). We only had about a fifth of that free. There were existing future requests for another 10-15TB, and the total data storage need tends to increase by 2-5TB every year. Already, the servers were warning us that we’d be full soon.

I want to step aside quickly here and talk about how unimaginably vast 120TB of space is. It could contain 5725 copies of the entirety of English Wikipedia as of April 2, 2022 (20.69GB), a half hour of HD video for each person in the 2020 US Census, or 5.7 billion copies of the CRS budget spreadsheet.

With that out of the way, it’s also worth noting that the 120TB we were working with is what we’d call “usable” space. Behind that usable space is 175TB worth of “raw” hard drives. Hard drives are sensitive pieces of equipment, and some percentage of each batch of hard drives are destined to fail much sooner than their rated 5-10 years. In the three years I’ve been at IBS, we’ve had to replace six or seven drives, but these systems are all built such that any given one or two or three drives can fail all at the same time, and the system keeps chugging along without any data loss or disruption. We can even slot in a new hard drive while the system is running, which is a useful feature when a small system can have 70 or 80 drives and serve 50-200 people at once! These systems accomplish that feat by reserving space on multiple hard drives to contain copies of the same data, or enough information to regenerate it, so we lose between one third and one half of the raw space in order to provide redundancy for the data stored in that usable space.

Fortunately, with the help of Dell, CU’s Procurement office, and the Office of Information Technology datacenter operations group, we were able to quickly purchase and install 375TB of raw storage, representing an increase of 200TB of usable space. Most of this is slower storage, meant for archiving large datasets that are accessed in their entirety only occasionally, but it also includes more than doubling the amount of really fast (SSD or “Solid-State Drive,” which is more like a USB flash drive than physically spinning platters of magnetic media) storage that is directly attached to our Citrix environment. The system that manages this storage is smart enough to place more-often-used data onto faster storage, and less-frequently-used data onto slow storage in a way that’s mostly invisible to us, so if you notice that the exact same operation in Stata done twice in a row can result in different amounts of time to run, that’s a big part of the explanation.

In total now, we have 320TB of usable storage, which is only about enough to store about 1.5% of the digital data that the Library of Congress ingests annually, but it’s enough for a lot of cool datasets.

What’s in a Name?

A collection of "Hello My Name Is" nametag stickers, with different names written in, like Vivian, Tom, Jen, and Tyler.

There’s a famous saying among computer scientists that one of the two hardest problems in their field is how to name things. They’re certainly not the only ones. In fact, I’d be surprised if there’s a single person in IBS who hasn’t had to struggle with a naming convention. Variables in code, folders on the O: drive, IBS administrative policies, and so many other things all need to be named such that their function is easy to understand at a quick glance. They have to contain a lot of information in as little space as possible while still allowing for additional related items, easy sorting, and categorization.

If you’ve been around IBS for a while, you probably remember the venerable P drive. Attaching it to your computer for the first time required typing in “ibscrs.colorado.edu,” but the real name of the machine that hosted it was “Ogre.” You may also remember logging into the shared computing environment on Flash, and some people may have even done some computing on Galactus or Quicksilver. Fans of comic books will have already figured out what the naming convention for servers in the department was at this point, although I had to go google Ogre to find that he is, in fact, a comic book character. At this point, only one server, named after Green Lantern, survives from that era of names.

Over the last three years, if you’ve had to manually map the O or R drives, or if you look in the lower right-hand corner of your desktop when logged into Citrix (and you grew up in America), you have probably figured out the naming convention that replaced comic book characters, thanks to Jim, Mitch, and Ashton. While CRS keeps Mahna Mahna, Miss Piggy, Beaker, and Bunsen for ourselves, you’ve almost certainly seen Gonzo, Floyd Pepper, and many of the other Muppets. It makes for some fun support calls with Dell: “I’m sorry, sir, did you say I should type Fozzie Bear into the connection window?”

More recently, we’ve had to move to less silly naming conventions for a variety of reasons–OIT would like all of our new servers to be prefixed with “IBS” in order to fit with their broader campus naming convention, for instance–but the big thing pushing us in that direction is space. The Wikipedia list of Muppets lists just over 100 names, and I have often plumbed the depths of that list to find Muppets I’d never heard of, like Johnny Fiama and Sal Minella. Many of the names (“Bill,” “Jill” or “Denise”) aren’t obviously Muppets out of context. Some interfere with our obligation not to occupy names that might be needed by another group, like “Chip,” “GIL,” and “Mulch;” and some, like “Flower-Eating Monster,” would exceed the 16-character limit we have on naming computers. That leaves us with around 50-75 usable names.

A combination of modern server infrastructure practices and IBS’s obligation to keep some datasets on separate servers pushes us toward a larger number of named servers: we fluctuate between 85 and 100 virtual servers that need unique names on any given day (stored on only a handful of physical machines!). So while the Muppets have served us well these last three years, expect to see a lot more “IBS-XA-03” and “IBS-MAXQDA”-style names around here!

Do you remember other naming conventions used for IBS computing resources? Let me know by email, and your comment may be featured in a future newsletter!

Jim Dykes and Josh Goode departures

A green exit sign, with a stylized image of a person running out a door on the left, and a right-facing arrow on the right

It’s impossible for me to overstate the impact that Jim and Josh have had on CRS and IBS over the past several years. 

Under Jim’s leadership, the CRS team expanded to gain dedicated desktop support and server administration, including hiring most of the current team. Thanks to his vision, guidance, and mentorship, CRS’ combined expertise and infrastructure has grown immensely over the last six years. The implementation of the new Citrix server cluster and a completely redesigned website (for which he took many of the photographs!) are two of the most visible of the many projects he guided to completion. While we will miss him, Jim has taken a teaching position in Computer Science and the department’s students and faculty will greatly benefit from his expertise. We will continue to build on what he helped create here.

In Josh’s time with CRS, he’s directly helped so many people with invaluable consultations on both statistics and the wide variety of analysis software the Institute uses every day. Within CRS, Josh has developed processes for automatically testing new versions of Stata, and processing data that helps us optimize the Citrix server environment. Josh is moving on to the University of Michigan where he will be doing research on family demography and biosocial processes as an NICHD postdoctoral fellow in the Population Studies Center. Even though we will no longer benefit from his honed pedagogical instincts and easy sense of humor as a teacher and consultant, there is no doubt Josh will do great things as he starts in his new position. We’ll certainly miss his easily accessible explanations of statistics and his in-depth knowledge of Stata.

Thank you to both Jim and Josh for everything that you’ve given to CRS, and all of us at IBS wish you the best of luck in your new positions!