Glotzilla Overview

Glotzilla is two things:

  • A C++ library for writing molecular simulations, data analysis codes and visualization tools.
  • A collection of executables for performing molecular simulations, data analysis and data visualization.

Introduction and Motivation

Molecular simulation has been applied to treat a wide range of problems in physics, chemistry, biology, and engineering that cannot easily be solved via conventional experiments. Applications of molecular simulation include mapping the phase behavior of rare, expensive, or theoretical systems, or common systems at experimentally inaccessible state points, studying activated events such as chemical reactions, and studying the behavior systems on the atomic or molecular level. The range of applications accessible to molecular simulation grows more expansive as computing power improves. Thus, in the future, molecular simulation will make an increasingly significant contribution to scientific and engineering research.

Several freely available software packages have been developed to perform molecular simulations and visualize results.

Lammps Massively parallel simulations
Etomica Simple but powerful API, limited speed and scalability
CHARMM Protein simulations
(list is incomplete

These codes perform simulations of atomic, polymeric or macromolecular systems in various thermodynamic ensembles, often with an emphasis on high-performance computing and parallelization. While such packages have proven useful for many applications, they can have several drawbacks. First, because there is little or no standardization between packages, each code involves a learning curve that may take months or years to surmount. Thus, in general, researchers are able to use, at best, one or two packages. Secondly, the packages are often inflexible, and cannot be trivially modified to perform special tasks. Thus, researchers become limited by their software-of-choice. This has caused many researchers to seek alternatives such as developing simulation codes from scratch. This practice wastes valuable research time on code development, and often produces simulation codes that are less accurate and efficient than the software packages that they replace.

Here, we propose to introduce a general framework code-named Glotzilla, which combines the flexibility of user-generated simulation codes with the speed and accuracy of widely available software packages. The main idea behind Glotzilla is to provide molecular simulation tools that work within the C/C++ and unix environments, which are two of the most commonly utilized standards in scientific computing. In this respect, standardization is inherent in Glotzilla; any user who is familiar with C/C++ and unix will understand how to use it.

In its final form, Glotzilla will consist of a two-level framework that is implemented both on the source code level and on the executable level of computer programming to give a unique combination of flexibility and usability. On the source-code level, Glotzilla will provide a pre-compiled C/C++ library that can be included in any new C/C++ molecular simulation code to facilitate development. On the executable level, Glotzilla will work within the “pipes and filters” framework of unix-based systems, which allows users to string together simulation-related processes in a modular fashion. Thus, Glotzilla will be capable of implementing all of the tasks currently available to molecular simulation researchers as well as aiding the development of new tasks. Because of its open-source philosophy, Glotzilla will always be current and will grow more powerful with time.

Background

Here, we review the pertinent functionalities underlying both C/C++ and unix, before explaining how we extend these functionalities to handle molecular simulations in the next section.

C/C++ Libraries

Glotzilla will implement a precompiled utility library that can be used to aid development of new simulation and data analysis codes in C/C++. Although such libraries will initially be written in C++, the general idea can be extended to other programming languages that support pre-compiled headers such as Fortran-90 and Java in the form of a general API. The idea underlying C libraries is straightforward: a common task that is performed by many programmers need not be re-written each time. Rather, a standard implementation of the task is made readily available via a one-line “#include” statement. For example, consider the C <stdio.h> library, which supplies a set of functions for printing to the screen from a C/C++ code (a process that involves thousands of lines of source code if written from scratch).

int main(int argc, char**argv)
{
   printf("Hello World!");
   return 0;
}

The program above calls the printf function to print the words “Hello World!” to the screen. The printf function is made available to the programmer by using the command “#include <stdio.h>” at the top of the file. This form of standardization 1) decreases the size of codes, 2) increases the accuracy and performance of codes. The C/C++ standard includes hundreds of pre-compiled libraries for performing tasks such as storing memory, manipulating data, writing to files, and performing mathematical calculations.

Integration with Unix

The vast majority molecular simulation research is carried out under unix-based operating systems such as MacOS or Linux. Glotzilla will integrate with such environments by providing binaries for performing common simulation and data analysis tasks from the unix command line. In unix, a binary (i.e., a set of computer instructions) can be executed by calling its name from the terminal.

host $ ls
Desktop Movies Documents Pictures

The example above demonstrates the execution of the unix “ls” (list) function, which, returns a directory listing to the screen (stdout) when executed. Users can execute custom binaries compiled in C/C++ or other computer languages in an identical fashion.

A particularly powerful feature of unix-based operating systems is the ability to “pipe” the output (stdout) of one process to the input (stdin) of another. This allows users to create chains of processes linked by their input/output (I/O) streams. As pipes process information in memory, chains of processes connected by pipes perform nearly as well as individual executables. The advantage lies in the fact that linking together executables in a modular fashion is results in many more permutations of useful processes.

host $ ls
Desktop Movies Documents Pictures

host $ ls | grep Movies
Movies

This general idea is demonstrated in the example above, where a user sends the output of “ls” to the input of “grep” using the pipe operator “|.” The “grep” function acts as a filter, printing only files that match the regular expression “Music”.

host $ ls | grep "Movies"
Movies

host $ ls | grep Movies | wc
       1       1       7

In next example, the output of the grep function is passed to the “wc” (word count) function, which writes the number of lines, words, and characters on to stdout. This general approach to data manipulation is referred to as the “pipes and filters” paradigm, since the flow of data through the unix pipeline is similar to the flow of mass through a physical pipeline. Here each program in the pipeline acts as a filter and passes the datastream to the next program.

Proposed Work Plan

The natural extension of C/C++ and unix to molecular simulation applications is to create C/C++ libraries and unix executables that are specific to molecular simulation problems. Here, we summarize the basic features of the libraries and executables that we propose to implement.

C/C++ Libraries

Just as a programmer printing text to the screen benefits from the C/C++ <stdio.h> built-in library, a researcher wanting to create a molecular simulation would benefit from a library capable of carrying out common simulation tasks. Examples of such tasks include generating a custom simulation, visualizing the simulation, and analysing the data. We propose to generate a C/C++ library containing many such functions, which are summarized in the API section of this wiki site. The functions could be included into a C/C++ executable using the #include macro, just as for <stdio.h>. An example C++ program using the <glotzilla++.h> library is given below.

Binaries

Glotzilla will include several compiled executables for performing basic simulation / data analysis tasks. The executables will interact in a way very similar to built-in unix functions via the pipes and filters paradigm. Figure 4 summarizes the executables that will be made available in the first release of Glotzilla. The framework will grow with time, as users will add their own executables to the framework. Note that users need not use the Glotzilla implementations of a given process; any process that reads from stdin and writes to stdout can be easily inserted into the pipeline.

Proposed Timeline of Work

In its final form, Glotzilla will be a “living” open-source framework that is maintained and updated by the molecular simulation community. However, the initial implementation of Glotzilla will be constructed by the Glotzer group at the University of Michigan. Year 1: The existing group simulation and analysis codes will be transformed into a working C/C++ library. Concurrently, simulations, analysis modules and data filters will be constructed for manipulating data in unix. Year 2: The framework will be extended to work with existing simulation packages by adding filters than convert data between various file formats. Concurrently, the framework will be tested by group members and other simulation groups, and will be improved based on the feedback obtained. Year 3: A final website and CVS server will be constructed to distribute the code to the community.