For students with good computational science background ...

This CMDF software architecture is designed to enable flexible mixing of methods to permit materials simulations that traverse length and time scales, as well as mixing of methods within a single particular length or time scale, via loosely coupled code components encapsulated under an object oriented (OO) programming paradigm. Components can be processing units, bi-directional paradigm/scale handshaking units or unidirectional cast/transform units. Currently, to control the simulation workflow of a particular multi-paradigm run over a particular molecular structure we use a single-entry ASCII flat file that defines the conditions and parameters (from a set of keywords) for each major processing component to be used. Inter-paradigm/scale couplings are assumed to be built-in (at this stage of development) hence hidden from the user's control. This would be the typical mode of operation for non-expert users, provided the required functionality is available and that a pre-determined sequential flow is expected. On the other hand, this scheme limits an expert user's ability to explicitly define his own coupling conditions, or to select or modify the existing ones using a single-entry interface.

Furthermore, the lack of explicit constructs for defining component/module connectivity or events at the keywords file level, leads to potential indeterminism in the control flow interpreted by the user (currently resolved by imposing an assumed deterministic sequence of events derived from the main task described within the keywords file) and does not allow one to fully exploit concurrency as built-into the architectural model (elicits a data-flow inference engine by which events are triggered through the availability of data and not necessarily by a time clock) under which, in principle, components can run on a single computational node or an a distributed or massively parallel system. In order to exploit the potential for parallelism required for the large-scale, long-term multiparadigm/multiscale simulations expected, we have adopted a task based parallel computing model that would provide support for event driven, MPI, and threads; hence, allowing parallel-ready code to run in parallel and non-thread safe code to still take advantage of parallelism under a functional parallel model. CMDF leverages on many legacy code components that were written at a time when computational resources were scarce, mainly single-threaded and not coded for parallel computations whence the task based parallel model suits it well to avoid the cost of re-coding. Important challenges remain, such as making the non-threaded/non-parallel software components in CMDF as light (and homogeneous in time) as possible to favor a improved load-balancing schemes when running massively parallel simulations, or by supporting compile and execution time automated parallelizers (once these have reached a mature stage).

Areas of Opportunity. The following tasks related to the aforementioned software engineering processes in CMDF's development plan include specific opportunities for Caltech CS (undergraduate and graduate) students to participate in this project:

	Specification and construction of a relational or OO database to handle pre-specified and user defined components/methods and simulation control flows (including test cases). This database should provide information about component interfaces and parameters in the architecture (i.e. I/O and parameter data-types), component residence (i.e. where is the component physically located within the computational environment), component documentation (currently we use epydoc and Doxygen), component relative performance and error bars (e.g. condition numbers for serial or parallel runs), other component specs (e.g. thread-safe, parallel or not) incremental integration of compound component/multiparadigm interfaces, as well as the appropriate metadata and search tools within the methods and parameter space. The data base can reside on a single node, but it must be addressable coherently and securely within a distributed environment (particularly for user updates). Task ID: Components/Simulation database component.
	Design and development of a top-level web-native visual programming interface (connected to the lower-level database) to allow specification, loading, execution and termination of user, or library-based, specified simulation runs. This web-based interface would become the new front-end for CMDF and it should be capable of allowing secure remote control flow setup, execution / termination of simulation jobs (in interactive mode or batch modes - for the later, users should be able to automatically generate a Python level script that integrates the current keywords set plus any added control flow and simulation setup information, an appropriate intermediate language grammar specification could be required for this). It will operate also as the front-end to our CMDF Application Server (users, should be able to remotely configure and run their simulation jobs for instructions, research or study). It should allow the user to extract from the Components/Simulation database the necessary specifications (black-box) for establishing/adding simulation conditions, including connectivity, specifying simple structural programming operations, establishing component groupings, estimating error bars for collective operations, and establishing overall relative performance. Furthermore, it should provide the means for adding new modules/components to the CMDF methods dataset by allowing automated in-lining of python-"wrapable" source code. Task ID: *Web-native Programming Interface component*. Under development (with Victor River and Jorge Victoria, PUJ-Cali).
	Development of a high-level task parallelizer and Web-native scheduler. Python does not have the native ability to handle multiple, distributed namespaces. In order for CMDF to fully exploit computational parallelism, using the currently integrated code (as mentioned, several components are non-threaded, non-parallel safe) we will be using a task push/pull scheme on top of existing MPI, shared memory and threads support to improve performance of large-scale simulations. The tasks should be identified a priori, from the knowledge base of the components Networking in the distributed sense will be handled using an event-driven networking engine in the same manner IPython1 has proposed to use Twisted. At this level (layer on top of the Web-native Programming Interface component the user should be able to allocate simulation resources (machine, processors, memory, time, etc.) or let the system query for available resources in a distributed environment before mapping the simulation task space, and use parameterize a conventional task scheduler (e.g. PBS). Furthermore, data-parallelization will be pursued using spatial decomposition of the task space, again using the MPI standard. Task ID: Task-Parallelizer and scheduler component.
	Development of automated inter-method/scale couplings. This involves the development of specific methods for extracting parameters from a given scale/method and transforming them for consumption within a different scale/method, uni-directional or bidirectional. The resulting schemes are also components on their own, within the CMDF architecture. This requires fundamental theoretical and applied knowledge of schemes for optimization, classification and features extraction, including computational Genetic Algorithms (GAs), Artificial Neural Networks (ANNs), as well as in other schemes appropriate for multi-paradigm/multiscale solutions including numerical interpolation, homogenization and averaging (discrete and continuum pde solvers by which heterogeneous/fluctuating quantities are replaced by homogeneous/constant ones). Task ID: *Method and Scale Inter-coupling component*. Under development (with Will Ford, Caltech).
	Narrow the gap-fit between current architecture and re-designed one by decoupling critical large components and executable components. This involves C/Fortran and Python coding capabilities to: profile current components, partition large and coupled low-level code (mainly C/Fortran 77) -some fundamental scientific knowledge is required for this as well as the ability to use appropriate code tracers-, optimize Python code to minimize data processing (leave only control flow) by translating to C/C++/Fortran code, and adding efficient (f2py or SWIG) wrappers to form the corresponding CMDF components. Task ID: Gap-Fit component.

Notes:

	This is not an exhaustive specification of the particular tasks.
	Knowledge of materials and process modeling and simulation, molecular simulations, UML and round-trip engineering methods is desirable but not required.
	Students would be introduced and tutored on the necessary scientific subjects and guided throughout.
	Student can contribute to one or more task, depending on his/her level of knowledge and ability at the time of involvement
	Participation time can vary according to students academic requirement and would be decided upon commitment
	Students who participate will be expected to present advances periodically and to attend ad-hoc meetings (depending on the subject) scheduled by the CMDF leading team (students academic schedule will be considered in doing so).

Multiscale Modeling and Simulation

Materials and Process Simulation Center (MSC)

California Institute of Technology

Areas of Opportunity. The following tasks related to the aforementioned software engineering processes in CMDF's development plan include specific opportunities for Caltech CS (undergraduate and graduate) students to participate in this project: