ExaLearn is a US Department of Energy (DOE) Exascale Computing Project (ECP) center developing and applying machine learning methods in high-performance computing environments. This repository hosts collections of training and test data ("Projects") relevant to ExaLearn goals. See below for information about available Projects.
To work with a specific Project, select "Search <project-name>." You can then browse and search the Project's contents, and download individual data elements. If you log in, you can also use the "BagIt" button to aggregate a data subset defined via search for download or transfer.
This project contains data from several cosmological N-body dark matter simulations, which are stored here both in their raw form as NumPy arrays, as well as assembled into TFRecord files which can be read by TensorFlow (these TFRecords contain data-label pairs which can be used for supervised learning problems). The simulations are run using MUSIC to generate the initial conditions, and are evolved with pyCOLA, a multithreaded Python/Cython N-body code. The output of these simulations is then binned into a 3D histogram of particle counts in a cube of a fixed size, which is then also sliced up into sub-volumes and 2D sheets to get data samples which are more manageable in size. The total size of each dataset stored here is thus several times larger since the data has been made available in multiple formats for user convenience. More details on the process of generating these datasets can be found in the CosmoFlow paper.
The governing cosmological parameters of interest in each dataset are varied uniformly around a mean value with some pre-defined spread. For example, in the cosmoUniverse_2019_02_4parE dataset, the Hubble constant H_0 is varied around H_0 = 70 with a 30% spread. For the purpose of machine learning, it is convenient to have normalized data/labels, so the labels corresponding to these cosmological parameters (for the data in the TFRecords) are stored as normalized unit values within the range [-1, 1]. The mapping from these unit labels to the actual physical parameter values is given by P = m + U*h, where P is the actual physical parameter value, m is the mean physical parameter value being varied around, U is the unit parameter value, and h is the half-width of the spread.
The records in this project are public.Search CosmoFlow
This project contains data from the TomoGan project.
The records in this project are public.Search TomoGAN