First Steps
To follow this tutorial, download the “haloproperites.hdf5” and “haloparticles.hdf5” files from the OpenCosmo Google Drive and set the environment variable. This file contains properties of dark-matter halos from a small hydrodynamical simulation run with HACC. You can easily open the data with the open function:
import opencosmo as oc
dataset = oc.open("haloproperties.hdf5")
print(dataset)
OpenCosmo Dataset (length=237441)
Cosmology: FlatLambdaCDM(name=None, H0=<Quantity 67.66 km / (Mpc s)>, Om0=0.3096446816186967, Tcmb0=<Quantity 0. K>, Neff=3.04, m_nu=None, Ob0=0.04897468161869667)
First 10 rows:
block fof_halo_1D_vel_disp fof_halo_center_x ... sod_halo_sfr unique_tag
km / s Mpc ... solMass / yr
int32 float32 float32 ... float32 int64
----- -------------------- ----------------- ... ------------ ----------
0 32.088795 1.4680439 ... -101.0 21674
0 41.14525 0.19616994 ... -101.0 44144
0 73.82962 1.5071135 ... 3.1447952 48226
0 31.17231 0.7526525 ... -101.0 58472
0 23.038841 5.3246417 ... -101.0 60550
0 37.071426 0.5153746 ... -101.0 537760
0 26.203058 2.1734374 ... -101.0 542858
0 78.7636 2.1477687 ... 0.0 548994
0 37.12636 6.9660196 ... -101.0 571540
0 58.09235 6.072006 ... 1.5439711 576648
The open function returns a Dataset object, which can retrieve the raw data from disk. It also holds additional information about the data, such as the simulation it was drawn from and the associated cosmology. You can easily access the data and cosmology directly as Astropy objects:
dataset.get_data()
dataset.cosmology
The first will return an astropy table of the data, with all associated units already applied. The second will return the astropy cosmology object that represents the cosmology the simulation was run with.
Basic Querying
Although you can access data directly, opencosmo provides tools for querying and transforming the data in a fully cosmology-aware context. For example, suppose we wanted to plot the concentration-mass relationship for the halos in our simulation above a certain mass. One way to perform this would be as follows:
dataset = dataset
.filter(oc.col("fof_halo_mass") > 1e13)
.take(1000)
.select(("fof_halo_mass", "sod_halo_cdelta"))
print(dataset)
OpenCosmo Dataset (length=1000)
Cosmology: FlatLambdaCDM(name=None, H0=<Quantity 67.66 km / (Mpc s)>, Om0=0.3096446816186967, Tcmb0=<Quantity 0. K>, Neff=3.04, m_nu=None, Ob0=0.04897468161869667)
First 10 rows:
fof_halo_mass sod_halo_cdelta
solMass
float32 float32
---------------- ---------------
11220446000000.0 4.5797048
17266723000000.0 7.4097505
51242150000000.0 1.8738283
70097712000000.0 4.2764015
51028305000000.0 2.678151
11960567000000.0 3.9594727
15276915000000.0 5.793542
16002001000000.0 2.4318497
47030307000000.0 3.7146702
15839942000000.0 3.245569
We could then plot the data, or perform further transformations.
Data Collections
This is cool on its own, but the real power of opencosmo comes from its ability to work with different data types. Go ahead and download the “haloparticles” file from the OpenCosmo Google Drive and try the following:
import opencosmo as oc
data = oc.open("haloproperties.hdf5", "haloparticles.hdf5")
This will return a data collection that will allow you to query and transform the data as before, but will associate the halos with their particles.
structures = data
.filter(oc.col("fof_halo_mass") > 1e13)
.take(1000, at="random")
for halo in structures.halos(["dm_particles", "star_particles"]):
halo_mass = halo["halo_properties"]["fof_halo_mass"]
dm_particles = halo["dm_particles"]
star_particles = halo["star_particles"]
# do_work
In each iteration, the “halo” object will be a dictionary containing a “halo_properties” dictionary as well as two opencosmo.Dataset objects, one containing the dark matter particles associated with the halos and the other containing the star particles. Because these are standard datasets, you can perform further transformaions on them as is useful for your analysis.
Core Concepts
The core design principle of opencosmo is lazy immutability. Data is not retrieved from disk until it is actaully needed, and dataset and collections cannot be modified once created. This approach leads to a number of performance benefits, and makes it possible to work with data that is much larger than the available memory. However, many data management libraries do not operate in this way so it is good to be aware of how opencosmo behaves.
Immutability
Immutability means that datasets do not change once they are created. Try to spot the problem with the following example:
import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
ds.filter(oc.col("fof_halo_mass") > 1e14)
data = ds.get_data("numpy")
assert np.all(data["fof_halo_mass"] > 1e14)
Running this code will give you an assertion errror. When we call ds.filter, the ds dataset object is not modified. Instead, ds.filter returns a new dataset, which is the same as the old dataset but with the new filter applied. Because we don’t assign this new dataset to anything, it simply disappears.
We can fix this problem by simply assigning the new dataset to our ds variable:
import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
ds = ds.filter(oc.col("fof_halo_mass") > 1e14)
data = ds.get_data("numpy")
assert np.all(data["fof_halo_mass"] > 1e14)
This pattern is quite common. Often, we don’t care about the original dataset once we’ve done our filter. In these cases, it is perfectly acceptable to simply discard the original dataset by binding the new dataset to the same variable name.
Because dataset operations return new datasets, we can chain together these operations when we do not care about the intermmediate results:
import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
ds = ds
.filter(oc.col("fof_halo_mass") > 1e14)
.select(("fof_halo_mass", "sod_halo_mass", "sod_halo_cdelta"))
.take(10_000, at="random")
data = ds.get_data("numpy")
assert np.all(data["fof_halo_mass"] > 1e14)
assert len(data) == 3 # Dictionary with 3 columns
assert len(data["fof_halo_mass"] == 10_000)
However there are some cases where we don’t to overwrite the original dataset. We could, if we wanted to, also hold onto the original dataset or create multiple derivative datasets:
import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
ds_high_mass = ds.filter(oc.col("fof_halo_mass") > 1e14)
ds_low_mass = ds.filter(oc.col("fof_halo_mass") < 1e14)
data_high_mass = ds_high_mass.get_data("numpy")
data_low_mass = ds_low_mass.get_data("numpy")
assert np.all(data_high_mass["fof_halo_mass"] > 1e14)
assert np.all(data_low_mass["fof_halo_mass"] > 1e14)
Laziness and Caching
Laziness means putting off the evlauation of some expression until its result is actually needed. In the context of opencosmo, this means data remains on disk until it is actually needed. Opening a dataset with opencosmo.open() doesn’t read any data into memory. Instead, it simple ensures that the file conforms to the opencosmo standard and reads enough of the metadata to perform operations and communicate to you about what is in the dataset.
In practice, this means that calling get_data or writing data to disk with opencosmo.write() will almost always be the slowest part of any script. These methods require the toolkit to actually read data from disk, and in some cases even create new columns based on operations like with_new_columns or evaluate. Depending on the quantity of data requested, or the complexity of the relationships between existing columns and new columns, this can take some time. o
However the advanatages of this approach are significant. opencosmo has no problem working with datasets that are much too large to fit in memory. You can open such a dataset and perform operations just as you would with a much smaller dataset. So long as the resultant dataset is small enough, everything will work when you call get_data. Of course, if you try to read in more data than can fit in memory, opencosmo will try to get that data for you even if it crashes your process.
However this approach leads to some complications. Consider what might happen in the following example.
import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
ds = ds.filter(oc.col("fof_halo_mass") > 1e14)
data = ds.get_data()
ds_selected = ds.select(("fof_halo_mass", "sod_halo_mass"))
data_selected = ds_selected.get_data()
What happens in this situation? Do we load the fof_halo_mass and sod_halo_mass columns twice? This is inneficient and can quickly lead to memory problems. It’s good to avoid reading data until we actually need it, because we will often find that much of the data doesn’t end up having to be read at all. But once data is read from disk, we want to try to avoid reading it a second time if we can.
opencosmo has a caching layer which attempts to avoid reading data twice. In the example above, data read in the first get_data call is cached and made available to any downstream datasets. ds_selected has access to this cache. Since the fof_halo_mass and sod_halo_mass columns are already in memory, the final ds_selected.get_data() call does not require reading anything from disk or allocating new memory. It simply returns the data that has already been read.
This also works with more subtle cases:
import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
data = ds.get_data()
ds = ds.filter(oc.col("fof_halo_mass") > 1e14)
data_filtered = ds.get_data()
Because the all the data required to create data_filtered is already in memory, opencosmo will use that data to construct the data returned by the second get_data call. This is much faster than going to disk and reading the data a second time. However, it is not possible to avoid creating a copy in memory in this case. Halos above a mass of \(10^{14}\) are randomly distributed throughout the data. Selecting only the rows with those high-mass halos requires creating new arrays in memory.
The goal of opencosmo is to allow you to spend less time worrying about data management and more time worrying about your science. The toolkit will handle moving data in and out of memory as efficiently as possible. You don’t need to worry about the details.