Main Transformations API

opencosmo provides a simple but powerful API for transforming and querying datasets and collections. Both the main opencosmo.Dataset type and the various collection types will have these transformations available., although the details of how they behave will differ slightly. Individual collection types may also have additional convinience methods based on their purpose, see Working with Collections for more info. The main transformations are:

with_units: Change the unit convention of the dataset or collection.
filter: Filter a dataset based on the value of one more more columns.
select: Select a subset of columns from a dataset.
take: Select a subset of rows from a dataset.
sort_by: Sort a dataset by one of its columns
bound: Limit a dataset or collection to a given spatial region.
with_new_columns: Combine columns in a dataset into a new column with automatic unit handling.
evaluate: Evaluate a computation over all the rows in a dataset or collection.

Each of these transformations is returns a new dataset or collection with the transformations applied. Because transformations are applied lazily, chaining them together is efficient:

import opencosmo as oc

# Load a dataset
ds = oc.open()

# Apply a series of transformations
ds = ds.with_units("scalefree")
ds = ds.filter(oc.col("fof_halo_mass") > 1e13)
ds = ds.select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
ds = ds.take(100, at="random")
ds = ds.with_units("physical")
data = ds.get_data()

In this example, we are we are applying a cut in halo mass using scalefree coordinates, meaning this filter will include all galaxies over 1e13 Msun/h. We then select a subset of the columns and transform them into physical units, removing the factors of h in the final values. See below for more information about unit conventions.

When writing queries like this, it can feel a bit redundant to write ds = ds.transform(...) over and over. In practice it is often more readable to simply apply transformations on top of each other:

ds = oc.open("haloproperties.hdf5")

ds = ds
   .with_units("scalefree")
   .filter(oc.col("fof_halo_mass") > 1e13)
   .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
   .take(100, at="random")
   .with_units("physical")

data = ds.get_data()

Note that if you’re working in a Jupyter notebook, you’ll need to use the line continuation character to split the query across multiple lines:

ds = oc.open("haloproperties.hdf5")

ds = ds \
   .with_units("scalefree") \
   .filter(oc.col("fof_halo_mass") > 1e13) \
   .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) \
   .take(100, at="random") \
   .with_units("physical")

data = ds.get_data()

You are also free to create multiple derivative datasets from the same original dataset:

ds = oc.open("haloproperties.hdf5")

low_mass_ds = ds
   .filter(oc.col("fof_halo_mass") > 1e13, os.col("fof_halo_mass") < 1e14)
   .with_units("phsyical")
   .select(["fof_halo_mass", "fof_halo_cdelta"])

high_mass_ds = ds
   .filter(oc.col("fof_halo_mass") > 1e14)
   .with_units("physical")
   .select(["fof_halo_mass", "fof_halo_cdelta"])

data1 = ds1.data
data2 = ds2.data

However you may also be interested in including all data that passes either filter in a single dataset. You can combine filters with boolean logic using the & and | operators:

ds = oc.open("haloproperties.hdf5")

high_mass_cut = oc.col("fof_halo_mass") > 1e14
low_mass_cut = oc.col("fof_halo_mass") < 1e12
low_concentration_cut = oc.col("sod_halo_cdelta") < 5

my_filter = (high_mass_cut | low_mass_cut) & low_concentration_cut
filtered_ds = ds.filter(my_filter)

Because transformations are evaluated lazily, you can have many derivative datasets without incurring a large memory overhead.

Unit Conventions

The with_units transformation is used to change the unit convention of the dataset. opencosmo supports the following unit conventions:

unitless: The dataset is read without applying any units
scalefree: The dataset is in “scale-free” units, meaning all lengths are in comoving Mpc/h and all masses are in Msun/h. This is the unit convention that the raw values are stored in.
comoving: Factors of h are absorbed into the values, but positions and velocities still use comoving coordinates.
physical: Factors of h are absorbed into the values, and positions and velocities are converted to physical coordinates.

When you initially load a dataset, it always uses the “comoving” unit convention. You can change this at any time on any dataset or collection by simply calling with_units with the desired unit convention. For more information, see Working with Units

Adding Columns

You can add new columns to a given that are derived from pre-existing columns using the oc.col() to construct new columns and passing them to with_new_columns. The new columns will inherit the cosmological dependence of the columns they are created from, and can be used throughout the transformations API as usual.

ds = oc.open("haloproperties.hdf5")

fof_halo_vtotal = (oc.col("fof_halo_com_vx")**2 + oc.col("fof_halo_com_vy")**2 + ("fof_halo_com_vz")**2)**(0.5)
fof_halo_com_p = oc.col("fof_halo_mass") * fof_halo_vtotal

ds = ds.with_new_columns(fof_halo_com_p = fof_halo_com_p)

The dataset will now contain a “fof_halo_com_p” column that can be used for filtering and selections as usual. Because the column definition was created outside the dataset itself, it can be used across multiple datasets as needed.

You can also simply pass values as a numpy array or astropy quantity:

import astropy.units as u
import numpy as np

random_angle = np.random.uniform(10, 50, len(ds))*u.arcmin
ds = ds.with_new_columns(angle = random_angle)

Columns can be added to collections as well, but there are some subtelties. See Working with Collections for more information.

Filtering

Filters operate on columns of a given dataset and return a new dataset that only contains the rows that satisfy the filter. Filters are constructed using the opencosmo.col() function, so they can be constructed independently of any single dataset. Available filters include:

Equality: col("column_name") == value
Inequality: col("column_name") != value
Greater than: col("column_name") > value
Greater than or equal to: col("column_name") >= value
Less than: col("column_name") < value
Less than or equal to: col("column_name") <= value
Membership: col("column_name").isin([value1, value2, ...])

Filters do not need to include units, however a filter with incorrect units will raise an error:

import astropy.units as u
from astropy.cosmology import units as u
import opencosmo as oc

ds = oc.open("haloproperties.hdf5")

# This will work fine
min_mass = oc.col("fof_halo_mass") > 1e13
ds = ds.filter(min_mass)

# This will work fine
min_mass_unitful = oc.col("fof_halo_mass") > 1e13 * u.Msun
ds = ds.filter(min_mass_unitful)

# This will fail, because the masses are not in Msun / h
min_mass_unitful = oc.col("fof_halo_mass") > 1e13 * u.Msun / cu.littleh
ds = ds.filter(min_mass_unitful)

The behavior of filters on collections depends on the collection type. See the Working with Collections page for more information.

Selecting Columns

For small datasets, it is usually not an issue to request all the columns in a given dataset. However for large datasets, loading everything into memory is slow and consumes singificant quantities of memory. We can use the opencosmo.Dataset.select() transformation to select only the subset of columns that are useful for our analysis. Select transformations can be applied sequentially, in which case the second select will only work if it contains columns that were selected in the first select. For example:

ds = oc.open("haloproperties.hdf5")

ds = ds
   .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
   .select(["fof_halo_mass", "fof_halo_center_x"])
   # This is fine

ds = oc.open("haloproperties.hdf5")

ds = ds
   .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
   .select(["fof_halo_mass", "sod_halo_cdelta"])
   # This will raise an error, because sod_halo_cdelta was not in the first select

select also accepts wildcards, allowing you to select large subsets of related columns without typing them all out explicitly:

ds = oc.open("haloproperties.hdf5")
ds = ds.select(["fof*", "*com*"])

There also is an equivalent drop function, which drops columns instead of selecting them:

ds = oc.open("haloproperties.hdf5")
ds = ds.drop(["fof*", "block"])

Filters and selects generally behave as you might expect. If you select after filtering, the resulting dataset will only have the columns that were selected for the rows that passed the filter. If you select before filtering, the filter can only use columns that were included in the select. For example, this works:

import opencosmo as oc
ds = oc.open("haloproperties.hdf5")

ds = ds
   .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
   .filter(oc.col("fof_halo_mass") > 1e13)

as does this:

ds = oc.open("haloproperties.hdf5")

ds = ds
   .filter(oc.col("fof_halo_mass") > 1e13)
   .select(["fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
   # This is also fine

but this will raise an error:

ds = oc.open("haloproperties.hdf5")

ds = ds
   .select(["fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"])
   .filter(oc.col("fof_halo_mass") > 1e13)
   # fof_halo_mass is not in the dataset when "filter" is called.

Taking Rows

The opencosmo.Dataset.take() transformation is used to select a subset of rows from a dataset. The at argument can be used to specify how the rows are selected. The available options are:

at="random": Select a random subset of n rows from the dataset (default).
at="start": Select the first n rows from the dataset.
at="end": Select the last n rows from the dataset.

As with the select transformations, take transformations can be chained together. However you cannot take more rows than are present in the dataset:

ds = oc.open("haloproperties.hdf5")

ds = ds
   .take(100, at="random")
   .take(500, at="random")
   # This will raise an error

You can also take a range of rows with opencosmo.Dataset.take_range(). As with all other transformations, this creates a new dataset so the following is valid:

ds = oc.open("haloproperties.hdf5")

ds = ds
   .take_range(500, 1000)
   .take(100, at="start")

This will take the rows 500-1000 from the original dataset, and then take the first 100 rows from that new dataset. The original dataset is unchanged.

Sorting

You can re-order a dataset based on the value of some column with opencosmo.Dataset.sort_by(). By default, this sorts in ascending order (from lowest to highest), however you can sort in descending order by passing invert = True.

For example, to get the 100 most massive halos in a given simulation, ordered from most to least massive:

ds = oc.open("haloproperties.hdf5")
ds = ds.sort_by("fof_halo_mass", invert=True).take(100, at="start")

Or, to get the 100 least massive halos, ordered from least to most massive:

ds = ds.sort_by("fof_halo_mass").take(100, at="start")

You can also use take in clever ways to get other results. For example, to get the 100 most massive halos but ordered from least to most massive:

ds = ds.sort_by("fof_halo_mass").take(100, at="end")

Spatial Querying

OpenCosmo data contains a spatial index which makes it efficient to perform spatial queries on the data. These queries can be performed by defining a region, and then passing it into opencosmo.Dataset.bound():

ds = oc.open("haloproperties.hdf5")
region = oc.make_box((20,20,20), (40,40,40))
bound_ds = ds.bound(region)

For lightcone data, spatial queries are performed using two dimensional regions on the sky. For example:

import astropy.units as u
from astropy.coordinates import SkyCoord

ds = oc.open("lc_haloproperties.hdf5")
center = SkyCoord(45*u.deg, -30*u.deg)
radius = 30*u.arcmin
region = opencosmo.make_cone(center, radius)
bound_ds = ds.bound(region)

See Regions for more information about constructing regions.

As with other transformations, spatial queries can be chained together to build complex query pipelines. If a given region contains no data, the spatial query will return a dataset with length zero.

There are some complications that arise when working with spatial queries in an MPI context. See Working with MPI for more details.

Iterating Over Rows

If you want to work row-by-row, you can always iterate over the dataset with opencosmo.Dataset.rows()

ds = oc.open("haloproperties.hdf5")

for row in ds.rows():
   # Do something with the row
   print(row["fof_halo_mass"], row["fof_halo_center_x"])

At each iteration, the row will be a dictionary of values for the specified rows with units applied. If you only need a subset of the columns, consider using opencosmo.Dataset.select() to select only those columns before iteration.

Evaluating Complex Expressions

Generally, basic data manipulation is not sufficient for science. We need to fit models and perform complex operations. The evaluate method can handle the low-level data management, leaving you to focus on building your model. See Evaluating Complex Expressions on Datasets and Collections for more information.