Dataset

class opencosmo.Dataset(header, state, tree=None)
Parameters:
property header: OpenCosmoHeader

The header associated with this dataset.

OpenCosmo headers generally contain information about the original data this dataset was produced from, as well as any analysis that was done along the way.

Returns:

header

Return type:

opencosmo.header.OpenCosmoHeader

property columns: list[str]

The names of the columns in this dataset.

Returns:

columns

Return type:

list[str]

property meta_columns: list[str]
property descriptions: dict[str, str | None]

Return the descriptions (if any) of the columns in this dataset as a dictonary. Columns without a description will be included in the dictionary with a value of None

Returns:

descriptions – The column descriptions

Return type:

dict[str, str | None]

property units: dict[str, Unit | None]

Return the current units of all columns in the dataset. Columns without units will return None.

Returns:

descriptions – The column units

Return type:

dict[str, str | None]

property cosmology: Cosmology

The cosmology of the simulation this dataset is drawn from as an astropy.cosmology.Cosmology object.

Returns:

cosmology

Return type:

astropy.cosmology.Cosmology

property dtype: str

The data type of this dataset.

Returns:

dtype

Return type:

str

property redshift: float | tuple[float, float] | None

The redshift slice or range this dataset was drawn from

Returns:

redshift

Return type:

float

property region: Region

The region this dataset is contained in. If no spatial queries have been performed, this will be the entire simulation box for snapshots or the full sky for lightcones

Returns:

region

Return type:

opencosmo.spatial.Region

property simulation: HaccSimulationParameters | None

The parameters of the simulation this dataset is drawn from. May return None if the parameters are not included in the file

Returns:

parameters

Return type:

Optional[opencosmo.dtypes.hacc.HaccSimulationParameters]

property sorted_by: str | None

The column this dataset is sorted by. If not sorted, returns None.

Returns:

column

Return type:

Optional[str]

property tree: Tree | None
property data: QTable | Quantity

Return the data in the dataset in astropy format. The value of this attribute is equivalent to the return value of Dataset.get_data("astropy").

Returns:

  • data (astropy.table.Table or astropy.table.Column) – The data in the dataset.

  • .. deprecated:: 1.1.0 – Accessing data through the .data attribute is deprecated and will be removed in a future version. Use get_data()

get_metadata(columns=[], ignore_sort=False)
Parameters:
  • columns (str | list[str])

  • ignore_sort (bool)

get_data(format='astropy', unpack=True, metadata_columns=[], wrap_single=False, **kwargs)

Get the data in this dataset as an astropy table/column or as numpy array(s). Note that a dataset does not load data from disk into memory until this function is called. As a result, you should not call this function until you have performed any transformations you plan to on the data.

The method supports output into several different formats, including “astropy”, “numpy”, “pandas”, “polars”, “jax”, and “arrow”. Although astropy and numpy are core dependencies of OpenCosmo, the remaining formats require you to have the relevant libraries installed in your python environment. This method will check that it can import the necessary libraries before attempting to read data. Note that outputting as “polars” or “arrow” requires copying the data out of its original numpy arrays, which will impact performance.

If the dataset only contains a single column, it will not be put in a table or dictionary. “astropy”, “numpy” and “arrow” will return a single array in this case, while “polars” and “pandas” will return a Series object. Pass wrap_single=True to always return the format’s multi-column container (QTable, DataFrame, dict, …) regardless of column count.

Parameters:
  • output (str, default="astropy") – The format to output the data in. Currently supported are “astropy”, “numpy”, “pandas”, “polars”, “arrow”, “jax”

  • wrap_single (bool, default=False) – If True, always return the format’s natural multi-column container even when only one column is present.

Returns:

data – The data in this dataset.

Return type:

Any

bound(region, select_by=None)

Restrict the dataset to some subregion. The subregion will always be evaluated in the same units as the current dataset. For example, if the dataset is in the default “comoving” unit convention, positions are always in units of comoving Mpc. However Region objects themselves do not carry units. See Regions for details of how to construct regions.

Parameters:
  • region (opencosmo.spatial.Region) – The region to query.

  • select_by (Optional[str])

Returns:

dataset – The portion of the dataset inside the selected region

Return type:

opencosmo.Dataset

Raises:
  • ValueError – If the query region does not overlap with the region this dataset resides in

  • AttributeError: – If the dataset does not contain a spatial index

evaluate(func, vectorize=False, insert=True, format='astropy', batch_size=-1, allow_overwrite=False, _verify=True, **evaluate_kwargs)

Iterate over the rows in this dataset, apply func to each, and collect the result as new columns in the dataset.

This function is the equivalent of with_new_columns for cases where the new column is not a simple algebraic combination of existing columns. Unlike with_new_columns, this method will evaluate the results immediately and the resulting columns will not change under unit transformations. You may also choose to simply return the result instead of adding it as a column.

The function should take in arguments with the same name as the columns in this dataset that are needed for the computation, and should return a dictionary of output values. Any addition arguments needed by the function can be passed as keyword arguments to evaluate.

The dataset will automatically selected the needed columns to avoid reading unnecessarily reading data from disk. The new columns will have the same names as the keys of the output dictionary See Evaluating on Datasets for more details. The keys of this dictionary must be different from the names of the columns that are already in the dataset, unless allow_overwrite is set to :code`True`

If vectorize is set to True, the full columns will be pased to the dataset. Otherwise, rows will be passed to the function one at a time. If the function returns None, this method will also return None as output.

Keyword arguments can be used to pass in external values that are not columns in the dataset. For example, we can compute each halo’s gas fraction bias — how much gas it retains relative to the cosmic baryon fraction — by passing the dataset’s cosmology object as a keyword argument:

def baryon_fraction_bias(sod_halo_mass_gas, sod_halo_mass, cosmology):
    f_gas = sod_halo_mass_gas / sod_halo_mass
    f_cosmic = cosmology.Ob0 / cosmology.Om0
    return {"sod_halo_baryon_bias": f_gas / f_cosmic}

ds = ds.evaluate(baryon_fraction_bias, cosmology=ds.cosmology, vectorize=True)
Parameters:
  • func (Callable) – The function to evaluate on the rows in the dataset.

  • vectorize (bool, default = False) – Whether to provide the values as full columns (True) or one row at a time (False). Ignored if batch_size is set.

  • insert (bool, default = True) – If true, the data will be inserted as a column in this dataset. The new column will have the same name as the function. Otherwise the data will be returned directly.

  • format (str, default = astropy) – The format in which to provide column data to your function. Supports the same formats as get_data (“astropy”, “numpy”, “pandas”, “polars”, “arrow”, “jax”). When insert=True, the function’s output is converted back to numpy before being stored.

  • allow_overwrite (bool, default = False)

  • batch_size (int, default = -1) – If set, feed data to the function in batches of the specified size. Default is -1, which disables batching. If set to another value, the vectorize flag is ignored.

  • **evaluate_kwargs (any,) – Any additional arguments that are required for your function to run. These will be passed directly to the function as keyword arguments. If a kwarg is an array of values with the same length as the dataset, it will be treated as an additional column.

  • _verify (bool)

Returns:

result – The new dataset with the evaluated column(s) or the results as numpy arrays or astropy quantities

Return type:

Dataset | dict[str, np.ndarray | astropy.units.Quantity]

filter(*masks)

Filter the dataset based on some criteria. See Querying Based on Column Values for more information.

Parameters:

*masks (Mask) – The masks to apply to dataset, constructed with opencosmo.col()

Returns:

dataset – The new dataset with the masks applied.

Return type:

Dataset

Raises:

ValueError – If the given refers to columns that are not in the dataset, or the would return zero rows.

rows(include_units=True, metadata_columns=[])

Iterate over the rows in the dataset. Rows are returned as a dictionary For performance, it is recommended to first select the columns you need to work with.

Parameters:
  • output (str, default = "astropy") – Whether to return values as “astropy” quantities or “numpy” scalars

  • include_units (bool)

Yields:

row (dict) – A dictionary of values for each row in the dataset with units.

Return type:

Generator[Mapping[str, float | Quantity | ndarray], None, None]

select(*columns, **derived_columns)

Create a new dataset from a subset of columns in this dataset. This function accepts wildcards. For exampe, “fof*” will select all columns that start with “fof”, while “com” will select all columns that have “com” somewhere in the middle.

You can also create new columns as part of this call, as long as they are derived from other columns in the dataset. For example:

dataset = oc.open("haloproperties.hdf5")
fof_halo_px = oc.col("fof_halo_mass")*oc.col("fof_halo_com_vx")

dataset = dataset.select("fof_halo_mass", "*com*", fof_halo_px=fof_halo_px)

This new dataset will contain the fof_halo_mass columns, all the columns with com in the center (e.g. fof_halo_com_vx) and a new fof_halo_px column.

Parameters:
  • *columns (str or list[str]) – The column or columns to select.

  • **derived_columns (DerivedColumn) – Any new derived columns that will be instantiated as part of the select

Returns:

dataset – The new dataset with only the selected columns.

Return type:

Dataset

Raises:

ValueError – If any of the given columns are not in the dataset.

drop(*columns)

Create a new dataset without the provided columns. This function accepts wildcards. For exampe, “fof*” will drop all columns that start with “fof”, while “com” will drop all columns that have “com” somewhere in the middle.

Parameters:

*columns (str or list[str]) – The columns to drop

Returns:

dataset – The new dataset without the dropped columns

Return type:

Dataset

Raises:

ValueError – If any of the provided columns are not in the dataset.

sort_by(column, invert=False)

Sort this dataset by the values in a given column. By default sorting is in ascending order (least to greatest). Pass invert = True to sort in descending order (greatest to least).

This can be used to, for example, select largest halos in a given dataset:

dataset = oc.open("haloproperties.hdf5")
dataset = dataset
            .sort_by("fof_halo_mass", invert=True)
            .take(100, at="start")
Parameters:
  • column (Optional[str]) – The column in the halo_properties or galaxy_properties dataset to order the collection by. Pass None to remove sorting.

  • invert (bool, default = False) – If False (the default), ordering will be from least to greatest. Otherwise greatest to least.

Returns:

result – A new Dataset ordered by the given column.

Return type:

Dataset

take(n, at='random', mode='local')

Create a new dataset from some number of rows from this dataset.

Can take the first n rows, the last n rows, or n random rows depending on the value of ‘at’.

Parameters:
  • n (int) – The number of rows to take.

  • at (str) – Where to take the rows from. One of “start”, “end”, or “random”. The default is “random”.

  • mode (str, "local" or "global", default = "local") –

    Controls how n is interpreted when running under MPI. Has no effect if you are not using MPI.

    • "local" (default): n rows are taken independently on each rank.

    • "global": n is the total number of rows to select across all ranks combined. Each rank receives the portion of those rows that it owns. If the dataset is sorted, ranks will coordinate to take from the globally-sorted dataset.

Returns:

dataset – The new dataset with only the selected rows.

Return type:

Dataset

Raises:

ValueError – If n is negative or greater than the number of rows in the dataset, or if ‘at’ is invalid.

take_range(start, end, mode='local')

Create a new dataset from a row range in this dataset. We use standard indexing conventions, so the rows included will be start -> end - 1.

Parameters:
  • start (int) – The beginning of the range.

  • end (int) – The end of the range (exclusive).

  • mode (str, "local" or "global", default = "local") –

    Controls how start and end are interpreted when running under MPI. Has no effect if you are not using MPI.

    • "local" (default): the range is applied independently on each rank.

    • "global": start and end index into the global row space across all ranks combined. Each rank receives the portion of that range it owns. If the dataset is sorted, ranks will coordinate to take from the globally-sorted dataset.

Returns:

dataset – The new dataset with only the rows from start to end.

Return type:

Dataset

Raises:

ValueError – If start or end are negative or greater than the length of the dataset or if end is greater than start.

take_rows(rows)

Take the rows of a dataset specified by the rows argument. rows should be an array of integers.

Parameters:

rows: np.ndarray[int]

returns:
  • dataset (The dataset with only the specified rows included)

  • Raises

  • ——-

  • ValueError – If any of the indices is less than 0 or greater than the length of the dataset.

Parameters:

rows (np.ndarray | DataIndex)

with_new_columns(descriptions={}, allow_overwrite=False, **new_columns)

Create a new dataset with additional columns. These new columns can be derived from columns already in the dataset, a numpy array, or an astropy quantity array. When a column is derived from other columns, it will behave appropriately under unit transformations. Columns provided directly as astropy quantities will not change under unit transformations. See Adding Custom Columns for examples.

If allow_overwrite is True, the new column may have the same name as a column that already exists in the dataset. This can be used to transform a column, for example:

log_mass = oc.col("fof_halo_mass").log10()
ds = ds.with_new_columns(fof_halo_mass=log_mass, allow_overwrite=True)

The “fof_halo_mass” column will now be the log of the original “fof_halo_mass” column.

Columns will be given the same name as the argument you use when you pass them into the function. For example, we could do the same as above but name the column “log_fof_halo_mass” with

log_mass = oc.col("fof_halo_mass").logo10()
ds = ds.with_new_columns(log_fof_halo_mass = log_mass)
Parameters:
  • descriptions (str | dict[str, str], optional) – A description for the new columns. These descriptions will be accessible through Dataset.descriptions. If a dictionary, should have keys matching the column names.

  • allow_overwrites (bool, default = False) – If false, attempting to add a new column with the same name as an existing column will throw an error. If true, overwrites are allowed.

  • new_columns (**) – The new columns to add. The name of the argument is the name the column will take.

  • allow_overwrite (bool)

Returns:

dataset – This dataset with the columns added

Return type:

opencosmo.Dataset

with_units(convention=None, conversions={}, **columns)

Create a new dataset from this one with a different unit convention, and/or convert one unit to another across the entire dataset, or convert individual columns.

Unit conversions are always performed after a change of convention, and changing conventions clears any existing unit conversions. Individual column conversions always take precedence over blanket unit conversions.

Calling this function without arguments will clear any existing unit conversions.

For more, see Working with Units.

import astropy.units as u

# this works
dataset = dataset.with_units(fof_halo_mass=u.kg)

# this clears the previous conversion
dataset = dataset.with_units("scalefree")

# This now fails, because the units of masses
# are Msun / h, which cannot be converted to kg
dataset = dataset.with_units(fof_halo_mass=u.kg)

# this will work, the units of halo mass in the "physical"
# convention are Msun (no h).
dataset = dataset.with_units("physical", fof_halo_mass=u.kg, fof_halo_center_x=u.lyr)

# Suppose you want all distances in lightyears, but the x coordinate of your
# halo center in kilometers, for some reason ¯\_(ツ)_/¯
blanket_conversions = {u.Mpc: u.lyr}
dataset = dataset.with_units(conversions = blanket_conversions, fof_halo_center_x = u.km)
Parameters:
  • convention (str, optional) – The unit convention to use. One of “physical”, “comoving”, “scalefree”, or “unitless”.

  • conversions (dict[astropy.units.Unit, astropy.Units.Unit]) – Conversions that apply to all columns in the dataset with the unit given by the key.

  • **column_conversions (astropy.units.Unit) – Custom unit conversions for one or more or of the columns in this dataset.

  • columns (Unit)

Returns:

dataset – The new dataset with the requested unit convention and/or conversions.

Return type:

Dataset