Dataset
- opencosmo.col(column_name)
Create a reference to a column with a given name. These references can be combined to produce new columns or express queries that operate on the values in a given dataset. For example:
import opencosmo as oc ds = oc.open("haloproperties.hdf5") query = oc.col("fof_halo_mass") > 1e14 px = oc.col("fof_halo_mass") * oc.col("fof_halo_com_vx") ds = ds.with_new_columns(fof_halo_com_px = px).filter(query)
For more advanced usage, see Working with Columns
- Parameters:
column_name (str)
- Return type:
Column
- class opencosmo.Dataset(handler, header, state, tree=None)
- Parameters:
handler (DatasetHandler)
header (OpenCosmoHeader)
state (DatasetState)
tree (Optional[Tree])
- property header: OpenCosmoHeader
The header associated with this dataset.
OpenCosmo headers generally contain information about the original data this dataset was produced from, as well as any analysis that was done along the way.
- Returns:
header
- Return type:
- property columns: list[str]
The names of the columns in this dataset.
- Returns:
columns
- Return type:
list[str]
- property cosmology: Cosmology
The cosmology of the simulation this dataset is drawn from as an astropy.cosmology.Cosmology object.
- Returns:
cosmology
- Return type:
astropy.cosmology.Cosmology
- property dtype: str
The data type of this dataset.
- Returns:
dtype
- Return type:
str
- property redshift: float | tuple[float, float]
The redshift slice or range this dataset was drawn from
- Returns:
redshift
- Return type:
float
- property region: Region
The region this dataset is contained in. If no spatial queries have been performed, this will be the entire simulation box for snapshots or the full sky for lightcones
- Returns:
region
- Return type:
opencosmo.spatial.Region
- property simulation: HaccSimulationParameters
The parameters of the simulation this dataset is drawn from.
- Returns:
parameters
- Return type:
- property data: Table | Column
Return the data in the dataset in astropy format. The value of this attribute is equivalent to the return value of
Dataset.get_data("astropy"). However data retrieved via this attribute will be cached, meaning further calls toDataset.datashould be instantaneous.However there is one caveat. If you modify the table, those modifications will persist if you later request the data again with this attribute. Calls to
Dataset.get_datawill be unaffected, and datasets generated from this dataset will not contain the modifications. If you plan to modify the data in this table, you should useDataset.with_new_columns.- Returns:
data – The data in the dataset.
- Return type:
astropy.table.Table or astropy.table.Column
- get_data(output='astropy')
Get the data in this dataset as an astropy table/column or as numpy array(s). Note that a dataset does not load data from disk into memory until this function is called. As a result, you should not call this function until you have performed any transformations you plan to on the data.
You can get the data in two formats, “astropy” (the default) and “numpy”. “astropy” format will return the data as an astropy table with associated units. “numpy” will return the data as a dictionary of numpy arrays. The numpy values will be in the associated unit convention, but no actual units will be attached.
If the dataset only contains a single column, it will be returned as an astropy.table.Column or a single numpy array.
This method does not cache data. Calling “get_data” always reads data from disk, even if you have already called “get_data” in the past. You can use
Dataset.datato return data and keep it in memory.- Parameters:
output (str, default="astropy") – The format to output the data in
- Returns:
data – The data in this dataset.
- Return type:
Table | Column | dict[str, ndarray] | ndarray
- bound(region, select_by=None)
Restrict the dataset to some subregion. The subregion will always be evaluated in the same units as the current dataset. For example, if the dataset is in the default “comoving” unit convention, positions are always in units of comoving Mpc. However Region objects themselves do not carry units. See Regions for details of how to construct regions.
- Parameters:
region (opencosmo.spatial.Region) – The region to query.
select_by (str | None)
- Returns:
dataset – The portion of the dataset inside the selected region
- Return type:
- Raises:
ValueError – If the query region does not overlap with the region this dataset resides in
AttributeError: – If the dataset does not contain a spatial index
- filter(*masks)
Filter the dataset based on some criteria. See Querying Based on Column Values for more information.
- Parameters:
*masks (Mask) – The masks to apply to dataset, constructed with
opencosmo.col()- Returns:
dataset – The new dataset with the masks applied.
- Return type:
- Raises:
ValueError – If the given refers to columns that are not in the dataset, or the would return zero rows.
- rows()
Iterate over the rows in the dataset. Rows are returned as a dictionary For performance, it is recommended to first select the columns you need to work with.
- Yields:
row (dict) – A dictionary of values for each row in the dataset with units.
- Return type:
Generator[dict[str, float | Quantity], None, None]
- select(columns)
Create a new dataset from a subset of columns in this dataset
- Parameters:
columns (str or list[str]) – The column or columns to select.
- Returns:
dataset – The new dataset with only the selected columns.
- Return type:
- Raises:
ValueError – If any of the given columns are not in the dataset.
- drop(columns)
Create a new dataset without the provided columns.
- Parameters:
columns (str or list[str]) – The columns to drop
- Returns:
dataset – The new dataset without the droppedcolumns
- Return type:
- Raises:
ValueError – If any of the provided columns are not in the dataset.
- take(n, at='random')
Create a new dataset from some number of rows from this dataset.
Can take the first n rows, the last n rows, or n random rows depending on the value of ‘at’.
- Parameters:
n (int) – The number of rows to take.
at (str) – Where to take the rows from. One of “start”, “end”, or “random”. The default is “random”.
- Returns:
dataset – The new dataset with only the selected rows.
- Return type:
- Raises:
ValueError – If n is negative or greater than the number of rows in the dataset, or if ‘at’ is invalid.
- take_range(start, end)
Create a new dataset from a row range in this dataset.
- Parameters:
start (int) – The first row to get.
end (int) – The last row to get.
- Returns:
table – The table with only the rows from start to end.
- Return type:
astropy.table.Table
- Raises:
ValueError – If start or end are negative or greater than the length of the dataset or if end is greater than start.
- with_new_columns(**new_columns)
Create a new dataset with additional columns. These new columns can be derived from columns already in the dataset, or a numpy array. When a column is derived from other columns, it will behave appropriately under unit transformations. See Creating New Columns for examples.
- Parameters:
columns (**)
new_columns (DerivedColumn)
- Returns:
dataset – This dataset with the columns added
- Return type:
- with_units(convention)
Create a new dataset from this one with a different unit convention.
- Parameters:
convention (str) – The unit convention to use. One of “physical”, “comoving”, “scalefree”, or “unitless”.
- Returns:
dataset – The new dataset with the requested unit convention.
- Return type:
- collect()
Given a dataset that was originally opend with opencosmo.open, return a dataset that is in-memory as though it was read with opencosmo.read.
This is useful if you have a very large dataset on disk, and you want to filter it down and then close the file.
For example:
import opencosmo as oc with oc.open("path/to/file.hdf5") as file: ds = file.(ds["sod_halo_mass"] > 0) ds = ds.select(["sod_halo_mass", "sod_halo_radius"]) ds = ds.collect()
The selected data will now be in memory, and the file will be closed.
If working in an MPI context, all ranks will recieve the same data.
- Return type:
- class opencosmo.Lightcone(datasets, z_range=None, hide_redshift=False)
A lightcone contains two or more datasets that are part of a lightcone. Typically each dataset will cover a specific redshift range. The Lightcone object hides these details, providing an API that is identical to the standard Dataset API. Additionally, the lightcone contains some convinience functions for standard operations.
- Parameters:
datasets (dict[str, Dataset])
z_range (tuple[float, float] | None)
hide_redshift (bool)
- property header: OpenCosmoHeader
The header associated with this dataset.
OpenCosmo headers generally contain information about the original data this dataset was produced from, as well as any analysis that was done along the way.
- Returns:
header
- Return type:
- property columns: list[str]
The names of the columns in this dataset.
- Returns:
columns
- Return type:
list[str]
- property cosmology: Cosmology
The cosmology of the simulation this dataset is drawn from as an astropy.cosmology.Cosmology object.
- Returns:
cosmology
- Return type:
astropy.cosmology.Cosmology
- property dtype: str
The data type of this dataset.
- Returns:
dtype
- Return type:
str
- property region: Region
The region this dataset is contained in. If no spatial queries have been performed, this will be the entire simulation box for snapshots or the full sky for lightcones
- Returns:
region
- Return type:
opencosmo.spatial.Region
- property simulation: HaccSimulationParameters
The parameters of the simulation this dataset is drawn from.
- Returns:
parameters
- Return type:
- property z_range
The redshift range of this lightcone.
- Returns:
z_range
- Return type:
tuple[float, float]
- get_data(output='astropy')
Get the data in this dataset as an astropy table/column or as numpy array(s). Note that a dataset does not load data from disk into memory until this function is called. As a result, you should not call this function until you have performed any transformations you plan to on the data.
You can get the data in two formats, “astropy” (the default) and “numpy”. “astropy” format will return the data as an astropy table with associated units. “numpy” will return the data as a dictionary of numpy arrays. The numpy values will be in the associated unit convention, but no actual units will be attached.
If the dataset only contains a single column, it will be returned as an astropy.table.Column or a single numpy array.
This method does not cache data. Calling “get_data” always reads data from disk, even if you have already called “get_data” in the past. You can use
Dataset.datato return data and keep it in memory.- Parameters:
output (str, default="astropy") – The format to output the data in
- Returns:
data – The data in this dataset.
- Return type:
Table | Column | dict[str, ndarray] | ndarray
- property data
Return the data in the dataset in astropy format. The value of this attribute is equivalent to the return value of
Dataset.get_data("astropy"). However data retrieved via this attribute will be cached, meaning further calls toDataset.datashould be instantaneous.However there is one caveat. If you modify the table, those modifications will persist if you later request the data again with this attribute. Calls to
Lightcone.get_datawill be unaffected, and datasets generated from this dataset will not contain the modifications. If you plan to modify the data in this table, you should useLightcone.with_new_columns.- Returns:
data – The data in the dataset.
- Return type:
astropy.table.Table or astropy.table.Column
- with_redshift_range(z_low, z_high)
Restrict this lightcone to a specific redshift range. Lightcone datasets will always contain a column titled “redshift.” This function is always operates on this column.
This function also updates the value in
Lightcone.z_range, so you should always use it rather than filteringo n the column directly.- Parameters:
z_low (float)
z_high (float)
- bound(region, select_by=None)
Restrict the dataset to some subregion. The subregion will always be evaluated in the same units as the current dataset. For example, if the dataset is in the default “comoving” unit convention, positions are always in units of comoving Mpc. However Region objects themselves do not carry units. See Regions for details of how to construct regions.
- Parameters:
region (opencosmo.spatial.Region) – The region to query.
select_by (str | None)
- Returns:
dataset – The portion of the dataset inside the selected region
- Return type:
- Raises:
ValueError – If the query region does not overlap with the region this dataset resides in
AttributeError: – If the dataset does not contain a spatial index
- cone_search(center, radius)
Perform a search for objects within some angular distance of some given point on the sky. This is a convinience function around :py:meth`bound <opencosmo.Lightcone.bound>` which is exactly equivalent to
region = oc.make_cone(center, radius) ds = ds.bound(region)
- Parameters:
center (tuple | SkyCoord) – The center of the region to search. If a tuple and no units are provided assumed to be RA and Dec in degrees.
radius (float | astropy.units.Quantity) – The angular radius of the region to query. If no units are provided, assumed to be degrees.
- Returns:
new_lightcone – The rows in this lightcone that fall within the given region.
- Return type:
- filter(*masks, **kwargs)
Filter the dataset based on some criteria. See Querying Based on Column Values for more information.
- Parameters:
*masks (Mask) – The masks to apply to dataset, constructed with
opencosmo.col()- Returns:
dataset – The new dataset with the masks applied.
- Return type:
- Raises:
ValueError – If the given refers to columns that are not in the dataset, or the would return zero rows.
- rows()
Iterate over the rows in the dataset. Rows are returned as a dictionary For performance, it is recommended to first select the columns you need to work with.
- Yields:
row (dict) – A dictionary of values for each row in the dataset with units.
- Return type:
Generator[dict[str, float | Quantity], None, None]
- select(columns)
Create a new dataset from a subset of columns in this dataset.
- Parameters:
columns (str or list[str]) – The column or columns to select.
- Returns:
dataset – The new dataset with only the selected columns.
- Return type:
- Raises:
ValueError – If any of the given columns are not in the dataset.
- drop(columns)
Produce a new dataset by dropping columns from this dataset.
- Parameters:
columns (str or list[str]) – The column or columns to drop.
- Returns:
dataset – The new dataset without the dropped columns
- Return type:
- Raises:
ValueError – If any of the given columns are not in the dataset.
- take(n, at='random')
Create a new dataset from some number of rows from this dataset.
Can take the first n rows, the last n rows, or n random rows depending on the value of ‘at’.
- Parameters:
n (int) – The number of rows to take.
at (str) – Where to take the rows from. One of “start”, “end”, or “random”. The default is “random”.
- Returns:
dataset – The new dataset with only the selected rows.
- Return type:
- Raises:
ValueError – If n is negative or greater than the number of rows in the dataset, or if ‘at’ is invalid.
- with_new_columns(*args, **kwargs)
Create a new dataset with additional columns. These new columns can be derived from columns already in the dataset, or a numpy array. When a column is derived from other columns, it will behave appropriately under unit transformations. See Creating New Columns for examples.
- Parameters:
columns (**)
- Returns:
dataset – This dataset with the columns added
- Return type:
- with_units(convention)
Create a new dataset from this one with a different unit convention.
- Parameters:
convention (str) – The unit convention to use. One of “physical”, “comoving”, “scalefree”, or “unitless”.
- Returns:
dataset – The new dataset with the requested unit convention.
- Return type:
- collect()
Given a dataset that was originally opend with opencosmo.open, return a dataset that is in-memory as though it was read with opencosmo.read.
This is useful if you have a very large dataset on disk, and you want to filter it down and then close the file.
For example:
import opencosmo as oc with oc.open("path/to/file.hdf5") as file: ds = file.(ds["sod_halo_mass"] > 0) ds = ds.select(["sod_halo_mass", "sod_halo_radius"]) ds = ds.collect()
The selected data will now be in memory, and the file will be closed.
If working in an MPI context, all ranks will recieve the same data.
- Return type: