arviz_base.dataset_to_dataframe

arviz_base.dataset_to_dataframe#

arviz_base.dataset_to_dataframe(ds, sample_dims=None, labeller=None, multiindex=False, new_dim='label')[source]#

Convert a Dataset to a DataFrame via a stacked DataArray, using a labeller.

Parameters:
dsxarray.Dataset
sample_dimssequence of hashable, optional
labellerlabeller, optional
multiindex{“row”, “column”} or bool, default False
new_dimhashable, default “label”
Returns:
pandas.DataFrame

Examples

The output will have whatever that uses sample_dims as the columns of the DataFrame, so when these are much longer we might want to transpose the output:

from arviz_base import load_arviz_data, dataset_to_dataframe
idata = load_arviz_data("centered_eight")
dataset_to_dataframe(idata.posterior.dataset)
mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau
(0, 0) 1.715723 2.317391 1.450174 2.085550 2.227076 3.071507 2.712972 3.083764 1.460448 0.877494
(0, 1) 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714
(0, 2) 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714
(0, 3) 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714
(0, 4) 2.017497 1.109120 0.818893 2.750620 1.928670 1.983162 1.029620 3.662744 2.167574 0.767934
... ... ... ... ... ... ... ... ... ... ...
(3, 495) 7.750625 11.477589 5.578327 9.321531 5.812095 5.437099 3.096142 9.731409 7.948321 3.020477
(3, 496) 6.922368 2.710763 8.646136 3.807844 7.543669 6.788881 6.595036 4.003042 5.275016 2.704639
(3, 497) 5.408836 11.406390 4.446937 9.210775 6.331074 4.150778 4.812302 9.693257 4.914656 2.236486
(3, 498) 7.721440 7.086139 12.311889 6.584301 10.286093 10.050167 11.859938 7.952268 9.754468 2.989656
(3, 499) 10.237157 10.464390 13.714306 10.261666 15.180098 10.916030 15.070900 14.923210 14.023129 3.051559

2000 rows × 10 columns

The default is to only return a single index, with the labels or tuples of coordinate values in the stacked dimensions. To keep all data from all coordinates as a multiindex use multiindex=True

dataset_to_dataframe(idata.posterior.dataset, multiindex=True)
label mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau
variable mu theta theta theta theta theta theta theta theta tau
school nan Choate Deerfield Phillips Andover Phillips Exeter Hotchkiss Lawrenceville St. Paul's Mt. Hermon nan
sample chain draw
(0, 0) 0 0 1.715723 2.317391 1.450174 2.085550 2.227076 3.071507 2.712972 3.083764 1.460448 0.877494
(0, 1) 0 1 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714
(0, 2) 0 2 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714
(0, 3) 0 3 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714
(0, 4) 0 4 2.017497 1.109120 0.818893 2.750620 1.928670 1.983162 1.029620 3.662744 2.167574 0.767934
... ... ... ... ... ... ... ... ... ... ... ... ...
(3, 495) 3 495 7.750625 11.477589 5.578327 9.321531 5.812095 5.437099 3.096142 9.731409 7.948321 3.020477
(3, 496) 3 496 6.922368 2.710763 8.646136 3.807844 7.543669 6.788881 6.595036 4.003042 5.275016 2.704639
(3, 497) 3 497 5.408836 11.406390 4.446937 9.210775 6.331074 4.150778 4.812302 9.693257 4.914656 2.236486
(3, 498) 3 498 7.721440 7.086139 12.311889 6.584301 10.286093 10.050167 11.859938 7.952268 9.754468 2.989656
(3, 499) 3 499 10.237157 10.464390 13.714306 10.261666 15.180098 10.916030 15.070900 14.923210 14.023129 3.051559

2000 rows × 10 columns

The only restriction on sample_dims is that it is present in all variables of the dataset. Consequently, we can compute statistical summaries, concatenate the results into a single dataset creating a new dimension.

import xarray as xr

dims = ["chain", "draw"]
post = idata.posterior.dataset
summaries = xr.concat(
    (
        post.mean(dims).expand_dims(summary=["mean"]),
        post.median(dims).expand_dims(summary=["median"]),
        post.quantile([.25, .75], dim=dims).rename(
            quantile="summary"
        ).assign_coords(summary=["1st quartile", "3rd quartile"])
    ),
    dim="summary"
)
summaries
<xarray.Dataset> Size: 864B
Dimensions:  (summary: 4, school: 8)
Coordinates:
  * summary  (summary) object 32B 'mean' 'median' '1st quartile' '3rd quartile'
  * school   (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon'
Data variables:
    mu       (summary) float64 32B 4.171 4.063 1.997 6.536
    theta    (summary, school) float64 256B 6.42 4.954 3.423 ... 9.237 7.939
    tau      (summary) float64 32B 4.321 3.511 2.191 5.669
Attributes:
    created_at:                 2025-01-19T14:32:33.071271+00:00
    arviz_version:              0.20.0
    inference_library:          pymc
    inference_library_version:  5.20.0
    sampling_time:              3.159093141555786
    tuning_steps:               1000

Then convert the result into a DataFrame for ease of viewing.

dataset_to_dataframe(summaries, sample_dims=["summary"]).T
mean median 1st quartile 3rd quartile
mu 4.171372 4.063302 1.996927 6.535536
theta[Choate] 6.420443 5.795054 2.601156 9.379063
theta[Deerfield] 4.954497 5.015449 1.697938 8.023527
theta[Phillips Andover] 3.422932 3.744714 0.271904 6.907635
theta[Phillips Exeter] 4.753565 4.690572 1.424864 7.960607
theta[Hotchkiss] 3.453035 3.618865 0.461903 6.645211
theta[Lawrenceville] 3.662959 3.904880 0.562143 7.143478
theta[St. Paul's] 6.505227 6.090589 3.059898 9.237491
theta[Mt. Hermon] 4.819780 4.645244 1.337334 7.938904
tau 4.321166 3.511275 2.190964 5.668695

Note that if all summaries were scalar, it would not be necessary to use expand_dims or renaming dimensions, using assign_coords on the result to label the newly created dimension would be enough. But using this approach we already generate a dimension with coordinate values and can also combine non scalar summaries.