Automate all steps of santaR fitting, Confidence bands estimation and p-values calculation for one or multiple variables

santaR_auto_fit encompasses all the analytical steps for the detection of significantly altered time trajectories (input data preparation: get_ind_time_matrix, establishing group membership: get_grouping, spline modelling of individual and group time evolutions: santaR_fit, computation of group mean curve confidence bands: santaR_CBand, identification of significantly altered time trajectories: santaR_pvalue_dist and/or santaR_pvalue_fit). As santaR is an univariate approach, multiple variables can be processed independently, which santaR_auto_fit can execute in parallel over multiple CPU cores.

Usage

santaR_auto_fit(
  inputData,
  ind,
  time,
  group = NA,
  df,
  ncores = 0,
  CBand = TRUE,
  pval.dist = TRUE,
  pval.fit = FALSE,
  nBoot = 1000,
  alpha = 0.05,
  nPerm = 1000,
  nStep = 5000,
  alphaPval = 0.05,
  forceParIndTimeMat = FALSE
)

Arguments

inputData: data.frame of measurements with observations as rows and variables as columns.
ind: Vector of subject identifier (individual) corresponding to each measurement.
time: Vector of the time corresponding to each measurement.
group: NA or vector of group membership for each measurement. Default is NA for no groups.
df: (float) Degree of freedom to employ for fitting the individual and group mean smooth.spline.
ncores: (int) Number of cores to use for parallelisation. Default 0 for no parallelisation.
CBand: If TRUE calculate confidence bands for group mean curves. Default is TRUE.
pval.dist: If TRUE calculate p-value based on inter-group mean curve distance. Default is TRUE.
pval.fit: If TRUE calculate p-value based on group mean curve improvement in fit. Default is FALSE.
nBoot: (int) Number of bootstrapping rounds for confidence band calculation. Default 1000.
alpha: (float) Confidence (0.05 for 95% Confidence Bands). Default 0.05.
nPerm: (int) Number of permutations for p-value calculation. Default 1000.
nStep: (int) Number of steps (granularity) employed for the calculation of the area between group mean curves (p-value dist). Default is 5000.
alphaPval: (float) Confidence Interval on the permuted p-value (0.05 for 95% Confidence Interval). Default 0.05.
forceParIndTimeMat: If TRUE parallelise the preparation of input data by get_ind_time_matrix. Default is FALSE.

Value

A list of SANTAObj corresponding to each variable's analysis result.

Details

Note

The calculation of confidence bands accounts for approximately a third of the time taken by santaR_auto_fit, while the identification of significantly altered time trajectories (either santaR_pvalue_dist or santaR_pvalue_fit) accounts for two third of the total time. The time taken by these steps increases linearly with the increase of their respective parameters: nBoot for confidence bands, nPerm and nStep for identification of significantly altered trajectories using santaR_pvalue_dist, nPerm for santaR_pvalue_fit. Default values of these parameters are optimised to balance the time taken with the precision of the value estimation; increasing nPerm can tighten the p-value confidence intervals.
If the parallelisation is activated (ncores>0), the fit of spline models, the calculation of confidence bands on the group mean curves and the identification of altered trajectories are executed for multiple variables simultaneously. However the preparation of input data (get_ind_time_matrix) is not parallelised by default as the parallelisation overhead cost is superior to the time potentially gained for all but the most complex datasets. The parallelisation overhead (instantiating worker nodes, duplicating and transferring inputs to the worker nodes, concatenating results) typically equals around 2 seconds, while executing get_ind_time_matrix is usually a matter of millisecond for a single variable (ex: 7 time-points, 24 individuals, 1 variable); the parallelisation overhead far exceeding the time needed to process all variables sequentially. If the number of individual trajectories (subjects), of time-points, or of variables is very large, forceParIndTimeMat enables the parallelisation of get_ind_time_matrix.

Examples

## 2 variables, 56 measurements, 8 subjects, 7 unique time-points
## Default parameter values decreased to ensure an execution < 2 seconds
inputData     <- acuteInflammation$data[,1:2]
ind           <- acuteInflammation$meta$ind
time          <- acuteInflammation$meta$time
group         <- acuteInflammation$meta$group
SANTAObjList  <- santaR_auto_fit(inputData, ind, time, group, df=5, ncores=0, CBand=TRUE,
                                pval.dist=TRUE, nBoot=100, nPerm=100)
#> Input data generated: 0.01 secs
#> Spline fitted: 0.04 secs
#> ConfBands done: 0.5 secs
#> p-val dist done: 0.71 secs
#> total time: 1.25 secs
# Input data generated: 0.02 secs
# Spline fitted: 0.03 secs
# ConfBands done: 0.53 secs
# p-val dist done: 0.79 secs
# total time: 1.37 secs
length(SANTAObjList)
#> [1] 2
# [1] 2
names(SANTAObjList)
#> [1] "var_1" "var_2"
# [1] "var_1" "var_2"