Automated command line analysis
Arnaud Wolfer
2019-10-03
Source:vignettes/automated-command-line.Rmd
automated-command-line.Rmd
The santaR
package is designed for the detection of
significantly altered time trajectories between study groups, in short
time-series. Command line parallelisation and reporting functions allow
the automated analysis of multiple variables.
The automated command line functions are to be prefered to the GUI for the processing of very high number of variables, as they are more efficient and can be integrated in scripts.
Using an example dataset, this vignette will:
- Detail the parallel processing function
- Detail the automated reporting function
- Save the processing results in a
.RData
file to be opened with the graphical interface for further analysis
Parallel processing
In a same experiment, multiple variables can be measured and explored
dynamically (e.g. NMR or MS features, genes). As
santaR
’s analysis is an univariate approach, each variable
can be fitted independently. This lack of dependency renders
santaR
’s analysis an embarrassingly parallel workload.
The santaR_auto_fit()
function is a wrapper for each of
the analytical functions (i.e. get_ind_time_matrix()
,
santaR_fit()
, santaR_CBand()
,
santaR_pvalue_dist()
and santaR_pvalue_fit()
),
executing them in a parallel fashion (for each individual function
see the help and advanced
command line options vignette). The parallelisation relies on
the doParallel
package for the instantiation of worker
nodes and foreach
for the distribution of tasks. This set
of packages enable the parallelisation on all operating systems
(Windows, Mac OS and most Linux distributions).
Observation values are expected as a data-frame of samples as
rows and variables as columns, the parallelisation
taking place over the columns. For a selected number of CPU
cores (ncores
parameter), santaR_auto_fit()
first instantiate worker nodes (if ncores=0
, the
procedure is applied sequentially (no parallelisation)). The
conversion of inputs by get_ind_time_matrix()
is however
not parallelised by default as the parallelisation overhead time cost is
superior to the time gain for all but the most complex datasets. When
the number of individuals, unique time points, or variables is elevated,
the forceParIndTimeMat
parameter enables the
parallelisation of this step. All subsequent analytical steps are
automatically parallelised, with the calculation of confidence bands on
the group mean curves and the identification of altered trajectory
activated by default.
santaR_auto_fit()
returns a list of SANTAObj
containing each variable’s analysis results. In practice,
santaR_auto_fit()
is the function employed for command line
analysis as it caters for all possible use cases.
library(santaR)
# Load example data
tmp_data <- acuteInflammation$data
tmp_meta <- acuteInflammation$meta
# Analyse data, with confidence bands and p-value
res_acuteInf_df5 <- santaR_auto_fit(inputData=tmp_data, ind=tmp_meta$ind, time=tmp_meta$time, group=tmp_meta$group, df=5, ncores=4, CBand=TRUE, pval.dist=TRUE)
# Input data generated: 0.13 secs
# Spline fitted: 1.05 secs
# ConfBands done: 18.98 secs
# p-val dist done: 35.43 secs
# total time: 55.59 secs
length(res_acuteInf_df5)
# [1] 22
names(res_acuteInf_df5)
# [1] "var_1" "var_2" "var_3" "var_4" "var_5" "var_6" "var_7" "var_8" "var_9" "var_10" "var_11" "var_12" "var_13" "var_14" "var_15" "var_16" "var_17" "var_18"
# [19] "var_19" "var_20" "var_21" "var_22"
Automated Reporting
After multiple variables have been analysed using
santaR_auto_fit()
, a reporting function helps assess
significant results and summarise them in an easily interpretable
fashion. santaR_auto_summary()
takes a list of
SANTAObj as generated by santaR_auto_fit()
as
input.
First, correction for multiple testing can be applied to generate
Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli corrected
p-values. P-values can be returned by the function,
but also automatically saved to disk as .csv
. For a given
significance cut-off (plotCutOff
parameter), the number of
variables significantly altered is reported and plots are automatically
saved to disk by increasing p-value. The aspect of the plots
can be altered using multiple options such as the representation of
confidence bands (showConfBand
parameter) or the generation
of a mean curve across all samples (showTotalMeanCurve
parameter) which can help assess difference between groups when group
sizes are unbalanced.
# Generate a summary
# without a defined 'targetFolder', no csv or plots can be saved
pval_acuteInf_df5 <- santaR_auto_summary(SANTAObjList=res_acuteInf_df5, targetFolder=NA)
# p-value dist found
# Benjamini-Hochberg corrected p-value
names(pval_acuteInf_df5)
# [1] "pval.all" "pval.summary"
pval_acuteInf_df5$pval.summary
Test | Inf 0.05 | Inf 0.01 | Inf 0.001 |
---|---|---|---|
dist | 17 | 8 | 0 |
dist_BH | 16 | 0 | 0 |
pval_acuteInf_df5$pval.all
dist | dist_upper | dist_lower | curveCorr | dist_BH | |
---|---|---|---|---|---|
var_1 | 0.00999 | 0.0183 | 0.005434 | -0.243 | 0.02747 |
var_2 | 0.007992 | 0.0157 | 0.004054 | 0.0006572 | 0.02747 |
var_3 | 0.006993 | 0.01437 | 0.00339 | -0.131 | 0.02747 |
var_4 | 0.2098 | 0.2361 | 0.1857 | -0.3878 | 0.2148 |
var_5 | 0.005994 | 0.01302 | 0.002749 | -0.5635 | 0.02747 |
var_6 | 0.008991 | 0.017 | 0.004736 | -0.4767 | 0.02747 |
var_7 | 0.01399 | 0.02334 | 0.008347 | -0.5629 | 0.03077 |
var_8 | 0.00999 | 0.0183 | 0.005434 | -0.4679 | 0.02747 |
var_9 | 0.03896 | 0.05282 | 0.02863 | -0.389 | 0.05042 |
var_10 | 0.03497 | 0.04825 | 0.02524 | -0.05017 | 0.04808 |
var_11 | 0.01399 | 0.02334 | 0.008347 | 0.0568 | 0.03077 |
var_12 | 0.2148 | 0.2413 | 0.1904 | 0.153 | 0.2148 |
var_13 | 0.06693 | 0.08414 | 0.05304 | -0.4078 | 0.0775 |
var_14 | 0.1548 | 0.1786 | 0.1337 | -0.06504 | 0.1703 |
var_15 | 0.008991 | 0.017 | 0.004736 | 0.1268 | 0.02747 |
var_16 | 0.01598 | 0.02581 | 0.00986 | 0.5055 | 0.03197 |
var_17 | 0.01998 | 0.03067 | 0.01297 | 0.2798 | 0.03663 |
var_18 | 0.02997 | 0.04247 | 0.02107 | 0.4028 | 0.04396 |
var_19 | 0.05395 | 0.06973 | 0.04157 | 0.5015 | 0.06593 |
var_20 | 0.02398 | 0.03543 | 0.01616 | 0.3899 | 0.03768 |
var_21 | 0.02298 | 0.03425 | 0.01536 | 0.1458 | 0.03768 |
var_22 | 0.007992 | 0.0157 | 0.004054 | -0.2075 | 0.02747 |
Save results for GUI
In practice, time-dependent patterns for a given biological question
(e.g. a grouping of individuals) are assessed by parallelised
fitting and analysis using santaR_auto_fit()
and reporting
using santaR_auto_summary()
. When results are available,
the most significantly altered variables can be identified using the
reports and visually inspected for confirmation using the plots already
saved to disk.
Additionally analysis results can be loaded into the GUI for
interactive visualisation or generation of plots. For that, the list of
SANTAObj generated by santaR_auto_fit()
must be
saved under the variable name inSp
in a .RData
file: