| Title: | Adjust Estimates of Learning for Guessing |
|---|---|
| Description: | Provides tools to adjust estimates of learning for guessing-related bias in educational and survey research. Implements standard guessing correction methods and a sophisticated latent class model that leverages informative pre-post test transitions to account for guessing behavior. The package helps researchers obtain more accurate estimates of actual learning when respondents may guess on closed-ended knowledge items. For theoretical background and empirical validation, see Cor and Sood (2018) <https://gsood.com/research/papers/guess.pdf>. |
| Authors: | Gaurav Sood [aut, cre], Ken Cor [aut] |
| Maintainer: | Gaurav Sood <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.5.0 |
| Built: | 2026-06-07 06:53:14 UTC |
| Source: | https://github.com/finite-sample/guess |
Calculate expected values for goodness of fit test
calculate_expected_values(gamma_i, params, total_obs, model_type = "nodk")calculate_expected_values(gamma_i, params, total_obs, model_type = "nodk")
gamma_i |
item-specific gamma value |
params |
estimated parameters for the item |
total_obs |
total observations for the item |
model_type |
"nodk" or "dk" model |
vector of expected values
Extract coefficients from guess_fit
## S3 method for class 'guess_fit' coef(object, ...)## S3 method for class 'guess_fit' coef(object, ...)
object |
guess_fit object |
... |
ignored |
parameter matrix
Converts the difference in ability estimates to a probability scale using the logistic function. Provides a comparable metric to posterior_learned() but without using the transition structure.
cross_sectional_irt(pre_test, pst_test, method = "logit", scale = 1)cross_sectional_irt(pre_test, pst_test, method = "logit", scale = 1)
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
method |
character: "logit" (default) or "rasch" |
scale |
numeric scaling factor for ability difference (default 1) |
numeric vector of learning probabilities in [0, 1]
sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) p_learned_cs <- cross_sectional_irt(sim$pre, sim$post) cor(p_learned_cs, sim$learned)sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) p_learned_cs <- cross_sectional_irt(sim$pre, sim$post) cor(p_learned_cs, sim$learned)
Estimates learning as the difference in ability between post and pre test. This ignores the transition structure that the LCA model uses.
cross_sectional_learning(pre_test, pst_test, method = "logit")cross_sectional_learning(pre_test, pst_test, method = "logit")
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
method |
character: "logit" (default) or "rasch" |
numeric vector of learning estimates (theta_post - theta_pre)
sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) learning_cs <- cross_sectional_learning(sim$pre, sim$post) cor(learning_cs, sim$learned)sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) learning_cs <- cross_sectional_learning(sim$pre, sim$post) cor(learning_cs, sim$learned)
Splits individuals into k folds, fits on training, evaluates on held-out.
cv_individuals(pre_test, pst_test, k = 5L, priors = NULL, seed = NULL)cv_individuals(pre_test, pst_test, k = 5L, priors = NULL, seed = NULL)
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
k |
integer number of folds |
priors |
optional numeric vector of starting parameters |
seed |
optional integer random seed |
list with fold_results, mean_ll, total_ll, perplexity, se
Splits items into k folds, fits on training items, evaluates on held-out items.
cv_items(transmatrix, k = 5L, priors = NULL, seed = NULL)cv_items(transmatrix, k = 5L, priors = NULL, seed = NULL)
transmatrix |
numeric matrix from multi_transmat() |
k |
integer number of folds |
priors |
optional numeric vector of starting parameters |
seed |
optional integer random seed |
list with fold_results, mean_ll, total_ll, perplexity, se
Estimates person ability using simple proportion correct (logit-transformed) or Rasch-style IRT. Ignores the transition structure between time points.
estimate_ability(responses, method = "logit", difficulty = NULL)estimate_ability(responses, method = "logit", difficulty = NULL)
responses |
data.frame of binary responses (0/1) |
method |
character: "logit" (default) or "rasch" |
difficulty |
numeric vector of item difficulties (for rasch method) |
numeric vector of ability estimates (length = n individuals)
sim <- simulate_lca(n = 100, seed = 123) theta_pre <- estimate_ability(sim$pre) theta_post <- estimate_ability(sim$post)sim <- simulate_lca(n = 100, seed = 123) theta_pre <- estimate_ability(sim$pre) theta_post <- estimate_ability(sim$post)
Chi-square goodness of fit between true and model based multivariate distribution. Handles both data with and without don't know responses automatically.
fit_model(pre_test, pst_test, g, est_param, force9 = FALSE) fit_dk(pre_test, pst_test, g, est_param, force9 = FALSE) fit_nodk(pre_test, pst_test, g, est_param)fit_model(pre_test, pst_test, g, est_param, force9 = FALSE) fit_dk(pre_test, pst_test, g, est_param, force9 = FALSE) fit_nodk(pre_test, pst_test, g, est_param)
pre_test |
data.frame carrying pre_test items |
pst_test |
data.frame carrying pst_test items |
g |
estimates of gamma produced from |
est_param |
estimated parameters produced from |
force9 |
Optional. Force 9-column format even if no DK responses. Default is FALSE. |
Unified Goodness of Fit Statistics
matrix with two rows: top row carrying chi-square value, bottom row p-values
## Not run: # Fit model first transmatrix <- multi_transmat(pre_test, pst_test) res <- lca_cor(transmatrix) # Calculate goodness of fit fit_stats <- fit_model(pre_test, pst_test, res$params[nrow(res$params), ], res$params[-nrow(res$params), ]) ## End(Not run)## Not run: # Fit model first transmatrix <- multi_transmat(pre_test, pst_test) res <- lca_cor(transmatrix) # Calculate goodness of fit fit_stats <- fit_model(pre_test, pst_test, res$params[nrow(res$params), ], res$params[-nrow(res$params), ]) ## End(Not run)
Format transition matrix result with appropriate row and column names
format_transition_matrix(transition_list, n_items, add_aggregate = FALSE)format_transition_matrix(transition_list, n_items, add_aggregate = FALSE)
transition_list |
list of transition vectors |
n_items |
number of items |
add_aggregate |
whether to add aggregate row |
formatted matrix
Adjusts observed 1s based on propensity to guess (based on observed 0s) and item level gamma. You can also put in your best estimate of hidden knowledge behind don't know responses.
group_adj(pre = NULL, pst = NULL, gamma = NULL, dk = 0.03)group_adj(pre = NULL, pst = NULL, gamma = NULL, dk = 0.03)
pre |
pre data frame. Required. Each vector within the data frame should only take values 0, 1, and 'd'. |
pst |
pst data frame. Required. Each vector within the data frame should only take values 0, 1, and 'd'. |
gamma |
probability of getting the right answer without knowledge |
dk |
Numeric. Between 0 and 1. Hidden knowledge behind don't know responses. Default is .03. |
nested list of pre and post adjusted responses, and adjusted learning estimates
pre_test_var <- data.frame(pre = c(1,0,0,1,"d","d",0,1,NA)) pst_test_var <- data.frame(pst = c(1,NA,1,"d",1,0,1,1,"d")) gamma <- c(.25) group_adj(pre_test_var, pst_test_var, gamma)pre_test_var <- data.frame(pre = c(1,0,0,1,"d","d",0,1,NA)) pst_test_var <- data.frame(pst = c(1,NA,1,"d",1,0,1,1,"d")) gamma <- c(.25) group_adj(pre_test_var, pst_test_var, gamma)
Adjusts observed 1s based on item level parameters of the LCA model. Currently only takes data with Don't Know. And treats don't know responses as true confessions on ignorance. If NAs are observed in the data, they are treating as acknowledgments of ignorance.
lca_adj(pre = NULL, pst = NULL)lca_adj(pre = NULL, pst = NULL)
pre |
pre data frame |
pst |
pst data frame |
list of pre and post adjusted responses
pre_test_var <- data.frame(pre = c(1, 0, 0, 1, "d", "d", 0, 1, NA)) pst_test_var <- data.frame(pst = c(1, NA, 1, "d", 1, 0, 1, 1, "d")) lca_adj(pre_test_var, pst_test_var)pre_test_var <- data.frame(pre = c(1, 0, 0, 1, "d", "d", 0, 1, NA)) pst_test_var <- data.frame(pst = c(1, NA, 1, "d", 1, 0, 1, 1, "d")) lca_adj(pre_test_var, pst_test_var)
guesstimate
lca_cor( transmatrix = NULL, nodk_priors = c(0.3, 0.1, 0.1, 0.25), dk_priors = c(0.3, 0.1, 0.2, 0.05, 0.1, 0.1, 0.05, 0.25) )lca_cor( transmatrix = NULL, nodk_priors = c(0.3, 0.1, 0.1, 0.25), dk_priors = c(0.3, 0.1, 0.2, 0.05, 0.1, 0.1, 0.05, 0.25) )
transmatrix |
transition matrix returned from |
nodk_priors |
Optional. Vector of length 4. Priors for the parameters for model that fits data without Don't Knows |
dk_priors |
Optional. Vector of length 8. Priors for the parameters for model that fits data with Don't Knows |
list with two items: parameter estimates and estimates of learning
# Without DK pre_test <- data.frame(item1 = c(1, 0, 0, 1, 0), item2 = c(1, NA, 0, 1, 0)) pst_test <- pre_test + cbind(c(0, 1, 1, 0, 0), c(0, 1, 0, 0, 1)) transmatrix <- multi_transmat(pre_test, pst_test) res <- lca_cor(transmatrix)# Without DK pre_test <- data.frame(item1 = c(1, 0, 0, 1, 0), item2 = c(1, NA, 0, 1, 0)) pst_test <- pre_test + cbind(c(0, 1, 1, 0, 0), c(0, 1, 0, 0, 1)) transmatrix <- multi_transmat(pre_test, pst_test) res <- lca_cor(transmatrix)
Convenience wrapper: creates transition matrix and fits model.
lca_fit(pre_test, pst_test, ...)lca_fit(pre_test, pst_test, ...)
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
... |
passed to lca_cor() |
output from lca_cor()
Fits an LCA model where item difficulty is parameterized using IRT-style difficulty parameters instead of raw gamma (guessing probability). This allows difficulty to be unbounded on the real line, which can improve optimization and makes difficulty parameters more interpretable.
lca_irt( transmatrix = NULL, base_rate = 0.25, nodk_priors = c(0.35, 0.3, 0.35, 0), dk_priors = c(0.25, 0.15, 0.1, 0.1, 0.15, 0.1, 0.15, 0) )lca_irt( transmatrix = NULL, base_rate = 0.25, nodk_priors = c(0.35, 0.3, 0.35, 0), dk_priors = c(0.25, 0.15, 0.1, 0.1, 0.15, 0.1, 0.15, 0) )
transmatrix |
Transition matrix returned from |
base_rate |
Numeric. Minimum guessing probability (random chance). Default 0.25 (1/4 for 4-choice items). This is the floor for gamma when difficulty → +∞. |
nodk_priors |
Optional. Vector of length 4. Starting values for (gg, gk, kk, difficulty). First 3 must sum to 1. |
dk_priors |
Optional. Vector of length 8. Starting values for DK model. First 7 must sum to 1. |
IRT-Parameterized LCA Estimation
The relationship between difficulty (d) and gamma is:
Where logistic(x) = 1/(1+exp(-x)). This means:
d = 0: gamma = base_rate + 0.5*(1-base_rate) (middle difficulty)
d → +∞: gamma → base_rate (hard item, random guessing)
d → -∞: gamma → 1 (easy item, always correct even when guessing)
A guess_fit object with additional components:
params |
Parameter matrix with "difficulty" row instead of "gamma" |
gamma |
Derived gamma values from difficulty (added for convenience) |
learning |
Learning estimates (gk or gk + kd) |
# Simulate data with known difficulty sim <- simulate_lca(n = 500, n_items = 3, difficulty = c(1, 0, -1), seed = 123) transmatrix <- multi_transmat(sim$pre, sim$post) # Fit with IRT parameterization fit_irt <- lca_irt(transmatrix) fit_irt$params["difficulty", ] # Should recover approximately c(1, 0, -1) fit_irt$gamma # Derived gamma values# Simulate data with known difficulty sim <- simulate_lca(n = 500, n_items = 3, difficulty = c(1, 0, -1), seed = 123) transmatrix <- multi_transmat(sim$pre, sim$post) # Fit with IRT parameterization fit_irt <- lca_irt(transmatrix) fit_irt$params["difficulty", ] # Should recover approximately c(1, 0, -1) fit_irt$gamma # Derived gamma values
Bootstrapped Standard Errors
lca_se( pre_test = NULL, pst_test = NULL, n_resamples = 100, seed = 31415, force9 = FALSE )lca_se( pre_test = NULL, pst_test = NULL, n_resamples = 100, seed = 31415, force9 = FALSE )
pre_test |
data.frame carrying pre_test items |
pst_test |
data.frame carrying pst_test items |
n_resamples |
number of resamples, default is 100 |
seed |
random seed, default is 31415 |
force9 |
Optional. Force 9-column format even if no DK responses. Default is FALSE. |
list with:
se_params |
standard errors of parameters by item |
avg_effects |
mean learning estimates |
se_effects |
standard error of learning by item |
pre_test <- data.frame(pre_item1 = c(1, 0, 0, 1, 0), pre_item2 = c(1, NA, 0, 1, 0)) pst_test <- data.frame( pst_item1 = pre_test[, 1] + c(0, 1, 1, 0, 0), pst_item2 = pre_test[, 2] + c(0, 1, 0, 0, 1) ) ## Not run: lca_se(pre_test, pst_test, n_resamples = 10, seed = 31415)pre_test <- data.frame(pre_item1 = c(1, 0, 0, 1, 0), pre_item2 = c(1, NA, 0, 1, 0)) pst_test <- data.frame( pst_item1 = pre_test[, 1] + c(0, 1, 1, 0, 0), pst_item2 = pre_test[, 2] + c(0, 1, 0, 0, 1) ) ## Not run: lca_se(pre_test, pst_test, n_resamples = 10, seed = 31415)
Calculate log-likelihood for transition data
log_likelihood(params, data)log_likelihood(params, data)
params |
numeric vector of length 4 (nodk) or 8 (dk) |
data |
numeric vector of transition counts |
scalar log-likelihood
params <- c(0.4, 0.3, 0.3, 0.25) data <- c(x00 = 10, x01 = 5, x10 = 3, x11 = 12) log_likelihood(params, data)params <- c(0.4, 0.3, 0.3, 0.25) data <- c(x00 = 10, x01 = 5, x10 = 3, x11 = 12) log_likelihood(params, data)
Needs an 'interleaved' dataframe (see interleave function). Pre-test item should be followed by corresponding post-item item etc. Don't knows must be coded as NA. Function handles items without don't know responses. The function is used internally. It calls transmat.
multi_transmat( pre_test = NULL, pst_test = NULL, subgroup = NULL, force9 = FALSE, agg = FALSE )multi_transmat( pre_test = NULL, pst_test = NULL, subgroup = NULL, force9 = FALSE, agg = FALSE )
pre_test |
Required. data.frame carrying responses to pre-test questions. |
pst_test |
Required. data.frame carrying responses to post-test questions. |
subgroup |
a Boolean vector identifying the subset. Default is NULL. |
force9 |
Optional. There are cases where DK data doesn't have DK. But we need the entire matrix. By default it is FALSE. |
agg |
Optional. Boolean. Whether or not to add a row of aggregate transitions at the end of the matrix. Default is FALSE. |
multi_transmat: transition matrix of all the items
matrix with rows = total number of items + 1 (last row contains aggregate distribution across items) number of columns = 4 when no don't know, and 9 when there is a don't know option
pre_test <- data.frame(pre_item1 = c(1,0,0,1,0), pre_item2 = c(1,NA,0,1,0)) pst_test <- data.frame(pst_item1 = pre_test[,1] + c(0,1,1,0,0), pst_item2 = pre_test[,2] + c(0,1,0,0,1)) multi_transmat(pre_test, pst_test)pre_test <- data.frame(pre_item1 = c(1,0,0,1,0), pre_item2 = c(1,NA,0,1,0)) pst_test <- data.frame(pst_item1 = pre_test[,1] + c(0,1,1,0,0), pst_item2 = pre_test[,2] + c(0,1,0,0,1)) multi_transmat(pre_test, pst_test)
Converts NAs to 0s
nona(vec = NULL)nona(vec = NULL)
vec |
Required. Character or Numeric vector. |
Character vector.
x <- c(NA, 1, 0); nona(x) x <- c(NA, "dk", 0); nona(x)x <- c(NA, 1, 0); nona(x) x <- c(NA, "dk", 0); nona(x)
Calculate perplexity from individual-level data
perplexity_individuals(lca_result, pre_test, pst_test, per_individual = FALSE)perplexity_individuals(lca_result, pre_test, pst_test, per_individual = FALSE)
lca_result |
output from lca_cor() or lca_fit() |
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
per_individual |
logical; return per-individual perplexity? |
numeric scalar or vector
Lower perplexity indicates better model fit.
perplexity_items(lca_result, transmatrix, item = NULL)perplexity_items(lca_result, transmatrix, item = NULL)
lca_result |
output from lca_cor() or numeric parameter vector |
transmatrix |
numeric matrix of transition counts (items x cells) |
item |
optional integer; specific item index (NULL = aggregate) |
numeric scalar perplexity
## Not run: transmatrix <- multi_transmat(pre_test, pst_test) res <- lca_cor(transmatrix) perplexity_items(res, transmatrix) ## End(Not run)## Not run: transmatrix <- multi_transmat(pre_test, pst_test) res <- lca_cor(transmatrix) perplexity_items(res, transmatrix) ## End(Not run)
Uses Bayes' rule to compute P(class | data) for each individual. The LCA model uses the joint transition structure across all items to separate true learning from lucky guessing.
posterior_class_probs(lca_result, pre_test, pst_test)posterior_class_probs(lca_result, pre_test, pst_test)
lca_result |
output from lca_cor() or lca_fit() |
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
data.frame with columns P_gg, P_gk, P_kk (rows = individuals)
sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) fit <- lca_fit(sim$pre, sim$post) posteriors <- posterior_class_probs(fit, sim$pre, sim$post) head(posteriors)sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) fit <- lca_fit(sim$pre, sim$post) posteriors <- posterior_class_probs(fit, sim$pre, sim$post) head(posteriors)
Returns P(gk | data) for each individual, representing the probability that the individual truly learned (vs. guessing or already knowing).
posterior_learned(lca_result, pre_test, pst_test)posterior_learned(lca_result, pre_test, pst_test)
lca_result |
output from lca_cor() or lca_fit() |
pre_test |
data.frame of pre-test responses |
pst_test |
data.frame of post-test responses |
numeric vector of P(learned | data) for each individual
sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) fit <- lca_fit(sim$pre, sim$post) p_learned <- posterior_learned(fit, sim$pre, sim$post) cor(p_learned, sim$learned)sim <- simulate_lca(n = 100, gk = 0.30, seed = 123, return_classes = TRUE) fit <- lca_fit(sim$pre, sim$post) p_learned <- posterior_learned(fit, sim$pre, sim$post) cor(p_learned, sim$learned)
Print method for guess_cv
## S3 method for class 'guess_cv' print(x, ...)## S3 method for class 'guess_cv' print(x, ...)
x |
guess_cv object |
... |
ignored |
invisible(x)
Print method for guess_fit
## S3 method for class 'guess_fit' print(x, ...)## S3 method for class 'guess_fit' print(x, ...)
x |
guess_fit object |
... |
ignored |
invisible(x)
Functions to generate simulated pre/post test data from known LCA parameters for validation and parameter recovery studies. Simulate Pre-Post Test Data (No DK Model)
simulate_lca( n, n_items = 1, gg = 0.35, gk = 0.3, kk = 0.35, gamma = 0.25, difficulty = NULL, base_rate = 0.25, seed = NULL, return_classes = FALSE )simulate_lca( n, n_items = 1, gg = 0.35, gk = 0.3, kk = 0.35, gamma = 0.25, difficulty = NULL, base_rate = 0.25, seed = NULL, return_classes = FALSE )
n |
Integer. Number of individuals to simulate. |
n_items |
Integer. Number of test items. Default 1. |
gg |
Numeric. Proportion in guess->guess state (stable ignorance). Default 0.35. |
gk |
Numeric. Proportion in guess->know state (LEARNED). Default 0.30. |
kk |
Numeric. Proportion in know->know state (stable knowledge). Default 0.35. |
gamma |
Numeric. Probability of guessing correctly. Can be scalar (same for all items) or vector of length n_items. Default 0.25. |
difficulty |
Numeric vector. Optional IRT difficulty parameters. If provided, gamma is computed as base_rate + (1 - base_rate) * plogis(-difficulty). Higher difficulty = harder item (lower gamma). Ignored if NULL. |
base_rate |
Numeric. Minimum guessing probability (random chance). Used when difficulty is specified. Default 0.25 (1/4 for 4-choice items). |
seed |
Optional integer. Random seed for reproducibility. |
return_classes |
Logical. If TRUE, also return true latent class assignments. Default FALSE for backward compatibility. |
Generates simulated pre/post test data from a latent class model with known parameters. Useful for parameter recovery validation studies.
The model simulates three latent classes: - **gg (guess->guess)**: Don't know at both times. Responses are random guesses. - **gk (guess->know)**: Learned between tests. Random guess pre, correct post. - **kk (know->know)**: Know at both times. Correct responses at both times.
Parameters must satisfy: gg + gk + kk = 1 (constraint enforced automatically).
When difficulty is specified, gamma values are derived using an IRT-like transformation: gamma_i = base_rate + (1 - base_rate) * plogis(-difficulty_i). This means: - difficulty = 0: gamma = base_rate + 0.5 * (1 - base_rate) (middle) - difficulty → +∞: gamma → base_rate (hard item, random guessing) - difficulty → -∞: gamma → 1 (easy item, always correct)
List with components:
pre |
Data frame of pre-test responses (0/1 for each item) |
post |
Data frame of post-test responses (0/1 for each item) |
true_class |
(If return_classes=TRUE) Factor with levels "gg", "gk", "kk" |
learned |
(If return_classes=TRUE) Logical vector: TRUE if individual is in gk class |
# Simulate data with 30% learning sim <- simulate_lca(n = 500, gg = 0.35, gk = 0.30, kk = 0.35, gamma = 0.25, seed = 123) fit <- lca_fit(sim$pre, sim$post) fit$params["gk", ] # Should be close to 0.30 # Multi-item simulation sim_multi <- simulate_lca(n = 500, n_items = 3, seed = 456) # Item-specific gamma (vector) sim_vec <- simulate_lca(n = 500, n_items = 3, gamma = c(0.2, 0.25, 0.3), seed = 789) # IRT-style difficulty parameters sim_irt <- simulate_lca(n = 500, n_items = 3, difficulty = c(1, 0, -1), seed = 101) # Return true class assignments for validation sim_classes <- simulate_lca(n = 500, gk = 0.30, seed = 123, return_classes = TRUE) table(sim_classes$true_class) mean(sim_classes$learned) # Should be close to 0.30# Simulate data with 30% learning sim <- simulate_lca(n = 500, gg = 0.35, gk = 0.30, kk = 0.35, gamma = 0.25, seed = 123) fit <- lca_fit(sim$pre, sim$post) fit$params["gk", ] # Should be close to 0.30 # Multi-item simulation sim_multi <- simulate_lca(n = 500, n_items = 3, seed = 456) # Item-specific gamma (vector) sim_vec <- simulate_lca(n = 500, n_items = 3, gamma = c(0.2, 0.25, 0.3), seed = 789) # IRT-style difficulty parameters sim_irt <- simulate_lca(n = 500, n_items = 3, difficulty = c(1, 0, -1), seed = 101) # Return true class assignments for validation sim_classes <- simulate_lca(n = 500, gk = 0.30, seed = 123, return_classes = TRUE) table(sim_classes$true_class) mean(sim_classes$learned) # Should be close to 0.30
Generates simulated pre/post test data from a latent class model with Don't Know responses.
simulate_lca_dk( n, n_items = 1, gg = 0.25, gk = 0.15, gd = 0.1, kg = 0.1, kk = 0.15, kd = 0.1, dd = 0.15, gamma = 0.25, difficulty = NULL, base_rate = 0.25, seed = NULL )simulate_lca_dk( n, n_items = 1, gg = 0.25, gk = 0.15, gd = 0.1, kg = 0.1, kk = 0.15, kd = 0.1, dd = 0.15, gamma = 0.25, difficulty = NULL, base_rate = 0.25, seed = NULL )
n |
Integer. Number of individuals to simulate. |
n_items |
Integer. Number of test items. Default 1. |
gg |
Numeric. Proportion: guess->guess (stable ignorance). Default 0.25. |
gk |
Numeric. Proportion: guess->know (learned). Default 0.15. |
gd |
Numeric. Proportion: guess->dk. Default 0.10. |
kg |
Numeric. Proportion: know->guess (forgot). Default 0.10. |
kk |
Numeric. Proportion: know->know (stable knowledge). Default 0.15. |
kd |
Numeric. Proportion: know->dk. Default 0.10. |
dd |
Numeric. Proportion: dk->dk. Default 0.15. |
gamma |
Numeric. Probability of guessing correctly. Can be scalar (same for all items) or vector of length n_items. Default 0.25. |
difficulty |
Numeric vector. Optional IRT difficulty parameters. If provided, gamma is computed as base_rate + (1 - base_rate) * plogis(-difficulty). Higher difficulty = harder item (lower gamma). Ignored if NULL. |
base_rate |
Numeric. Minimum guessing probability (random chance). Used when difficulty is specified. Default 0.25 (1/4 for 4-choice items). |
seed |
Optional integer. Random seed for reproducibility. |
The DK model has 7 latent classes representing transitions between guess (g), know (k), and don't know (d) states: - **gg**: guess both times - **gk**: guess -> know (learned) - **gd**: guess -> dk - **kg**: know -> guess (forgot) - **kk**: know -> know - **kd**: know -> dk - **dd**: dk -> dk
Parameters must sum to 1 (constraint enforced automatically).
When difficulty is specified, gamma values are derived using an IRT-like transformation: gamma_i = base_rate + (1 - base_rate) * plogis(-difficulty_i).
List with two data frames:
pre |
Pre-test responses (character: "0", "1", or "d") |
post |
Post-test responses (character: "0", "1", or "d") |
# Simulate DK data sim <- simulate_lca_dk(n = 500, gk = 0.15, seed = 123) fit <- lca_fit(sim$pre, sim$post) fit$params["gk", ] # Should be close to 0.15 # Item-specific gamma (vector) sim_vec <- simulate_lca_dk(n = 500, n_items = 3, gamma = c(0.2, 0.25, 0.3), seed = 456) # IRT-style difficulty parameters sim_irt <- simulate_lca_dk(n = 500, n_items = 3, difficulty = c(1, 0, -1), seed = 789)# Simulate DK data sim <- simulate_lca_dk(n = 500, gk = 0.15, seed = 123) fit <- lca_fit(sim$pre, sim$post) fit$params["gk", ] # Should be close to 0.15 # Item-specific gamma (vector) sim_vec <- simulate_lca_dk(n = 500, n_items = 3, gamma = c(0.2, 0.25, 0.3), seed = 456) # IRT-style difficulty parameters sim_irt <- simulate_lca_dk(n = 500, n_items = 3, difficulty = c(1, 0, -1), seed = 789)
Estimate of learning adjusted with standard correction for guessing. Correction is based on number of options per question.
The function takes separate pre-test and post-test dataframes. Why do we need dataframes? To accomodate multiple items.
The items can carry NA (missing). Items must be in the same order in each dataframe. Assumes that respondents are posed same questions twice.
The function also takes a lucky vector — the chance of getting a correct answer if guessing randomly. Each entry is 1/(number of options).
The function also optionally takes a vector carrying names of the items. By default, the vector carrying adjusted learning estimates takes same
item names as the pre_test items. However you can assign a vector of names separately via item_names.
stnd_cor(pre_test = NULL, pst_test = NULL, lucky = NULL, item_names = NULL)stnd_cor(pre_test = NULL, pst_test = NULL, lucky = NULL, item_names = NULL)
pre_test |
Required. data.frame carrying responses to pre-test questions. |
pst_test |
Required. data.frame carrying responses to post-test questions. |
lucky |
Required. A vector. Each entry is 1/(number of options) |
item_names |
Optional. A vector carrying item names. |
a list of three vectors, carrying pre-treatment corrected scores, post-treatment scores, and adjusted estimates of learning
# Without DK pre_test <- data.frame(item1 = c(1,0,0,1,0), item2 = c(1,NA,0,1,0)) pst_test <- pre_test + cbind(c(0,1,1,0,0), c(0,1,0,0,1)) lucky <- rep(.25, 2); stnd_cor(pre_test, pst_test, lucky) # With DK pre_test <- data.frame(item1 = c(1,0,0,1,0,'d',0), item2 = c(1,NA,0,1,0,'d','d')) pst_test <- data.frame(item1 = c(1,0,0,1,0,'d',1), item2 = c(1,NA,0,1,0,1,'d')) lucky <- rep(.25, 2); stnd_cor(pre_test, pst_test, lucky)# Without DK pre_test <- data.frame(item1 = c(1,0,0,1,0), item2 = c(1,NA,0,1,0)) pst_test <- pre_test + cbind(c(0,1,1,0,0), c(0,1,0,0,1)) lucky <- rep(.25, 2); stnd_cor(pre_test, pst_test, lucky) # With DK pre_test <- data.frame(item1 = c(1,0,0,1,0,'d',0), item2 = c(1,NA,0,1,0,'d','d')) pst_test <- data.frame(item1 = c(1,0,0,1,0,'d',1), item2 = c(1,NA,0,1,0,1,'d')) lucky <- rep(.25, 2); stnd_cor(pre_test, pst_test, lucky)
Summary method for guess_cv
## S3 method for class 'guess_cv' summary(object, ...)## S3 method for class 'guess_cv' summary(object, ...)
object |
guess_cv object |
... |
ignored |
invisible summary
Summary method for guess_fit
## S3 method for class 'guess_fit' summary(object, ...)## S3 method for class 'guess_fit' summary(object, ...)
object |
guess_fit object |
... |
ignored |
invisible summary object
Prints Cross-wave transition matrix and returns the vector behind the matrix. Missing values are treated as ignorance. Don't know responses need to be coded as 'd'.
transmat(pre_test_var, pst_test_var, subgroup = NULL, force9 = FALSE)transmat(pre_test_var, pst_test_var, subgroup = NULL, force9 = FALSE)
pre_test_var |
Required. A vector carrying pre-test scores of a particular item. Only |
pst_test_var |
Required. A vector carrying post-test scores of a particular item |
subgroup |
Optional. A Boolean vector indicating rows of the relevant subset. |
force9 |
Optional. There are cases where DK data doesn't have DK. But we need the entire matrix. By default it is FALSE. |
a numeric vector. Assume 1 denotes correct answer, 0 and NA incorrect, and d 'don't know.' When there is no don't know option and no missing, the entries are: x00, x10, x01, x11 When there is a don't know option, the entries of the vector are: x00, x10, xd0, x01, x11, xd1, xd0, x1d, xdd
pre_test_var <- c(1,0,0,1,0,1,0) pst_test_var <- c(1,0,1,1,0,1,1) transmat(pre_test_var, pst_test_var) # With NAs pre_test_var <- c(1,0,0,1,"d","d",0,1,NA) pst_test_var <- c(1,NA,1,"d",1,0,1,1,"d") transmat(pre_test_var, pst_test_var)pre_test_var <- c(1,0,0,1,0,1,0) pst_test_var <- c(1,0,1,1,0,1,1) transmat(pre_test_var, pst_test_var) # With NAs pre_test_var <- c(1,0,0,1,"d","d",0,1,NA) pst_test_var <- c(1,NA,1,"d",1,0,1,1,"d") transmat(pre_test_var, pst_test_var)
Validate that two data frames have compatible dimensions
validate_compatible_dataframes(pre_test, pst_test)validate_compatible_dataframes(pre_test, pst_test)
pre_test |
pre-test data frame |
pst_test |
post-test data frame |
TRUE if valid, throws error otherwise
Validate gamma parameter
validate_gamma(gamma)validate_gamma(gamma)
gamma |
probability parameter |
TRUE if valid, throws error otherwise
Validate lucky vector for standard correction
validate_lucky_vector(lucky, n_items)validate_lucky_vector(lucky, n_items)
lucky |
vector of guessing probabilities |
n_items |
number of items to validate against |
TRUE if valid, throws error otherwise
Validate prior parameters
validate_priors(priors, expected_length, param_name)validate_priors(priors, expected_length, param_name)
priors |
vector of prior parameters |
expected_length |
expected length of priors vector |
param_name |
name of parameter for error messages |
TRUE if valid, throws error otherwise
Performs Monte Carlo simulations to assess parameter recovery of the LCA model. Useful for validating estimator performance.
validate_recovery(true_params, n = 500, n_items = 2, n_sims = 100, seed = NULL)validate_recovery(true_params, n = 500, n_items = 2, n_sims = 100, seed = NULL)
true_params |
Named numeric vector of true parameters. For no-DK model: c(gg=, gk=, kk=, gamma=) For DK model: c(gg=, gk=, gd=, kg=, kk=, kd=, dd=, gamma=) |
n |
Integer. Sample size per simulation. Default 500. |
n_items |
Integer. Number of items. Default 2. |
n_sims |
Integer. Number of Monte Carlo simulations. Default 100. |
seed |
Optional integer. Random seed for reproducibility. |
Data frame with one row per parameter containing columns: parameter (name), true_value, mean_estimate, bias (mean estimate minus true), rmse (root mean squared error), se (standard deviation of estimates), and coverage_95 (proportion of times 95
## Not run: # Validate no-DK model recovery results <- validate_recovery( c(gg = 0.35, gk = 0.30, kk = 0.35, gamma = 0.25), n = 500, n_sims = 50 ) print(results) # Validate DK model recovery results_dk <- validate_recovery( c(gg = 0.25, gk = 0.15, gd = 0.10, kg = 0.10, kk = 0.15, kd = 0.10, dd = 0.15, gamma = 0.25), n = 500, n_sims = 50 ) ## End(Not run)## Not run: # Validate no-DK model recovery results <- validate_recovery( c(gg = 0.35, gk = 0.30, kk = 0.35, gamma = 0.25), n = 500, n_sims = 50 ) print(results) # Validate DK model recovery results_dk <- validate_recovery( c(gg = 0.25, gk = 0.15, gd = 0.10, kg = 0.10, kk = 0.15, kd = 0.10, dd = 0.15, gamma = 0.25), n = 500, n_sims = 50 ) ## End(Not run)
Validate transition matrix values
validate_transition_values(pre_test_var, pst_test_var)validate_transition_values(pre_test_var, pst_test_var)
pre_test_var |
pre-test variable vector |
pst_test_var |
post-test variable vector |
TRUE if valid, throws error otherwise