Next Article in Journal / Special Issue
Applications and Extensions of Metric Stability Analysis
Previous Article in Journal / Special Issue
Exploring Approaches for Estimating Parameters in Cognitive Diagnosis Models with Small Sample Sizes
 
 
Article
Peer-Review Record

dexter: An R Package to Manage and Analyze Test Data

Psych 2023, 5(2), 350-375; https://doi.org/10.3390/psych5020024
by Ivailo Partchev 1,*, Jesse Koops 1, Timo Bechger 2, Remco Feskens 1 and Gunter Maris 3
Reviewer 1:
Reviewer 2:
Psych 2023, 5(2), 350-375; https://doi.org/10.3390/psych5020024
Submission received: 29 March 2023 / Revised: 20 April 2023 / Accepted: 21 April 2023 / Published: 28 April 2023
(This article belongs to the Special Issue Computational Aspects and Software in Psychometrics II)

Round 1

Reviewer 1 Report

Review of "Dexter: A R package to manage and analyze test data"

I read the article carefully and with great interest. Overall I find this is generally a good article introducing and highlighting features of a splendid piece of software, dexter.

The article appears correct and well supported by references, and has a number of great ideas. I think the article can be improved in a few sections, mainly for clarity, presentation and flow - I'll offer some suggestions below. I recommend publishing the article after these points are addressed.   

Specific comments:
Page 1:
- I'd not list the authors in the main text body so just "At that time, all four authors were employed at Cito ..." Also, I must say that this unblinded the article for me.
- Last not least -> Last but not least
Page 2:
- Sentence with "For example, we read that..." I'm not sure what is referred - is this missing a citation or reference to a specific document? So "For example in the rule book for XXX, we read..."
- "Models" as in "footwear models"? this should be made specific as models was used earlier for referring to measurement models.    
- 2PL, 3PL, 4PL etc are not cited nor are the abbreviations explained.
- In the bullet point starting with "Conditional Maximum Likelihood (CML) ..." I'd add "of item parameters", so Conditional Maximum Likelihood (CML) of item parameters..."
Page 3:
I like the listing of the functions in dexter and the brief description of what they do. Maybe a table with an overview of function, main arguments and functionality would benefit the reader?
- It probably depends a bit on the reader, but in general I think it would be great to have more explanation of what happens in the included code chunks. For example, not everyone might know the sprintf() and its effects and that this is not a function from dexter; same for pivot_longer which needs tidyR. Also, I wondered what the ":memory:" argument in start_new_project() is doing (not explained in the paragraph describing the function) - it also has curious syntax with the two :. Similarly, the argument "one" in add_booklet is not explained. Just in general being a bit more verbose with the explanations of what is seen in code in the chunks would improve the article and its usability.
Page 5/Figure 1:
- I was first confused by the Pval and that this was referenced as "proportion correct" as well as what Rit and Rir stand for. The first is explained later in the article as an irritating misnomer which I agree with - this should already be mentioned here to avoid this confusion. Also it makes everything clearer if "On the bottom, we see the basic..." the abbreviations in the plot were added to the list in the main body like this: proportion correct (Pval), item-total (Rit) and item-test correlation(Rir). I'd appreciate a more thorough discussion of what is seen in Figure 1 (i.e., that the categories are colored and labeled with the response, that the (1) means correct and (0) means distractor and so on) as this would help with the explanations that follow.     
- On the bottom -> At the bottom     
- "When two of the wrong responses" Why not list which categories these were? e.g., "When two of the wrong responses, i.e., categories 3 (green line) and 4 (orange line) are so..." I understand that all of this can be deduced based on plausibility but I think it is better to be as clear and direct as possible in scientific writing, so there is no room for error in what the authors meant.
- It is stated that fit_inter fits two psychometric models, but in the next paragraph fit_enorm is mentioned and it is not clear at this point how the two functions differ. Also there never is any mention of fit_inter later on, and only fit_enorm. Could it be that is should have been fit_enorm instead of fit_inter or the other way round in the two paragraphs? I think this needs to be explained more clearly as not everyone is as intimately familiar with these models and their relationships as the authors clearly are.  
- The "more restrictive calibration model". At this point it is not clear yet what the calibration model is. This should be made more explicit.
- Can plot also be used for an object returned by fit_inter as for fit_enorm or not? Also instead of "output of fit_enorm" I'd say "object returned from fit_enorm" which makes it clear that there is a plot method for the class.
Page 6:
Now the calibration model is described. I wasn't familiar with the extended terminology - does "extended" generally refer to models for the population? I'd mention that.
Page 7:
"this involves averaging over the samples" - do I understand this correctly as the returned point estimates are always the posterior means? What about the standard erors - are they the posterior standard deviation? Can the user change that, e.g., use the posterior median? I'm asking because it looks like the posterior could well be skewed with the specified Gamma priors.
- "In the final run". I'm not sure what this means - does this refer to the last run of the sampler or the last use of a function?
- primitive -> basic
- "no true individual differences in the data". I'd give the null hypothesis here; specifically, with individual I'm not sure if this refers to differences between people in the response patterns (interindividual) or to differences in responses for the same person (intraindividual), or both although I think the first (this might be expanded on with functions interindividual_differences or intraindividual_differences testing the respective ones).
Page 8
- I'm not sure what the relevance of the paragraph with 3DC is - it comes out of the blue and seems to be misplaced here, or at least the relevance is not explained (other than that dexter supports that). Please expand or rewrite.
- "Correct prior" is a bit of a difficult statement, which would suggest that there is one. Maybe "appropriate" is meant?
- In general I have some misgivings on how this part on priors and the Bayesian analysis is phrased. There are enough Bayesians who would object to the idea that one can just "try a better chosen prior" as the prior should reflect all the a priori information we have and thus there is nothing new to be learned about the prior when looking at the posterior. Also the language with "correct" and that "the posterior suggesting a different functional form than the prior" (I'm not sure what the latter means - is this related to conjugancy?) might be seen as problematic in a Bayesian setting (granted it makes sense if we don't follow the philosophical tenets of Bayesian inference but just need a good and flexible computational tool to fit our models which seems to be the stance taken by the authors). I'd suggest to rewrite this with the above points in mind.
- I think the choices of priors is nice but I'd appreciate a reference that this indeed provides enough flexibility for modelling latent traits that are skewed or heavy-tailed (especially in light of the "correct prior" and choosing the prior statements from before). A mixture of two normals may be a bit arbitrary - what if there are 4 discrete-ish groups on the latent trait?  
Page 9:
- Figure 3 should get a nicer width/height ratio. Also more explanation in the caption of what is seen and what the colors mean would be great.  
- In the code chunk I'm confused by person_properties=list(gender="unkown") and that later there is a call to DIF with gender as argument. Does this first set gender to unknown for everyone and then DIF is calculated for all people as one group (instead of separated by male/female)? But that doesn't fit to Figure 3 so I'm probably misunderstanding something.
- Maybe mention in the "Test for DIF" that the null is no DIF and that here we reject that there is no DIF (at least that is how I read it).
Page 10:
- "It is easy to see..." Isn't there more to see? Aren't this DIF scores between male and female so this should be reflected in the interpretation? For the explanantion of two groups of items based on the mode, I don't think one needs to look at the DIF plot. More explanation of what to be concluded from Figure 4 seems warranted.
- Please explain what the profiles are.  
Page 11:
Section 3 is in my opinion the most well written one.
- generalisation PCM -> generalisation of the PCM
Page 12:
- foonote 2: "whereas non-boldface variables are numbers" Is this meant to say scalar instead of number (x_{pij} is no boldface and not a number)?
- Does that script E in (1) denote the expect value? Please mention that.
- Is there an intuitive interpretation for the lambda parameter? (similar to b_{ij} being an item parameter)
- In the Bayesian model I'd appreciate a discussion why the choice of Gamma priors for _every_ random variable (y,lambda, b). I suspect this is because the Gibbs sampler works efficiently and some other nice properties, but it strikes me as needing an argument from a substantive point and/or explanation why the choice doesn't really matter (I can imagine that for the use cases that dexter was envisioned for, we have such high sample sizes that the influence of the prior becomes inconsequential and one can just use what is algorithmically most convenient).
Page 13:
- just after (4) its says power law prior - this might be confusing since before was mentioned a Gamma prior; of course in (4) I recognize that there is a power law with exponential cut off of the form x^a-1 exp(-bx) if (a-1)<0 and this relates to the Gamma pdf but that might not be obvious for all.
- This is mainly an issue of the type-setting of the mdmpi style but I find it hard to distinguish n and \eta - can you choose a different letter instead of \eta?
Page 14:
- There are 3 parameters for the Gamma distribution in (7) (Gamma is a two-parameter family); I think there is a typo.  
Figure 6:
- I'd use a better width,height setting.
Page 16:
- "is the model that fits classical test theory" sounds strange to me. Perhaps "is the IRT model that corresponds to the quantities of interest in classical test theory" or something like that.
- What does the tilde over the | in Theta \tilde{|} X signify?
- I'd type set the mu and sigma as \texttt{mu} and \texttt{sigma} so this corresponds to the type setting in the code chunk.  
Page 17:
- There is a "have have" in the second paragraph of 3.5
- I'd also refer to the page 7 when the individual differences test is mentioned
- "use an item property to specify" does item property refer to code here? If so, I'd include an example code.
- "should note take longer" -> should not take    
- In section 4, in the extensive example would benefit from more structure. I'd structure the analysis as in the steps described, with paragraphs/subsections. E.g.,
4.1 Downloading and Parsing
Analyzing  more recent waves...  
4.2. Create a Dexter database
Now start a new dexter project ...
etc.
Page 18:
- Why are the code chunks suddenly grey blocks? Does this signify anything? I think it should be consistent with all the other chunks.
- As mentioned before, a bit more verbose descriptions of what the code is doing would be great.
Page 19:
- "the correspondence is more than decent" I feel like there is a reference to Figure 7 missing somewhere. Same for Figure 8. Figures 7 and 8 are not referred to in the main text.
Page 20:
- I'd make a newline and indent before Zwitser at al.
Page 21:
The article ends abruptly. I'd recommend to write a discussion section.  

Author Response

Many thanks for this enormously helpful review. It is amazing how thorough it is, especially given the short time allowance. It has prompted me to rewrite Sections 1 and 2, while my betters have taken care of Section 3.

Page 1:
- I'd not list the authors in the main text body ... so just "At that time, all four authors were employed at Cito ..." Also, I must say that this unblinded the article for me.  what's done cannot be undone
- Last not least -> Last but not least corrected
Page 2:
- Sentence with "For example, we read that..." I'm not sure what is referred - is this missing a citation or reference to a specific document? So "For example in the rule book for XXX, we read..." corrected
- "Models" as in "footwear models"? this should be made specific as models was used earlier for referring to measurement models. corrected    
- 2PL, 3PL, 4PL etc are not cited nor are the abbreviations explained. fixed
- In the bullet point starting with "Conditional Maximum Likelihood (CML) ..." I'd add "of item parameters", so Conditional Maximum Likelihood (CML) of item parameters..." fixed
Page 3:
I like the listing of the functions in dexter and the brief description of what they do. Maybe a table with an overview of function, main arguments and functionality would benefit the reader? fixed
- It probably depends a bit on the reader, but in general I think it would be great to have more explanation of what happens in the included code chunks. For example, not everyone might know the sprintf() and its effects and that this is not a function from dexter; same for pivot_longer which needs tidyR. Also, I wondered what the ":memory:" argument in start_new_project() is doing (not explained in the paragraph describing the function) - it also has curious syntax with the two :. Similarly, the argument "one" in add_booklet is not explained. Just in general being a bit more verbose with the explanations of what is seen in code in the chunks would improve the article and its usability. all fixed
Page 5/Figure 1:
- I was first confused by the Pval and that this was referenced as "proportion correct" as well as what Rit and Rir stand for. The first is explained later in the article as an irritating misnomer which I agree with - this should already be mentioned here to avoid this confusion. Also it makes everything clearer if "On the bottom, we see the basic..." the abbreviations in the plot were added to the list in the main body like this: proportion correct (Pval), item-total (Rit) and item-test correlation(Rir). I'd appreciate a more thorough discussion of what is seen in Figure 1 (i.e., that the categories are colored and labeled with the response, that the (1) means correct and (0) means distractor and so on) as this would help with the explanations that follow.  fixed    
- On the bottom -> At the bottom     corrected
- "When two of the wrong responses" Why not list which categories these were? e.g., "When two of the wrong responses, i.e., categories 3 (green line) and 4 (orange line) are so..." I understand that all of this can be deduced based on plausibility but I think it is better to be as clear and direct as possible in scientific writing, so there is no room for error in what the authors meant. addressed

  • It is stated that fit_inter fits two psychometric models, but in the next paragraph fit_enorm is mentioned and it is not clear at this point how the two functions differ. Also there never is any mention of fit_inter later on, and only fit_enorm. Could it be that is should have been fit_enorm instead of fit_inter or the other way round in the two paragraphs? I think this needs to be explained more clearly as not everyone is as intimately familiar with these models and their relationships as the authors clearly are. thank you very much, it helped restructure and  improve the argument 
  •  
  • - The "more restrictive calibration model". At this point it is not clear yet what the calibration model is. This should be made more explicit. ditto

  • - Can plot also be used for an object returned by fit_inter as for fit_enorm or not? Also instead of "output of fit_enorm" I'd say "object returned from fit_enorm" which makes it clear that there is a plot method for the class.  I don't particularly like plots with latent axes but I added a paragraph to say they can be produced

  • Page 6:
    Now the calibration model is described. I wasn't familiar with the extended terminology - does "extended" generally refer to models for the population? I'd mention that.  I tried to disambiguate but it to refers to a couple of very important papers that few people seem to read, and everyone connects it to LLTM. But it would open up a long discussion to fully address
  •  

  • Page 7:
    "this involves averaging over the samples" - do I understand this correctly as the returned point estimates are always the posterior means? What about the standard erors - are they the posterior standard deviation? Can the user change that, e.g., use the posterior median? I'm asking because it looks like the posterior could well be skewed with the specified Gamma priors. That's very interesting because analytic approaches love the mode (optimization pure) and avoid the median (no derivatives) while with MCMC it is exactly the opposite (the mode is hell to estimate from data). I have shown how to do the median and the MAD. In the example, they are practically the same as the mean and the SD
  •  

  • - "In the final run". I'm not sure what this means - does this refer to the last run of the sampler or the last use of a function? British (I think) -- I changed it
    - primitive -> basic fixed
    - "no true individual differences in the data". I'd give the null hypothesis here; specifically, with individual I'm not sure if this refers to differences between people in the response patterns (interindividual) or to differences in responses for the same person (intraindividual), or both although I think the first (this might be expanded on with functions interindividual_differences or intraindividual_differences testing the respective ones).   Yes, it was imprecise, fixed now
    Page 8
    - I'm not sure what the relevance of the paragraph with 3DC is - it comes out of the blue and seems to be misplaced here, or at least the relevance is not explained (other than that dexter supports that). Please expand or rewrite.  Well, people do standard settings, some download the software, so it seemed worth mentioning
    - "Correct prior" is a bit of a difficult statement, which would suggest that there is one. Maybe "appropriate" is meant?  I fixed the whole part about priors, hopefully better now, and not confusing theory with practice
    - In general I have some misgivings on how this part on priors and the Bayesian analysis is phrased. There are enough Bayesians who would object to the idea that one can just "try a better chosen prior" as the prior should reflect all the a priori information we have and thus there is nothing new to be learned about the prior when looking at the posterior. Also the language with "correct" and that "the posterior suggesting a different functional form than the prior" (I'm not sure what the latter means - is this related to conjugancy?) might be seen as problematic in a Bayesian setting (granted it makes sense if we don't follow the philosophical tenets of Bayesian inference but just need a good and flexible computational tool to fit our models which seems to be the stance taken by the authors). I'd suggest to rewrite this with the above points in mind.
    - I think the choices of priors is nice but I'd appreciate a reference that this indeed provides enough flexibility for modelling latent traits that are skewed or heavy-tailed (especially in light of the "correct prior" and choosing the prior statements from before). A mixture of two normals may be a bit arbitrary - what if there are 4 discrete-ish groups on the latent trait?   We have the reference, and we have included it
    Page 9:
    - Figure 3 should get a nicer width/height ratio. Also more explanation in the caption of what is seen and what the colors mean would be great. fixed  
    - In the code chunk I'm confused by person_properties=list(gender="unkown") and that later there is a call to DIF with gender as argument. Does this first set gender to unknown for everyone and then DIF is calculated for all people as one group (instead of separated by male/female)? But that doesn't fit to Figure 3 so I'm probably misunderstanding something. I admit it is confusing, and I have explained what it means
    - Maybe mention in the "Test for DIF" that the null is no DIF and that here we reject that there is no DIF (at least that is how I read it).
    Page 10:
    - "It is easy to see..." Isn't there more to see? Aren't this DIF scores between male and female so this should be reflected in the interpretation? For the explanantion of two groups of items based on the mode, I don't think one needs to look at the DIF plot. More explanation of what to be concluded from Figure 4 seems warranted. fixed
  •  

  • - Please explain what the profiles are.   explained and expanded
    Page 11:
    Section 3 is in my opinion the most well written one. 
  •  

  • - generalisation PCM -> generalisation of the PCM fixed
    Page 12:
    - foonote 2: "whereas non-boldface variables are numbers" Is this meant to say scalar instead of number (x_{pij} is no boldface and not a number)? fixed

  • - Does that script E in (1) denote the expect value? Please mention that.
    - Is there an intuitive interpretation for the lambda parameter? (similar to b_{ij} being an item parameter) fixed
    - In the Bayesian model I'd appreciate a discussion why the choice of Gamma priors for _every_ random variable (y,lambda, b). I suspect this is because the Gibbs sampler works efficiently and some other nice properties, but it strikes me as needing an argument from a substantive point and/or explanation why the choice doesn't really matter (I can imagine that for the use cases that dexter was envisioned for, we have such high sample sizes that the influence of the prior becomes inconsequential and one can just use what is algorithmically most convenient). hopefully explained by the colleagues
    Page 13:
    - just after (4) its says power law prior - this might be confusing since before was mentioned a Gamma prior; of course in (4) I recognize that there is a power law with exponential cut off of the form x^a-1 exp(-bx) if (a-1)<0 and this relates to the Gamma pdf but that might not be obvious for all.
    - This is mainly an issue of the type-setting of the mdmpi style but I find it hard to distinguish n and \eta - can you choose a different letter instead of \eta?
    Page 14:
    - There are 3 parameters for the Gamma distribution in (7) (Gamma is a two-parameter family); I think there is a typo.   indeed, fixed now
    Figure 6: 
    - I'd use a better width,height setting. fixed
    Page 16:
    - "is the model that fits classical test theory" sounds strange to me. Perhaps "is the IRT model that corresponds to the quantities of interest in classical test theory" or something like that. fixed
    - What does the tilde over the | in Theta \tilde{|} X signify?
    removed.
    - I'd type set the mu and sigma as \texttt{mu} and \texttt{sigma} so this corresponds to the type setting in the code chunk.  fixed
    Page 17:
    - There is a "have have" in the second paragraph of 3.5  fixed
    - I'd also refer to the page 7 when the individual differences test is mentioned fixed
    - "use an item property to specify" does item property refer to code here? If so, I'd include an example code.
    - "should note take longer" -> should not take fixed    
    - In section 4, in the extensive example would benefit from more structure. I'd structure the analysis as in the steps described, with paragraphs/subsections. E.g., partially fixed, I tried at least to alternate text with code and add more comments
    4.1 Downloading and Parsing
    Analyzing  more recent waves...  
    4.2. Create a Dexter database
    Now start a new dexter project ...
    etc.
    Page 18:
    - Why are the code chunks suddenly grey blocks? They are all supposed to be light gray unless I forgot something Does this signify anything? I think it should be consistent with all the other chunks.
    - As mentioned before, a bit more verbose descriptions of what the code is doing would be great. Added a few but time is running out... five days, and today is the sixth!
    Page 19:
    - "the correspondence is more than decent" I feel like there is a reference to Figure 7 missing somewhere. Same for Figure 8. Figures 7 and 8 are not referred to in the main text. references added
    Page 20:
    - I'd make a newline and indent before Zwitser at al. added
    Page 21:
    The article ends abruptly. I'd recommend to write a discussion section.  Added

 

Reviewer 2 Report

This is an interesting manuscript that provides an overview on the R package dexter, which aims at modeling data from large-scale assessments in an item response theory framework. The paper consists of two main parts, which also seem to have different authors:

1. In the first main part, which encompasses the sections 1 and 2, the authors provide a brief overview on the package and some of its core ideas and functions. This part is rather easy to read.

2. In the second main part, which consists of section 3, the focus shifts on the underlying psychometric theory, which leads to a theoretical discussion of estimation methods and goodness-of-fit tests. This part is more technical and much more difficult to read than the first part.

 

The paper concludes with another application example. I think that the proposed package is useful for its stated purpose, and the authors illustrate this software well. In this review, I want to provide suggestions for improving this paper from the perspective of a potential user in the field of educational assessments.

 

Major points:

- As the authors mention themselves, there are several R packages with a similar functionality, e.g. eRm, mirt or TAM. Please state explicitly in the introduction what this package can or cannot do in comparison to these other packages, so that users can decide which package they should use. Of course, you can (and should!) also highlight advantages of your package in this overview. From what I see, its relative advantages are: a) It provides fast estimation algorithms and b) it provides specific item fit statistics that can be seen as complements (or maybe even replacements) for classical approaches, such as the DIF plot and the distractor plot. On the other hand, there are some practical restrictions (which might be related to technical characteristics of the used estimation methods), including: a) The package supports only specific IRT models, such as the Rasch model and the PCM (as stated on p. 5), which might or might not show a good fit to empirical data. b) Missing data are assumed to be missing by design, meaning that every respondent answered all items she or he were given (as stated on p. 3). To my understanding, dexter is also not able to handle data from an adaptive design, such as computerized adaptive testing or multistage testing, although the authors mention another package named dexterMST that seems to be designed for this type of data. 

- Some functions of dexter are only mentioned, but it is not really explained what they do. Some examples include: "dexter has functions to support both paper-based and computerized standard setting." (p. 8) or "There are many more features, often of an innovative nature, that we cannot explain in detail." (p. 17). I understand that not every function can be demonstrated or explained in detail, but I suggest to include references or links where these details can be looked up. (Indeed, the authors did already include links in their manuscript.)

- Please explain what the DIF plots (Figures 3 and 4) are showing. To my understanding, the colors indicate how much estimates for item parameter differences for item pairs differ between female and male respondents, but this is not stated in the text.  

- Is it possible to include confidence intervals (or another measure of uncertainty) in the profile plot? Based on Figure 5, I do not see how the relevance of the difference between the lines can be assessed. To me, both lines seem very close, in particular for large scores of "Do", so I am not sure if these data show a meaningful gender effect.

- In the explanation of the ENORM model via equation (1), I am not sure how to get from the second line to the third, or from the third to the fourth in this equation. Since this model is somewhat central to this package, please elaborate on these steps or refer to previous work where these steps are explained. In particular, I do not understand what X in the third line stands for, and how the lambda and gamma functions in the fourth line are. I see, of course, that these functions are discussed later, but I do not see how the equations in (1) follow from this discussion.

- In a similar way, the derivations underlying section 3.2 are not clear to me. For instance, it is not clear to me how equation (3) follows from the equation above it (i.e., the definition of pi) and equation (2), or how equation (5) is used to get to equation (6). I do not think that these technical details are necessary to understand dexter, unless these details were not covered in previous work of the authors. I suggest to skip them and to insert references to previous work here, if possible. 

- Section 4 is a nice illustration, but maybe the code could be embedded in a general workflow that illustrates how dexter can be used for evaluating educational assessments, which could again relate to its strenghts and practical limitations. In other words, please outline typical steps when working with this package, such as a) creating a project, b) estimating the model parameters, c) checking the model fit, including DIF. You could use the PISA data as a practical use case, if you like.

- In its current form, it is not really made clear what the code in section 4, is doing, which would be important for readers that are new to this package (which are, I feel, the target audience of this paper). Some of the R commands, such as |>, are not typical R code and need to be explained. Further, the code should lead to the reported output, i.e. Figures 7 and 8, which is currently not the case. To address these points, I suggest the following changes to the code: a) Include R comments that explain the code; this section is on a much higher technical level than the previous sections with R code. b) Expand it so that it includes model fit, and graphical output such as Figures 7 or 8. Maybe it is even possible to put the code for the PISA application in an external file (e.g., an R file in a data repository), where it can be curated or updated, and to refer to it by a link in the manuscript.

 

Minor points:

- Two closing brackets are missing in the code in section 4: The first at the end of the line where the db object is defined, and the second at the end of the line where url4 is defined.

- Please explain what you mean by "satellite packages". This is used in the abstract and in the main text.

- At the end of p. 2, it is stated that "all graphs are arranged in carousels", this is unclear and should be rephrased.

Author Response

I disagree with the review in one point: from version 4.1, the pipe operator |> is part of R base, and the preferred way to make an omelette in R is now eggs |> break() |> stir() |> fry(). Of course,  fry(stir(break(eggs))) is still valid.

Everything else you said has been digested, accepted, and implemented as far as the time allowance of five days allowed. You will notice that more than half of the paper has been rewritten to profit from two extremely detailed and helpful reviews. We all thank you very much.

Major points:

- As the authors mention themselves, there are several R packages with a similar functionality, e.g. eRm, mirt or TAM. Please state explicitly in the introduction what this package can or cannot do in comparison to these other packages ... we have mentioned TAM and mirt, and I think there is a mention of eRm somewhere as we try to explain our different use of "extended". We have expanded the discussion of our choices and added some references 

- Some functions of dexter are only mentioned, but it is not really explained what they do. Some examples include: "dexter has functions to support both paper-based and computerized standard setting." (p. 8) or "There are many more features, often of an innovative nature, that we cannot explain in detail." (p. 17). I understand that not every function can be demonstrated or explained in detail, but I suggest to include references or links where these details can be looked up. (Indeed, the authors did already include links in their manuscript.) We have restructured this as a list while simultaneously adding a bit of material and cross-references. No time and space for more. 

- Please explain what the DIF plots (Figures 3 and 4) are showing. To my understanding, the colors indicate how much estimates for item parameter differences for item pairs differ between female and male respondents, but this is not stated in the text.  The plots have been remade and a lot of explanation added.

- Is it possible to include confidence intervals (or another measure of uncertainty) in the profile plot? Based on Figure 5, I do not see how the relevance of the difference between the lines can be assessed. To me, both lines seem very close, in particular for large scores of "Do", so I am not sure if these data show a meaningful gender effect. No, unfortunately not at this stage. Velhelst's own software has standard errors for profiles. On plots, the stairwise lines overlap or cross when there is no significant difference (the vast majority of plots from cognitive tests). Gaps like the one in the papers correspond to huge effects.

- In the explanation of the ENORM model via equation (1), I am not sure how to get from the second line to the third, or from the third to the fourth in this equation. Since this model is somewhat central to this package, please elaborate on these steps or refer to previous work where these steps are explained. In particular, I do not understand what X in the third line stands for, and how the lambda and gamma functions in the fourth line are. I see, of course, that these functions are discussed later, but I do not see how the equations in (1) follow from this discussion. This and the next ones: I let the two very busy colleagues who wrote the theoretical papers add as much as they can, given the short time; if anything is still not clear, the reader will have to look at the papers (references are provided). 

- In a similar way, the derivations underlying section 3.2 are not clear to me. For instance, it is not clear to me how equation (3) follows from the equation above it (i.e., the definition of pi) and equation (2), or how equation (5) is used to get to equation (6). I do not think that these technical details are necessary to understand dexter, unless these details were not covered in previous work of the authors. I suggest to skip them and to insert references to previous work here, if possible. 

- Section 4 is a nice illustration, but maybe the code could be embedded in a general workflow that illustrates how dexter can be used for evaluating educational assessments, which could again relate to its strenghts and practical limitations. In other words, please outline typical steps when working with this package, such as a) creating a project, b) estimating the model parameters, c) checking the model fit, including DIF. You could use the PISA data as a practical use case, if you like. This has been partially fixed in the little time left from rewriting sections 1 and 2. The code and the explanations have been interspersed, and some comments added. Where the complete workflow could not be demonstrated -- for example, the otherwise vitally important check of the items, an explanation has been added.

- In its current form, it is not really made clear what the code in section 4, is doing, which would be important for readers that are new to this package (which are, I feel, the target audience of this paper). Some of the R commands, such as |>, are not typical R code and need to be explained. Further, the code should lead to the reported output, i.e. Figures 7 and 8, which is currently not the case. References to the two figures were missing, added now. To address these points, I suggest the following changes to the code: a) Include R comments that explain the code; this section is on a much higher technical level than the previous sections with R code. b) Expand it so that it includes model fit, and graphical output such as Figures 7 or 8. Maybe it is even possible to put the code for the PISA application in an external file (e.g., an R file in a data repository), where it can be curated or updated, and to refer to it by a link in the manuscript. You are right in principle, but I think the persons who might want to download the cognitive data of PISA and analyze it themselves are probably few in number and advanced in R.

 

Minor points:

- Two closing brackets are missing in the code in section 4: The first at the end of the line where the db object is defined, and the second at the end of the line where url4 is defined. I take it as an enormous compliment that a reviewer has read so carefully. Fixed now.

- Please explain what you mean by "satellite packages". This is used in the abstract and in the main text. Changed to companion

- At the end of p. 2, it is stated that "all graphs are arranged in carousels", this is unclear and should be rephrased. Explained

(And no, the separate sections are not by different authors: they are in different languages and for different readers)

Once again, thank you very much for the useful and stimulating comments.

 

Back to TopTop