Using Psychometric Testing Procedures for Scale Validity, Reliability, and Invariance Analysis: The PRETIEQ Portuguese Version
Round 1
Reviewer 1 Report (Previous Reviewer 1)
The authors have revised their manuscript per the last revision, however, there are still significant problems that remain.
First, the authors attempted to conduct the requested twotier twocorrelated CFA model and twotier twocorrelated ESEM, but their syntax is incorrect. Specifically, the statement: WITH Neg@0 Pos@0; in both models is incomplete. It should be:
TOL POS WITH Neg@0 Pos@0;
Second, even though the syntax is incorrect, the authors disregarded these twotier models as fitting the data better than the Twocorrelated factor ESEM and still reported and interpreted results from the Twocorrelated factor ESEM.
Third, the authors do acknowledge the value of the twotier models in the data analysis section, but the language they used is nearly identical to my comment to the authors. The authors should have used my language to update their data analytic framework and not just copied and pasted it into the text.
Also, once the authors correctly specify and interpret the winning twotier model, likely the ESEM version, they will need to update Tables 1 and 2 and the corresponding text.
Additionally, the MG ESEM results and findings in Table 3 and the text will need to be done based on the twotier ESEM. The same will be true when conducting the SEM to examine correlations among the other latent variables.
Finally, the discussion section will need to be updated.
The writing can be understood, but the Quality of the English language could be improved.
Author Response
Reviewer 1
The authors have revised their manuscript per the last revision, however, there are still significant problems that remain.
First, the authors attempted to conduct the requested twotier twocorrelated CFA model and twotier twocorrelated ESEM, but their syntax is incorrect. Specifically, the statement: WITH Neg@0 Pos@0; in both models is incomplete. It should be:
TOL POS WITH Neg@0 Pos@0;
Second, even though the syntax is incorrect, the authors disregarded these twotier models as fitting the data better than the Twocorrelated factor ESEM and still reported and interpreted results from the Twocorrelated factor ESEM.
Third, the authors do acknowledge the value of the twotier models in the data analysis section, but the language they used is nearly identical to my comment to the authors. The authors should have used my language to update their data analytic framework and not just copied and pasted it into the text.
Also, once the authors correctly specify and interpret the winning twotier model, likely the ESEM version, they will need to update Tables 1 and 2 and the corresponding text.
Additionally, the MG ESEM results and findings in Table 3 and the text will need to be done based on the twotier ESEM. The same will be true when conducting the SEM to examine correlations among the other latent variables.
Finally, the discussion section will need to be updated.
Response: Thank you for reviewing our manuscript in this round and the previous ones (ejihpe2213456). We have carefully incorporated your feedback and suggestions in the past, and we appreciate the opportunity to address your concerns point by point the best we could. We used the track changes option in Microsoft Word to clearly indicate the revisions made throughout the manuscript in previous revisions and provided revisions for the other reviewers in this submission. After carefully considering your feedback and suggestions, we have made significant revisions to address your concerns. However, we have decided to remove the twotier CFA and ESEM model analysis from our manuscript for several reasons, which we would like to explain in detail.
Firstly, we have considered the objective of our study and concluded that the inclusion of the twotier CFA analysis does not contribute significantly to our research goals. Although there are reversecoded items in the PRETIEQ questionnaire (see Ekkekakis et al., 2008; Hall et al. 2014), we have followed the recommended procedures by the creators of the questionnaire (whom we contacted for clarification) and reversecoded the items accordingly. Therefore, conducting the proposed analysis would not yield substantial additional insights since we have already addressed the reverse coding appropriately.
Secondly, the twotier model analysis would be more appropriate if our study indicated the presence of several loworder factors within the twocorrelated factor model of the CFA. However, existing studies (Smirmaul et al., 2015; Wang et al., 2022), including our own findings, have consistently shown that the PRETIEQ questionnaire does not exhibit higher and lowerorder factors. Our results align with this understanding, and thus, incorporating a twotier model analysis would not provide meaningful or relevant results in the context of our study.
Lastly, we would like to address your comment regarding a perceived tendency to include unnecessary or excessive analyses. We want to assure you that we have taken great care to ensure the thoroughness and rigor of our research. We have already conducted complex analyses, provided detailed syntax, and thoroughly described the psychometrics of the questionnaire in the introduction section. However, it appears that these efforts have not satisfied your expectations.
We want to reiterate our commitment to producing a highquality research manuscript that adheres to the objectives and focus of our study. While we value constructive criticism and appreciate the opportunity to enhance our work, we believe it is equally important to carefully evaluate the relevance and appropriateness of proposed analyses.
We hope you understand our rationale for removing the twotier CFA and ESEM model analysis from the manuscript. We have made the necessary revisions to address your other suggestions, and we believe the current version of the manuscript reflects the most accurate and relevant findings in accordance with our research objectives. Once again, we sincerely appreciate your time and effort in reviewing our manuscript. Your expertise and insights have undoubtedly contributed to its improvement.
Reviewer 2 Report (New Reviewer)
Dear Authors,
Please review the document.
Thanks.
Comments for author File: Comments.pdf
Author Response
Dear Authors, It is a work that makes a contribution to the intensity and tolerance of physical exercise evaluated through a questionnaire that can be useful for physical activity and sports professionals, especially to support evidence to promote adherence to the practice of physical activity by people. The latter as a projection of the investigation. For its part, it is a study that is transferable to the investigative and occupational field, especially for its statistical analysis and practical utility, respectively.
Response: We appreciate the time and effort dedicated to the review process. To facilitate the identification of changes, we have utilized the track change option in MS Word, which highlights all the modifications made throughout the document.
Some considerations.
 Verify and/or change keywords in: https://meshb.nlm.nih.gov/
Response: The keywords have been updated and revised (see line 34).
 Does the study have strengths? (Limitations and directions for further research section).
Response: In the conclusion section of our study, we have highlighted the strengths of our research. These strengths have been revised to ensure clarity and provide a comprehensive understanding of the robust aspects of our study (see lines 595608).
Reviewer 3 Report (New Reviewer)
Dear Authors,
Thanks!
Please:
Abstract:
Explicitly state the conclusions of your research.
Introduction:
I would strongly advise the authors of this paper to rewrite their introduction to produce a more contextualised introduction toward a clear purpose.
Moreover:
Line 240: Please, insert aim and objectives of your research.
Discussion
Please, insert pratical implications
References
The references must follow the guidelines.
Thank you for considering my suggestions, and I look forward to seeing your revised manuscript.
Sincerely,
Referee

Author Response
Abstract:
Explicitly state the conclusions of your research.
Response: Revised (lines 2733).
Introduction:
I would strongly advise the authors of this paper to rewrite their introduction to produce a more contextualized introduction toward a clear purpose.
Response: The purpose of our study, as outlined in the provided text, aimed to address several important gaps in the literature regarding the psychometric properties and contextual validity of the PRETIEQ Portuguese version in the domain of exercise. We appreciate the reviewer's concern about the clarity and contextualization of the introduction, and we acknowledge that the provided text clearly communicates the objectives and rationale of our research.
The introductory section of our study explicitly states the research goals, which include examining the factor structure of the PRETIEQ in the context of exercise using various analytical methodologies, evaluating measurement invariance across different exercise modalities and exercise experience, and investigating the correlational validity with related constructs such as enjoyment, exercise intentions, and exercise frequency. These objectives are aligned with existing literature and address the need for further validation and contextual application of the PRETIEQ instrument.
Furthermore, we acknowledge the previous validation study by Teixeira et al. [21] and their contributions to establishing the preliminary validity of the PRETIEQ Portuguese version. However, our study expands upon their work by examining the scale's performance in different exercise types and considering exercise experience, which are important factors for evaluating the instrument's applicability in diverse contexts. We also highlight the necessity of crosscultural validation and the importance of developing questionnaires tailored to specific scenarios, as advocated by Cid et al. [26] and Ekkekakis [27].
Moreover:
Line 240: Please, insert aim and objectives of your research.
Response: We would like to bring to the reviewer's attention that the aim and objectives of our research are clearly outlined in lines 204 to 220 of the revised manuscript. In this revised paragraph, we have made efforts to enhance the clarity and comprehensiveness of the introduction, ensuring that the purpose and objectives of the study are explicitly stated. We believe that the revisions provide a more contextualized introduction that aligns with the reviewer's recommendation (see lines 205231).
Discussion
Please, insert practical implications
Response: We have taken into account your suggestion and have added a section on practical implications to the revised manuscript. This new section highlights the practical applications of our findings for researchers and exercise physiologists (see lines 620631).
References
The references must follow the guidelines.
Response: The references in the manuscript have been meticulously revised to adhere to the guidelines specified by the journal. We would like to inform you that the minor revisions required for the citation style also will be taken care of by the editing team of the journal.
Thank you for considering my suggestions, and I look forward to seeing your revised manuscript.
Response: We appreciate the time and effort dedicated to the review process. To facilitate the identification of changes, we have utilized the track change option in MS Word, which highlights all the modifications made throughout the document.
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Abstract
1. Rephrase opening sentence to : This study investigated the psychometric nature of preference for and tolerance of exercise intensity in physical activity
2. ESEM is not defined prior to its first use. Relatedly, removed model after the phrase ‘ESEM model’ as this is redundant.
3. I appreciate the authors reminder to researchers about the importance of researchers needing employing “context verified measures to evaluate preference for and tolerance of exerciseintensity, based on the characteristics of the researched sample or in realtime context.” This is an often overlooked assumption where researchers assume the behavior of said instrument will apply across all contexts and samples.
Introduction
1. Line 40: change ‘components’ to ‘factors’
2. Line 4243: The authors state “Furthermore, it lacks evidence of construct validity, which is a measure of how effectively a test or questionnaire assesses what it is supposed to measure.” This definition is a bit dated as the new standards for psychological and educational testing have move beyond this definition (see ch. 1 on validity, a free download is found at: https://www.testingstandards.net/openaccessfiles.html). The newer language would refer to this as structural validity.
3. Please update your definition around construct validity through the manuscript according to the new standards. Also, be sure to provide citations for such definitions.
4. Line 49: change ‘factor analyses,’ to ‘factor analytics techniques’
5. Lines 6061: Rephrase this sentence to say: ‘CFA has been the goto technique for assessing factor structures when it comes to scale development, refinement, and validation’
6. Line 62: change ‘compare correlations between components’ to ‘assess the relationship between later factors’
7. At the end of Line 74 cite [1,5].
8. Line 87: Explain what is meant by ‘measurement and structural coefficients.’ Structural coefficients refer to factor pattern loadings and measurement coefficients refer to correlations among factors? Vice versa or do you means something else. Please clarify.
9. Line 94: replace ‘interactions’ with ‘interrelationships’
10. Line 95: drop ‘model’ after SEM model’ as it is redundant. Also, this is the first use of SEM
11. Remove the sentence ‘In this regard, ESEM may be superior to EFA, CFA, and SEM as separate statistical approaches.’ This sentence is not entirely correct because ESEM is just an aspect of SEM, in general.
12. Replace all use of the word ‘components; with ‘factors’. You are not dong a PCA or formative model, you are doing reflective modeling.
13. Lines 9899: Please rephrase to keep the focus on bifactor modeling in the SEM or CFA framework. Keep in mind that bifactor modeling also exists in the IRT framework (see e.g., Toland MD, Sulis I, Giambona F, Porcu M, Campbell JM. Introduction to bifactor polytomous item response theory analysis. J Sch Psychol. 2017 Feb;60:4163. doi: 10.1016/j.jsp.2016.11.001. Epub 2016 Dec 29. PMID: 28164798)
14. See lines 103106Please update your definition of what a bifactor model does. Your missing the element about how the general and specific factors are orthogonal to one another.
15. Lines 110111: change the end phrase to read ‘two or more related but independentfactors.’ Bifactor modeling can of course be used when there are two more correlated factors and is not limited to two factor models.
16. Line 113: SEM is already defines, so just use the abbreviation and don’t spell it out
17. Line 116: the ability for ESEM to model measurement error is a common feature of any SEM. Also, in line 117, what do you mean by ‘is a common issue in tradiation bifactor models’? What do you mean by ‘traditional bifactor models’? Please clarify.
18. Line 117: change particular to specific
19. Lines 118119, 126: Please rephrase your definitions or use of the phrases for ‘measurement residuals’. Please see https://www.analysisinn.com/post/measurementversusresidualerrorterms/ Stick with referring to the noise in any given item/variable as measurement error or unique item error (unreliability) or measurement noise within each observed variable/item. Otherwise, your definition conflates measurement error with residual error or disturbance.
20. Line 152: I believe the authors meant to say ‘With eight items per dimension (preference and tolerance), respondents use a 5point Likerttype scale ranging from …
21. Lines 156157: rephrase to ‘These items are reverse scored so that higher responses now reflect …’
22. Lines 166: Please clarify if the Portuguese version includes all positive or some negatively phrased items.
23. Lines 197200: In general, when two factors are highly correlated, it will not be a surprise to find the bifactor model fits better than a two factor CFA or two factor ECFA. The same is true when getting ESEM for any type of model (ESEMs will always fit better than constrained CFAs that do not allow for cross loadings). So, why do the bifactor ESEM model in the first place? The reason to use ESEM (bifactor or otherwise) is to be able to handle minor cross loadings for complex items  an item relevant to measuring a psychologically relevant construct is likely to end up reflecting several constructs; pure items (without cross loadings) don't exist. I don't know that I've ever seen A bifactor ESEM with only two specific factors before because with only two constructs of interest (well, plus the general) cross loadings should probably be considered as misfit and the item dropped. However, line 197 … of your paper does give a valid reason to use a bifactor model, but not a reason to use an ESEM. I think some people just view ESEM as the thing to do because it is new and shiny. What's the purpose of fitting the model in the first place? Just wanting to do it is not enough of a reason.
24. Given that previous work was based on showing two dimensions, why not focus on the negative and positively oriented items as specific nuisance factors? Lines 197200: Given that the PRETIEQ consists of two factors each made up of half positively oriented and half negatively oriented, why didn’t the authors consider other models that allowed for two correlated general factors (preference and tolerance) with two specific factors (positively oriented items and negatively oriented specific factors)? See Cai, L. A TwoTier FullInformation Item Factor Analysis Model with Applications. Psychometrika 75, 581–612 (2010). https://doi.org/10.1007/s1133601091780 You could also consider a general factor with four specific factors, but doing so would conflate the orientation of the items and the content of each substantive dimension. This technique is easily extended to the ESEM realm. This twotier model seems like a much more sensible extension of the literature to date than the bifactor ESEM you proposed.
Methods
1. I appreciate that the authors attempted an a priori sampling calculation, but this calculator assumes you have a rationale or previous evidence to support the input values. Unfortunately, there are no supporting resources for how you chose the input values. Also, this calculator does not take into consideration missing data, assumes the variables are each being treated as linear or continuous, which your 5point Likerttype items are not, assumes MVN is tenable, which it may not be. More justification and supporting documentation is needed.
2. The Statistical analysis section clearly treats the item responses as linear or continuous given that MLR is invoked in Mplus. Moreover the authors treat the missing data as missing at random which is what is assumed by MLR. How do the authors know MAR is a tenable assumption for the missingness? Also, is the assumption of treating the item response data as linear or continuous tenable given the ordinal nature of the item responses? Please justify. Or, instead, use a technique that does not make this assumption and allows you to account for the missing data and ordered nature of the data. See BLIMP 2.0 by Enders and colleagues.
3. Lines 279280: You indicate that various components were estimated freely using oblique rotations. In a bifactor model, the general and specific factors are orthogonal to one another, but here you allow them to correlate. Please explain.
4. In regard to model fit indices, please update your fit criteria to also include more modern approaches. See Peugh and Feldon (2020) https://doi.org/10.1187/cbe.20010016 What you have is okay, but add the equivalency testing strategy as well.
5. Line 293294: The interpretation of a standard factor loading of .50 means that 25% of the variance in the observed item/indicator can be explained by the latent factor, controlling for all other latent factors or covariates in the model. Please revise.
6. Coefficient alpha is an outdated method for estimating reliability. Please see
Teo, T., Fan, X. Coefficient Alpha and Beyond: Issues and Alternatives for Educational Research. AsiaPacific Edu Res 22, 209–213 (2013). https://doi.org/10.1007/s402990130075z
Flora, David B. "Your coefficient alpha is probably wrong, but which coefficient omega is right? A tutorial on using R to obtain better reliability estimates." Advances in Methods and Practices in Psychological Science 3.4 (2020): 484501.
7. Please use Dueber, D. M. "Package ‘BifactorIndicesCalculator’." (2020). For calculating your bifactor indices values.
8. Lines 332335: Please clarify if your correlational analysis was done using all latent variables or observed variables? Given that all of your variables, except one, consisted of 3 or more items, you should be using latent variables in this analytic model to account for measurement error during the analysis. Please confirm or rerun your analysis as such.
9. When examining table 1, the model names make sense for each model except for the last four models. What exactly do the last four models represent? Make sure DNC is defined in your note. DNC = did not converge. Note, this noncergence makes sense because the model is trying to estimate too many coefficients under a high dimensional model situation. You will likely get the models to converge if you make all dimensions orthogonal as is traditionally done with bifactor models. Please see my earlier comment about using a twotier modeling approach minus the cross loadings. You will likely find that the two general factors model with specific nuisance factors due to item phrasing will represent your data better and be more meaningful.
10. Go back and look at your results again. Your bifactor model did not converge so how can you say it did not have acceptable fit to the data?
11. Check Table 2 for spelling errors and typos
12. Table 2 also shows that Item 6 had the largest cross loading. This item likely needs to be further analyzed as it seems to be tapping both preference and tolerance, although more strongly tapping tolerance in this sample.
13. When you computed reliability in Table 2, how was it computed considering the cross loadings of all items? Please clarify if you only used the loadings on the intended factors and provide an appropriate citation to justify your approach.
14. Lines 375376: It is no surprise that the ESEM fit the data better given that more parameters are being estimated.
15. Elaborate on where the measurement model fit each group well independently.
16. You can never confirm a hypothesis in a frequentist framework. That is like saying, the null hypothesis is true. It would be better to say that the measurement invariance results show that the multigroup analyses provide evidence that strict invariance is tenable across both exercise type and exercise experiences groupings (see Table 3).
17. Please provide the p value for your chisquare with associated df) in the results section.
18. Why di the authors run so many independent SEMs? Each model is suggesting that variables across models are not related, which is not necessarily true. It would be much wiser if the authors constructed one single model with all DVs and predictors and then interpreted the model as correlational validity evidence.
19. Line 403: What do the authors means by “hedonic” assumptions?
Discussion:
1. In general, the results suggest that the two factors of preference and tolerance are highly correlated in an ESEM framework, with item 6 showing some potential misfit due to uses relatively substantive cross loading on preference, which is not its intended dimension. What reasons might the authors given for this misfit and next steps? Please update lines 427430 accordingly as your current discussion does not consider this issue/flaw in the instrument.
Overall
1. Overall, the authors still need to address the larger issue I noted earlier. What is the justification for doing ESEM with only two factors?
2. The authors are advised to rerun their bifactor model as noted in my earlier comment regarding making the factors all orthogonal and then they will fit.
3. Also, the authors truly need to examine look at the twotier model I suggested earlier to address the two correlated factors of substantive interested with two orthogonal factors, positively oriented items and negative oriented items.
Author Response
Reviewer 1
The Abstract
 Rephrase opening sentence to: This study investigated the psychometric nature of preference for and tolerance of exercise intensity in physical activity
Response: Amendments done.
 ESEM is not defined prior to its first use. Relatedly, removed model after the phrase ‘ESEM model’ as this is redundant.
Response: Amendments done.
 I appreciate the authors reminder to researchers about the importance of researchers needing employing “context verified measures to evaluate preference for and tolerance of exerciseintensity, based on the characteristics of the researched sample or in realtime context.” This is an oftenoverlooked assumption where researchers assume the behavior of said instrument will apply across all contexts and samples.
Response: Thank you for your positive feedback.
Introduction
 Line 40: change ‘components’ to ‘factors’
Response: Amendments done.
 Line 4243: The authors state “Furthermore, it lacks evidence of construct validity, which is a measure of how effectively a test or questionnaire assesses what it is supposed to measure.” This definition is a bit dated as the new standards for psychological and educational testing have move beyond this definition (see ch. 1 on validity, a free download is found at: https://www.testingstandards.net/openaccessfiles.html). The newer language would refer to this as structural validity.
Response: Construct validity was revised to structural validity.
 Please update your definition around construct validity through the manuscript according to the new standards. Also, be sure to provide citations for such definitions.
Response: Done.
 Line 49: change ‘factor analyses,’ to ‘factor analytics techniques’
Response: Amendments done.
 Lines 6061: Rephrase this sentence to say: ‘CFA has been the goto technique for assessing factor structures when it comes to scale development, refinement, and validation’
Response: Amendments done.
 Line 62: change ‘compare correlations between components’ to ‘assess the relationship between later factors’
Response: Amendments done.
 At the end of Line 74 cite [1,5].
Response: Amendments done.
 Line 87: Explain what is meant by ‘measurement and structural coefficients.’ Structural coefficients refer to factor pattern loadings and measurement coefficients refer to correlations among factors? Vice versa or do you mean something else. Please clarify.
Response: Measurement coefficients are factor loadings and structural coefficients are the covariances. Clarification added in the manuscript (lines 8995).
 Line 94: replace ‘interactions’ with ‘interrelationships’
Response: Amendments done.
 Line 95: drop ‘model’ after SEM model’ as it is redundant. Also, this is the first use of SEM
Response: Amendments done.
 Remove the sentence ‘In this regard, ESEM may be superior to EFA, CFA, and SEM as separate statistical approaches.’ This sentence is not entirely correct because ESEM is just an aspect of SEM, in general.
Response: Sentence was revised “ESEM may be an alternative to EFA, CFA, and SEM as separate statistical techniques in this regard.”
 Replace all use of the word ‘components; with ‘factors’. You are not doing a PCA or formative model, you are doing reflective modeling.
Response: We agree. Revisions made.
 Lines 9899: Please rephrase to keep the focus on bifactor modeling in the SEM or CFA framework. Keep in mind that bifactor modeling also exists in the IRT framework (see e.g., Toland MD, Sulis I, Giambona F, Porcu M, Campbell JM. Introduction to bifactor polytomous item response theory analysis. J Sch Psychol. 2017 Feb;60:4163. doi: 10.1016/j.jsp.2016.11.001. Epub 2016 Dec 29. PMID: 28164798)
Response: Revisions made (lines 106110).
 See lines 103106Please update your definition of what a bifactor model does. Your missing the element about how the general and specific factors are orthogonal to one another.
Response: We agree in part. In bifactor modeling, the general factor and specific factors are typically not assumed to be orthogonal. Rather, they are allowed to correlate with one another, as the general factor is assumed to account for a portion of the variance in each specific factor. The correlation between the general factor and specific factors is often referred to as the "factor correlation" or "bifactor correlation" and is estimated as a parameter in the bifactor model. This correlation reflects the extent to which the general factor and specific factors are related to each other, after accounting for their unique contributions to the observed variables. It is worth noting, however, that there are some situations where researchers may choose to impose orthogonality constraints on the general factor and specific factors in a bifactor model. This is often done when the specific factors are intended to represent measurement error or nuisance factors that are not of substantive interest. In such cases, imposing orthogonality constraints can simplify the model and improve its interpretability. For more see Gegenfurtner (2022), Morin et al. (2016), and Reise et al. (2010). Nevertheless, we tested both assumptions (see our response to your comment 23).
 Lines 110111: change the end phrase to read ‘two or more related but independentfactors.’ Bifactor modeling can of course be used when there are two more correlated factors and is not limited to two factor models.
Response: Amendments done.
 Line 113: SEM is already defined, so just use the abbreviation, and don’t spell it out
Response: Amendments done.
 Line 116: the ability for ESEM to model measurement error is a common feature of any SEM. Also, in line 117, what do you mean by ‘is a common issue in tradition bifactor models’? What do you mean by ‘traditional bifactor models’? Please clarify.
Response: The sentence was removed as it did not add much to the discussion of the advantages of bifactor ESEM.
 Line 117: change particular to specific
Response: Amendments done.
 Lines 118119, 126: Please rephrase your definitions or use of the phrases for ‘measurement residuals’. Please see https://www.analysisinn.com/post/measurementversusresidualerrorterms/ Stick with referring to the noise in any given item/variable as measurement error or unique item error (unreliability) or measurement noise within each observed variable/item. Otherwise, your definition conflates measurement error with residual error or disturbance.
Response: Amendments done.
 Line 152: I believe the authors meant to say ‘With eight items per dimension (preference and tolerance), respondents use a 5point Likerttype scale ranging from …
Response: Amendments done.
 Lines 156157: rephrase to ‘These items are reverse scored so that higher responses now reflect …’
Response: Amendments done.
 Lines 166: Please clarify if the Portuguese version includes all positive or some negatively phrased items.
Response: Amendments done.
 Lines 197200: In general, when two factors are highly correlated, it will not be a surprise to find the bifactor model fits better than a two factor CFA or two factor ECFA. The same is true when getting ESEM for any type of model (ESEMs will always fit better than constrained CFAs that do not allow for cross loadings). So, why do the bifactor ESEM model in the first place? The reason to use ESEM (bifactor or otherwise) is to be able to handle minor cross loadings for complex items  an item relevant to measuring a psychologically relevant construct is likely to end up reflecting several constructs; pure items (without cross loadings) don't exist. I don't know that I've ever seen A bifactor ESEM with only two specific factors before because with only two constructs of interest (well, plus the general) cross loadings should probably be considered as misfit and the item dropped. However, line 197 … of your paper does give a valid reason to use a bifactor model, but not a reason to use an ESEM. I think some people just view ESEM as the thing to do because it is new and shiny. What's the purpose of fitting the model in the first place? Just wanting to do it is not enough of a reason.
Response: Interesting comment. From a psychological perspective, while the traits of preference (i.e., inclination to choose a specific level of exercise intensity) and tolerance (i.e., inclination to continue exercising at an imposed level of intensity even when the activity is unpleasant/uncomfortable) are identified as distinct (Ekkekakis et al., 2005), both share the same principle: exercise intensity. Thus, we intended to examine if the factors inherent in the PRETIEQ could be seen as the overall result of exercise intensity (i.e., global factor) or as specific factors (i.e., preference vs tolerance). From a statistical perspective, all PRETIEQ translated versions “suffered” from item reduction adaptions to display acceptable fit. Since previous applications used CFA, authors could not examine if the items shared cross loadings that could be handled using ESEM.
 Given that previous work was based on showing two dimensions, why not focus on the negative and positively oriented items as specific nuisance factors? Lines 197200: Given that the PRETIEQ consists of two factors each made up of half positively oriented and half negatively oriented, why didn’t the authors consider other models that allowed for two correlated general factors (preference and tolerance) with two specific factors (positively oriented items and negatively oriented specific factors)? See Cai, L. A TwoTier FullInformation Item Factor Analysis Model with Applications. Psychometrika 75, 581–612 (2010). https://doi.org/10.1007/s1133601091780 You could also consider a general factor with four specific factors, but doing so would conflate the orientation of the items and the content of each substantive dimension. This technique is easily extended to the ESEM realm. This twotier model seems like a much more sensible extension of the literature to date than the bifactor ESEM you proposed.
Response: We appreciate your interesting pointofview. We agree in part with your assumptions. However, we would like to comment that the PRETIEQ is still underresearched. As described, the measure has only been psychometrically tested in AmericanEnglish, BrazilianPortuguese, European Portuguese, and Chinese. The latter three, have described their work as preliminary analysis, since the authors described in their discussion that more studies are warranted. As described in the manuscript: “…the Portuguese (ten items; five per construct; eight items represent low preference/tolerance) and the Chinese (8 items; four per construct; four items represent low preference/tolerance) versions presented a distinct final set of items compared to the original version, which may be crucial for the questionnaires quality comprehension. Patterson et al. [23] highlighted this while claiming that the questionnaire would benefit from redesigned and reduced scales, a problem that further psychometric testing could address.” Thus, the PRETIEQ is still in its early stages of refinement and our study intends to provide further evidence of the applicability, considering the most contemporaneous psychometric testing analyses. Forthcoming studies could explore the assumption of a possible correlatedfour factor model (two correlated general factors with two specific factors; i.e., positively oriented items and negatively oriented specific factors). This suggestion was added to the limitation section.
Methods
 I appreciate that the authors attempted an a priori sampling calculation, but this calculator assumes you have a rationale or previous evidence to support the input values. Unfortunately, there are no supporting resources for how you chose the input values. Also, this calculator does not take into consideration missing data, assumes the variables are each being treated as linear or continuous, which your 5point Likerttype items are not, assumes MVN is tenable, which it may not be. More justification and supporting documentation is needed.
Response: Previous studies testing the validity of the PRETIEQ have not conducted a priori sampling calculations (see Ekkekakis et al., 2015; Teixeira et al., 2020; Wang et al., 2022). In the study of Smirmaul et al. (2015), the authors stated that “the sample size of 122 participants provides sufficient statistical power to detect a 6.25% variance overlap between two correlated variables (r = 0.25), assuming a twotailed test of significance, alpha of 5%, and 1beta of 80%.” We revised the a priori sampling calculations according to the study of Smirmaul et al. (2015), assuming as default the anticipated effect size to be equal to 0.15. Increasing the anticipated effect size would decrease the recommended minimum sample size.
 The Statistical analysis section clearly treats the item responses as linear or continuous given that MLR is invoked in Mplus. Moreover, the authors treat the missing data as missing at random which is what is assumed by MLR. How do the authors know MAR is a tenable assumption for the missingness? Also, is the assumption of treating the item response data as linear or continuous tenable given the ordinal nature of the item responses? Please justify. Or, instead, use a technique that does not make this assumption and allows you to account for the missing data and ordered nature of the data. See BLIMP 2.0 by Enders and colleagues.
Response: WLSMV is not as efficient as ML (Múthen & Múthen, 2010). ML handles data missing at random, whereas WLSMV cannot give its pairwise variable orientation.
 Lines 279280: You indicate that various components were estimated freely using oblique rotations. In a bifactor model, the general and specific factors are orthogonal to one another, but here you allow them to correlate. Please explain.
Response: Please, see our response to your comment number 14.
 In regard to model fit indices, please update your fit criteria to also include more modern approaches. See Peugh and Feldon (2020) https://doi.org/10.1187/cbe.20010016 What you have is okay but add the equivalency testing strategy as well.
Response: We thank the reviewer for his/her comment. We think that the analysis made is enough for the purpose of the paper, and that including the equivalency test through another software could be confusing to the readers.
 Line 293294: The interpretation of a standard factor loading of .50 means that 25% of the variance in the observed item/indicator can be explained by the latent factor, controlling for all other latent factors or covariates in the model. Please revise.
Response: Amendments done.
 Coefficient alpha is an outdated method for estimating reliability. Please see
Teo, T., Fan, X. Coefficient Alpha and Beyond: Issues and Alternatives for Educational Research. AsiaPacific Edu Res 22, 209–213 (2013). https://doi.org/10.1007/s402990130075z
Flora, David B. "Your coefficient alpha is probably wrong, but which coefficient omega is right? A tutorial on using R to obtain better reliability estimates." Advances in Methods and Practices in Psychological Science 3.4 (2020): 484501.
Response: The debate between which is better, Alpha, Omegas, etc. it not relevant for this paper, since some support its deletion (e.g., Deng & Chan, 2017; McNeish, 2018), and others accept its utility (e.g., Raykov et al., 2019). Omegas are described to better represent factor consistency in bifactor models compared to its applicability in CFA and SEM (Rodriguez et al., 2016), and has also received some warnings on its applicability (Cho, 2021). However, we calculated Omega reliability coefficients as they tend to better represent the shared variance of the items within the latent factor (See Rodriguez et al., 2016; 2016) compared to alphas.
 Please use Dueber, D. M. "Package ‘BifactorIndicesCalculator’." (2020). For calculating your bifactor indices values.
Response: Since the bifactor model specification did not converge, reporting bifactor indices values is not necessary.
 Lines 332335: Please clarify if your correlational analysis was done using all latent variables or observed variables? Given that all of your variables, except one, consisted of 3 or more items, you should be using latent variables in this analytic model to account for measurement error during the analysis. Please confirm or rerun your analysis as such.
Response: As described “SEM with latent variables was performed for correlational analysis between…”
 When examining table 1, the model names make sense for each model except for the last four models. What exactly do the last four models represent?
Response: The twocorrelated factor ESEM was run in each subsample as a mean to provide preliminary evidence that the model would fit the data in each subsample before conducting invariance analysis. This is a standard procedure as described by Hair et al. (2019).
Make sure DNC is defined in your note. DNC = did not converge. Note, this non convergence makes sense because the model is trying to estimate too many coefficients under a high dimensional model situation. You will likely get the models to converge if you make all dimensions orthogonal as is traditionally done with bifactor models. Please see my earlier comment about using a twotier modeling approach minus the cross loadings. You will likely find that the two general factors model with specific nuisance factors due to item phrasing will represent your data better and be more meaningful.
Response: We rerun the bifactor model using “rotation = target (orthogonal)” and the model achieved convergence. Still, fit indexes were not acceptable (see Table 1).
 Go back and look at your results again. Your bifactor model did not converge so how can you say it did not have acceptable fit to the data?
Response: See newly added fit indexes.
 Check Table 2 for spelling errors and typos
Response: Amendments done.
 Table 2 also shows that Item 6 had the largest cross loading. This item likely needs to be further analyzed as it seems to be tapping both preference and tolerance, although more strongly tapping tolerance in this sample.
Response: Factor loadings were revised as there was a typo. Item 6 is intended to load on the preference factor. Crossloading is discussed on lines 432438.
 When you computed reliability in Table 2, how was it computed considering the cross loadings of all items? Please clarify if you only used the loadings on the intended factors and provide an appropriate citation to justify your approach.
Response: We used omegas coefficient calculation procedures.
 Lines 375376: It is no surprise that the ESEM fit the data better given that more parameters are being estimated.
Response: We agree.
 Elaborate on where the measurement model fit each group well independently.
Response: Done (lines 381385).
 You can never confirm a hypothesis in a frequentist framework. That is like saying, the null hypothesis is true. It would be better to say that the measurement invariance results show that the multigroup analyses provide evidence that strict invariance is tenable across both exercise type and exercise experiences groupings (see Table 3).
Response: Amendments done.
 Please provide the p value for your chisquare with associated df in the results section.
Response: Added.
 Why did the authors run so many independent SEMs? Each model is suggesting that variables across models are not related, which is not necessarily true. It would be much wiser if the authors constructed one single model with all DVs and predictors and then interpreted the model as correlational validity evidence.
Response: We revised SEM analysis. One model with three DV and predictors was calculated. Please, see the results on lines 393400.
 Line 403: What do the authors means by “hedonic” assumptions?
Response: It was a typo. Amendments made.
Discussion:
 In general, the results suggest that the two factors of preference and tolerance are highly correlated in an ESEM framework, with item 6 showing some potential misfit due to uses relatively substantive cross loading on preference, which is not its intended dimension. What reasons might the authors given for this misfit and next steps? Please update lines 427430 accordingly as your current discussion does not consider this issue/flaw in the instrument.
Response: Revisions made. See our discussion on lines 435442.
Overall
 Overall, the authors still need to address the larger issue I noted earlier. What is the justification for doing ESEM with only two factors?
Response: See previous comment 23.
 The authors are advised to rerun their bifactor model as noted in my earlier comment regarding making the factors all orthogonal and then they will fit.
Response: See previous comment 9.
 Also, the authors truly need to examine look at the twotier model I suggested earlier to address the two correlated factors of substantive interested with two orthogonal factors, positively oriented items, and negative oriented items.
Response: See previous comments 14 and 23.
Reviewer 2 Report
I have found some technical issues that I believe that should be improved to achieve the expected quality of a scientific manuscript.
The first regards to the title of the manuscript. It seems that both bifactor models failed to converge, therefore no further mention about them is done in the manuscript. If that is the case, I strongly suggest to change the title, because I was left expecting a lengthier discussion about bifactor models.
Please describe why did you use MLR instead of WLSMV. As this latter is a common choice when you have five response options. It would be useful to see the reasoning to use a estimator that is not recommended for this scenario.
Please add a difference test for the models, it seems that the CFA and ESEM models are nested however there is no statistical test for the fit differences.
Some of the results don’t make a sense to me. For instance, unidimensional and two correlated factor CFA had the same degrees of freedom, however it is impossible, as there are differences in the degrees in the number of parameters that are estimated. Please describe with more detail how the models were specified, or even, if its possible please share your Mplus syntax.
In this case of table 3, which displays the invariance of the model it is unclear how did it change so little, as the fit indicators aren’t that high, it is expected to lose at least some of the fit when constraints are imposed. This problem is additionally concerning as no chi square, and degrees of freedom are displayed.
It also seems that the cutoff scores that are being used for your models is too loose. Please check the recommendations by Hu and Bentler (1999) regarding the subject.
Author Response
Reviewer 2
I have found some technical issues that I believe that should be improved to achieve the expected quality of a scientific manuscript. The first regards to the title of the manuscript. It seems that bifactor models failed to converge, therefore no further mention about them is done in the manuscript. If that is the case, I strongly suggest changing the title, because I was left expecting a lengthier discussion about bifactor models.
Response: Title revised: “Using Psychometric Testing Procedures for Scale Validity, Reliability, and Invariance Analysis: The PRETIEQ Portuguese Version”.
Please describe why did you use MLR instead of WLSMV. As this latter is a common choice when you have five response options. It would be useful to see the reasoning to use an estimator that is not recommended for this scenario.
Response: WLSMV is not as efficient as ML (Múthen & Múthen, 2010). ML handles data missing at random whereas WLSMV cannot give its pairwise variable orientation.
Please add a difference test for the models, it seems that the CFA and ESEM models are nested however there is no statistical test for the fit differences.
Response: We thank the reviewer for his/her comment. The purpose of the present study was to examine the psychometric properties of the PRETIEQ Portuguese version in a large sample of Portuguese adults engaged in a variety of exercise activities through CFA and ESEM analysis. Therefore, including a test of differences could not add anything new about the paper´s purpose and could confound the reader.
Some of the results don’t make a sense to me. For instance, unidimensional and two correlated factor CFA had the same degrees of freedom, however it is impossible, as there are differences in the degrees in the number of parameters that are estimated. Please describe with more detail how the models were specified, or even, if it’s possible please share your Mplus syntax.
Response: It was a typo and all tested models were corrected. See below the Mplus syntaxes used for each model.
In this case of table 3, which displays the invariance of the model it is unclear how did it change so little, as the fit indicators aren’t that high, it is expected to lose at least some of the fit when constraints are imposed. This problem is additionally concerning as no chi square, and degrees of freedom are displayed.
Response: Chisquare test and df added. We also revised the analyses and we discovered typos regarding the strong and strict results.
It also seems that the cutoff scores that are being used for your models is too loose. Please check the recommendations by Hu and Bentler (1999) regarding the subject.
Response: As stated by Peugh and Feldon (2020) “Hu and Bentler (1998, 1999) conducted fit index Monte Carlo simulations to determine the cutpoint values that reliably distinguished “goodfitting” from “badfitting” structural equation models. Results suggested CFI values ≥0.95 and RMSEA values ≤0.08 distinguished wellfitting from poorly fitting structural equation models. However, subsequent research has shown that model fit index values can also be influenced by sample size (Marsh et al., 2004), df (Chen et al., 2008), the number of variables analyzed (i.e., model complexity; Kenny and McCoach, 2003), and missing data (Davey, 2005; Savalei, 2011).” Nevertheless, we added fit indexes for wellfitting model proposed by Hu and Bentler (1999)
Mplus syntaxes
USEVARIABLES ARE
pt1 pt2 pt3 pt4 pt5
pt6 pt7 pt8 pt9 pt10;
Unidimensional (model 1)
ANALYSIS:
ESTIMATOR IS MLR;
MODEL:
UNI by pt1 pt2 pt3 pt4 pt5
pt6 pt7 pt8 pt9 pt10;
OUTPUT:
STDYX.
Correlated twofactor CFA (model 2)
ANALYSIS:
ESTIMATOR IS MLR;
MODEL:
TOL by pt1 pt3 pt8 pt9 pt10;
PREF by pt2 pt4 pt5 pt6 pt7;
OUTPUT:
STDYX.
Correlated twofactor ESEM (model 3)
ANALYSIS:
ESTIMATOR IS MLR;
ROTATION IS TARGET (OBLIQUE);
MODEL:
PREF by pt1pt10
pt1~0 pt3~0 pt8~0 pt9~0 pt10~0(*1);
TOL by pt1pt10
pt2~0 pt4~0 pt5~0 pt6~0 pt7~0(*1);
OUTPUT:
STDYX.
One bifactor and twocorrelated CFA (Model 4)
ANALYSIS:
ESTIMATOR=MLR;
MODEL:
G by PT1PT10;
TOL by pt1 pt3 pt8 pt9 pt10;
PREF by pt2 pt4 pt5 pt6 pt7;
OUTPUT:
STDYX.
One bifactor and twocorrelated ESEM (Model 5)
ANALYSIS:
ESTIMATOR=MLR;
ROTATION=TARGET (orthogonal);
MODEL:
G by
PT1PT10 (*1);
TOL BY
PT1PT10
PT2PT7~0 (*1);
PREF BY
PT2PT7
PT1PT10~0 (*1);
OUTPUT:
STDYX.
Round 2
Reviewer 1 Report
n the introduction section the authors indicate that the PRETIEQ Portuguese is ten items, which is consistent with Table 1, but the text (see Instruments section) says there are 8 items with 4 per subscale/dimension. Please advise. I believe it should be 5 for each subscale based on what was analyzed and presented in Table 1.
The authors state that the purpose of the study was examine the psychometric properties of the PRETIEQ Portuguese version using CFA, ESEM, and bifactor methods. The 8item PRETIEQ Portuguese version consists of two scales (intensitypreference and intensitytolerance) with half of the items on each scale being positively oriented and half being negatively oriented. The authors tested various measurement models and concluded that the twocorrelated factor ESEM is the best fitting based on CFI/TLI and SRMR/RMSEA showing the best fit overall. In my previous review I recommended that the authors account for the negatively oriented and positively negatively by treating them as nuisance factors within their study. Doing so will likely lead to more accurate results than just ignoring them during the data analyses. However, the authors punted on my comment and decided to note it as a limitation. A limitation is reserved when something could not be done beforehand or out of the control of the researchers, however, you can do a twofactor model with orthogonal nuisance factors. Since the authors are using Mplus, they could easily adapt the Mplus code below to look at the twocorrelated factor CFA with nuisance factors, one for positive and one for negative item phrasing (and modify it for the ESEM approach that has cross loadings). Add this model to your study so you can have a better picture of the two dimensions while accounting for item phrasing nuisance factors which are known method factor which affects construct irrelevant variance. Without it, you are ignoring a huge field within the broad field of measurement and psychometrics. Furthermore, doing these two simple analyses will bolster your data analyses so you are indeed using the most modern psychometric testing methods (within a linear SEM framework). Plus, you will be able to extend the literature, which is then pushing the science in this field. Finally, if this model ends up being the best fitting model, you will not only be confirming prior literature about the twofactor structure, but extending the literature to argue why it is better to model item phrasings as nuisance factors than just ignore them.
Model:
Pref BY I2 I4 I5 I6 I7;
Tol BY I1 I3 I8 I9 I10;
Neg BY I2 I4 I1 I3; !assumes these are the negatively phrased items
Pos BY I5 I6 I7 I8 I9 I10; !assumes these are the positively phrased items
Pref Tol WITH Neg@0 Pos@0;
Neg WITH Pos@0;
In my original review I asked the authors to justify how they know their item response level data is MAR, but instead they repeated back to me that ML handles missing data as MAR. Again, how do you know your data is MAR vs MCAR or MNAR? The authors need to run diagnostic analyses to test these assumptions and present those results. Then, you would handle missing data.
The authors report that 3% of data is missing. Good, but what are the missing data patterns (this is something reported in the Mplus output) and %’s per pattern? Is the observed missing data related mostly to certain items (this would be important information to report)?
Also, I asked the authors to justify the treatment of the data as linear or continuous when item responses are Likerttype in nature or ordered categorical data. The authors referred to WLSMV as a way to handle categorical data, but that it is not as efficient as ML because it handles it pairwise. Although true, you still have not justified treated the Likerttype data as linear or continuous. You have a few options in front of you. First, provide a source that argues when it is reasonable to treat 5point Likerttype data as continuous and when it is not reasonable and then cite it. Second, treat your data as categorical, use WLSMV, and use multiple imputations, which then does not have to invoke pairwise deletion. Third, treat your data as categorical, use MLR, and specify in Mplus that your items are all categorical (which then invokes IRT in Mplus).
Author Response
Reviewer 1
In the introduction section the authors indicate that the PRETIEQ Portuguese is ten items, which is consistent with Table 1, but the text (see Instruments section) says there are 8 items with 4 per subscale/dimension. Please advise. I believe it should be 5 for each subscale based on what was analyzed and presented in Table 1.
Response: It was a typo. The Portuguese version comprises ten items, five items for each factor. Corrected in the manuscript.
The authors state that the purpose of the study was examine the psychometric properties of the PRETIEQ Portuguese version using CFA, ESEM, and bifactor methods. The 8item PRETIEQ Portuguese version consists of two scales (intensitypreference and intensitytolerance) with half of the items on each scale being positively oriented and half being negatively oriented. The authors tested various measurement models and concluded that the twocorrelated factor ESEM is the best fitting based on CFI/TLI and SRMR/RMSEA showing the best fit overall. In my previous review I recommended that the authors account for the negatively oriented and positively negatively by treating them as nuisance factors within their study. Doing so will likely lead to more accurate results than just ignoring them during the data analyses. However, the authors punted on my comment and decided to note it as a limitation. A limitation is reserved when something could not be done beforehand or out of the control of the researchers, however, you can do a twofactor model with orthogonal nuisance factors. Since the authors are using Mplus, they could easily adapt the Mplus code below to look at the twocorrelated factor CFA with nuisance factors, one for positive and one for negative item phrasing (and modify it for the ESEM approach that has cross loadings). Add this model to your study so you can have a better picture of the two dimensions while accounting for item phrasing nuisance factors which are known method factor which affects construct irrelevant variance. Without it, you are ignoring a huge field within the broad field of measurement and psychometrics. Furthermore, doing these two simple analyses will bolster your data analyses so you are indeed using the most modern psychometric testing methods (within a linear SEM framework). Plus, you will be able to extend the literature, which is then pushing the science in this field. Finally, if this model ends up being the best fitting model, you will not only be confirming prior literature about the twofactor structure, but extending the literature to argue why it is better to model item phrasings as nuisance factors than just ignore them.
Model:
Pref BY I2 I4 I5 I6 I7;
Tol BY I1 I3 I8 I9 I10;
Neg BY I2 I4 I1 I3; !assumes these are the negatively phrased items
Pos BY I5 I6 I7 I8 I9 I10; !assumes these are the positively phrased items
Pref Tol WITH Neg@0 Pos@0;
Neg WITH Pos@0;
Response: We will try to discuss your comment and our revisions based on the first and the current round of review. As stated in the manuscript “The PRETIEQ, created by Ekkekakis et al. (2005), is a 16item questionnaire designed to assess the traits of preference… and tolerance...” These authors were responsible for creating the positively and negativelyoriented items for measuring only two factors. Teixeira and colleagues (2020) followed standard procedures (Brislin, 1980) by translating and validating the instrument to Portuguese “The translation of the PRETIEQ from English to Portuguese was done through the committee approach methodology…” (pp. 4). We considered the Portuguese version of the study and conducted all the reported analyses as mean to provide evidence of possible bifactor model specification and advance our understanding of the PRETIEQ measurement model. We did account for the negatively and positively oriented items by reverse coding them before conducting all factor analyses. This procedure was performed in IBM SPSS Statistics version 27 (we added this information in the manuscript for clarity, see lines 257259). Thus, your proposed model would provide the exact same results as ours if we did not reverse coded the items. The limitation was considered since the reviewer suggested a fourcorrelated factor model considering the negatively vs positively coded items for preference and tolerance as distinct factors within each trait (see your comment 24 and our response).
In my original review I asked the authors to justify how they know their item response level data is MAR, but instead they repeated back to me that ML handles missing data as MAR. Again, how do you know your data is MAR vs MCAR or MNAR? The authors need to run diagnostic analyses to test these assumptions and present those results. Then, you would handle missing data.
Response: We compared models with and without missing data in Mplus. The significant chisquare difference test and the improvement in CFI and RMSEA values suggest support for MAR, since MCAR assumption assumes that the missingness is completely random and does not depend on any observed or unobserved data, which may not be true base in real data (Múthen & Múthen, 2017). This information was added in the manuscript.
The authors report that 3% of data is missing. Good, but what are the missing data patterns (this is something reported in the Mplus output) and %’s per pattern? Is the observed missing data related mostly to certain items (this would be important information to report)?
Response: We calculated percentage based on the output Mplus provides after running model with missing data. It is a mean percentage considering missing values of the 10 items. This information was added. We do believe that it is not relevant reporting % of total data pattern since this study does not aim to explore models based on data missingness. It is worth to mention that in this study, we reported data MAR of 3% of 10 item responses provided by 1117 participants and we do have sufficient statistical power (with sample size calculations reported).
Also, I asked the authors to justify the treatment of the data as linear or continuous when item responses are Likerttype in nature or ordered categorical data. The authors referred to WLSMV as a way to handle categorical data, but that it is not as efficient as ML because it handles it pairwise. Although true, you still have not justified treated the Likerttype data as linear or continuous. You have a few options in front of you. First, provide a source that argues when it is reasonable to treat 5point Likerttype data as continuous and when it is not reasonable and then cite it. Second, treat your data as categorical, use WLSMV, and use multiple imputations, which then does not have to invoke pairwise deletion. Third, treat your data as categorical, use MLR, and specify in Mplus that your items are all categorical (which then invokes IRT in Mplus).
Response: We opt to treat data as categorical using the MLR estimator for several reasons. First, categorical data, such as ordinal responses, are inherently nonnormal and may not meet the distributional assumptions of the WLSMV estimator, which assumes multivariate normality. On the other hand, the MLR estimator is robust to violations of normality assumptions and can provide more accurate parameter estimates in the presence of nonnormal data (Múthen & Múthen, 2017). Second, categorical data often exhibit nonconstant item variance, where the variance of responses may vary across different items. The WLSMV estimator assumes constant item variance, and violations of this assumption can result in biased estimates. In contrast, the MLR estimator can account for nonconstant item variance, which can lead to more accurate and robust parameter estimates (Flora & Curran, 2004). Third, as stated in our previous response, the MLR estimator in Mplus can handle missing data using robust techniques, which provide robust parameter estimates even in the presence of missing data (Múthen et al., 2015). Fourth, research has shown that the MLR estimator tends to provide better model fit compared to the WLS estimator for categorical data, especially when the sample size is small or when the data are nonnormally distributed (Múthen et al., 2015; Yan & Bentler, 2007). This content was added to the manuscript (see lines 280293). Last, while we agree that MLR can invoke IRT, it is not our aim to discuss IRT parameters, such as the item discrimination, item difficulty, and guessing parameters. This study does not aim to explore all theories of factor analysis tests.
Reviewer 2 Report
I want to commend the authors for this subsequent version. I believe that this review was more thoroughly that the previous one. I have three additional comments that I believe that could improve the manuscript:
1) I'm thankful for the inclusion of the mplus syntax in the reviewer response. I believe that if that syntax is included as complementary material or as an appendix it could be helpful to boost the probability of citation of the manuscript.
2) Authors mention that they used MLR because it is more robust to data missingness when MCAR, however there is no evidence of MCAR. I suggest to add the results of Little test to further support this claim.
3) It seems that the best model has little evidence regarding strict invariance, however in the discussion and the conclussions authors seems to ellude this results in their overall interpretation. Please include this result in order to clarify the overall limitations of the PRETIEQ
4) In the conclussion it is stated that "The current work adds to the body of evidence supporting the PRETIEQ for exercise research, allowing academics to acquire reliable estimations of exerciseintensity quality for proper exercise 554 prescription.". I think that this statement is a little misleading as it seems to suggest that PRETIEQ measures excercise intensity, however it is not the case. I suggest to rewrite such sentence.
Author Response
Reviewer 2
I want to commend the authors for this subsequent version. I believe that this review was more thoroughly that the previous one. I have three additional comments that I believe that could improve the manuscript:
Response: We appreciate your positive feedback. Pointbypoint responses are provided in this cover letter and amendments done in the manuscript and tracked using the trackchange option in MS Word.
1) I'm thankful for the inclusion of the Mplus syntax in the reviewer response. I believe that if that syntax is included as complementary material or as an appendix it could be helpful to boost the probability of citation of the manuscript.
Response: Mplus syntax added as supplemental material (appendix 1).
2) Authors mention that they used MLR because it is more robust to data missingness when MCAR, however there is no evidence of MCAR. I suggest to add the results of Little test to further support this claim.
Response: For clarification, MAR was assumed based on the data missing at random. We compared models with and without missing data in Mplus. The significant chisquare difference test and the improvement in CFI and RMSEA values suggest support for MAR, since MCAR assumption assumes that the missingness is completely random and does not depend on any observed or unobserved data, which may not be true base in real data (Múthen & Múthen, 2017). This information was added in the manuscript.
3) It seems that the best model has little evidence regarding strict invariance, however in the discussion and the conclusions authors seems to allude this results in their overall interpretation. Please include this result in order to clarify the overall limitations of the PRETIEQ
Response: We discussed in more detail the lack of strict invariance and added content on the conclusion section.
4) In the conclusion it is stated that "The current work adds to the body of evidence supporting the PRETIEQ for exercise research, allowing academics to acquire reliable estimations of exerciseintensity quality for proper exercise 554 prescription.". I think that this statement is a little misleading as it seems to suggest that PRETIEQ measures exercise intensity, however it is not the case. I suggest to rewrite such sentence.
Response: Sentence revised “The current work adds to the body of evidence supporting the PRETIEQ for exercise research, allowing academics to acquire reliable estimations of exerciseintensity traits of preference and tolerance for proper exercise prescription.”
Round 3
Reviewer 1 Report
See attached file
Comments for author File: Comments.pdf
Author Response
 Thank you for updating the information about the number of items as 5 per factor for a total of 10 items on the Portuguese version.
Response: We appreciate.
 Also, thank you for adding you’re your Mplus syntax for your analysis models
Response: You are welcome.
 I appreciate the authors historical overview of the PRETIEQ and adaptation to the Portuguese version and then studied further in their study. However, the authors seem to misunderstand the difference between reverse coding items prior to analysis versus accounting for negatively and positively oriented items (whether reverse coding is done or not) by treating each type as orthogonal nuisance factors. Reverse coding items prior to analysis only has the effect of changing the sign of coefficients (i.e., loadings). However, the reverse coding does not account for construct irrelevant item variance due to item phrasing (see any textbook on Psychometrics or scale development: See Debbie Bandalos’s textbook). To address the construct irrelevant variance, bifactor models or its extension to twotier models can be used. So, when a researcher is interested in understanding how two factors relate and not interested in specific factors that may be due to nuisance factors (i.e., item orientation factors, which are a source of construct irrelevant variance or method effect), then a twotier model, which is an extension of the bifactor model (i.e., general factor and one or more orthogonal specific factors), is appropriate. So, a twotier model consisting of two correlated factors and two nuisance factors in a CFA or ESEM framework is most appropriate for your studies purpose. These models will mesh perfectly with the authors goal of studying the PRETIEQ factor structure. Additionally, the twotier models will provide evidence of possible extended versions of the bifactor model specification can advance our understanding of the PRETIEQ measurement model, which the authors desire (at least based on your writing). Furthermore, as mentioned in my last review, adding these models to your study will allow you to create “a better picture of the two dimensions while accounting for item phrasing nuisance factors which are known method factor which affects construct irrelevant variance. Without it, you are ignoring a huge field within the broad field of measurement and psychometrics. Furthermore, doing these two simple analyses will bolster your data analyses so you are indeed using the most modern psychometric testing methods (within a linear SEM framework). Plus, you will be able to extend the literature, which is then pushing the science in this field. Finally, if this model ends up being the best fitting model, you will not only be confirming prior literature about the twofactor structure but extending the literature to argue why it is better to model item phrasings as nuisance factors than just ignore them.” Since the authors shared their Mplus syntax, thank you, I have taken the liberty of specifying the correlated twofactor CFA with orthogonal nuisance factors, which they can easily implement with their data. Note, I did not know which items within each factor are negatively and positively oriented, so I took a guess to provide an example. Modify appropriately. Then, extend to the same model to ESEM. My prediction is that you will find the correlated twofactor CFA with orthogonal factors ESEM (model 7) will fit better than your twocorrelated factors ESEM and consequently updating Table 2 will be straightforward as you add two more columns for the positive and negative nuisance factors. Moreover, you will likely (hopefully) find the noise in the inflated cross loadings to decrease (my $0.02). Assuming these models results, when presented in your revised manuscript, show this improved finding, you will need to update your other analyses (invariance testing, etc.).
Reference
Cai, L. A TwoTier FullInformation Item Factor Analysis Model with Applications. Psychometrika 75, 581–612 (2010). https://doi.org/10.1007/s1133601091780
Twotier model: Correlated twofactor CFA with orthogonal nuisance factors (model 6) !note. this model assumes you have already reverse coded your negatively oriented items before !analysis
MODEL:
TOL by pt1 pt3 pt8 pt9 pt10;
PREF by pt2 pt4 pt5 pt6 pt7;
Neg BY pt2 pt4 pt1 pt3; !put your negatively oriented items here
Pos BY pt5 pt 6 pt 7 pt 8 pt 9 pt 10; !put your positively oriented items here Pref Tol
WITH Neg@0 Pos@0; Neg WITH Pos@0;
Twotier model: Correlated twofactor ESEM with orthogonal nuisance factors (model 7)
Repeat syntax above but ESEM model
Response: We appreciate your comments and suggestions to add further analyses and possibly complement further our research topic. Regarding your suggestion to include an additional analysis, we understand the potential value of this Twotier model analysis in further exploring the measurement model of the PRETIEQ questionnaire. However, after careful consideration, we have concluded that it may go beyond the scope and objective of our study, which is specifically focused on testing already advanced bifactor analyses and contributing to the existing research on the PRETIEQ questionnaire. Considering your expertise and approval on the current statistical tests, it seems that current analyses are appropriate and sufficient to address the research questions and objectives of our study effectively.
 This statement is not correct and should be revised accordingly. “In this study, researchers accounted for the negatively and positively oriented items by reverse coding them before conducting all factor analyses. This procedure was performed in IBM SPSS Statistics version 27.” Suggested revision: Prior to data analysis, all negatively oriented items were reverse scored to align with the polarity of the positively oriented items.
Response: Revised accordingly.
 I appreciate the authors work to address my concern about the handling of missing data. Upon reading your revised paper, it would be bets to remove the following statements: ”researchers compared models with and without missing data. The significant chisquare difference test (p < 0.05) and the improvement in CFI (Δ = 0.08) and RMSEA (Δ = 0.01) values between models suggest support for data missing at random, since data missing completely at random assumption assumes that the missingness is completely random and does not depend on any observed or unobserved data, which may not be true base in real data.” First, because the sample size will be different across the two models, there is uncertainty about the behavior of the chisquare, RMSEA, and CFI change tests. Second, even if your data is MCAR, assuming or suspecting that your data is MAR and then using MLR, which makes this assumption, is not problematic and in fact gives you more power. Third, revise this statement: “To deal with the small amount of missing data at random observed at the item level (missing data mean = 3%) researchers used the Full Information Maximum Likelihood method.” To instead read: To address the small amount of missing data at the item level across all instruments (missing data mean = 3%) Full Information Maximum Likelihood (MLR) estimation was used for all data analyses which assumes data is missing at random (MAR).
Response: Statements removed and the “missing data" statement revised accordingly.
 I appreciate the authors attempt to address my concerns about treating the data as continuous vs categorical. However, their response is filled with inaccuracy around WLSMV assuming multivariate normality (actually, WLSMV assumes multivariate normality of the latent variables/factors [y*’s], not the observed item responses). If indeed, your observed ordinal variables follow highly kurtotic and/or skewed distributions, then WLSMV shouldn’t be used. To support this argument, the authors should at minimum provide empirical evidence (a summary narrative would suffice) that this was true for the observed ordinal variables under study. Once this summary statement is added, the authors have then provided adequate enough justification for using MLR with observed ordinal item response data. Note, although necessary for this paper, the authors could address the sparse data issue for the observed ordinal item response data using Bayesian estimation. Or, if the data is not too sparse (say at least 10 obs per category on all items), then IRT modeling could be used.
Response: We added justification for using the MLR “The authors of the study conducted an initial analysis at the level of item responses to evaluate the distributional properties of the data. They observed that some items 6, 7, and 10 displayed skewed (scores >7) distributions which could indicate departures from normality.”
 This statement is not true: “Researchers opt to treat data as categorical using the MLR estimator for several reasons.” Because the authors used MLR and did not specify the items were categorical during the data analysis. If they had, then indices such as RMSEA, CFI, SRMR would not be provided in the output. These indices are only provided in Mplus when MLR is specified in the absence of categorical = or when WLSMV is specified in conjunction with categorical = …
Response: Sentences revised accordingly (see lines 280289).
 The authors write “However, the twocorrelated factors of the ESEM model solution achieved good fit (CFI and TLI <.90; 376 and RMSEA >.08).” However, their signs are flipped, it should be > .90 and < .08.
Response: We appreciate your revision. Revised accordingly.