Building reliable and valid teacher research questionnaires

By Ulas Kayapinar, PhD

Liberal Arts Department, American University of the Middle East, Kuwait



Teacher researchers often use questionnaires to collect data from respondents. However, some teacher research includes questionnaire difficulties that generate results that do not address research questions well. In fact, teacher research might even create false results by building survey questions that are neither valid nor reliable. The purpose of this paper is to review some difficulties of using questionnaires and provide insight to teacher researchers who wish to use questionnaires that are both valid and reliable. This paper is an attempt to help teacher researchers become more aware of the importance of scientific processes when they use questionnaires.

Keywords: Questionnaire, teacher research, validity, reliability.



Most people have responded to many questionnaires in their lives. In many qualitative teacher research studies, the use questionnaires is the major source of information because researchers think these questionnaires elicit information which can be discussed, is easy to construct and analyze, and fulfills the purpose of the study. Questionnaires are used especially when teachers are unable to obtain data by observation and want to know what people or learners do, think, know, feel, and want.

For these reasons, questionnaire design and development is seen more as an art than a science. However, developing questionnaires is a scientific activity, and those who create good questionnaires must understand the consequences of the decisions they make in survey design to meet conditions of validity, so they might draw conclusions on accurate data by using appropriate research methods (Saris & Gallhofer, 2007). The current paper reviews literature on the validity and reliability aspects of questionnaire design as a way to forward teacher research that gathers data using questionnaires.


1.            Validity and Reliability Issues

Validity has been defined by “the extent to which [a test] measures what it claims to measure” (Gregory, 1992). Any measure can be called as “valid” if it measures what it is supposed to measure. Violations of instrument validity severely impact the function and functioning of any testing instrument, and this instrumental testing property is often even less understood than reliability (Crocker & Algina, 1986; Gregory, 1992).

1.1.         Validity

A valid questionnaire has undergone validation procedures to show that it accurately measures what it is intended to measure regardless of the hypothesis, researcher’s interests, respondent’s status, timing of response, and different researchers. The validity concept is best understood and examined within the context of its discrete facets which are called content validity, construct validity, and criterion-oriented (predictive and concurrent).

1.1.1.     Content Validity

Content validity involves items that cover a given area of content or ability (Bachman, 1990). Items should effectively act as a representative sample of all the possible questions that could have been derived from the construct (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992) showing that the items are a sample of a universe (Cronbach & Meehl, 1955). The following steps should be considered when evaluating the content validity of any measuring instrument (Crocker & Algina, 1986): (1) identify and outline the domain of interest, (2) gather resident domain experts, (3) develop consistent matching methodology of objectives and items to be used, and (4) analyze results from the matching task.

1.1.2.     Construct Validity

Construct validity is the heart of any study in which researchers use a measure as an index of a variable that is not itself directly observable (Westen & Rosenthal, 2003), such as intelligence, success, attitude, and memory. “We have to remember that what we observe is not nature in itself but nature exposed to our method of questioning” (Heisenberg, 1958). In this sense, construct validity can be studied when the researcher has no definite criterion measure of the quality with which he is concerned and must use indirect measures (Cronbach & Meehl, 1955). The indirect measures or test scores obtained from instruments such as IQ, attitude, achievement, performance tests, or scales make related constructs observable and measurable.

1.1.3.     Predictive Validity/Concurrent (Criterion-oriented) Validity

Criterion-oriented validity is based upon the comparison of the test scores with the scores of a criterion (Aiken, 2000). Predictive validity assesses the predictive utility of an instrument (Crocker & Algina, 1986). For example, standardized testing such as TOEFL or IELTS aims to predict performance outcomes respectively. Then, decisions are made based on the test scores. Concurrent validity is also studied when a test score or criterion score is determined at the same time, and one test is proposed as a substitute for another (Cronbach & Meehl, 1955). For example, a sample of students completes the two tests (e.g., the short answer form of a test and the new multiple-choice form the test measuring the same behaviors) when the teacher/researcher wants to know whether the multiple-choice form really measures the intended behavior. Here the researcher needs to show a strong, consistent relationship between the scores of the two tests, usually using a correlation.

1.2.         Reliability

Reliability indicates the consistency and proximity of the test results obtained from the same individuals in different particular time distances (Anastasi, 1982). It can be called as the adequacy of a measuring tool without any error even if it is applied at different times. If a group of respondents given the same test in different occasions has similar results, the test can be called reliable (Brown, 1994).

1.2.1. Internal Consistency

The most common method of assessing internal consistency reliability estimates is through the use of Cronbach’s Alpha. More important to understand is that reliability estimates are a function of the test scores yielded from an instrument, not the test itself (Thompson, 1999). Acceptable reliability estimates of a measurement instrument range from .70 to .80 in the social sciences (Nunnally & Bernstein, 1994).

These methods are concerned with the consistency of scores within the test itself, which means the consistency of scores among the items (Crocker & Algina, 1986; DeVellis, 1991). Here, the key is to have a homogenous set of items that reflects a unified underlying construct. High reliability estimates of this kind will result in high inter-item correlations among the items or subscales (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992; Henson, 2001).

1.2.2. Test-retest

The test-retest method is employed when the same test is administered to a set of examinees more than once, a sufficient period of time elapses, and the test is administered once again. Upon completion of the second administration, the correlation coefficient between scores on the two measures is calculated, which yields information about how stable or consistent the test results are over time (Crocker & Algina, 1986; Gregory, 1992).

1.2.3. Alternate Forms

Another type of reliability estimate is the alternate form method. This test-retest technique evaluates the consistency of alternate forms of a single test (DeVellis, 1991). In this method, participants take one form of the test, a period of time elapses, and they then take a second form of the test. The correlation coefficient between the two sets of scores is calculated after results are gathered from both sessions. Finally, a coefficient of equivalence is yielded (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992).


2.            The questionnaire is not valid and reliable because of …

2.1.         Ill-literature Review

In scale development, a critical first step is to develop a precise and detailed conception of the target construct and its theoretical context (Clark & Watson, 1995). This doesn’t mean that a set of interrelated theoretical concepts must be fully explained before embarking on development of a questionnaire. However, it is vital to review the relevant literature to see existing studies on the same set of constructs to study. In this sense, a review of literature is a ‘must’ to develop a precise conception of the target construct building a valid questionnaire.

2.2.         Item Development and Common Pitfalls

Once the type of information needed is identified, items should be drafted for each piece of information. To do this, an item pool that includes the items reflecting the construct of interest is generated. The more items in this pool, the more carefully you can choose those that will do the job you intend. It would not be unusual to begin with a pool of items that is three or four times as large as the final scale (DeWellis, 1991). In this item pool, all the items should represent the content of the construct to be measured with appropriate item format.

2.2.1. Response Alternatives (Open/Close-Ended Questions)

Open-ended questions allow respondents to respond in their own words. The responses might be varied and difficult to report because the interpretation of each respondent may vary. A possible pitfall here could be responses beyond the researcher’s anticipation and the tendency to report the results for the entire group of participants. However, responses from several points of view might be useful when testing a new program, method, or technique. Close-ended questions allow respondents to choose among specific respond options. Responses might be agree/disagree, true/false, multiple choice, and Likert type. Although this type of questions is more likely to promote consistency, the tendency to force respondents to choose pre-determined choices including bias must be avoided. However, close-ended choices allow the researcher to generalize results for the entire group when a total score is obtained.

2.2.2.     Leading questions

These questions subtly prompt respondents to answer in a particular way. An example is, “Do you believe that Communicative Approach is useful in the classroom?” or “Do you think tablet use provides effective learning in the classroom?” These questions prompt the person to give an acceptable answer, which is “yes”.

2.2.3.     Double-barreled questions

These questions are also called double-direct questions or compound questions. They address more than one construct in a single question. An example is, “Is the content graded according to the needs of the students and the requirements of the existing syllabus?” It might be possible that the content is graded according to the requirements of the existing syllabus but not to the needs of the students.

2.2.4.     Unclear/ambiguous questions

These types of questions have two clear possible interpretations. An example is, “Does your teacher use the rules strictly in the language class?” Here each respondent answers based on his or her own definition of “strict,” and it is unclear if the question refers to rules for classroom, grammar, or discipline. A better question would clarify which types of rules are and are not to be counted. It is important to provide as much information as possible to make sure each question is crystal clear and interpreted the same way by everyone.

2.2.5.     Personal/invasive questions

Respondents should have a right to privacy about their personal information, attitudes, opinions, or feelings. Questions intruding on one’s privacy should be avoided. Some examples are, “How important is religion in your life?” “What makes you angry with your teacher?” “Do you cheat?” “When you get a bad grade, what do your parents do to you?”

2.3.         Invalidity

Not defining constructs. For clarity’s sake and to avoid “infinite frustration”, defining the variable to be measured is essential (Cronbach & Meehl, 1955). In this regard, if a questionnaire is being developed to measure a construct, the construct must be both operationalized and syntactically defined to be measured effectively (Benson, 1998; Crocker & Algina, 1986; Gregory, 1992).

Not gathering domain experts. To avoid manipulation, speculation, bias, or misleading inferences, each item and the questionnaire as a whole must be examined by experts. Close examination is also essential for scrutiny on content coverage and content relevance, which refer to the performance of each item to represent and to measure and assess (Bachman, 1990).

Not developing consistent methodology. To avoid wrong analyses of results, the methodology and questionnaire design used should be accurate. Because a method refers to a means of gathering, analyzing, and interpreting data using generally recognized procedures (Richards, 2003), methodological evidence should be presented clearly considering the construct and the items. A consistent match between methodology and construct must be evident. A simple example is that different item formats need different methodology and different types of analysis.

Not analyzing results accurately. To obtain valid results, the outcomes of analysis to make inferences or judgments about participants’ responses after systematic data collection improves the questionnaire’s effectiveness as analysis transforms data into results or findings (Patton, 2002). Results lead to inferences and interpretations, which may take one of three forms; making the obvious obvious, making the obvious dubious, making the hidden obvious (Schlechty & Noblit, 1982).

2.4.         Unreliability

As mentioned earlier, reliability indicates the consistency and proximity of the test results obtained from the same individuals in different particular time distances (Anastasi, 1982). Contrary to popular belief, the “test results” mentioned above might not mean questionnaire results. In fact, in most cases it is impossible to calculate the reliability of a questionnaire. When researchers work on a questionnaire, they tend to focus on reliability coefficient of the questionnaire, rather than validity. They believe such reliability supports item strength and the inferences they make. However, reliability works for tests because tests measure constructs. Using a Likert-type questionnaire does not mean it measures a construct. Questionnaires usually include independent items that lack the ability to measure constructs but do get the view of the respondents. These items can be like: “I think I’m a pretty good language learner”, “Learning a language may be important to my goals, but I don’t expect it to be much fun”, "The best method for me to learn is Communicative Approach", or "I like getting to know native speakers in general".

At this point, it is safe to say that most questionnaires are not used for measurements, and questionnaire items are usually independent from each other. Therefore, it is impossible to obtain a total score because such questionnaire items are not summable. In another way, the scores of these questionnaire items cannot be totaled if they, unlike tests, do not have measurement characteristics and do not have a total score. One way to determine the reliability of a questionnaire could be test-retest reliability; however, respondents might not volunteer to do it twice after a period of time or might give the same answers so as to be viewed favorably by the researcher.



Although teacher research could be seen differently than academic research, individual teacher researchers do engage in research, collect data, and use academic research methods and techniques to analyze and interpret results, make decisions about themselves and their work, and to better engage students in learning. As mentioned earlier, using questionnaires in teacher research is a common source of gathering data that can be discussed, and questionnaires are easy to construct and analyze; as well, questionnaires help teacher researchers better understand what people do and what people think, know, feel, and want. Developing a questionnaire is a scientific activity. To gain more accurate results about individuals or processes and to shape educational research in the future, scientific knowledge should be used by teacher researchers to help avoid false results or to avoid modifying research materials or processes that create data that do not actually exist.

 Especially, if the concern of a questionnaire is not subjective views of respondents, some measurement can be included in the questionnaire to draw reliable conclusions. Drawing accurate and reliable conclusions based on accurate data requires validity to be employed. Validity first requires items which are theory-based including relevant constructs and literature- based including constructs that ask the intended questions in an appropriate format. Therefore, a valid measurement tool includes constructs that are operationally and syntactically defined based upon judgments of domain experts that have a total, and reliable, score. In this way, the entire content of the behaviour or construct is represented in the questionnaire with appropriate item formats. Valid items lead to valid findings with appropriate analysis methods.

Valid findings lead to valid results that provide accurate inferences and interpretations which, as Schlechty and Noblit (1982) state, make "the obvious obvious" avoiding "infinite frustration" (Cronbach & Meehl, 1955).



Aiken, L. R. (2000). Psychological Testing and Assessment (10th ed.). Boston, MA: Allyn and Bacon.

Anastasi, A. (1982). Psychological testing. New York: Macmillan Publishing.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxfod: OUP.

Benson, J. (1998). Developing a strong program of construct validation: a test anxiety example. Educational Measurement: Issues and Practice, 17, 10-17.

Brown, H. D. (1994). Principles of language learning and teaching. New Jersey: Prentice Hall Regents.

Clark, L. A. and Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological assessment, 7(3), 309-319.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Philadelphia: Harcourt Brace Jovanovich College Publishers.

Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.

DeVellis, R.F. (1991). Scale Development: Theory and applications. (Applied social research methods series, Vol. 26). Newbury Park: Sage Publications.

Gregory, R.J. (1992). Psychological testing: History, principles and applications. Boston: Allyn and Bacon.

Heisenberg, W. (1958). Physics and philosophy: The revolution in modern science. New York: Harper and Row.

Nunnally, J. and Bernstein, L. (1994). Psychometric theory. New York: McGraw-Hill Higher, Inc.

Oppenheim, A. N. (1966). Questionnaire design, interviewing and attitude measurement.

London: Heinemann.

Patton, M. Q. (2002). Qualitative research and evaluation methods. California: Sage Publications.

Richards, K. (2003). Qualitative inquiry in TESOL. New York: MacMillan.

Saris, W. E. and Gallhofer, I. N. (2007). Design, evaluation, and analysis of questionnaires for survey research. New Jersey: John Wiley & Sons, Inc.

Schlechty, P. and Noblit, G. (1982). Some Uses of Sociological Theory in Educational Evaluation. in R. Corwin (ed.), Policy Research. Greenwish, CT.: JAI Press, pp. 283-305.

Sudman, S. & Bradburn, N. M. (1974). Response effects in surveys. Chicago: Aldine.

Westen, D. and Rosenthal, R. (2003). Quantifying construct validity: Two simple measures. Journal of Personality and Social Psychology, 84(3), 608-618.