ArticlePDF Available

ESL students’ oral performance in English language school-based assessment: results of an empirical study

December 2017
Language Testing in Asia 7(1)

December 2017
7(1)

DOI:10.1186/s40468-017-0051-2

License
CC BY 4.0

Authors:

Zhengdong Gan

The University of Macau

Pey Tee, Emily Oon

University of Macau

Chris Davison

UNSW Sydney

Background The English language school-based assessment (SBA) component of the Hong Kong Diploma of Secondary Education (HKDSE) Examination is innovative in that the assessment tasks involve assessing English oral language skills in a high-stakes context but they are designed and implemented in the ESL classroom by school teachers in light of a regular reading and viewing program or the elective modules integrated into the school curriculum. While this certainly is a positive move towards better congruence between the teaching, learning, and assessment activities, there has been concern whether the teachers are capable of applying the assessment criteria and standards consistently in spite of going through a variety of standardization meetings and sharing discussions initiated and mandated by the Hong Kong Examination and Assessment Authority (HKEAA). In other words, there has been concern about the extent to which results provided from teachers in different schools are comparable. Also, how may task difficulty be reflected in students’ assessment results across the two SBA task types? It was to provide some research evidence on matters relating to these issues associated with teacher assessment results that the study described here was carried out. Methods The study, with the help of Rasch analysis, aims to examine the psychometric qualities of this English language school-based assessment, how students’ assessment results may vary across different schools, and how task difficulty may vary across the two different task types. Results The findings indicated the following: (1) among the three schools involved in this study, two band 2 schools demonstrated similar abilities across all task domains as there were no significant differences in students’ SBA results in all assessment domains between these two band 2 schools. Significant differences were found in some assessment domains between the two band 2 schools and the band 3 school; (2) an obviously more fine-grained pattern of difference in difficulty levels of different assessment domains was observed in students’ assessment results across the two task types in this study than in previous studies. Conclusions Implications of the results for teacher assessor training and test task development are discussed.

tem difficulty estimates for the three schools

…

tem fit statistics

…

Person-item map on interval scale

…

Relative item difficulty for the three schools

…

Figures - available from: Language Testing in Asia

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from Language Testing in Asia

This content is subject to copyright. Terms and conditions apply.

R E S E A R C H Open Access

ESL students’oral performance in English

language school-based assessment: results

of an empirical study

Zhengdong Gan

, Emily Pey Tee Oon

and Chris Davison

* Correspondence:

zhengdonggan@umac.mo

Faculty of Education, University of

Macau, Macao, People’s Republic of

China

Full list of author information is

available at the end of the article

Abstract

Background: The English language school-based assessment (SBA) component of

the Hong Kong Diploma of Secondary Education (HKDSE) Examination is innovative

in that the assessment tasks involve assessing English oral language skills in a high-

stakes context but they are designed and implemented in the ESL classroom by

school teachers in light of a regular reading and viewing program or the elective

modules integrated into the school curriculum. While this certainly is a positive move

towards better congruence between the teaching, learning, and assessment

activities, there has been concern whether the teachers are capable of applying the

assessment criteria and standards consistently in spite of going through a variety of

standardization meetings and sharing discussions initiated and mandated by the Hong

Kong Examination and Assessment Authority (HKEAA). In other words, there has been

concern about the extent to which results provided from teachers in different schools are

comparable. Also, how may task difficulty be reflected in students’assessment results

across the two SBA task types? It was to provide some research evidence on matters

relating to these issues associated with teacher assessment results that the study

described here was carried out.

Methods: The study, with the help of Rasch analysis, aims to examine the psychometric

qualities of this English language school-based assessment, how students’assessment

results may vary across different schools, and how task difficulty may vary across the two

different task types.

Results: The findings indicated the following: (1) among the three schools involved in this

study, two band 2 schools demonstrated similar abilities across all task domains as there

were no significant differences in students’SBA results in all assessment domains between

these two band 2 schools. Significant differences were found in some assessment

domains between the two band 2 schools and the band 3 school; (2) an obviously more

fine-grained pattern of difference in difficulty levels of different assessment domains was

observed in students’assessment results across the two task types in this study than in

previous studies.

Conclusions: Implications of the results for teacher assessor training and test

task development are discussed.

Keywords: School-based assessment, Oral performance, Rasch analysis

License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,

provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and

indicate if changes were made.

Gan et al. Language Testing in Asia (2017) 7:19

DOI 10.1186/s40468-017-0051-2

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Background

In contrast to large-scale standardized testing in which the assessor is usually someone

who must remain objective and uninvolved throughout the whole assessment process,

school-based assessment tends to be embedded in the regular curriculum and assessed

by a teacher who is familiar with the student’s work (Davison 2007). Davison maintains

that school-based assessment derives its validity from building into its actual design the

capacity for triangulation and the collection of multiple sources and types of evidence

under naturalistic conditions over a lengthy period of time. Consequently, “the reliability

of the assessment was also enhanced by having a series of assessments (rather than just

one) by a teacher who was familiar with the student and by encouraging multiple oppor-

tunities for assessor reflection and standardization.”(Davison 2007, p. 51). In other words,

teachers are in the best position to determine the quality of student achievement over

time and at specific points and to improve student learning (Wyatt-Smith et al. 2010).

However, drawing on her qualitative observation, Sadler (1998, p. 80–82) made expli-

cit the typical intellectual and experiential resources teachers rely on when making a

judgment in classroom assessment:

Superior knowledge about the content or substance of what is to be learned

Deep knowledge of criteria and standards [or performance expectations]

appropriate to the assessment task

Evaluative skill or expertise in having made judgments about students’efforts on

similar tasks in the past

A set of attitudes or dispositions towards teaching, as an activity, and towards learners,

including their own ability to empathize with students who are learning; their desire

to help students develop, improve, and do better; their personal concern for the

feedback and veracity of their own judgments; and their patterns in offering help

Implicit in Sadler’s observation is thus that teacher judgments might be characterized

as remaining responsive to the influence of other knowledge and skills rather than the

stated standards and criteria. Clapham (2000) further commented:

A problem with methods of alternative assessment, however, lies with their validity

and reliability: tasks are often not tried out to see whether they produce the desired

linguistic information; marking criteria are not investigated to see whether they

‘work’; and raters are often not trained to give consistent marks. (p. 152).

In a survey of a high-profile school-based assessment initiative in Hong Kong (Davi-

son et al. 2010), teacher comments such as “I would like the HKEAA to take up my

marks to see if I have interpreted the criteria correctly”revealed a lack of confidence

among teachers about this teacher-mediated and context-dependent assessment initia-

tive, with many doubting that they had the required knowledge and skills to carry out

the assessment properly. Although this English language school-based assessment com-

ponent has been implemented in schools for nearly 10 years, there has been almost no

empirical evidence to illustrate the extent to which teacher assessment results from one

school are comparable to results of another school. Also, to what extent does difficulty

level of different task domains vary across the two task types in this assessment? It was

Gan et al. Language Testing in Asia (2017) 7:19 Page 2 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

to provide some research evidence on matters relating to teacher assessment results

that the study described here was carried out.

School-based English language assessment (SBA) scheme in Hong Kong

The literature on school-based assessment has been growing for more than two de-

cades (Davison 2007; Meisels et al. 2001; Gan 2012; Gan 2013; Qian 2014). School-

based assessment, as an alternative to testing, in the form of greater use of teachers’

assessment of their own students, has become increasingly popular in many countries

over the world. Such curriculum-embedded performance assessments often defined as in-

tegrated parts of students’learning experience rely heavily on teacher judgment. They dif-

fer from external assessments in that curriculum-embedded performance assessments are

integrated into the daily curriculum and instructional activities of a classroom (Meisels et

al. 2001). The thinking behind the curriculum-embedded performance assessments is

based on a social-constructivist view of learning (Vygotsky 1978). The use of this

curriculum-embedded performance assessment is often advocated on the grounds that it

can be conducted as part of teaching and so provide formative feedback to students, thus

improving their learning (Crooks 1988). What characterizes this type of curriculum-

embedded performance assessment is that both the teacher and students are actively

engaged with every stage of the assessment process in order that they truly understand

the requirements of the process, and the criteria and standards being applied (Price et al.

2007). Essential to the operation of this type of assessment is the teacher’s ability to recon-

cile the dual role that they are required to take in both promoting and judging learning

(Harlen 2005). Harlen points out that the task of helping teachers take up this dual role

can be particularly difficult in countries where a great deal of emphasis is given to exami-

nations results. For example, Choi (1999) suggested that in a highly competitive

examination-driven school system such as Hong Kong’s, success of a school-based assess-

ment initiative hinges on assessment training and resource support provided for teachers.

Choi, however, mentioned another difficulty in introducing a school-based assessment ini-

tiative is to ensure credibility for school-based assessment. This means that an effective

and efficient quality assurance and quality control system needs to be established so that

the users of examination results can be assured of the reliability of this scheme of assess-

ment and have confidence in the teachers’judgments.

The school-based assessment (SBA) scheme in Hong Kong started out as a compo-

nent of the Hong Kong Certificate of Education Examination (HKCEE) English

Language in 2006. This assessment scheme which was collaboratively initiated by the

Education Bureau (EDB) and the Hong Kong Examinations and Assessment Authority

(HKEAA) is innovative in that assessments are administered in schools and marked by

teachers in the context of public assessment. Grounded within an “assessment for

learning”framework, it is now incorporated into the new Hong Kong Diploma of

Secondary Education (HKDSE) English Language Examination, adopting a standards-

referenced assessment system, aiming to not just report on the full range of educational

achievement but also motivate learning in Hong Kong secondary schools. In addition

to the fact that this assessment scheme accounts for 15% of the total subject mark in

the HKDSE, this SBA component seeks to provide a more comprehensive evaluation of

learners’achievement by assessing those learning objectives which can hardly be

assessed in public assessments while concurrently enhancing the capability for student

Gan et al. Language Testing in Asia (2017) 7:19 Page 3 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

self-assessment and life-long learning (Davison 2007). Given the nature of multiple

functions of this SBA component, we believe this current school-based English

language assessment can best be defined as:

The process by which teachers gather evidence in a planned and systematic way in order

to draw inferences about their students’learning, based on their professional judgment,

and to report at a particular time on their students’achievements (Harlen 2005, p. 247).

According to HKEAA, these two kinds of assessment tasks build on two different kinds

of learning programs embedded in the school curriculum in Hong Kong. One is a reading/

viewing program in which students read/view four texts over the course of 3 years and

undertake an individual presentation or a group interaction based on the books/videos/films

that they have read/viewed. The other is the elective module(s) in the school curriculum

where students carry out an individual presentation or a group interaction based on the

knowledge, skills, and experience gained in these elective modules.

Although SBA underwent a detailed research, development, and distribution process and

bears the advantages of providing teachers with a formative view of the progress of individ-

ual students and allowing them to address moreeffectivelythespecificneedsoftheirstu-

dents (Yip and Cheung 2005; Carless and Harfitt 2013), challenges and controversy arose

particularly when assessment for both formative and summative purposes is integrated into

the regular teaching and learning process, with school teachers involved at all stages of the

assessment cycle, from planning the assessment program, to identifying and/or developing

appropriate assessment tasks right through to making the final judgments (Davison 2007).

While responses of teachers and students to the underlying philosophy of SBA and its em-

phasis on improving the quality of teaching and learning were generally very positive, con-

cern about the comparability of SBA scores across schools has been pervasive and still

continues, with some more experienced teachers being even more vocal with regard to

negative comments towards the administration of SBA in the initial stage (Qian 2014).

Reliability is often defined as the consistency of measurement (Bachman and Palmer

1996). In other words, the reliability of a test or assessment has to do with the consistency

of scoring and the accuracy of the administration procedures of the test or assessment

(Chiedu and Omenogor 2014). Chiedu and Omenogor suggest that in the case of teacher-

directed classroom assessment, two teacher assessors may not necessarily interpret the as-

sessment criteria the same way. In addition, as teacher-directed classroom assessment may

vary in different contexts at different times, it may lead to inconsistent assessor judgment

(McNamara 1996). It has thus been widely believed that a major source of unreliability is

thescoringofatestorassessment.Undoubtedly, reliability is as an important issue for

school-based assessment as for traditional testing. Currently, in the case of English language

SBA in Hong Kong, the following methods, within-school standardization,inter-school

sharing,andHKEAA’sstatisticalmoderation, are adopted by the HKEAA (2016, p. 22) to

ensure the reliability and consistency of SBA scores across schools. Below is a description of

each of these four assessment training methods.

Within-school standardization

“Within-school standardization”means that if there is more than one subject teacher

teaching the subject to the same cohort of students in the school, it is necessary for the

Gan et al. Language Testing in Asia (2017) 7:19 Page 4 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

teachers involved to agree on the criteria for awarding marks so that the same standard

of assessment is applied to all students. Specifically, teachers teaching the same cohort

of students bring samples of video-recorded assessments of different levels (e.g., the

three highest and the three lowest assessments) to the school-level standardization

meeting where the video-recorded assessments are shown and discussed. The discus-

sions at this school-level standardization meeting may lead to adjustments to scores

across classes in the school. This school-level standardization ensures that all the

teachers involved in SBA in the school will achieve a clear understanding of the shared

expectations of what students at particular levels should be able to do in order to

achieve a certain score.

Inter-school sharing

Following the within-school standardization meeting, “Inter-school sharing”meetings

are organized by SBA District Coordinators. At the end of the school year, the SBA

District Coordinator will organize an inter-school meeting for professional sharing

among the schools within the group. The School Coordinators bring samples of video-

recordings and assessment records to this inter-school meeting where these samples of

student performance from different schools will be viewed and discussed with reference

to the assessment criteria. Each School Coordinator needs to report back to colleagues

in their own schools. If it is apparent that a particular school’s scores are markedly

higher or lower as a whole than those from the other schools as a whole, the school

team may wish to review their scores.

HKEAA’s statistical moderation

Despite the school-level teachers’participatory and reflective professional sharing in

the implementation of SBA, there is still the likelihood that teachers in one school may

be harsher or more lenient in their judgments than teachers in other schools. Given

this concern, a statistical moderation method is adopted by HKEAA in moderating the

SBA assessments submitted by schools, with the aim to ensuring the comparability of

SBA scores across schools. This statistical moderation is done by adjusting the average

and the spread of SBA scores of students in a given school with reference to the public

examination scores of the same group of students, supplemented with review of sam-

ples of students’work. The statistical moderation results will be compared to the

results from the sample review. Potential adjustments will be made to the statistical

moderation results so that the final moderated scores of these schools can properly

reflect the performance of their students in the SBA.

Kane (2010) makes a distinction between procedural fairness and substantive fairness.

Procedural fairness can be said to require that all test takers take the same test or

equivalent tests, under the same conditions or equivalent conditions, and that their

performances be evaluated using the same rules and procedures. Substantive fairness in

testing requires that the score interpretation and any test-based decision rule be rea-

sonable and appropriate and ‘that examinees of equal standing with respect to the con-

struct the test is intended to measure should on average earn the same test score,

irrespective of group membership’(AERA et al. 1999, p. 74). In other words, substantive

fairness is concerned with how well the program functions for different groups, and it

requires that scores have comparable meaning in different groups. While the above

Gan et al. Language Testing in Asia (2017) 7:19 Page 5 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

school-level processes of systematic, participatory and reflective professional sharing

may indeed be helpful in mitigating stakeholders’potential concern about fairness of

SBA scores across schools, there has been almost no empirical evidence to illustrate

the extent to which SBA results across different types of schools are comparable. To fill

in this research gap, the present study, with the help of Rasch analysis, aims to examine

how students participating in SBA in different schools may vary with regard to SBA

scores in the assessment tasks.

Task-based L2 performance in the research literature

Currently, there are two competing theoretical perspectives on task-based L2 performance

aiming to account for the impact of task type and task conditions on L2 spoken perform-

ance, the Tradeoff Hypothesis (Skehan 2009; Skehan 2014) and the Cognition Hypothesis

(Robinson et al. 2009, Robinson 2007). Skehan’s theoretical framework views limitations

in attention as fundamental to second language speech performance, which entails a need

to analyze what attentional and working-memory demands a task makes and the conse-

quences this may have for different language performance dimensions often referred to as

accuracy, fluency, and complexity. Consequently, it is often assumed that more demand-

ing tasks are likely to result in prioritization of fluency over accuracy and complexity and

that tasks based on familiar or concrete information favor a concern for accuracy. Also,

within this Tradeoff Hypothesis, it is suggested that interactive tasks or tasks requiring

transformation or manipulation of materials or tasks which have had pre-task planning

are likely to lead to greater linguistic complexity. Standing in clear opposition to Skehan’s

Tradeoff Hypothesis, Robinson’s (2009, 2007) Cognition Hypothesis claims that there is

no limit to human attentional resources and as such human mind can attend to different

aspects of performance if certain conditions are met and that language learners can access

multiple attentional pools that do not compete and depletion of attention in one pool has

no effect on the amount remaining in another. Robinson (2007) also argues that the more

demanding a task is in terms of its content, the more complex and accurate its linguistic

performance will be.

Empirical studies that were guided by either Skehan’sorRobinson’s framework and con-

ducted in pedagogic con-texts, however, yielded mixed results. For example, Bygate

(1999) examined the complexity of the language of Hungarian secondary EFL learners on

a monologic narrative task and an argumentation task and found that the narrative tasks

might stretch the speakers more in terms of complexity of syntactic and lexical process-

ing. Bygate’s study finding appeared to be echoed in the Michel et al. (2007) study which

revealed that the dialogic (i.e., interactive) task tended to elicit shorter and structurally

simpler sentences than the monologic narrative task, although Michel et al. also found

that students made significantly fewer errors and were significantly more fluent in the dia-

logic task condition. In other words, Michel et al.’s study suggests that interactivity may

affect structural complexity negatively. It was thus apparent that Skehan and his col-

leagues’(Foster and Skehan 1996; Skehan and Foster 1997) observation that more inter-

active tasks lead to more complex language performance did not find support in the

Bygate and Michel et al. (2007) studies. In language testing contexts, a few studies (e.g.,

Fulcher 1996; Bachman et al. 1995) reported significant but small differences in test scores

across different types of test tasks. More recently, a number of studies conducted in

experimental language testing settings that replicated Skehan’sorRobinson’sframework

Gan et al. Language Testing in Asia (2017) 7:19 Page 6 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

concerning the impact of task performance conditions on task performance revealed

results that did not lend much support to either of the their theoretical frameworks. Given

the mixed results of these studies on the relationship between task type and task perform-

ance, it is clear that this issue warrants further empirical research.

The context for the present study is innovative in that the assessment tasks in this

study involve speaking in a high-stakes language assessment context but they are

designed and implemented in the ESL classroom by school teachers in light of a regular

reading and viewing program or the elective modules integrated into the school cur-

riculum (Gan 2013). The processes of selecting appropriate assessment tasks and mak-

ing the actual assessments are undertaken collaboratively among teachers concerned,

taking into account the students’backgrounds, needs, and their skills. All the teachers

involved in the assessment, however, need to go through a series of within-school and

inter-school standardization meetings and discussions organized by the HKEAA to help

them to develop a clear understanding of the shared expectations of what students at

particular levels of performance should be able to do to achieve a certain grade.

Building on the research discussed above, the present study focuses on the following

research questions:

1. What is the variation of SBA results across schools in Hong Kong?

2. How may task difficulty be reflected in students’assessment results across the two

SBA task types?

Methods

Participants

The study is part of a large-scale longitudinal project of investigating teachers and stu-

dents’perceptions of a high-profile school-based assessment initiative in Hong Kong

and using various measures to validate assessment tasks and assessment results. In an

earlier related study, a convenience sample of 373 secondary Form 6 students from

three different schools completed a questionnaire about their perceived difficulty of the

two task types on the school-based assessment. The students also reported on their

assessment results from the two assessment tasks. The study reported in this paper fo-

cused on analysis of the students’assessment results collected in the earlier study.

Among the three schools involved in the study, schools A and B are both catholic

schools where English is used as the medium to teach core subjects such as English,

Math, Chemistry, and Physics. School C became a Chinese-medium school after 1997,

and at the time of this study, school C was making efforts to build up better discipline

and learning atmosphere among the students. Note that schools A and B are ranked as

band 2 school whereas school C is ranked as band 3 school in the traditional local

school rankings.

Procedures

Prior to students’participation in the questionnaire survey, students’performance in

the two SBA tasks were assessed by their teachers who followed the assessment criteria

for both group discussion and individual presentation that cover six levels (level 1 rep-

resents the lowest level, and level 6 represents the highest level) of oral English profi-

ciency in the four major domains of English language performance. The two task types

are defined by HKEAA (2016) as follows:

Gan et al. Language Testing in Asia (2017) 7:19 Page 7 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

An individual presentation, which may be quite informal, is defined as a single piece

of oral text in which an individual speaker presents some ideas or information over a

sustained period (3–5 min), with the expectation that they will not be interrupted. An

individual presentation requires comparatively long turns, hence generally needing

more pre-planning and a more explicit structure to ensure coherence. A presentation

may be followed by questions or comments from the audience, but this exchange is not

mandatory for the assessment for the individual presentation.

Agroup interaction is defined as an exchange of short turns or dialog with more than

one speaker on a common topic. An interaction is jointly constructed by two or more

speakers, hence generally needing less explicit structuring but more attention to turn-

taking skills and more planning of how to initiate, maintain, and/or control the inter-

action by making suggestions, asking for clarification, supporting and/or developing

each other’s views, and disagreeing and/or offering alternatives.

In each of the individual presentations or group discussions, each participant thus

received a separate score for each of the four domains of assessment criteria, as well as

a global score as a result of the aggregation of the domain scores (Gan 2012).

Data analysis

In some of the previous test, validation studies, test psychometric properties, and result in-

terpretations were analyzed typically through conventional analysis methods. For instances,

internal consistency of test items in the form of Cronbach’s alpha is usually examined for

the indication of reliability; face and content validity are obtained solely from a panel of ex-

perts; raw scores from each item were summed across for a total mean score for compari-

son of students’performance or for parametric statistical test examination. Such

conventional analyses of raw scores assumed interval-scale data for an ordinal-scale data

(Wright 1999) where parametric statistical tests are not readily to be performed on (Wright

and Master 1982; Boone et al. 2014; Liu 2010). When parametric test was done on ordinal

data, the results have an element of error. In other words, the reliability and validity of data

are jeopardized. In the present study, the psychometric properties of the test were assessed

by Rasch modeling analysis and raw scores were transformed into interval data (Rasch esti-

mates in unit logit) for the conduct of parametric statistical test—these features clearly ad-

vancedtheprecursorystudies.

In the current paper, the school-based English language assessment scores from 373 sec-

ondary Form 6 students from three schools were analyzed using Rasch analysis (Rasch

1980) with FACETS software (Linacre 2017). In the analysis, each separate domain of task

performance of the two SBA assessment tasks is referred to as an assessment “item,”scored

ona6-pointscale(seeAppendixes1and2).Atotal of eight assessment items were included

for analysis. This enables the psychometric quality of the instrument to be assessed. For this

purpose, principal component analysis of residuals, fit statistics, and Rasch separation indi-

ces were examined. Rasch model was used to transform the raw scores into interval-scale

data for analyses. Specifically, raw scores were transformed into Rasch’s estimates that are

linear and readily used for conventional statistical analyses, e.g., ANOVA for variables com-

parisons. In order to evaluate whether scores on the eight items of the two assessment tasks

were significantly different across schools, interaction analysis between item difficulty and

schools was conducted. In order to examine the relative task difficulty across the two SBA

task types, the difficulty estimates of the eight items were compared.

Gan et al. Language Testing in Asia (2017) 7:19 Page 8 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Results

Psychometric features of the assessment

Rasch model expects unidimensionality where scores are measures of only one latent

trait. While principal component analysis (PCA) of residuals identifies potential sec-

ondary dimension that distorts the measurement of the latent trait (Linacre 2014), it

assists unidimensionality assessment through variance explained by Rasch’s modeling

measures. The data is assumed to be unidimensional if the variance explained by Rasch

measures is greater than or equal to 50% (Linacre 2014). For the present study, the

PCA of residual test reporting 77.5% of variance was explained by Rasch measures.

This is an indication that the data are sufficiently unidimensional and appropriate for

Rasch analysis—an attribute of construct validity (Bond and Fox 2015) and of strong

evidence that the scores are interpretable.

Fit statistics are also indicators for unidimensionality. The fit statistics is assessed

through Infit and Outfit Mean Squares (MnSq). Infit MnSq are derived from on-target

performance scores while Outfit MnSq are influenced more by the off-target scores.

Data that fit the Rasch model perfectly will yield a fit of 1. This ideal situation is impos-

sible in real world from actual data. A MnSq fit range between 0.60 and 1.40 indicated

good adhesion to the model (Bond and Fox 2015; Wright and Linacre 1994). Misfitting

statistics indicated that test items may measure more than one latent trait. Results to

the items staying outside the acceptable range should be interpreted with caution.

Table 1 shows that all items reported acceptable Infit and Outfit MnSq with values ran-

ging between 0.86 and 1.18. In addition to the PCA results reported earlier, the item fit

statistics indicated that the data were unidimensional and that item performed accord-

ing the Rasch model’s expectations.

Rasch modeling two separation indices providing information on whether the person

and item estimates estimated by Rasch model are reliable. Person separation index indi-

cates replicability of person ordering while item separation index indicates replicability

of item placement on an interval scale (Bond and Fox 2015). The widely accepted

threshold for the separation index is 3.0 (Bond and Fox 2015). Person and item separ-

ation indices for the present study were 5.07 and 4.31 (corresponding to 7.09 person

strata and 6.08 item strata)—these results indicate that this sample and items are

Table 1 Item fit statistics

Entry Name Measure SE Infit MnSq Outfit MnSq

1 disProdel 0.06 0.08 1.01 1.05

2 disComstr −0.12 0.08 1.16 1.18

3 disVoclan 0.37 0.08 0.86 0.86

4 disIdeorg −0.67 0.08 0.91 0.86

5 indpreProdel 0.19 0.08 0.87 0.88

6 indpreComstr 0.33 0.08 1.14 1.11

7 indpreVoclan 0.25 0.08 0.86 0.88

8 indpreIdeorg −0.42 0.08 1.10 1.16

Notes: Entries 1–4 refer to the four performance domains (i.e., pronunciation and delivery, communication strategies,

vocabulary and language patterns, ideas and organization) of the SBA group interaction task; entries 5–8 refer to the four

performance domains (i.e., pronunciation and delivery, communication strategies, vocabulary and language patterns,

ideas and organization) of the SBA individual presentation task. Each domain of either task is scored on a 6-point scale.

See Appendixes 1 and 2

Gan et al. Language Testing in Asia (2017) 7:19 Page 9 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

separable into 6–7 levels of ability or difficulty levels, respectively (Bonk and Ockey

2003), from which they indicated that the person and item estimates from the Rasch

analysis are reliable and are replicable on interval scales (Bond and Fox 2015).

What is the variation of SBA results across different schools in Hong Kong?

An interaction analysis between item difficulty and schools was conducted to examine

the difference of students’oral English performance on the eight assessment items

across the three different schools. The result of chi-square test showed that the inter-

action effect was significant [χ

= 111.3, p< .05]. In other words, generally, students

from different schools demonstrated significantly different performance on the items.

Rasch estimates are indicators of students’ability and item difficulty. A positive value

indicates higher ability and higher difficulty; in contrary, a negative value indicates

lower ability and lower difficulty. Students from school C scored higher on discussion

task (Fig. 1) as they reported lower Rasch-calibrated item difficulties across the four

group discussion task domains (items). Item 1 (disProdel) (−0.34 logit) and item 2 (dis-

Comstr) (−0.41 logit) were particularly easier for student from school C than they were

for students from schools A and B. The item difficulty of item 1 for schools A and B

were 0.34 logit and 0.42 logit, while the item difficulty of item 2 were 0.18 logit and

0.07 logit, respectively. The differences between schools A/B and C were significant

(p<.05). The differences of item difficulty of item 3 (disVoclan) and item 4 (disIdeorg)

between students from school B and school C were also significant (p<.05). It is obvi-

ous that students from schools A and B demonstrated similar abilities across the dis-

cussion items, as no significant difference on item difficulty was observed between

them on all discussion task domains.

In general, students from school C showed poorer performance on individual presen-

tation task domains, especially on items 6, 7, and 8. The item difficulty of these three

items for students from school C were 0.64 logit, 0.77 logit, and −0.03 logit

Fig. 1 Item difficulty estimates for the three schools

Gan et al. Language Testing in Asia (2017) 7:19 Page 10 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

respectively. In contrary, students from schools A and B performed significantly better

on these items (p< .05). The item difficulty of these three items were 0.18 logit, −0.36

logit, and −0.67 logit for students from school A and 0.06 logit, −0.03 logit, and −0.82

logit for students from school B. Students from schools A and B demonstrated similar

performance across the individual presentation task items without any significant dif-

ference on item difficulty observed.

The pattern of relative item difficulty for students from different schools is clearer in

Fig. 2. Students from schools A and B demonstrated similar performance pattern. The

items in discussion task appeared more difficult for school A/B than they were for

school C. In contrast, items in individual presentation task were easier for school A/B

than they were for school C.

How may difficulty of items (i.e., assessment domains) be reflected in students’assess-

ment results across the two SBA task types?

Figure 3 lays out the locations of the students and the items on an interval scale. The

first column is the logit scale, and the second and third columns graphically described

the locations of the students and the eight items, respectively. The fourth column is the

rating scale of the items. This map transformed the student scores and item scores on

a common interval scale in logit unit. For the present study, the logit scale runs from

−10 to + 9 logits. Students towards the top of the figure were higher in ability than stu-

dents staying at the bottom. Items near the top are more difficult items while those

near the bottom are less difficult items.

Across the two task types, item 4 (disldeorg) and item 8 (indpreldeorg) appeared to

be the easiest items to students (Fig. 3); the former is a group discussion item while the

latter an individual presentation item. Item 6 (indpreComstr), item 3 (disVoclan), and

item 7 (indpreVoclan) emerged as the most difficult items (Fig. 3); item 6 and item 7

are individual presentation items while item 3 is a group discussion item. The

remaining items 2 (disComstr), 5 (disProdel), and 1 (disProdel) appeared to be of

Fig. 2 Relative item difficulty for the three schools

Gan et al. Language Testing in Asia (2017) 7:19 Page 11 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

medium difficulty level relatively. These results suggest some more fine-grained pattern

of difference in item difficulty across the two SBA task types, which was not reported

in previous studies.

Discussion

Psychometric qualities of the English language school-based assessment

In the English language school-based assessment, the teacher assessor, who has

received rater training organized by the HKEAA before undertaking assessment, sits

nearby, assesses each participant, and assigns scores to students. Each student thus

receives two independent ratings for their oral English performance in either the indi-

vidual presentation or group interaction task and is scored on pronunciation and deliv-

ery, communication strategies, vocabulary and language patterns, and ideas and

organization. Raw scores for each of the assessment tasks were assigned on a scale of

0–6 on each of four rating categories, for a total score of 24. In conducting data ana-

lysis of test datasets, the assumption of unidimensionality is perhaps one of the most-

discussed features of Rasch models (Bonk and Ockey 2003). In our study, the statistics

reported above display adequate psychometric unidimensionality, suggesting the

English language school-based assessment tends to assess a unidimensional latent trait,

i.e., the oral English proficiency, as represented by ratings on four scoring categories

and thus providing evidence of construct validity. The statistics also show that the

Fig. 3 Person-item map on interval scale

Gan et al. Language Testing in Asia (2017) 7:19 Page 12 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

items in the SBA had satisfactory fit statistics, indicating that all items performed in a

consistent way as expected. The person and item separation indices shown in the sum-

mary statistics are above the widely accepted threshold for the separation index, indi-

cating that the SBA can differentiate levels of proficiency. This means that the Rasch

model generated in our analysis could reliably separate examinees by ability.

Bonk and Ockey (2003) used Rasch measurement in their study of a group oral

assessment in English language at a Japanese university, in which groups of three or

four students were assessed in conversation. Examinees were scored on five soring cat-

egories, i.e., pronunciation, fluency, grammar, vocabulary/content, and communication

skills/strategies. Although items in the form of these five scoring categories were found

to show acceptable fit, “communication skills”and “pronunciation”were the categories

with a possible degree of misfit. Two scoring categories of our study (see Appendixes 1

and 2) also measure “communication skills”and “pronunciation,”but unlike Bonk and

Ockey’s study, these two categories as well as the other two categories in our study all

demonstrate good fit. This means that all these assessment categories (pronunciation

and delivery, communication strategies, vocabulary and language patterns, ideas and

organization) obviously belong to the same measurement domain. This makes sense as

the focus of the school-based assessment is on the students’speaking ability to discuss

issues in depth and to convey their ideas clearly and concisely rather than

memorization skills or their ability to provide highly specific factual details about what

they have read or viewed.

Variation of SBA results across schools in Hong Kong

This study showed that students from school C demonstrated significantly poor

performance on three assessment domains (i.e., communication strategies, vocabu-

lary and language patterns, and ideas and organization) in the individual presenta-

tion task compared with school A and school B. However, somewhat unexpectedly,

students from school C scored significantly higher on two assessment domains

(i.e., pronunciation and delivery, and communication and strategy) in the group

discussion task than students from school A or school B, given the fact that school

C is a government-funded band 3 school. At the time of this study, school C was

struggling hard to improve its teaching quality and discipline maintenance among

the students. There are two possible interpretations of school C’shigherperform-

ance on those two assessment domains. First, as a typical practice in many

government-funded band 3 schools in Hong Kong, these schools tend to designate

a couple of classes from each grade as “elite classes.”Such “elite”classes usually

have the privilege of access to the best teaching and learning resources in the

school. For example, these classes are usually taught by the best English teachers

in the school and may also participate in extra-curricular English learning tutorials

offered by native-English speaking teachers in the school. In this study, there was

the likelihood that a considerable proportion of elite class students from school C

might have participated in this study. Second, there was the possibility that some

teacher assessors from school C might have been lenient in assessing their stu-

dents’oral performance in some assessment domains in the group discussion task

intheSBA.Overall,thisstudyindicatesthatstudentsfromschoolCinthisstudy

Gan et al. Language Testing in Asia (2017) 7:19 Page 13 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

were likely to demonstrate relatively unstable language performance in the English

language SBA in Hong Kong. This implies that in spite of the school-level teachers’

participatory sharing in the SBA standardization processes, there is still the likeli-

hood that there is variance in being harsh or lenient in their judgments of stu-

dents’performance in different assessment domains in school C. This study thus

points to the need for HKEAA to adopt a statistical moderation method in moder-

ating the SBA assessment results submitted by schools to ensure comparability of

SBA scores across schools.

Two of the three schools involved in this study, schools A and B, are catholic

direct-subsidy schools that use English as the medium to teach the core school

subjects, and have been ranked as band 2 schools in Hong Kong. This study shows

that students from these two catholic subsidy schools demonstrated no statistically

significant differences in their school-based English language assessment results,

suggesting that teachers’assessment scores appeared to be comparable across these

two schools. In other words, teacher judgments of the student performance from

these two band 2 schools on the two English language SBA tasks tend to be con-

sistent. Such potentially reliable judgment of students’performance on the SBA

might have to do with a range of standardization procedures within or across

schools that enable teachers to meet together, look at/listen to/discuss student oral

samples, the tasks students have done, and talk about why they think a sample is

at a level on each domain. These procedures thus likely constitute the important

processes that contribute to understanding and to common grounds among English

teachers involved in the SBA.

Difference in difficulty of different assessment domains across the two SBA task types

The notion of “task-induced variation”(Ellis 1994) means that a particular type of task

that a learner is asked to perform will result in variation (Rahimpour 2007). This is

echoed by Tarone (1990) who argues that as second language learners perform different

tasks, their production of some grammatical, morphological, and phonological forms

will vary in a particular manner. Gan (2013) examined how learner L2 oral performance

may vary across two different task types in the current school-based assessment in

Hong Kong by analyzing both the discourse produced from the tasks and the teacher

rater assessments of students’task performance. Gan’s study revealed a general trend

towards higher assessment scores on most of the assessment domains in individual

presentation task than in the group discussion task. It needs to be pointed out that only

30 students’assessment performance from one particular secondary school in Hong

Kong was analyzed in the Gan study. With the help of Rasch analysis, the present study

examined the teacher rater assessments of 373 students across three different schools

and revealed an obviously more fine-grained pattern of difference in difficulty levels of

different assessment domains observed in students’assessment performance across the

two SBA task types. Item 4 (disldeorg) of the group discussion task and item 8 (indprel-

deorg) of the individual presentation task appeared to be the easiest task domains to

students. This result could be associated with the possibility that while assessing stu-

dent oral performance, the teacher rater was likely to attend more to the grammatical,

lexical, or phonological features of the test candidate’s language use than to

Gan et al. Language Testing in Asia (2017) 7:19 Page 14 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

organization of their ideas. This appears to corroborate the result that item 3 (disVoclan)

of group discussion task and item 7 (indpreVoclan) of individual presentation task

emerged as the most difficult items as these items represent performance domains on

which the teacher assessor was more likely to base their decisions (Gan 2012). Note that

item 6 (indpreComstr) was also one of the most difficult items. This might be due to the

possibility that the condition under which the learner performed individual presentation

task resulted in the learner concentrating on accuracy and fluency of their language

production but overlooking use of interactional skills. Consequently, these results show

that different aspects of the two SBA tasks may have different strengths in measuring

students’speaking proficiency in the school-based assessment context. In other words, the

result provides evidence that the two SBA task types could be used to complement each

other in measuring the same construct of oral language proficiency as they claim to

measure. In the past decades, there has been anxiety among educators and researchers

about the reliability of the group oral discussion format in the testing literature. The re-

sults of this study lead us to concur with Bonk and Ockey that the group oral may also be

a reasonably solid basis upon which to make a valid overall decision about students’L2

oral ability.

Conclusions

This study was motivated by the concern in both research and practice that

teachers from different schools might not be able to provide comparable results,

given teachers’necessarily subjective judgments and interpretations of assessment

data. We were thus interested to examine the extent to which teachers’assessment

results from three different schools were comparable. The results suggest that as-

sessment results from two band 2 schools appeared generally comparable as there

was no significant difference in students’SBA results in most assessment domains

across the two schools. Teachers’assessment scores of students from the band 3

school in this study could be less stable occasionally as students from this school

scored significantly lower on some assessment domains but significantly higher on

some other domains compared with the two band 2 schools.

Overall, the finding that students from two schools of similar banding level demon-

strated similar performance on the two assessment task types provides empirical sup-

port for reliability and fairness of the SBA as a component in the public assessment of

the English language subject at the secondary level in Hong Kong. Meanwhile, the pos-

sibility that teacher rater’s leniency might lead to higher scores in some domains of

group discussion task in the band 3 school in this study provides justification to the

need for the HKEAA to adopt a statistical moderation method in moderating the SBA

assessment results submitted by schools to ensure the comparability of SBA scores

across schools. Finally, observation of an obviously more fine-grained pattern of differ-

ence in difficulty levels of different assessment domains in students’assessment results

across the two task types clearly adds to our understanding of the role of different task

types in oral assessment in the classroom assessment context. The generalizability of

the specific results of this study, however, could be limited by its small sample of

schools involved in this study. Future studies should use a more representative sample

of schools selected from a variety of geographic regions across the region.

Gan et al. Language Testing in Asia (2017) 7:19 Page 15 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 2 SBA assessment criteria for group interaction (GI)

I. Pronunciation and delivery II. Communication strategies III. Vocabulary and language patterns IV. Ideas and organization

6 Can project the voice appropriately for the

context without artificial aids. Can pronounce

all sounds/sound clusters and words clearly

and accurately. Can speak fluently and

naturally, with very little hesitation, while

using suitable intonation to enhance

communication.

Can use appropriate body language to display and

encourage interest. Can use a full range of

turn-taking strategies to initiate and maintain ap

propriate interaction, and can draw others into the

interaction (e.g., by summarizing for weaker

students’benefit or by redirecting a conversation

to a quiet student). Can interact without the use of

narrowly formulaic expressions.

Can use a wide range of accurate and

appropriate vocabulary. Can use varied,

appropriate, and highly accurate

language patterns; minor slips do not

impede communication. Can self-correct

effectively. May occasionally glance at

notes but is clearly not dependent on

them.

Can express a wide range of relevant

information and ideas without any signs of

difficulty and without the use of notes. Can

consistently respond effectively to others,

sustaining and extending a conversational

exchange. Can use the full range of

questioning and response levels (see

Framework of Guiding Questions) to engage

with peers.

5 Can project the voice appropriately for the

context without artificial aids. Can pronounce

all sounds/sound clusters clearly and almost

all words accurately. Can speak fluently using

intonation to enhance communication, with

only occasional hesitation, giving an overall

sense of natural non-native language.

Can use appropriate body language to display and

encourage interest. Can use a good range of

turn-taking strategies to initiate and maintain

appropriate interaction and can help draw others

into the interaction (e.g., by encouraging

contributions, asking for opinions, or by responding

to group members’questions). Can mostly interact

without the use of narrowly formulaic expressions.

Can use varied and almost always

appropriate vocabulary. Can use almost

entirely accurate and appropriate

language patterns. Can usually self-

correct effectively. May occasionally refer

to a note card.

Can express relevant information and ideas

clearly and fluently, perhaps with occasional,

unobtrusive, reference to a notecard. Can

respond appropriately to others to sustain

and extend a conversational exchange. Can

use a good variety of questioning and

response levels (see Framework of Guiding

Questions).

4 Can project the voice mostly satisfactorily

without artificial aids. Can pronounce most

sounds/sound clusters and all common

words clearly and accurately; less common

words can be understood although there

may be articulation errors (e.g., dropping

final consonants). Can speak at a deliberate

pace, with some hesitation but using

sufficient intonation conventions to convey

meaning.

Can use some features of appropriate body

language encourage to and display interest. Can

use a range of appropriate turn-taking strategies to

participate in interaction (e.g., by making

suggestions in a group discussion), and can

sometimes help draw others in (e.g., by asking for

their views). Can interact using a mixture of mainly

natural language and formulaic expressions.

Can use mostly appropriate vocabulary.

Can use language patterns that are

usually accurate and without errors that

impede communication. Can self-correct

when concentrating carefully or when

asked to do so. May refer to a note card

but is not dependent on notes.

Can present relevant literal ideas clearly in a

well-organized structure, perhaps with

occasional reference to a notecard. Can often

respond appropriately to others; can sustain

and may extend some conversational

exchanges However, can do these things less

well when attempting to respond to

interpretive or critical questions, or when

trying to interpret information and present

elaborated ideas.

3 Volume may be a problem without artificial

aids. Can pronounce all simple sounds clearly

but some errors with sound clusters; less

common words may be misunderstood

unless supported by contextual meaning.

Can use appropriate body language to display

interest in the interaction. Can use appropriate but

simple turn-taking strategies to participate in, and

occasionally initiate, interaction (e.g., by requesting

repetition and clarification or by offering agreement).

Can use simple vocabulary and language

patterns appropriately and with errors

that only occasionally impede

communication. Can sometimes self-

correct simple errors. May suggest a level

Can present some relevant ideas sequentially

with some links among own ideas and with

those presented by others. Can respond to

some simple questions and may be able to

Appendix 1

Gan et al. Language Testing in Asia (2017) 7:19 Page 16 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 2 SBA assessment criteria for group interaction (GI) (Continued)

I. Pronunciation and delivery II. Communication strategies III. Vocabulary and language patterns IV. Ideas and organization

Can speak at a careful pace and use

sufficient basic intonation conventions to be

understood by a familiar and supportive

listener; hesitation is present.

Can use mainly formulaic expressions as

communication strategies.

of proficiency above 3 but has provided

too limited a sample, or cannot be scored

accurately because of dependence on notes.

expand these responses when addressed

directly.

2 Volume may be a problem without artificial

aids. Can pronounce simple sounds/sound

clusters well enough to be understood most

of the time; common words can usually be

understood within overall context. Can

produce familiar stretches of language with

sufficiently appropriate pacing and

intonation to help listener’s understanding.

Can use appropriate body language when especially

interested in the group discussion or when

prompted to respond by a group member. Can use

simple but heavily formulaic expressions to respond

to others (e.g., by offering greetings or apologies).

Can appropriately use vocabulary drawn from

a limited and very familiar range. Can use

some very basic language patterns

accurately in brief exchanges. Can identify

some errors but may be unable to self-correct.

Provides a limited language sample or a

sample wholly spoken from notes.

Can express some simple relevant information

and ideas, sometimes successfully, and may

expand some responses briefly. Can make

some contribution to a conversation when

prompted.

1 Volume is likely to be a problem. Can

pronounce some simple sounds and

common words accurately enough to be

understood. Can use appropriate intonation

in the most familiar of words and phrases;

hesitant speech makes the listener’s task

difficult.

Can use restricted features of body language when

required to respond to peers. Can use only simple

and narrowly restricted formulaic expressions and

only to respond to others.

Can produce a narrow range of simple

vocabulary. Can use a narrow range of

language patterns in very short and

rehearsed utterances. The language sample is

too limited for a full assessment of proficiency.

Can occasionally produce brief information

and ideas relevant to the topic. Can make

some brief responses or statements made

when prompted.

0 Does not produce any comprehensible

English speech.

Does not use any interactional strategies. Does not produce any recognizable words

or sequences.

Does not produce any appropriate, relevant

material.

Gan et al. Language Testing in Asia (2017) 7:19 Page 17 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 3 SBA assessment criteria for individual presentation (IP)

I. Pronunciation and delivery II. Communication strategies III. Vocabulary and language patterns IV. Ideas and organization

6 Can project the voice appropriately for the

context without artificial aids. Can pronounce

all sounds/sound clusters and words clearly

and accurately. Can speak fluently and

naturally, with very little hesitation, while

using suitable intonation to enhance

communication.

Can use appropriate body language to show

focus on audience and to engage interest. Can

judge timing in order to complete the

presentation. Can confidently invite and

respond to questions if this is required by the

task.

Can use a wide range of accurate and appropriate

vocabulary. Can use varied, appropriate and highly

accurate language patterns; minor slips do not

impede communication. Can choose appropriate

content and level of language to enable audience

to follow. Can self-correct effectively. Can present

without use of notes, but may glance at a note

card occasionally.

Can convey relevant information and ideas

clearly and fluently without referring to

notes. Can elaborate in detail on some

appropriate aspects of the topic, and can

consistently link main points with support

anddevelopment.Canbefollowedeasily

and with interest. Can reformulate a point

if the audience is unclear.

5 Can project the voice appropriately for the

context without artificial aids. Can pronounce

all sounds/sound clusters clearly and almost all

words accurately. Can speak fluently using

intonation to enhance communication, with

only occasional hesitation, giving an overall

sense of natural nonnative language.

Can use appropriate body language to show

focus on audience and to engage interest. Can

judge timing sufficiently to cover all essential

points of the topic. Can appropriately invite

and respond to questions or comments when

required for the task.

Can use varied and almost always appropriate

vocabulary. Can use almost entirely accurate and

appropriate language patterns. Can choose

content and level of language that the audience

can follow, with little or no dependence on notes.

Can usually self-correct effectively. May

occasionally refer to a note card.

Can convey relevant information and ideas

clearly and well, perhaps with occasional,

unobtrusive, reference to a note card. Can

elaborate on some appropriate aspects of

the topic, and can link main points with

support and development. Can be followed

easily. Can explain a point if the audience is

unclear.

4 Can project the voice mostly satisfactorily

without artificial aids. Can pronounce most

sounds/sound clusters and all common words

clearly and accurately; less common words can

be understood although there may be

articulation errors (e.g., dropping final

consonants). Can speak at a deliberate pace,

with some hesitation but using sufficient

intonation conventions to convey meaning.

Can use appropriate body language to display

audience awareness and to engage interest,

but this is not consistently demonstrated. Can

use the available time to adequately cover all

the most essential points of the topic. Can

respond to any well-formulated questions if

these are required by and directly related to

the task

Can use mostly appropriate vocabulary. Can use

language patterns that are usually accurate and

without errors that impede communication. Can

choose mostly appropriate content and level of

language to enable audience to follow. Can

self-correct when concentrating carefully or when

asked to do so. May refer to a note card but is not

dependent on notes.

Can present relevant literal ideas clearly in

a well-organized structure, perhaps with

occasional reference to a note card. Can

expand on some appropriate aspects of

the topic with additional detail or

explanation, and can sometimes link these

main points and expansions together

effectively. Can be followed without much

effort.

3 Volume may be a problem without artificial

aids. Can pronounce all simple sounds clearly

but some errors with sound clusters; less

common words may be misunderstood

unless supported by contextual meaning.

Can speak at a careful pace and use sufficient

basic intonation conventions to be understood

Can use some appropriate body language,

displaying occasional audience awareness and

providing some degree of interest. Can present

basic relevant points but has difficulty

sustaining a presentation mode. Can respond

to any relevant, cognitively simple,

well-formulated questions required by the task.

Can use simple vocabulary and language patterns

appropriately and with errors that only occasionally

impede communication, but reliance on memorized

materials or written notes makes language and

vocabulary use seem more like written text spoken

aloud. Can choose a level of content and language

that enables audience to follow a main point, but

Can present some relevant literal ideas

clearly, and can sometimes provide some

simple supporting ideas. Can sometimes

link main and supporting points together.

May appear dependent on notes.

Appendix 2

Gan et al. Language Testing in Asia (2017) 7:19 Page 18 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 3 SBA assessment criteria for individual presentation (IP) (Continued)

I. Pronunciation and delivery II. Communication strategies III. Vocabulary and language patterns IV. Ideas and organization

by a familiar and supportive listener; hesitation

is present.

needs to refer to

notes. Can sometimes self-correct simple errors,

may suggest a level of proficiency above 3, but

cannot be scored accurately because of

dependence on notes.

2 Volume may be a problem without artificial

aids. Can pronounce simple sounds/sound

clusters well enough to be understood most

of the time; common words can usually be

understood within overall context. Can produce

familiar stretches of language with sufficiently

appropriate pacing and intonation to help

listener’s understanding.

Can use a restricted range of features of body

language, but the overall impression is stilted.

Can present very basic points but does not

demonstrate use of a presentation mode and

is dependent on notes. Audience awareness is

very limited.

Can appropriately use vocabulary and language

patterns drawn from a limited and very familiar

range. Can read notes aloud but with difficulty.

Can identify some errors but may be unable to

self-correct. Provides a limited language sample

or a sample wholly spoken from notes.

Can make an attempt to express simple

relevant information and ideas, sometimes

successfully, and can attempt to expand

on one or two points. Can link the key

information sequentially. May be

dependent on notes.

1 Volume is likely to be a problem. Can pronounce

some simple sounds and common words

accurately enough to be understood. Can use

appropriate intonation in the most familiar of

words and phrases; hesitant speech makes the

listener’s task difficult.

Body language may be intermittently present,

but communication strategies appropriate to

delivering a presentation are absent. There is

no evident audience awareness.

Can produce a narrow range of simple vocabulary.

Can use a narrow range of language patterns in

very short and rehearsed utterances. Insufficient

sample to assess vocabulary and language patterns.

Can express a main point or make a brief

statement when prompted, in a way that

is partially understandable. The

presentation is wholly dependent on

notes or a written text.

0 Does not produce any comprehensible English

speech.

Does not attempt a presentation. Does not produce any recognizable words or

sequences.

Does not express any relevant or

understandable information.

Gan et al. Language Testing in Asia (2017) 7:19 Page 19 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Acknowledgements

We would like to thank Professor Antony John Kunnan for his helpful comments on an earlier version of the paper.

Authors’contributions

All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Publisher’sNote

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author details

Faculty of Education, University of Macau, Macao, People’s Republic of China.

Faculty of Arts and Social Sciences,

UNSW, Sydney, NSW 2052, Australia.

Received: 24 July 2017 Accepted: 2 November 2017

References

American Educational Research Association (1999). Standards for educational and psychological testing. Washington, DC:

American Psychological Association.

Bachman, L, Lynch, B, Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of

foreign language speaking. Language Testing,12(2), 238–257.

Bachman, L.F. and Palmer, A.S. 1996: Language testing in practice: designing and developing useful language tests.

Oxford: Oxford University Press.

Bond, TG, & Fox, CM (2015). Applying the Rasch model, (3rd ed., ). New York and London: Routledge, Taylor & Francis Group.

Boone, W.J., Staver, J.R., & Yale, M.S. (2014). Rasch Analysis in the Human Sciences. Dordrecht: Springer.

Bonk, WJ, & Ockey, GJ. (2003). A many-facet Rasch analysis of the second language group oral discussion task.

Language Testing,20(1), 89–110.

Bygate, M. (1999). Quality of language and purpose of task: patterns of learners’language on two oral communication

tasks. Language Teaching Research,3(2), 185–214.

Carless, DR, & Harfitt, G (2013). Innovation in secondary education: A case of curriculum reform in Hong Kong. In K

Hyland, LC Wong (Eds.), Innovation and change in English language education, (pp. 172–185). London: Routledge.

Chiedu, RE, & Omenogor, HD. (2014). The concept of reliability in language testing: issues and solutions. Journal of

Resourcefulness and Distinction,8(1), 1–9.

Choi, CC. (1999). Public examinations in Hong Kong. Assessment in Education,6(3), 405–418.

Clapham, C. (2000) Assessment and testing. Annual Review of Applied Linguistics, 20:147–161.

Crooks, TJ. (1988). The impact of classroom evaluation practices on students. Review of Educational Research,58,438–481.

Davison, C. (2007). Views from the chalk face: school-based assessment in Hong Kong. Language Assessment

Quarterly,4(1), 37–68.

Davison, C., Hamp-Lyons, L., Leung, W., Gan, Z., Poon C., Fung, V. 2010. Longitudinal Study on the Schoolsbased

Assessment Component of the 2007 Hong Kong Certificate of Education (HKCE) English Language Examination.

The Hong Kong Examinations and Assessment Authority.

Ellis, R (1994). The study of second language acquisition. Oxford: Oxford University Press.

Foster, P, & Skehan, P. (1996). The influence of planning time and task type on second language performance. Studies

in Second Language Acquisition,18, 299–323.

Fulcher, G. (1996). Testing tasks: issues in task design and the group oral. Language Testing,13,23–51.

Gan, Z. (2012). Complexity measures, task type, and analytic evaluations of speaking proficiency in a school-based

assessment context. Language Assessment Quarterly,9(2), 133–151.

Gan, Z. (2013). Task type and linguistic performance in school-based assessment situation. Linguistics and Education,24,535–544.

Harlen, W. (2005). Trusting teachers’judgement: research evidence of the reliability and validity of teachers’assessment

used for summative purposes. Research Papers in Education,20(3), 245–270.

Hong Kong Examination and Assessment Authority (HKEAA). (2016). English language school-based assessment

teachers’handbook.

Kane, M. (2010). Validity and fairness. Language Testing,27, 177–182.

Linacre, J.M. (2014). WINTSEPS (Version 3.81.0) [Computer Software]. Retrieved 17 November, 2016 from Chicago:

winsteps.com.

Linacre, J. M. (2017). Facets computer program for many-facet Rasch measurement, version 3.80.0. Retrieved 20,

October 2017 from Beaverton, Oregon: Winsteps.com.

Liu, X (2010). Using and developing measurement instruments in science education: a Rasch modeling approach. Charlotte:

Information Age Publishing.

McNamara, T (1996). Measuring second language performance. London & New York: Longman.

Meisels, SJ, Bickel, DD, Nicholson, J, Xue, Y, Atkins-Burnett, S. (2001). Trusting teachers’judgments: a validity

study of a curriculum-embedded performance assessment in kindergarten to grade 3. American Educational

Research Journal,38(1), 73–95.

Michel, MC, Kuiken, F, Vedder, I. (2007). The influence of complexity in monologic versus dialogic tasks in Dutch L2.

International Review of Applied Linguistics in Language Teaching,45(2), 241–259.

Gan et al. Language Testing in Asia (2017) 7:19 Page 20 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Price, M, O'Donovan, B, Rust, C. (2007). Putting a social-constructivist assessment process model into practice: building

the feedback loop into the assessment process through peer review. Innovations in Education and Teaching

International,44(2), 143–152.

Qian, DD. (2014). School-based English language assessment as a high-stakes examination component in Hong Kong:

insights of frontline assessors. Assessment in Education: Principles, Policy & Practice,21(3), 251–270.

Rasch, G (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press.

Rahimpour, M., 2007. Task complexity and variation in L2 learners' oral discourse.The University of Queensland Working

papers in Linguistics, Australia., 1: 1–9.

Robinson, P. (2007). Task complexity, theory of mind, and intentional reasoning: effects on L2 speech production,

interaction, uptake, and perceptions of task difficulty. International Review of Applied Linguistics,45(3), 193–213.

Robinson, P, Cadierno, T, Shirai, Y. (2009). Time and motion: measuring the effects of the conceptual demands of tasks

on second language speech production. Applied Linguistics,30(4), 533–544.

Sadler, D.R. (2006). Formative Assessment: revisiting the territory. Assessment in Education: Principles, Policy & Practice,5

(1):77–84.

Skehan, P. (2009). Modelling second language performance: integrating complexity, accuracy, fluency, and lexis. Applied

Linguistics,30(4), 510–532.

Skehan, P (Ed.) (2014). Processing perspectives on task performance. London: John Benjamins.

Skehan, P, & Foster, P. (1997). Task type and task processing conditions as influences on foreign language performance.

Language Teaching Research,1(3), 185–211.

Tarone, E. (1990). On variation in interlanguage: a response to Gregg. Applied Linguistics,11, 392–400.

Vygotsky, L (1978). Mind in society: the development of higher psychological processes. Cambridge: Harvard

University Press.

Wright, BD (1999). Fundamental measurement for psychology. In SE Embretson, SL Hershberger (Eds.), The new rules of

measurement: what every educator and psychologist should know, (pp. 65–104). Hillsdale: Lawrence Erlbaum.

Wright, BD, & Linacre, JM. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions,8, 370.

Wright, BD, & Master, GN (1982). Rating scale analysis. Chicago: Mesa Press.

Wyatt-Smith, C, Klenowski, V, Gunn, S. (2010). The centrality of teachers’judgement practice in assessment: a study of

standards in moderation. Assessment in Education: Principles, Policy & Practice,17(1), 59–75.

Yip, DY, & Cheung, D. (2005). Teachers’concerns on school-based assessment of practical work. Journal of Biological

Education,39(4), 156–162.

Gan et al. Language Testing in Asia (2017) 7:19 Page 21 of 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-

scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By

accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these

purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal

subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription

(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will

apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within

ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not

otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as

detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may

not:

use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access

control;

use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is

otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in

writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal

content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,

royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal

content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any

other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or

content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature

may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied

with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,

including merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed

from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not

expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

Available via license: CC BY 4.0

Content may be subject to copyright.

Application of Virtual Reality Speaking Assessment Tasks in an EFL Context

Chapter

Mar 2023

As Virtual Reality (VR) technologies continue to grow, they have a promising potential in providing a dynamic learning and assessment experience. In this study, we used a set of real-life tasks on a VR platform to assess the speaking performance of a group of adult EFL students. 16 participants at A2 proficiency level attended a 70-hour English conversation class (using a task-based approach) to improve their communication skills in English for emigration purposes. At the end of the course, the students attended the VR test specifically designed to assess their speaking skills. In this test, the learners interacted with 3D characters, powered by artificial intelligence, in different situations and tried to accomplish the specified communicative goals. The communicative goals were defined based on real-life tasks practiced during their course. The tasks included situations like registering for a course, accepting or declining an invitation, etc. The learners’ performance was recorded and evaluated by the three experienced examiners at the end of the test session. In addition, the students’ attitudes towards the assessment method were investigated in a group discussion. Using FACETS analysis, psychometric characteristics of the test were analyzed to assess the test’s construct and scoring validity. The results suggested the test enjoyed an acceptable construct and scoring validity level. Also, it was found, in general, that the VR exam made the students more motivated and engaged in the task. The findings imply that the design and development of VR platforms tasks specifically tailored for English as a Foreign Language (EFL) contexts can help the learners improve their communicative skills and overcome their anxiety.

Norwood, W. C. (2023). Instructional English as a second language practices for oral proficiency development in Chinese English as a second language classrooms

Research

Full-text available

Dec 2022

Wallace C. Norwood

Perceptions of foreign language teachers working in the Hangzhou region of China, regarding their use of instructional strategies to support Chinese high school students’ learning of oral English language.

Classroom Assessment Practices of EFL Lecturers with Current Curriculum Implementation: Where Policy Meets Practice

Article

Full-text available

Dec 2022

Within the context of Indonesian higher education, the assessment focuses on the three domains of students’ competencies, namely knowledge, skills, and attitude. Therefore, this study aims to explore the assessment practice of EFL lecturers, especially the methods used in assessing the three students’ competencies. It also explores the general assessment practice and examines differences by educational degree and length of teaching experience. This quantitative study relies on survey data from 71 participants joining the EFL lecturers’ association in Indonesia. Furthermore, the convenient sampling technique was used to determine the sample, and the data were analyzed using SPSS version 24. The results showed that the lecturers employed various methods in assessing the three domains of students’ competencies. It was also found that the lecturers mostly employed observation (66.2%) for assessing attitudes domain, tests (33.8%) for assessing knowledge, and work performance (29.6%) for skills. The findings demonstrated that the lecturers have followed the assessment policy of the government related to the method, purpose, instrumentation, procedure, and reporting grade. The results also showed no differences in the assessment practice between lecturers with Master and Doctoral degrees. Additionally, there was no difference among the four categories of teaching experiences in the assessment practice. These findings are expected to become a reference for lecturers and curriculum development in conducting and designing appropriate assessments to guarantee teaching and learning quality.

Rapport scientifique pour le chèque InnoSuisse intitulé « 47430.1 -Preuve de concept : une plate- forme informatique pour la diffusion des "procès de la Tech" » Juin 2021 -V1.0

Technical Report

Full-text available

Nov 2021

Ce rapport présente les travaux réalisés dans le cadre du chèque InnoSuisse intitulé « 47430.1 - Preuve de concept : une plate-forme informatique pour la diffusion des “procès de la Tech” ». Les « Procès de la Tech » sont des ateliers ludiques conçus, développés et mis en œuvre par l’Empowerment Foundation (EF)(https://www.empowerment.foundation/). Pour l’essentiel, il s’agit de conduire des procès fictifs dans lesquels une Intelligence Artificielle (IA) est jugée pour avoir causé la mort d’un être humain. Les participants se voient attribuer des rôles d’avocats de la défense, d’accusation, de membres du jury ou encore de juge. En respectant les phases habituelles d’un procès (énoncé de l’acte d’accusation, instruction, plaidoiries, jugement), les participant·e·s développent leurs connaissances du monde numérique et de l’IA (littéracie numérique et littéracie IA) mais aussi leur éloquence et leur connaissance du monde juridique lors des plaidoiries. Les ateliers des « Procès de la Tech » ne sont donc pas seulement des moments ludiques, mais une opportunité de développer ce qui est aujourd’hui nommé « citoyenneté numérique ». Arrivé à maturité, le déploiement des « Procès de la Tech » passe par une démocratisation de leur animation. Les concepteurs ne peuvent plus assurer l’ensemble de la coordination et de l’animation des sessions. C’est ainsi qu’est née l’idée de développer une plateforme informatique ludique susceptible de favoriser le déploiement des ateliers des « Procès de la Tech ». L’équipe de projet a donc évalué la faisabilité (proof of concept) d’une telle approche sous différents angles. Premièrement, une analyse du fonctionnement des ateliers des « Procès de la Tech » tels qu’ils se déroulent actuellement, est décrite en détail dans le chapitre 4.3.1 en page 11. Ensuite, le groupe de projet a travaillé à l’élaboration d’un scénario permettant d’organiser et conduire des ateliers des « Procès de la Tech », soit à distance via la plateforme informatique et ludique, soit en présentiel, mais en s’appuyant également sur ladite plateforme informatique et ludique (chapitre 4.3.2 en page 11). Sur la base du scénario élaboré, les chercheurs de la HE-Arc ingénierie ont analysé la nature du produit informatique à réaliser (chapitre 5.1.2 en page 16) et en ont élaboré et décrit la structure générale et les coûts (chapitre 5.2 en page 17). Afin de pouvoir mieux évaluer le retour sur l’investissement qu’ElleProd SA, le partenaire commercial du projet, pourrait envisager en s’engageant dans le développement d’un projet tel que la gamification et l’informatisation des ateliers des « Procès de la Tech », les chercheurs de la HE-Arc Santé ont esquissé une analyse Social Value ou SROI du projet. La démarche complète est décrite dans le chapitre 6 à partir de la page 27. Enfin, le chapitre 7 en page 57 propose une exploration sommaire du marché des jeux en ligne et du potentiel des ateliers des « Procès de la Tech ». Au terme de cette analyse, il ressort qu’une plateforme multimédia numérique ludique serait la solution optimum pour déployer les ateliers des « Procès de la Tech ». Néanmoins, les analyses réalisées suggèrent qu’il est difficilement envisageable de commercialiser un jeu comme les ateliers des « Procès de la Tech » sans bénéficier d’un appui des départements de l’Instruction Publique. Or, pour convaincre lesdits départements ou d’autres mécènes, l’analyse Social Value / SROI sera probablement une étape importante. En effet, la force des ateliers des « Procès de la Tech » repose sur l’objectif de développement de la citoyenneté numérique - incluant l’éloquence et la connaissance du fonctionnement système judiciaire suisse - qui prend une importance croissante dans le monde digital d’aujourd’hui et de demain. Il s’agit donc de savoir si notre pays, au travers des établissements scolaires et d’institutions telles qu’Innosuisse, veut se donner les moyens d’investir dans ce qui forment l’avenir de notre société : notre jeunesse et les technologies digitales lui correspondant.

Parenting, loneliness, and stress in Chinese international students: do parents still matter from thousands of miles away?

Article

Apr 2021

To advance understanding of Chinese international student (CIS)’ psychological adjustment, we examined the role of parents in CIS’s loneliness and stress. 167 college students (Mage = 21.2 years) from two universities in the United States participated in this study. Regression analyses revealed that CIS reported less loneliness with more parent-CIS contact, but more stress when they reported more parent-CIS conflict. Helicopter parenting was associated with higher levels of loneliness and stress. Parental involvement was not a significant predictor when considering the impact of helicopter parenting. These findings support the important role of parents for CIS’s psychological adjustment. Implications for practice and future research are discussed.

School-based Assessment and Assessment for Learning : An Exploratory Study of Hong Kong English Teachers’ Practices

Article

Dec 2019

Angus Cheng

A CRITICAL REVIEW ON ORAL ENGLISH COMMUNICATION SKILLS (OECS)

Conference Paper

Full-text available

Oct 2019

In this digital era, English is such a lingua franca which is important to fill up our daily offline and online communication activities in all spheres of our life. Thus, EFL learners (graduate level) need to be skilled in oral English communication in functioning professional operation effectively in future career. It is observed that a graduate with good oral English communication skills (OECS) has a better chance in career advancement and promotion rather than one who does not. Thus, the objective of this critical review is to This critical review on 28 research papers from 2010 to 2019 chosen from the database of Springer and Scopus using selecting criteria of PRISMA model (2009) and analyzing through NVIVO (12 version) aims to explore and identify causes for poor OECS, teaching techniques for OECS and assessment procedure of OECS. The prime findings of this study are shown that there are several causes e.g. anxiety, teaching techniques e.g. using technology or features of mobile phone and assessment procedures e.g. School based assessment for OECS. The analysis of this study is conducted for detailed description of the concepts and ideas for teachers and academic administrators for teaching and learning OECS effectively and functionally. However, this study would provide in-depth understanding and insights on causes and assessment of OECS for teachers who are teaching at University level, administrators who are involved to design courses and above all graduate level learners in EFL contexts and suggest to investigate a paradigm shift of traditional pedagogy into mobile based pedagogy in future.

Modelling Second Language Performance: Integrating Complexity, Accuracy, Fluency, and Lexis

Article

Full-text available

Dec 2009

Peter Skehan

Complexity, accuracy, and fluency have proved useful measures of second language performance. The present article will re-examine these measures themselves, arguing that fluency needs to be rethought if it is to be measured effectively, and that the three general measures need to be supplemented by measures of lexical use. Building upon this discussion, generalizations are reviewed which focus on inter-relationships between the measures, especially between accuracy and complexity, since positive correlations between these two areas have been less common in the literature. Some examples of accu-racy–complexity correlations are reviewed. The central issue here is how to account for these correlations, and so the discussion explores rival claims from the Cognition and Trade-off Hypotheses. It is argued that such joint raised performance between accuracy and complexity is not a function of task difficulty, as the Cognition Hypothesis would predict, but that instead it reflects the joint operation of separate task and task condition factors. Extending the theoretical discussion, connection is made with the Levelt model of first language speaking, and it is proposed that the results obtained in the task-based performance literature can be linked to this model, modified to take account of differences between first and second language processing, particularly as these stem from differences in the underlying mental lexicons.

Time and Motion: Measuring the Effects of the Conceptual Demands of Tasks on Second Language Speech Production

Article

Full-text available

Dec 2009

The Cognition Hypothesis (Robinson 2005) claims that pedagogic tasks should be sequenced for learners in an order of increasing cognitive complexity, and that along resource-directing dimensions of task demands increasing effort at conceptualization promotes more complex and grammaticized second language (L2) speech production. This article summarizes results of two studies that measured the effects of increasing the complexity of task demands in conceptual domains using specific measures of the accuracy and complexity of speech. These measures are motivated by research into the development of tense–aspect morphology when referring to time (Shirai 2002), and by typological, cross-linguistic research into using lexicalization patterns when referring to motion (Cadierno 2008). Results show there is more developmentally advanced use of tense–aspect morphology on conceptually demanding tasks compared with less demanding tasks, and a trend to more target-like-use of lexicalization patterns for referring to motion on complex tasks.

Mind in society: The development of higher psychological processes

Book

Jan 1978

Lev S Vygotsky

The new rules of measurement: What every educator and psychologist should know

Article

Jan 1999

Benjamin D. Wright

Using and developing measurement instruments in science education: A Rasch modeling approach

Article

Jan 2010

Xiufeng Liu

Reasonable mean-square fit values

Article

Jan 1994

This article is at www.rasch.org/rmt/rmt83b.htm

Rating Scale Analysis

Book

Jan 1982

Measuring Second Language Performance

Article

Dec 1999

Acknowledgements 1. Introductions 2. Second language performance assessment 3. Modelling performance: opening Pandora's Box 4. Designing a performance test: the Occupational English Test 5. Raters and ratings: introduction to multi-faceted measurement 6. Concepts and procedures in Rasch measurement 7. Mapping and reporting abilities and skill levels 8. Using Rasch analysis in research on second language performance assessment 9. Data, models and dimensions References Index

School-based English language assessment as a high-stakes examination component in Hong Kong: Insights of frontline assessors

Article

Jul 2014

David D Qian

In recent years, school-based assessment (SBA) has been incorporated into the English Language subject of a traditional high-stakes public examination, the Hong Kong Certificate of Education Examination. As reactions from various stakeholder groups have been mixed, it was necessary to review this new practice. This paper reports on a study of 33 English language teachers who have been frontline SBA assessors. The study aimed to better understand (1) teacher-assessors’ perceptions of the English SBA, (2) their professional development needs in relation to implementing SBA, (3) their perceptions of how students could benefit from SBA and (4) the pressing issues arising from the SBA implementation. The study found that, while many participants agreed that SBA could be beneficial to students’ English learning, there was much room for improvement in all four aspects investigated. In particular, some problems identified at the very beginning of the implementation of the English SBA have still persisted.

Rasch Analysis in the Human Sciences

Book

Jan 2014

W.J. Boone, Miami University, Cincinnati, USA; J.R. Staver, Purdue University, Lafayette, USA; M.S. Yale, Purdue University, Irving, USA Rasch Analysis in the Human Sciences ▶ Offers step-by-step instruction on how-to use Rasch analysis ▶ Features a combination of Rasch conversations, Rasch data analyses, Rasch theory, and formative assessments ▶ Written for a broad audience as an introductory text: light on the mathematics, heavy on the practical aspects, and clear on the concepts and theory ▶ All activities and tables utilize user-friendly (and free) Ministeps Rasch software that prepares readers to confidently utilize Rasch Winsteps software Rasch Analysis in the Human Scienceshelps individuals, both students and teachers, master the key concepts and resources needed to use Rasch techniques for analyzing data from assessments to measure variables such as abilities, attitudes, and personality traits. Upon completion of the text, readers will be able to confidently evaluate the strengths and weaknesses of existing instrumentation, compute linear person measures and item measures, interpret Wright Maps, utilize Rasch software, and understand what it means to measure in the Human Sciences. Each of the 24 chapters presents a key concept using a mix of theory and application of user-friendly Rasch software. Chapters also include a beginning and ending dialogue between two typical researchers learning Rasch, formative assessment check points, sample data files, an extensive set of application activities with answers, a one paragraph sample research article text integrating the chapter topic, quick-tips, and suggested readings.

ESL students’ oral performance in English language school-based assessment: results of an empirical study

Abstract and Figures

Recommended publications

The State of Teaching and Learning English Pronunciation in Malaysia: A Preliminary S

Introducing English as a Second Language to Early Primary School Curriculum in Saudi Arabia

Images of post-16 English teaching

Implications for language education policy