How to improve teaching practice? Experimental comparison of centralized training and in-classroom coaching, J Cilliers, B Fleisch, C Prinsloo, V Reddy

Tags: teachers, Coaching, lesson plans, teacher, class size, Training, professional development, observations, baseline, curriculum, reading proficiency, South Africa, government, developing countries, teaching techniques, reading comprehension, teaching activities, centralized training, centralized training program, David Evans, Stephen Taylor, teaching practices, control group, reading material, reading books, government curriculum, additional training, curriculum coverage, nearby schools, cent
Content: How to improve teaching practice? Experimental comparison of centralized training and in-classroom coaching. Jacobus Cilliers, Brahm Fleisch, Cas Prinsloo§ Stephen Taylor¶ February 2018 Abstract The quality of a teacher matters greatly for how much a child learns in a year, yet there is limited consensus on how to systematically improve the instructional practices of existing teachers. In a randomized evaluation in 180 public primary schools in South Africa, we compare two structured pedagogic programs aimed at improving the teaching of home language reading in the early grades. In both programs teachers receive the same scripted lesson plans, aligned to the official literacy curriculum. We find that pupils exposed to two years of the program improved their reading proficiency by 0.12 standard deviations if their teachers received centralized Training, compared to 0.24 if their teachers received in-class Coaching. classroom observations reveal that teachers were more likely to split pupils into smaller reading groups sorted by ability, which enabled individualized attention and more opportunities to practice reading. In large classes this made more of a difference. Results reported on in this study forms part of the Early Grade Reading Study (EGRS) which was lead by the government to find out how to improve early grade reading. We are grateful for useful feedback from Servaas van der Berg, David Evans, Clare Leaver, and James Habyarimana. Janeli Kotze and Mpumi Mohohlwane provided excellent research assistance and management support. The International Initiative for Impact Evaluation provided funding for the evaluation. All errors and omissions are our own. McCourt School of Public Policy, Georgetown University University of Witwatersrand's School of Education, South Africa §human sciences Resource Council ¶Department of Basic Education, Government of South Africa 1
1 Introduction Governments and other large organisations face the challenge of upskilling and improving worker productivity at a large scale. Obstacles to change typically include a combination of low worker capacity, low motivation when workers face weak incentives, and deeply entrenched norms of practice that may be outdated or ineffective. As a result, there are many examples of civil servants who do not perform the tasks they were trained to do, especially when these involve a higher degree of complexity. Public sector doctors in India, for example, only completed 18 percent of a checklist of questions required for diagnosis, despite many years of medical training (Das, Holla, Mohpal & Muralidharan 2016). Similar low levels of clinical adherence are observed throughout the developing world (Das, Hammer & Leonard 2008).1 Nowhere is the need for adopting improved practices more important, yet more challenging, than in education. Numerous studies have found that teachers play a critical role in shaping a child's learning trajectory and raising his/her future productivity,2 and good teaching practices correlate with faster learning.3 Yet, across the world, the quality of teaching is highly variable. In recognition of this, government and donors invest billions of dollars annually to improve teaching practices of the existing pool of teachers,4 but with disappointing results. For example, many studies in the United States have found no impact of professional development programs on student learning, especially when conducted by government at scale;5 and a recent meta-analysis of evaluations of in-service teacher training programs in developing countries concluded that "teacher training programs vary enormously, both in their form and their effectiveness" (Popova, Evans & Arancibia 2016). Broadly defined, there are two approaches to introducing new skills into a workforce: taking workers out of the workspace over a brief time period to participate in a centralized training program, possibly combined with follow-up cues and reminders; or a more ongoing individualized on-the-job training combining regular observation and feedback. The first approach may provide more time for a deeper conceptual understanding to develop before actually implementing the new techniques. The second approach ensures that implementation will actually happen since somebody is there to observe practice, which may in turn lead to learning by doing, while regular feedback may assure correct application of the new techniques. Both these approaches could facilitate a change in behavior through skill acquisition, since new tasks become easier to implement as a worker becomes more adept in them. In education, governments typically opt for the former approach: one-off teacher training at a central venue. Many consider this approach to be ineffective, but a promising and potentially cost-effective way to strengthen the training is to provide scripted lesson plans that provide additional instructional guidance, and daily prompts and reminders to facilitate practice (Jackson & Makarin 2016). Although 1See, for example: 2(Rivkin, Hanushek & Kain 2005, Kane & Staiger 2008, Bau & Das 2017, Buhl-Wiggers, Kerwin, Smith & Thornton 2017, Staiger & Rockoff 2010, Cruz-Aguayo, Ibarrarґan & Schady 2017, Chetty, Friedman & Rockoff 2014). 3(Allen, Gregory, Mikami, Lun, Hamre & Pianta 2013, Kane & Staiger 2012, Araujo, Carneiro, Cruz-Aguayo & Schady 2016). 4By some estimates the United States spends 18 billion annually on teacher professional development (Fryer 2017). According to a nationally representative survey conducted in 38 developed countries, 91 percent of teachers received professional development in the past 12 months (Strizek, Tourkin & Erberber 2014). And Popova, Evans and Arancibia (2016) calculate that nearly two thirds of World Bank-funded education programs include a professional development component. 5(Harris & Sass 2011, Garet, Wayne, Stancavage, Taylor, Eaton, Walters, Song, Brown, Hurlburt, Zhu et al. 2011, Garet, Cronen, Eaton, Kurki, Ludwig, Jones, Uekawa, Falk, Bloom, Doolittle et al. 2008, Jacob & Lefgren 2004, Randel, Beesley, Apthorp, Clark, Wang, Cicchinelli & Williams 2011). 2
far less common, governments and donors have also experimented with the latter approach: pedagogical coaches, who visit schools on a regular basis to observe teaching, provide feedback, and demonstrate correct teaching techniques (Kraft, Blazar & Hogan 2016). Is a short centralized training program --combined with daily lesson plans that prompt and guide the implementation of the new practice-- sufficient to ensure use of new practice? How important is ongoing individualised observation and feedback, provided by an expert coaches, for ensuring that new practices are implemented and implemented well? How does this depend on the characteristics of the teacher or the class size? Ultimately, which approach --training or coaching-- is more cost-effective at improving pupil learning? To answer these questions, we conduct a randomized evaluation in 180 public primary schools in South Africa, comparing two different approaches to improving the teaching of home language reading in the early grades. The first approach (which we refer to as Training) follows the traditional model commonly employed by governments: short, intensive training held at a central venue.6 In the second approach (which we refer to as Coaching), specialist reading coaches visit the teachers on a monthly basis to observe teaching practice and provide feedback. The average duration of exposure to the programs over the course of the year is roughly equivalent.7 Both interventions also provide teachers with scripted lesson plans and educational materials such as graded reading booklets, flash cards, and posters. The lesson plans are based on official government curriculum and mirror exactly the pedagogical techniques prescribed by government, but at a higher level of specificity. The programs therefore do not introduce new curriculum, but are rather aimed at improving the delivery of existing curriculum. Coaching costs roughly 62 USD per pupil annually, compared to 48 USD for Training. South Africa faces education challenges that are typical of developing countries. The learning trajectory in early grade reading in South Africa is low: a striking 78 percent of pupils still cannot read with meaning in any South African language after 4 years of schooling (Mullins, Martin, Foy & Hooper 2017).8 Such low learning trajectories in early grade reading are also prevalent India and elsewhere in subSaharan Africa (Banerji, Bhattacharjea & Wadhwa 2013, Bold, Filmer, Martin, Molina, Stacy, Svensson & Wane 2017). Moreover, teacher content knowledge is extremely low --an estimated 79% of grade 6 mathematics teachers in South Africa demonstrated content knowledge below a grade 6/7 level (Venkat & Spaull 2015)-- but again not much worse than what has been documented in other countries in subSaharan Africa (Bold et al. 2017). We measured both pupil learning and teaching activity in the classroom. We assessed the reading ability of a random sample of 20 pupils in each school at three points in time: once as they entered grade one prior to the roll-out of the interventions (February 2015), and again at the end of their first and second academic years (November 2016 and 2017 respectively). During these school visits, we also surveyed teachers and the school principal. We also conducted detailed lesson observations in a stratified random sample of 60 schools in October 2016-- 20 schools in each evaluation arm. The lesson observation 6In our case, teachers receive two training sessions, once at the beginning and once in the middle of the year, each lasting two days. 7We estimate that the average number of hours of exposure to the programs were 32 and 37 hours for the Training and Coaching arms respectively. So, roughly 4/5 days in total. 8This is the percentage of children scoring less than the low international benchmark score in reading, as defined by the Progress in International Reading Study (PIRLS). Failing to achieve this low benchmark means that children cannot identify and retrieve explicitly stated information from a reading comprehension passage. 3
instrument was explicitly designed to capture the teaching practices prescribed by government and thus targeted by the program. We find that, after two years of exposure to the program, pupils' reading proficiency increased by 0.12 and 0.24 standard deviations if their teachers received Training or Coaching respectively. The impacts are larger still --0.18 and 0.29 standard deviations respectively-- when we exclude the small sample of multi-grade classrooms, a setting where the program was never intended to work. These effect sizes are large relative to other education interventions.9 We conclude that Coaching is more cost-effective than Training with an estimated 0.0041 standard deviation increase in reading proficiency per 1 USD spent per pupil annually, compared to 0.0023 in the case of Training. Next, our classroom observation allows us to unpack mechanisms by measuring how teaching practice changed in the classrooms. We find that, even though there is no change in the frequency that the pupils are practicing reading in the classroom, there is a big change in how they practice reading: Teachers in both treatment arms are more likely to practice a technically challenging teaching technique called group-guided reading, where the pupils read aloud in smaller groups sorted by ability.10 According to government curriculum, this is supposed to take place on a daily basis, but almost no teachers are doing this in the control. Moreover, the largest improvement is consistently observed in classrooms where the teachers received Coaching. As a result, pupils are more likely to read aloud in smaller groups and receive individual attention from the teacher when they read. Notably, we see no change in other activities that are also required to take place at a daily basis, but are easier to teach.11 Related, our classroom observations also reveal that far more pupils are reading the graded reading books in the Coaching arm compared to Training, even though teachers in both arms received the same amount of books. Strikingly, virtually no pupils in the control are reading any books, despite the fact that almost every teacher in the control claims to have access to some. The graded reading books are meant to be used during group-guided reading. This result reveals the important interaction between resources, teaching practice, and use of resources: provision of reading material is insufficient if teachers do not apply the appropriate teaching techniques so pupils have opportunities to use them. This is an encouraging counterpoint to other studies which have documented that provision of textbooks by themselves are not sufficient to improve learning (Glewwe, Kremer & Moulin 2009, Sabarwal, Evans & Marshak 2014). Finally, we find that the impacts on both teaching practice and pupil learning depends crucially on the number of pupils in the classroom. For both programs, the smallest impact is observed in the smallest classes. This is plausibly because the benefits of practicing group-guided reading is higher in the larger classes, where pupils would otherwise have received very little individual attention. Moreover, Training also had no impact on learning in the largest classes, and a smaller impact on the probability that a teacher practices group-guided reading. This trend is not observed in the Coaching arm, plausibly 9For example, a meta-analysis of education reviews by McEwan (2015) found that the categories of interventions with the largest effect sizes (in standard deviations) are: "instructional materials (0.08); computers or instructional technology (0.15); teacher training (0.12); smaller classes, smaller learning groups within classes, or ability grouping (0.12 ); student and teacher performance incentives (0.10); and contract or volunteer teachers (0.10)." 10Group-guided reading forces pupils to engage individually with the text, rather than mimic what the teacher is reading with the class as a whole. It also opens up the possibility for individual feedback from a teacher, as s/he can now move between different reading groups. During in-depth surveys, some teachers in the training intervention complained that training was too short, so they did not have enough time to understand group-guided reading. 11Phonics and letter recognition are also required to be taught daily and are typically taught through whole-class reading, where all the children in the classroom follow or read with the teacher. This is a far easier form of teaching. 4
because teachers that received coaching have developed better skills to implement the difficult teaching techniques in the more challenging context of large classes.12 Taken together, our results show that a combination of training and lesson plans can shift teaching practice and improve learning, but the shift is far larger when teachers receive ongoing observation and feedback from a coach, especially for the more difficult techniques. It is encouraging that the traditional model of training can shift learning if it is well-designed, well-implemented and combined with lesson plans that provide guidance, regular reminders and encourage development of routines. But Coaching remains the more cost-effective option. Moreover, our results provide strong suggestive evidence for which classroom activities matter most for learning: providing pupils with more opportunities to practice reading, and more individual feedback from the teacher. This seems like a trivial result, but in large classrooms providing individual attention is a non-trivial task. Our paper contributes to growing evidence from developing countries demonstrating that a bundled intervention of training, lesson plans, and coaching can dramatically improve pupils' proficiency in earlygrade reading (Piper, Zuilkowski & Mugenda 2014, Piper & Korda 2011, Lucas, McEwan, Ngware & Oketch 2014, Kerwin, Thornton et al. 2017). This is also consistent with the conclusion from a recent review that structured pedagogic programs --a combination of highly specified curricula, training on instructional methods, and additional learning materials-- have great potential to improve learning (Snilstveit, Stevenson, Menon, Phillips, Gallagher, Geleen, Jobse, Schmidt & Jimenez 2016). This paper makes a unique contribution in two important ways. First, we experimentally vary two common forms of teacher professional development: training versus coaching. This allows us to unpack which components are uniquely responsible for the learning gains, and test for the importance of observation and feedback in developing skills. This is important, since one-off training is the most common form of government teacher professional development, yet most research looks at a more resource-intensive model of coaching. Second, the detailed classroom observations, which were explicitly developed to measure the teaching practices emphasized by the program, shed light on the underlying mechanisms. Our finding on the importance of providing pupils with individual attention is also in line with the growing body of evidence from developing countries that pedagogy that targets teaching to the level of the child can be highly effective at improving learning (Evans & Popova 2016). Randomized evaluations have found that remedial education programs (Banerjee, Cole, Duflo & Linden 2007), additional teacher assistants (Duflo & Kiessel 2014), or contract teachers (Duflo, Dupas & Kremer 2011) can improve test scores, since they free up resources to provide additional attention to worse-performing pupils. What is encouraging from our study is that we show that individual attention can be accomplished by changing classroom management practices of the existing pool of teachers, without introducing additional teachers or computer technology. More broadly, we believe that results of this study also contribute to debates around teacher accountability and autonomy. There has been vast academic interest in motivating teachers to change their behavior by increasing the returns to effort, either through financial incentives,13 or reducing job security (Duflo et al. 2011, Muralidharan & Sundararaman 2013). In contrast, our study shows that it is possible to shift teaching practice by decreasing the cost of effort, through providing pedagogical support. This is 12During the exit surveys, teachers complained that the activities prescribed in the lesson plans are too difficult to enact in larger classes. 13(Glewwe, Ilias & Kremer 2010, Muralidharan & Sundararaman 2009, Cilliers, Kasirye, Leaver, Serneels & Zeitlin 2017) 5
encouraging, because policies aimed at improving teacher accountability are often not politically tractable due to strong teacher unions.14 Furthermore, there is often push-back against a prescribed curriculum and set pedagogical standards, especially in the United States, because of the fear that it will undermine teacher autonomy and his/her ability to adapt teaching to the needs of the classroom. Our study demonstrates the benefits from providing lesson plans that precisely detail the teaching activities that should take place in the classroom. Teacher satisfaction with the program was high, underscoring the fact that teachers value the structure provided by standardized lesson plans. Importantly, there were no detectable negative impacts on any segment of the pupil population, so the reduced teacher autonomy does not come at a cost of lower learning for some types of pupils. The paper proceeds as follows: section 2 describes the interventions and the motivating theoretical channels, section 3 describes the evaluation design and empirical strategy, sections 4 and 5 report and discuss the results, and section 6 concludes. 2 Program description and theoretical framework 2.1 Program Working with the South African government, we designed two related interventions aimed at improving early-grade reading in one's home language.15 Both interventions provide teachers with lesson plans, which prescribe in detail the content that should be covered and pedagogical techniques that should be applied for each instructional day. In addition, teachers receive supporting materials, such as graded reading booklets, flash cards, and posters. The graded reading booklets provide a key resource for the teacher to use in group-guided reading (discussed in more detail below) so as to facilitate reading practice at an appropriate pace and sequence of progression. The program was lead and managed by government, who appointed a service provider, Class Act, to implement the interventions. The two interventions differ in their approach to improving teacher pedagogical practice. The one intervention trains the teachers on how to use the lesson plans and accompanying materials through central training sessions, each lasting two days and occurring twice yearly. During these training sessions, roughly a quarter of the training time was spent on teachers practicing the techniques. The trainers also performed follow-up visits to most of the schools, in order to encourage them to continue with the program. We refer to this intervention as Training. The second intervention, which we refer to as Coaching, provides exactly the same set of instructional materials. However, instead of central training sessions, specialist reading coaches visit the teachers on a monthly basis to observe their teaching and provide feedback on how to improve. The coaches also hold information session with all the teachers at the start of each term to hand out new materials; and there were occasional (1 to 3) afternoon workshops with the coach and a 14For example, recent education reform in Indonesia attempted to put in place pay scale differentials based on performance metrics, but it was blocked by trade unions. The eventual policy lead to an unconditional doubling of teacher salaries, with no resultant impact on learning (De Ree, Muralidharan, Pradhan & Rogers 2015). 15In South Africa, most children are taught in their home language in grades one to three and then experience a transition to English as the language of instruction in grade four. Both linguistic theory as well as empirical evidence from South Africa (Taylor and Von Fintel, 2016) indicate that learning to read in one's home language enables better learning outcomes later in school. 6
small cluster of nearby schools that are part of this intervention. The coaches were educated --all three had a at least a bachelors degree-- and had past experience as both teachers and coaches. They received additional training from Class Act at the start of every term.16 The coaches also conducted the training, so the differences between the programs cannot be attributed to the expertise of those administering the programs. Figure 1 shows the distribution of teacher exposure to the coaching program in 2016, based on data collected by Class Act. We see that the median number of visits that a teacher received was ten, but some teachers received far fewer visits. There was also high variation in the number of afternoon workshops that teachers attended. Putting this all together, we calculate that the average number of hours of exposure to the program was 36.7.17 According to administrative data, teacher attendance for Training was high --98 and 93 percent for the two sessions held in 2016-- and there were a total of 157 follow-up visits. The organization held follow-up training for the teachers who missed the initial training. The average number of hours of exposure to the program is roughly 34.18 It is important to note that both treatments follow the same curriculum as in the control. The lesson plans are fully aligned official government curriculum, both in terms of the topics covered and instructional techniques prescribed. These standards are very detailed and go as far as specifying the weekly frequency with which different teaching activities should take place. The lesson plans are also integrated with the government-provided workbooks, which detail daily exercises to be completed by students. The "new methods introduced through this program are therefore new relative to existing practice but not relative to the existing policy intention. Moreover, the control teachers also receive a high level of support from government. For example, over 79 per cent of teachers in the control reported to have received in-service training on teaching Setswana as a home language the past year; and 96 per cent of teachers have at least some graded reading booklets in the classroom. Any difference we observe is therefore due to the modality of support the teachers receive, not the pedagogical content. One aspect of the curriculum, group-guided reading, stands out as an important ingredient to learning how to read with comprehension. Group-guided reading is "an ability-group reading teaching strategy where all the members in the group read the same text under the direction of the teacher." (Department of Basic Education, 2011). This is expected to take place on a daily basis and also requires regular assessment of pupils' reading proficiency in order to assign pupils to the appropriate graded reading booklets. Note the relationship between resources and teaching activity: the reading booklets are necessary to effectively enact group-guided reading, but might have no impact on learning if they are not combined with group-guided reading. 16The training focused on coaching and mentoring, school curriculum, and teaching skills. 17Assuming that each information session lasts five hours, each coaching visit lasts one and a half hours (one hour observation and 30 minutes feedback), and each afternoon session lasts two hours. 18Assuming that teachers spent on average 20 minutes talking to the teacher when visited by the trainers. The trainers were only supposed to talk to the school principals, but inevitably also talked to the teachers. 7
2.2 Theoretical framework Leaving aside the possibilities for improving education through more appropriate schooling curricula, there is much that can be done through improving the delivery of existing curricula by teachers. In South Africa, for example, there is documented evidence of highly incomplete curriculum coverage (Taylor, 2011), ineffective curriculum sequencing and pacing by teachers (Hoadley, 2010), and the avoidance of more complex learning and teaching activities (Prinsloo, 2008). In the majority of classrooms there is a significant gap between existing practice and what is prescribed in the curriculum.19 Specifically when it comes to the teaching of reading, there are dominant norms of practice involving an over-reliance on teacher-directed strategies and whole-class activities, such as "chorusing, to the exclusion of the systematic teaching of decoding and other components of learning to read and the exclusion of opportunities for individual children to practice reading (Hoadley, 2010). To some extent, these ineffective practices may be linked to a lack of home language reading materials available in classrooms. Changing ingrained teaching practices on a large scale presents a significant challenge. Simply running training workshops (on their own), as is often the approach to in-service teacher development, assumes that teachers will update their knowledge and then go away and change their classroom practice in the intended fashion. On the other hand, if one simply tries to address teacher motivation through increased accountability and strengthen incentives, one may be constrained by the extent to which teachers are able to respond in ways that actually lead to improved learning. These approaches are nave and fail to take account of the emotional and practical obstacles to implementing new practices and routines. Both interventions of this study are built around the assumption that teaching is a skill that needs to be developed through regular practice, and teachers might need additional guidance and support to ensure consistent and correct application of the new techniques. Skill acquisition could lead to a sustained change in behavior, either by increasing the marginal product of effort for intrinsically motivated teachers (who now see the fruits of their labor), or by reducing the marginal cost of effort (once-difficult tasks now become easy to implement). The lesson plans provide several mechanisms for ensuring that the methods are actually implemented and the topics are covered at the appropriate pace and sequence. Firstly, the provision of fully scripted lesson plans can reduce the effort cost of transition to a new set of practices, since teachers do not need to develop daily plans themselves. Secondly, even before a teacher has a deep understanding of the methods or curriculum topics, the lesson plans prompt enactment, thus creating the possibility for learning by doing. In this way, the regular routines embedded in the lesson plans foster an iterative relationship between knowing and doing through which the teachers own instructional repertoire is expanded. Lesson plans also provide a way to ensure that new reading materials are used and are integrated into a lesson in a coherent way some research indicates that additional resources are often ineffective since they are not used properly if at all (Van der Berg, 2008). Lastly, lesson plans provide a focus for the entire intervention guiding not only the use of time and materials but providing a point of focus for all training or coaching interactions. In these ways, lesson plans can be viewed as providing a set of mechanisms to encourage correct implementation of the curriculum and of what is taught at training sessions. This is an important perspective informed by a more nuanced understanding of the process of changing teaching 19This is because the curriculum has been revised several times in recent decades, but most teachers were not properly trained to implement new methods and did not have all the necessary reading materials. 8
practice at scale, rather than viewing it simplistically as either a matter of building capacity (through increased knowledge) or increased motivation (through for instance accountability or incentives). A significant initial dose of training might be important if a thorough conceptual understanding of new topics and methods is necessary before effective implementation is possible. However, there may be other practical and emotional constraints to introducing a new set of routines and activities into an existing classroom space. The coaching intervention, whilst not relying on deep knowledge before implementation, does provide an additional set of mechanisms to ensure that new methods are being attempted (somebody is there to observe thus playing a monitoring role), to facilitate an evaluation of how new practices are being implemented, and to encourage re-implementation in a better way through both guidance and even modelling best practice themselves. 3 Evaluation Design 3.1 Sampling and Random Assignment The study is set in two districts in the North West Province, in which the main home language is Setswana. We chose this province, since it is relatively homogeneous linguistically and is one of the poorer provinces in South Africa. Our sample is restricted to non-fee public schools schools that use Setswana as the main language of instruction, and were identified as unlikely to practice multi-grade teaching.20 We randomly drew a sample of 230 schools from this population and created 10 strata of 23 similar schools based on school size, socio-economic status, and previous performance in the the national standardized exam, called the Annual National Assessments (ANA). Within each stratum we then randomly assigned 5 schools to each treatment group and 8 to the control group. All treatment schools with exception of one in the Coaching arm agreed to participate in the program. We included this school in the sample of treatment schools.21 We chose to exclude schools that practice multi-grade classes, since the interventions are grade-specific and unlikely to work in such settings, but we were unable to ex ante exclude all those schools: roughly 6 per cent of grade two teachers in each treatment arm reported teaching pupils from multiple grades in the same classroom. For sake of transparency we report results on both the full sample and the restricted sample that excludes pupils who were taught in a multi-grade setting. 3.2 Data collection We visited each school three times: once prior to the start of the interventions (February 2015), again after the first year of implementation (November 2015), and finally at the end of the second year (November 2016). During these school visits we administered four different survey instruments: A pupil test on reading proficiency and aptitude conducted on a random sample of 20 pupils who entered grade one at the 20Approximately 65% of South African children attend non-fee schools. Schools serving communities with higher socioeconomic status are allowed to charge fees, but receive a smaller government subsidy as a consequence. 21The full evaluation also consisted of a third treatment arm with a different focus on Parental involvement, the result of which we will discuss in a separate paper. 9
start of the study, a school principal questionnaire, a teacher questionnaire, and a parent/guardian questionnaire. We assessed the same pupils in every round of data collection, but surveyed a different set of teachers between midline and endline, because pupils generally have different teachers in different grades. Finally, we also conducted lesson observations on a stratified random sub-set of 60 teachers in September 2016. The data-collection and data-capture organizations are independent from the implementing organization and research team, and were blind to the treatment assignment. We registered a pre-analysis plan at the AEA RCT registry in October 2016, before we had access to the endline data. 3.2.1 Pupil assessment The pupil test was designed in the spirit of the Early Grade Reading Assessment (EGRA) and was administered orally by a fieldworker to one child at a time. The letter recognition fluency, word recognition fluency and sentence reading components of the test were based on the Setswana EGRA instrument, which had already been developed and validated in South Africa. To this, we also added a phonological awareness component in every round of assessment. The baseline instrument did not include all the same sub-tasks as the midline/endline instruments, because of different levels of reading proficiency expected over a twoyear period. For baseline, we also included a picture comprehension (or expressive vocabulary) test since this was expected to be an easier pre-literacy skill testing vocabulary, and thus useful for avoiding a floor effect at the start of grade 1 when many children are not expected to read at all. Similarly, we included a digit span memory test.22 The logic of including this test of working memory is that it is known to be a strong predictor of learning to read and would thus serve as a good baseline control to improve statistical power. For the midline and endline, we added a writing and a paragraph reading sub-task. Out of the 3, 539 pupils surveyed in baseline, we were able to re-survey 2, 951 in endline, yielding an attrition rate of 16.6 per cent. The attriters had either moved school (90 per cent of attriters) or were absent on the day of assessment (10 per cent of attriters). Moreover, an additional 13% of our original sample were repeating grade one. Figure 2 shows the breakdown of groups. Column (1) in table A.1. regresses treatment assignments on attrition status, after controlling for stratification. It shows there is no statistically significant difference in attrition rates across treatment arms. Columns (2) and (4), show that the attriters are slightly older and less likely to be female, but columns (3) and (5) show that the reduced sample remains balanced on these two indicators. Column (6) shows that the attriters did not perform significantly better or worse on the baseline reading tests. 3.2.2 Survey data and document inspection The teacher survey contained questions on basic demographics (education, gender, age, home language), teaching experience, curriculum knowledge, and teaching practice. For curriculum knowledge, we asked the frequencies with which the teacher performs the following activities: group-guided reading, spelling tests, phonics, shared reading, and creative writing. The prescribed frequency of performing these activities is stipulated in the government curriculum and also reflected in the lesson plans. Performing these activities at the appropriate frequency is thus a measure of knowledge and mastery of the curriculum, 22This involved repeating by memory first two numbers, then three, and so forth up to six numbers, and the same 5 items for sequences of words. 10
as well as fidelity to the scripted lesson plans. Note that even if there is risk of social desirability bias, these measures still accurately capture knowledge of the appropriate routines, since some activities are supposed to take place infrequently.23. The questions on teaching practice covered important pupil-teacher interactions that flow from groupguided reading: whether teachers ask pupils to read out loud, provide one-on-one assessment, and sort reading groups by ability. Finally, the teacher survey also included a voluntary comprehension test, which was completed by 75, 89, and 98 per cent of teachers who completed the teacher survey at baseline, midline and endline respectively. In the endline, we have teacher survey data for 275 teachers in 175 schools. As a result, for 81 percent of the 2, 951 pupils assessed at endline, we also have data on their teacher.24 In column (8) in Table A.1 we regress treatment assignment dummies on an indicator for whether a pupil's teacher also completed the teacher survey. We see that teacher non-response was random across treatment arms. We also conducted classroom and document inspection for the surveyed teachers. Fieldworkers counted the number of days that writing exercises were completed in the exercise book, and the number of pages completed in the government workbook.25 To minimize risk of bias due to strategic selection of exercise and workbooks, the teacher was asked to provide books of one of the most proficient pupils in his/her class. Furthermore, fieldworkers indicated if the teacher has a list for the reading groups, and rated on a 4-point Likert scale the sufficiency and quality of the following print material: a reading corner (box library), graded readers, Setswana posters, and flashcards. The school principal survey includes basic demographic questions, questions on school policies, school location, school access to resources, and a rough estimate of parent characteristics: the language spoken most commonly in the community, and highest overall education of the majority of parents. 3.2.3 Lesson observations To gain a better understanding of how teaching practice changed in the classroom, we also conducted detailed lesson observations in October 2016 in a stratified random subset of 60 schools-- 20 schools per treatment arm. We observed the lesson of one teacher per school. We stratified by school-average pupil reading proficiency in order to assure representation across the distribution of school performance. We also over-sampled urban schools, where the impacts of the programs were largest at midline.26 An expert on early-grade reading developed the classroom observation instrument, in close consultation with Class Act and the evaluation team. The instrument covered teaching and classroom activities that we expect to be influenced by the program. For example, the fieldworkers were required to record the number of pupils who read or handle books; the number of pupils who practice the different building blocks of reading (e.g. vocabulary 23With social desirability bias, we would expect teachers to say that the perform all activities more frequently 24We cannot tell what proportion of teachers did not respond, because children are randomly drawn at a school level, so we do not know how many teachers pupils with missing teacher data would have matched with. 25To reduce data capture error, we asked the fieldworker to only count pages completed for three specific days. We chose three days that should have been covered by teachers by the end of the year, regardless of their choice of sequencing. 26In particular, we randomly drew schools from each treatment group in the following manner: (i) six urban schools; (ii) five schools in the top tercile and five schools in the bottom tercile in terms of average performance across both baseline and midline; (iii) four schools in the top tercile in terms largest improvement between baseline and midline. 11
development, phonics, word/letter recognition, reading sentences or extended texts); how reading is practiced in the classroom (e.g. read individually or in a group; read silently or aloud); and the frequency and types of writing activities taking place. The instrument also captured student-teacher interactions related to group-guided reading: whether reading groups are grouped by ability, how frequently pupils receive individual feedback from the teacher, and how frequently pupils are individually assessed. This final set of indicators mirror the questions that were asked in the teacher survey. The instrument was very detailed, but unlike some lesson observation instruments, did not require the fieldworkers to record time devoted to different activities. Rather, questions related to frequency of different activities were generally coded on a Likert scale.27 Since it was a detailed and comprehensive instrument, we decided to limit ourselves to six qualified fieldworkers, all of whom were proficient in Setswana and had at least a bachelors degree in education. To further assure consistency across fieldworkers, the project manager visited at least one school with each of the fieldworkers at the start of the data collection, and data quality checks were conducted after two days of data collection. After the completion of the lesson observations, the fieldworkers also asked some questions about the type of teaching support they received the past year. These were open-ended questions, which allowed us to code whenever a teacher mentioned receiving training or coaching from Class Act, or using its graded readers or scripted lesson plans. 3.2.4 Administrative data To add precision to our estimates, we further complemented these survey measures with 2011 census data and results from a standardized primary school exam conducted in 2014. From the 2011 census, we constructed a community wealth index derived from several questions about household possessions, and we also calculated the proportion of 13 to 18 year-olds in the community that are attending an educational institution.28 We also have have data on each school's quintile in terms of socio-economic status, as coded by government. 3.2.5 Aggregation of indicators In order to minimize the risk of over-rejection of the null hypotheses due to multiple different indicators, we aggregated data in the following ways. First, for own main outcome measure of success --reading proficiency-- we combined all the sub-tasks into one aggregate score using principal components. We did this separately for each round of assessment. For the midline and endline scores, we used the factor loading of the control group to construct the index. This score was then standardized into a z-score: subtracting the control group mean and dividing by the standard deviation in the control. The treatment impact on the aggregate score can thus be interpreted in terms of standard deviations. Furthermore, we grouped the potential mediating factors of changed teaching practice and classroom environment into five broad categories that are theoretically distinct inputs into learning to read: (i) 27For example, when coding frequency of different types of reading activities, the fieldworkers recorded: never, sometimes, mostly, and always. 28We acknowledge Stellenbosch University, and Asmus Zoch in particular, for constructing the dataset linking census data to schools data. 12
curriculum coverage; (ii) fidelity to routine specified in curriculum; (iii) teacher-pupil interactions related to group-guided reading; (iv) frequency of practicing different reading activities; and (v) pupils' use of reading materials. For each category we created a mean index, using the method proposed by Kling, Liebman and Katz (2007), which is a average of the z-scores of all the constituent indicators. 3.3 Balance and descriptive statistics Table 1 shows balance and basic descriptive statistics of our evaluation sample. Each row represents a separate regression of the baseline variable on treatment assignments and strata dummies, clustering standard errors at the school level. The first column indicates the mean in the control. Columns (2) and (4) indicate the coefficient on the treatment dummies. Column (6) reports the number of observations, and column (7) reports the p-value for the test of equality between Training and Coaching. Our sample of schools come predominantly from poor communities: 46.3 per cent of schools are in bottom quintile in terms of socio-economic status, and 85 per cent are from rural areas. In only 44 per cent of schools do the majority of parents have a high school degree or higher. In almost all schools the main language spoken in the community is Setswana. A small fraction of classrooms ended up being multi-grade classrooms (6.2 percent of grade two classes). We were thus not perfectly able to identify and exclude ex ante all schools that do multi-grade teaching. The teachers are mostly female and are educated: 85 and 95 per cent of the grade one and two teachers respectively have a degree or diploma. Nonetheless, reading comprehension levels are low: The average score for the simple comprehension test is 66 per cent. We observe slight imbalance on baseline pupil reading proficiency and the school community's socio-economic status for the Training treatment arm. We control for all these variables in the main regression specification. 3.3.1 Sub-sample where we conducted lesson observations. Table A.2. compares the sample where we conducted the lesson observations with the full evaluation sample. In each column we regress another independent variable on a dummy variable indicating if the pupil/school is in the sample where we conducted the lesson observation or not. In columns (1) to (4) the data is at the individual level; in column (5) the data is at the school level. In column (1) the dependent variable is midline reading proficiency, including the full set of controls used in the main analysis (equation 1, below). A significant coefficient could thus be interpreted as the `value-added', over and above the average learning trajectory of a pupil. Columns (1) to (4) in table A.2. show that there is no statistically significant difference between schools where we conducted the lesson observations and the rest of our evaluation sample, both in terms of pupil reading proficiency evaluated at baseline, midline and endline, and a value-added measure between baseline and endline. As expected given our sampling strategy, a far higher proportion of schools where we conducted lesson observations are urban: 36.7 per cent, compared to 20 per cent in our overall sample. Figure ?? in the appendix further shows that the distribution of baseline and endline pupil reading proficiency is very similar, when comparing the lesson observation sample with the rest of the evaluation sample. When conducting the Kolmogorov-Smirnof equality of distribution test for the baseline and endline measures of reading proficiency, we cannot reject the null that the distributions are the same. 13
In addition, Table A.3 shows that the reduced sample where we conduct our lesson observations is balanced between treatment groups.
3.4 Empirical Strategy Our main estimating equation is:
yicsb1 = 0 + 1(Training)s + 2(Coaching)s + Xisb0 + b + icsb1,
where yicsb1 is the endline (end of second year) aggregate score of reading proficiency for pupil i in who is taught by a teacher in class c, school s and strata b; (Training)s and (Coaching)s are the relevant treatment dummies; b refers to strata fixed effects; Xicsb0 is a vector of baseline controls; and icsb1 is the error term clustered at the school level. In order to increase statistical power, we control separately for each domain of reading proficiency collected at baseline: vocabulary, letter recognition, working memory, phonological awareness, word recognition, words read, and sentence comprehension. To further increase statistical power and account for any incidental differences that may exist between treatment groups, we control for individual and community-level characteristics which are highly correlated with yisb1 or were imbalanced at baseline.29 Where data is missing for some observations for the control variables, we imputed missing values and added a dummy indicating missingness as a control.30
4 Results 4.1 Quality of implementation As a first step in our analysis, we examine the quality of implementation. Panels (a) to (c) in Figure 3 and rows (1) to (4) in Table 2 show results from the teacher questionnaire administered to all teachers in the evaluation sample. Panels (d) to (f) in Figure 3 and rows (4) to (6) in Table 2 show results from the in-depth teacher survey conducted in a sub-set of 60 schools. We see that that the program was well-implemented: 97 and 94 per cent of teachers in the Training and Coaching arms respectively state that they have received in-service training on teaching Setswana as a home language during that year. The support was also generally well-received: 45 and 66 per cent in the Training and Coaching arms respectively state they received very good support in teaching Setswana, relative to 17 per cent in the Control.31 Teacher satisfaction also increased in the Coaching arm: teachers that received Coaching are 28.4 percentage points more likely to strongly agree with the statement: "I feel supported and recognized for their work". Moreover, results from the sample of teachers interviewed during the lesson observations reveal that exposure to the program was high: 90% and 85% of the the regular grade 2 teachers in the Coaching and Training arms respectively state to use the Class Act scripted 29The additional controls include: pupil gender, pupils' parents' education, district dummy (schools were randomly spread across two districts), performance in the most recent standardized Annual National Assessments (ANA), a community-level wealth index, and average secondary school attendance rate in the community surrounding the school. 30For categorical variables, we assigned missing values to zero; for continuous variables we assigned missing observations to equal the sample mean. 31Although interestingly teachers in the Coaching arm are more likely to state that they received too much support.
lesson plans; 90 and 95 per cent respectively use the program's graded readers; and 85% of teachers in the Coaching arm reported that they were visited by the program's reading coach that year. 4.2 Impacts on learning Next we turn to the mean impacts of the programs on pupil reading proficiency, our main outcome of interest. Table 3 shows the regression results on our aggregate score of reading proficiency, estimated using equation (1). As recommended by Athey and Imbens (2017), all the p-values reported in Table 3 are constructed using randomization-based inference. We see from column (1) that Training and Coaching improved aggregate learning by 0.12 and 0.24 standard deviations respectively. Column (2) shows that, for both treatment arms, the impacts are larger when we exclude pupils in multi-grade classrooms: 0.18 and 0.29 standard deviations respectively. The program was never expected to be effective in such settings.32 In this sample the impact of Training is statistically significant. Moreover, column (3) shows that the impacts are also larger when we exclude repeaters. These are pupils who had shorter exposure to the program, because they were not taught by the treated teachers in the second year. Table 4 further unpacks the results, looking separately at each domain of reading proficiency that constitutes the aggregate score. For sake of comparability, we also standardize these measures to have a mean of zero and standard deviation of one in the control. It is encouraging that the magnitude of the impact is roughly the same across all the domains of reading proficiency in the Coaching arm. Even though the impact is slightly smaller in the case of writing (0.15 standard deviations), it remains statistically significant. For the Training arm, we only see statistical significance for phonological awareness and non-word decoding. The largest difference between Training and Coaching is in comprehension (p = 0.011). This is arguably the most important indicator, since the ultimate goal of literacy is reading with comprehension. By restricting analysis to common items that were in both baseline and endline --letter and word recognition-- we can also report results in terms of years of learning. Table 5 shows the average improvement in the control schools between baseline and endline for common constructs, as well as the respective treatment impacts at endline. The impact of Coaching is equivalent to 15 and 22 percent of the gains in letter and word recognition that took place in the control over the two years of the program. The impact of Training is 5 and 10 percent of the gains in letter and word recognition respectively. Moreover, for constructs that were not assessed at baseline, we can place an upper bound on learning by assuming that everyone in the control would have scored zero for the test at baseline. Given this method, the gains in reading comprehension in the Coaching treatment arm is at least 24 percent higher than in the control. 4.3 Cost-effectiveness analysis Since we found that the more costly program is more effective, it is important to determine which intervention was relatively more cost-effective. For thus purpose, we follow the approach of Dhaliwal et al (2013) and calculate the ratio of gains to costs-- i.e. the improvements in literacy per dollar spent.33 32In our sampling strategy we tried to exclude these schools. 33Thus implicitly assumes that the marginal social return to improving the aggregate score is constant and does not depend on the underlying distribution of reading proficiency. It also assumes that the social planner is risk-neutral places 15
We consider two different outcomes: aggregate reading proficiency and the probability that a pupil can reach a minimum threshold of comprehension.34 We consider the latter indicator, because reading with comprehension is arguably the ultimate goal of literacy development. For cost estimates, we use the program budget for the third year of implementation. We choose the third year, since this is at a point where a lot of the set-up challenges have been resolved and fixed costs have been paid (all the materials have already been developed). One would therefore not expect the per-pupil cost to be much different when the program is scaled up to more schools. Based on these estimates, the per pupil cost of the Coaching and Training programs are 48 USD and 62 USD respectively per year.35 Coaching is roughly twice as effective at improving aggregate reading proficiency, but less than twice as expensive. It is thus more cost-effective. In particular, Coaching improves reading proficiency by 0.0041 standard deviations per dollar spent per pupil per year, compared to 0.0023 increase in the case of Training. Moreover, Training and Coaching improved the proportion of literate pupils by 2 and 9.3 percentage points, respectively. Coaching is thus substantially more cost-effectiveness at improving reading comprehension, with a 0.15 percentage point increase in the probability of that a pupil can read with comprehension per dollar spent on a pupil per year, compared to a 0.04 percentage points increase in the Training arm.
4.4 Changing teaching practice In this section we further investigate underlying mechanisms by measuring how the learning environment, teaching practice, and classroom activities changed as a result of the program. For this purpose we draw from three different data-sources: the teacher survey and document inspection administered for the full evaluation sample of teachers, and lesson observations conducted in a stratified random sub-set of 60 schools. As discussed in section 3, we grouped the potential mediating factors into five broad categories: (i) curriculum coverage; (ii) adherence to the teaching routine as prescribed in the curriculum; (iii) teacherpupil interactions related to group-guided reading; (iv) frequency of practicing reading; and (v) pupils' use of reading material. The regression results are reported in Tables 6 to 8. For each category we construct a mean index that aggregates all the constituent indicators, as proposed by Kling Liebman and Katz (2007). Many of the indicators are ordinal variables, but for ease of interpretation we report results for adapted binary variables. Results on statistical significance remain the same when running an ordered logit model on the ordinal variables; and the mean index is constructed using the ordinal variable, thus preserving all the information captured by fieldworkers. Our estimating equation is now:
Mcs = 1 + 1(Training)s + 2(Coaching)s + b + csb1,
an equal weight on each pupil. Ideally we would calculate the welfare gains, but there are no credible causal estimates in South Africa on the long-term financial/social returns from improving early-grade literacy. 34We construct a binary indicator, which is equal to one if a pupil got at least two of the four comprehension questions right. 35Total costs for implementing the program in 50 schools are 179, 853 USD and 230, 860 USD in the Coaching and Training arms respectively. Given an average size of 74.6 of pupils per school at the start of the program, this surmounts to per-pupil costs of 48 USD and 62 USD respectively.
Where Mcs is the mediating variable of interest for a teacher in class c and school s. Standard errors are clustered at the school level for teacher survey data.36 With classroom observation data we also include day fixed effects to account for the fact that not all teaching activities observed were supposed to take place at a daily basis.37 Results are robust to the exclusion of day fixed effects. To summarize the results, we observe that teachers are more likely to follow the routine required by the lesson plans. Teachers were more likely to practice group-guided reading, with the largest change observed for teachers that received the Coaching. Consequently, pupils were more likely to receive individual attention from a teacher and opportunities to practice reading aloud. Finally, even though there is no difference between Coaching and Training in access to resources, we find that far more pupils in the Coaching arm are reading the graded reading booklets. (i) Curriculum coverage Columns (1) to (5) in table 6 shows treatment impacts on curriculum coverage, as captured during document inspection. Fieldworkers inspected the exercise and workbook of the most proficient pupil in the class,38 and then counted the number of days that writing exercises were completed in the exercise book, and the proportion of pages completed in the government workbook.39 Overall we see that there was a statistically significant increase in curriculum coverage of similar magnitude for both Training and Coaching arms. The only difference between the two treatment arms is that teachers who received Coaching completed a higher proportion of pages in the workbook, and teachers that received Training completed more exercises in the exercise books. But the impact on the mean index is of similar magnitude. (ii) Teaching routine In the teacher survey, we asked teachers how frequently they perform different types of teaching activities on a weekly basis: group-guided reading, spelling tests, phonics, shared reading, and creative writing.40 The frequencies of doing these activities are clearly stipulated in the government curriculum, so in principle the teachers in the Control should be performing them at the same frequency. Row (1) to (6) in Table 6 indicate that teachers in both Training and Coaching schools are more likely to perform each activity at the appropriate level of frequency, especially for teachers that received Coaching. Moreover, the p-value reported in column (7) shows that the difference between Coaching and Training is statistically significant. It is important to note that the treated teachers are not stating that they are more likely to perform all activities. They are more likely to perform activities that are required to be performed on a daily basis --group-guided reading and phonics-- but state they are less likely to perform the activity that should only take place only once a week-- correcting spelling. These results can therefore not purely be attributed to social desirability bias. At the very least, they show that treated teachers have better knowledge of the appropriate routine they should follow. 36We only observed one teacher per school in the classroom observations, so there is no need to cluster our standard errors at the school level. But we surveyed all the grade 2 teachers in each school, often more than one teacher per school. 37According to the lesson plans, creative writing is supposed to take place on Fridays, which provides fewer opportunities to practice reading. 38This was done to minimize risk of bias due to strategic selection of exercise and workbooks. 39To reduce data capture error, we asked the fieldworker to only count pages completed for three specific days. We chose three days that should have been covered by teachers by the end of the year, regardless of their choice of sequencing. 40Options were: Less than once a week, once a week, 2-4 times a week, every day, twice a day. 17
(iii) Group-guided reading Next we dig deeper to unpack the type of teaching activities related to group-guided reading, an activity that teachers in both Training and Coaching arm report to perform more frequently. There are three important components of group-guided reading that we consider important (and practically measurable) for learning: individual attention from teachers, individual assessment, and sorting reading groups by ability. We asked for each one of these indicators separately in the teacher questionnaire, and also measured these activities during the lesson observations. Rows (1) to (5) in Table 7 show result from the teacher survey. There was an overall increase for both treatment arms in the activities that relate to group-guided reading, with a consistently larger impact for Coaching relative to Training. First, as a confirmation of the self-reported increase in conducting groupguided reading, we find that program teachers are more likely to provide a list of reading groups relative to the control (16.8 and 34.4 per cent in the Training and Coaching arms respectively), and this impact is significantly larger for teachers that received Coaching. We further find that teachers who received Coaching are more likely, compared to Training and Control teachers, to state that they frequently listen to pupils read out loud and perform one-on-one reading assessments.41 Teachers in both Training and Coaching are more likely to state that they stream groups by ability. The results from the teacher survey provide strong evidence that group-guided reading was far more likely to take place in both treatment arms, with the largest increase observed for teachers who received Coaching. Moreover, the larger change seems to come from individual attention, rather than streaming. However, these results are all self-reported. To test if these practices actually changed in the classroom, we next turn to results from the lesson observations. Rows (6) to (11) in Table 7 show that the results from the teacher survey on group-guided reading are broadly supported by the lesson observations: there is a large, statistically significant increase in the mean index of 0.63 and 0.76 standard deviations in the Training and Coaching groups respectively. When examining the different components of group-guided reading, we see that there is a large increase in the Coaching arm in the probability that pupils read aloud in groups (39.7 percentage point increase), and that the pupils read individually to the teacher (48.5 percentage point increase).42 The impact for these two indicators is smaller for the Training arm, and not always statistically significant. However, we do not find strong evidence for any improvement in the probability of providing individual assessment and grouping by ability.43 Note that not all types of reading activities are more likely to take place. For sake of comparison, rows (12) to (14) show that teachers are no more likely to perform whole-class reading, where the whole class reads aloud with the teacher. Teachers are also no more/less likely to read aloud with the pupils following silently. Whole-class reading is an easy activity to perform in the classroom, and almost all teachers in the control are already doing it. 41Original variables are ordinal ranging from 1 "Never" to 5 "Nearly every day". 42These indicators were first recorded as ordinal variables ranked from 1 to 4. For ease of interpretation we created a binary indicator for these two indicators, indicating if any activity took place. 43There is a small increase in the probability of providing individual assessment, which is statistically significant only in the Training arm. Teachers that received Coaching are 24.7 percentage more likely to have different reading groups assigned to different graded readers (compared to 9.2 percentage point increase for teachers that received Training), but the difference is not statistically significant. 18
(iv) Practicing reading and phonics Results from rows (1) to (7) in table 8 show that pupils are no more likely to practice reading in the classroom because of the programs, nor is there any evidence that teachers are more likely to teach phonics.44 Although the mean index for reading frequency is not significant, we see in columns (8) and (9) that pupils in both the Training and Coaching arms are more likely to read extended texts. This is not surprising, since practicing phonics and individual letters/words is typically done through whole-class reading, whereas reading of extended texts is supposed to take place during group-guided reading. This is also consistent with our finding that Training had no impact on the sub-indicators of reading proficiency that are taught during group-guided reading: reading paragraphs and comprehension. (v) Pupil use of reading material Rows (11) to (13) in Table 9 report results on use of books and reading material. During the lesson observations, fieldworkers were required to count how many pupils have an opportunity to hold books (excluding the government workbooks) and how many pupils read the graded reading books. Even though there was no difference in access to the graded readers between any treatment arm and the control, we see a substantial increase in use of reading material, especially in the number of children who have opportunities to read. The average number of pupils who read the booklets increased by 2.2 and 5.2 in the Training and Coaching arms respectively. The difference between Coaching and Training is statistically significant at the 5 percent level. This is important since Training and Coaching were provided the exact same set of reading books. 4.5 Heterogeneous treatment impacts Given the large average gains in pupil learning and teaching practice for both interventions, an obvious question is how these impacts depend on the characteristics of the teacher and the class. It is not clear, ex ante, which types of teachers will be most positively influenced by the program. It is possible that the treatment impacts are larger for more qualified or competent teachers, because they are better able to apply the new teaching techniques specified in the lesson plans. On the other hand, it is possible that the programs have no impact for the best teachers, because they are already applying all the correct techniques. Similarly, we might observe no impact on learning for the more experienced teachers, as they are more reluctant to change behavior; yet, we might observe the largest impact for experienced teacher, if they are more capable of applying the new techniques. The relationship between treatment impact and class size could also, in theory, go both ways. The benefit of individual attention that pupils receive through group-guided reading might be largest in the larger classes, where teachers find it difficult to provide individual attention otherwise. Yet, teachers might find it too difficult to apply new teaching techniques in overly-large classes.45 44The fieldworkers were asked to record how many pupils in the classroom are involved with reading letters, words, sentences, or extended texts. The answers were recorded as 5-point Likert scale, ranging from none to all the pupils. They also recorded the extent to which teacher covers phonics on a 4-point Likert scale. As before we construct binary variables for ease of interpretation (equal to one, if at least some pupils are reading; and equal to one if the teacher teaches phonics at least some of the time). 45During exit survey some teachers did indeed complain that they found it difficult to implement the lesson plans in larger classes. 19
To test for heterogeneous treatment impacts, we estimate the following equation: yicsb1 = 0 + 1(Training)s + 2(Coaching)s + 3(Training Ч )c + 4(Coaching Ч )c + Xicsb0 + b + icsb1, (3) where c is the moderating variable of interest and is now also included in the vector of baseline controls. We further re-weigh the observations so that each teacher/class receives equal weight.46 Table 9 displays the regression results. Columns (1) to (4) show that treatment impacts do not depend on observable teacher characteristics, such as teacher qualifications, age, experience, and the number of books that the teacher has read in a year. Columns (6) and (7) in Table 9 show that, although there is no linear relationship between the number of pupils in a classroom and treatment impact, there is a strong non-linear (positive concave) relationship. Column (8) shows that this relationship is not driven by multi-grade schools. To further unpack this non-linear relationship between treatment impact and class size, we separately estimate the treatment impacts for each tercile of class size, using the following equation:
yicsb1 = 0,0 + 0,q (qc)+ 1,q (Training Ч qc)c + 2,q (Coaching Ч qc)c +Xisb0+b +icsb1, (4)
where qc is a dummy variable equal to one if class c is in the qth tercile in terms of class size, q (1, 2, 3). In our sample, the smallest third of classes have fewer than 38 pupils in a class; and the largest third of classes have more than 44 pupils. Figure 5 plots the relevant regression coefficients from equation 4. We see that for both interventions there is a concave positive relationship between treatment impact and class size, so that the smallest impact on learning is observed in the smallest classes. In fact, for neither intervention can we reject the null that it has no impact on learning in small classes. This might be because the benefits of individual attention brought on by group-guided reading is larger in the the larger classes, and/or because learning is already high in the small classes. Indeed, Figure 5 shows that pupils in the control learnt 0.24 standard deviations more in small classes, relative to large classes (the omitted category in equation 4 is the largest classes). Moreover, whereas the impact of Coaching increases monotonically with class size, the impact of Training is largest in medium-sized classes --0.23 standard deviations-- and more muted in the largest classes-- 0.17 standard deviations. In contrast, teachers who received Coaching have a similar impact in medium and large classes-- 0.39 and 0.4 standard deviations respectively. As a result, Coaching becomes increasingly more effective at improving learning, relative to Training, as class size increases.47 Since we observe large heterogeneous treatment impacts on learning by class size, we also investigate if there is a consistent trend on teaching practice. In table 10 we regress the different indicators for group-guided reading on treatment assignment, class size, and the interaction with class size, controlling 46We have the same number of pupils per school, but due to random sampling of pupils, we do not have the same number of pupils per teacher/class. 47This pattern does not depend on the cut-off for class size. We also calculated non-parametric estimates of the treatment size, performing a kernel-weighted local polynomial regression of the predicted treatment size on class size percentile rank, which revealed the same trend. These figures are available upon request.
for stratum fixed effects.48 There is a statistically significant negative relationship between class size and the impact of Training and the probability that teachers practice group-guided reading. This is not the case for Coaching. A plausible interpretation of these results is that it is particularly challenging to implement groupguided reading in large classes, but teachers who have received Coaching are better able to overcome those challenges. As a result, the impact of Coaching relative to Training is larger in large classes. 5 Discussion Results from the teacher survey and lesson observations reveal three key insights. First, our results of the treatment impacts on group-guided reading suggest that coaches play an important role in the adoption of more technically challenging teaching techniques. Group-guided reading is particularly difficult to implement: teachers need to re-organize the classroom and keep the rest of the classroom busy as they provide targeted feedback to the smaller reading group. Indeed, during the exit surveys, teachers complained that group-guided reading is difficult, especially in larger classrooms, and that the training was too short from them to fully understand group-guided reading. Consistent with this, we find that teachers that received Coaching are more likely to conduct group-guided reading, compared to both Training and control. Moreover, teachers that received Training are less likely to implement groupguided reading in larger classrooms, but there is no similar trend for teachers who received Coaching. As a result, the difference in efficacy between Coaching and Training is more pronounced in larger classrooms. Second, we believe that the individual attention that pupils receive, and opportunities to individually practice reading, provided through group-guided reading, is at least part of the explanation for faster acquisition of reading proficiency in the Coaching arm. Individual attention is arguably particularly important in contexts of large classrooms, as is common in developing countries, where children would receive very little individual attention otherwise. This interpretation is also consistent with a general conclusion drawn from education reviews that pedagogical interventions that allow teachers to target teaching to the level of the child are highly effective at improving learning (Evans & Popova 2015). Randomized evaluations have found that remedial education programs (Banerjee et al. 2007), additional teacher assistants (Duflo & Kiessel 2014), or contract teachers (Duflo et al. 2011) can improve test scores since they free up resources to provide additional attention to worse-performing pupils. What is encouraging in our context is that we show how re-organizing the classroom, without any additional resources, can also enable teachers to provide more individual attention. Third, the substantial increase in the number of pupils reading the graded reading booklets in the Coaching arm relative to Training, despite equal access to these resources, reveals an important interaction between resources, teaching practice, and use of resources. The purpose of the graded readers is to provide individual opportunities to practice reading. Pupils are provided this opportunity during group-guided reading, an activity that teachers find challenging to implement. These resources therefore cannot be used without appropriate enactment of a new teaching method. Coaching thus enabled teachers to use more effectively the resources that are available to them. As further evidence to this, we find a strong 48We do not report results on lesson observations: Limited Sample Size makes it difficult to make any definitive conclusions on heterogeneous treatment effects on teaching activities. 21
correlation between number of pupils reading a graded reader and whether the teacher practiced groupguided reading or not, even after controlling for treatment assignment and stratification. 6 Conclusion We report the results of a randomized evaluation of two different approaches to improving the instructional practices of early-grade reading teachers in public primary schools in South Africa. The first approach (Training) follows the traditional model of a once-off training conducted at a central venue. In the other approach (Coaching), teachers are visited on a monthly basis by a specialist reading coach who monitors their teaching, provide feedback, and demonstrate correct teaching practices. We find that Coaching had a large and statistically significant impact on pupil reading proficiency, more than twice the size of the Training arm. Coaching was also more cost-effective. We also find that teachers in both treatments are more likely to practice a difficult teaching technique called group-guided, although this impact was far larger for teachers that received Coaching. We also observe substantial heterogeneity by class size: Coaching is far more effective in large classes, whereas Training is most effective in medium-sized classes. It is worth noting that Training had a large impact on both learning and teaching practice; and the impact on learning is statistically significant if we exclude multi-grade classes. It is encouraging that the traditional approach of one-off training at a central venue can succeed in changing teaching practice, when well-designed, well-implemented, and combined with the appropriate learning material and lesson plans. However, it is unlikely that government training programs as currently implemented are effective. This program was implemented by a highly motivated non-governmental organization with strong incentives to demonstrate impact, and was based on strong pedagogical theory. There were two rounds of oneoff training --one at the beginning of each semester-- and the schools also received follow-up calls to encourage them to continue with the program. The feedback from teachers was also overwhelmingly positive. Moreover, we know 78 percent of teachers in the control also report to have participated in some form of professional development. Although it is difficult to know what sort of training teachers received in the control group, the impact of the Training and Coaching interventions should be seen as relative to a counter-factual of some professional development opportunities rather than relative to none. An important question is the generalizability of the results to other contexts. Seen in the context of other evaluations of similar programs, we feel it is likely that these results are generalizable, at least within sub-Saharan Africa. Other studies in sub-Saharan Africa have found that the combination of reading coaches and supporting learning material can improve pupils' proficiency in early-grade reading (Piper et al. 2014, Piper & Korda 2011, Lucas et al. 2014, Kerwin et al. 2017). Moreover, a previous quasi-experimental evaluation of a very similar coaching program in a different province in South Africa also found positive impacts (Fleisch, SchoЁer, Roberts & Thornton 2016, Fleisch et al. 2016). Our contribution to this literature is to deeper unpack mechanisms and understand which components are uniquely responsible for the success of the programs. Another important consideration is scaling: should the Coaching program be expanded into new schools; if so, why and in which schools?. One option is to make use of the existing subject advisors employed at district offices by government. The subject advisors are are already supposed to visit schools on a regular basis and provide pedagogical support. They could, in principle, perform the role of coaches. 22
However, the ratio of subject advisors to schools does not allow them to visit each school more than about once a year. Moreover, the hiring process and employment contracts in government means that they are are unlikely to be best-suited for this role, nor do they have the right combination of support and incentives to perform it. It is still an open empirical question if less-qualified coaches can have the same impact on learning and teaching practice as the experts hired for this study. But scaling a program does not mean that it needs to be implemented in all schools at the same time. Another model would be to stagger implementation, where the same group of reading coaches visit a different cluster of schools every couple of years. A key question for scaling is thus whether the impacts persist. Future research will assess the sustained impact on pupil learning and teaching practice, even after support through Training or Coaching has been discontinued. 23
7 Tables and Figures 24
Table 1. Descriptive and balance statistics
Coef. Std error Coef. Std error
Obs P value
Pupil Characteristics Age Female Reading proficiency
6.481 0.479 0.0380
0.0781 -0.0156 -0.209*
(0.0520) (0.0220) (0.118)
-0.0244 -0.0120 0.0666
(0.0524) (0.0207) (0.146)
3,523 3,518 3,539
0.0669 0.884 0.0658
Grd 2 Teacher Characteristics
Diploma or degree
0.0127 (0.0312) 0.0413 (0.0253) 271
-1.566 (1.365)
-0.287 (1.217)
-0.0138 (0.0134) 0.00001 (0.00234) 271
Class size
-1.993 (1.464)
-3.174** (1.589)
0.0619 0.00698 (0.0333) 0.00253 (0.0293) 271
Comprehension test
-0.0425 (0.0304) -0.00419 (0.0326) 269
0.262 0.311 0.305 0.420 0.905 0.237
School characteristics
Setswana most common 1
-0.0418 (0.0284) -0.0216 (0.0213) 167
Most parents - highschool 0.443
-0.106 (0.0871) 0.0341 (0.0823) 179
-0.0700 (0.0679)
-0.110 (0.0691) 180
Bottom quintile (SES)
0.463 0.0975* (0.0520) -0.0425 (0.0392) 180
Pass rate (ANA)
-1.184 (0.894)
-0.981 (0.917)
Wealth index
-0.522 (0.497)
-0.616 (0.496)
Kenneth district
-0.0125 (0.0705)
0.0875 (0.0771) 180
Notes: Each row indicates a separate regression on treatment dummies controlling for strata indicators. Column one shows
the control mean, columns (2) and (4) the coefficient on the two treatment dummies. Standard errors are indicated in
columns (3) and (5) and are clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1
Table 2: Implementation
General support in teaching Setswana
Received training Feel supported & respected Very good support
Exposure to Class Act (in-depth teach survey)
Graded readers Lesson plans Reading coach
Training Coaching
0.179*** (0.0471) 0.148*** (0.0514)
0.0874 (0.0744) 0.284*** (0.0692)
0.287*** (0.0696) 0.490*** (0.0637)
0.720*** (0.118) 0.677*** (0.117)
0.732*** (0.124) 0.858*** (0.0938)
0.0167 (0.105) 0.792*** (0.104)
Control mean
Notes: each column represents a separate regression, including strata fixed effects. Date from rows (1) to (3) come from the teacher questionnaire administered to all
teachers in the evaluation sample. Data from rows (4) to (6) come from the in-depth teacher survey conducted in a sub-set of 60 schools. Observations are at a teacher
level. Standard errors are clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1
Table 3. Main results
Training Coaching
0.116 (0.0791) 0.242*** (0.0778)
0.177** (0.0814) 0.290*** (0.0802)
0.165* (0.0851) 0.318*** (0.0841)
0.234*** (0.0884) 0.363*** (0.0883)
Excluding multi-grade?
Excluding repeaters?
Notes: Each column represents a separate regression, using equation (1). All specificaitons include the following
controls: baseline reading proficiency, gender, parents' education, school performance in standardized national
exam, a district dummy, a community-level wealth index and highschool attendance rates. In column (2) the sample
is further restricted to schools that do not have multi-grade classrooms. In column (3) grade repeaters are excluded
from the sample. In column (4) both multi-grade classrooms and repeaters are excluded. Standard errors are in
parentheses and clustered at the school level. P-values are constructed using randmoization inference. *** p<0.01,
** p<0.05, * p<0.1
(1) Letters
(2) Words
Table 4. Results on sub-indicators
Non-words Paragraph reading Comprehension
Phon. awareness Writing
Training Coaching
0.0596 0.103 (0.0889) (0.0786) 0.192** 0.224*** (0.0927) (0.0731)
0.128* (0.0755) 0.258*** (0.0743)
0.122 (0.0744) 0.232*** (0.0684)
0.0518 (0.0730) 0.223*** (0.0719)
0.139** (0.0657) 0.156** (0.0712)
0.105 (0.0789) 0.156* (0.0851)
Observations R-squared Training=Coaching:Pvalue
2,951 0.147 0.141
2,951 0.158 0.098
2,951 0.142 0.064
2,951 0.151 0.115
2,951 0.121 0.011
2,951 0.071 0.793
2,951 0.124 0.51
Notes: each column represents a separate regression, using equation (1) with the same set of controls as in table 3. The top column indicates the outcome. Standard errors are in parentheses and clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1
Table 5. Impacts as proportion of two years of learning in the control
Improvement in control
At most 1.23
Impact %
Impact %
Treatment impacts
1.572 4.7%
1.75 10.3% 0.0695
5.056 15.0%
3.80 22.4%
Note. The odd-numbered columns show the treatment impacts on three different sub-tests. Row (1) shows
the change in the control over two years of the study for these sub-tests. The even-numbered columns shows the impacts as a proportion of two years of learning.
Currciculum coverage (1) Kling index Days pupil completed: (2) ---Any exercises (3) ---Writing exercises (4) ---Full sentence writing exercises (5) Proportion of pages completed Routine
Table 6. Curriculum coverage and routine
Coef. Std. Error Coef. Std. Error
Training = Coaches
p value
0.469*** (0.128)
0.317** (0.139) 271
23.57 19.08 14.11 0.761
16.64*** 8.532*** 9.736*** -0.0441
(3.348) (3.046) (3.155) (0.0555)
5.007 (3.778) 270 6.306* (3.478) 270 5.539* (3.044) 270 0.0840** (0.0423) 258
0.00679 0.581 0.264 0.0185
(6) Kling index
0.300*** (0.0811) 0.497*** (0.0652) 276
(7) Group-guided reading
0.124* (0.0738) 0.197*** (0.0674) 274
(8) Spelling test
0.155** (0.0627) 0.238*** (0.0509) 273
(9) Phonics
-0.0708 (0.0745)
0.171** (0.0720) 274
(10) Shared reading
0.183** (0.0728)
0.171** (0.0711) 274
(11) Creative writing
0.310 0.301*** (0.0715) 0.383*** (0.0681) 274
Notes. Each row represents a separate regresion, including stratification fixed effects. Data is at the teacher level. Standard errors are clustered at the school
level *** p<0.01, ** p<0.05, * p<0.1
Group-guided reading (questionnaire)
(1) Kling index
(2) Teacher can provide list of groups
Listen to each pupil read out loud (almost daily)
One-on-one reading assessment (at least weekly)
(5) Stream by ability
(1) Control 0 0.430 0.578 0.655 0.718
Table 7. Types of reading activity
Training Coef. Std. error
Coaching Coef. Std. error
0.210** (0.0880) 0.168* (0.0987) 0.0324 (0.0772)
0.415*** (0.0772) 0.344*** (0.0815) 0.237*** (0.0638)
0.0877 (0.0755) 0.107* (0.0579)
0.161** (0.0638) 0.144** (0.0580)
Training = Coaching
Group-guided reading (lesson observations)
(6) Kling index
0.634** (0.268)
0.762*** (0.243)
(7) Pupils read aloud in groups
0.0973 (0.173)
0.397** (0.155)
(8) Pupils read individually to teacher 0.176
0.340* (0.188)
0.485** (0.183)
(9) Individual reading assessment
0.283 (0.179)
0.0913 (0.173)
(10) Reading groups, different texts
0.0417 (0.152)
0.245 (0.174)
0.636 0.0605 0.432 0.295 0.335
Whole class reading
(11) Teacher reads, class not following
-0.169 (0.138)
-0.0233 (0.141)
(12) Teacher reads, class following silently. 0.550
-0.167 (0.199)
-0.000186 (0.221)
(13) Whole class reads aloud with teacher 0.833
-0.0915 (0.181)
0.164 (0.129)
Notes. Each row represents a separate regresion, including stratification fixed effects. Data is at the teacher level. Data from rows (1) to (5) come from the teacher survey conducted in the full evaluation sample. Data from rows (6) to (14) come from lesson observations conducted in a sub-sample of 60 schools. Regressions from rows (6) to (14) also include day-of-the-week fixed effects. *** p<0.01, ** p<0.05, * p<0.1
Reading frequency (1) Kling index (2) Phonics (3) Letters (4) 1-2 words (5) 3-10 words (6) 10+ words (7) 1-2 sentences (8) 3-5 sentences (9) 5+ sentences (10) Extended texts
Table 8 . Frequency of reading activity and use of reading material
Coef. Std. error
Coef. Std. error
0.00108 (0.148)
0.147 (0.151)
0.194 (0.157)
0.154 (0.162)
-0.161 (0.198)
0.119 (0.191)
-0.125 (0.181)
0.155 (0.225)
-0.149 (0.141)
0.0803 (0.124)
0.0533 (0.176)
0.192 (0.176)
-0.282 (0.210)
-0.101 (0.225)
0.301* (0.169)
0.416** (0.171)
0.193 (0.184)
0.318* (0.176)
0.0487 (0.188)
0.171 (0.195)
(7) Training = Coaching P-value 0.422 0.770 0.139 0.178 0.0762 0.420 0.355 0.550 0.518 0.449
Use of reading material
(11) Kling index
4.371** (2.117)
12.45*** (3.550)
(12) No. learners read readers
2.182** (0.937)
5.225*** (1.518)
(13) No. learners handle books
0.518 (0.793)
2.599* (1.336)
0.00897 0.0194 0.0576
Notes. Each row represents a separate regresion, including stratification fixed effects. Data is at the teacher level, each teacher at a different school. Standard errors are clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1
Table 9. Teacher and class-level interaction effects
Degree Age Books read Experience
Class size
(1) Training (2) Training x group (3) Training x group squared
0.108 (0.132) 0.0520 (0.171)
0.396 (0.445) -0.00525 (0.00892)
0.148 (0.111) -0.00287 (0.0194)
0.182 (0.164) -0.00220 (0.00731)
-0.253 -3.022*** -2.747***
(0.369) (0.832)
0.00905 0.155*** 0.141***
(0.00887) (0.0431) (0.0492)
-0.00182*** -0.00164***
(0.000540) (0.000604)
(4) Coaching (5) Coaching x group (6) Coaching x group squared
0.231* (0.133) 0.186 (0.166)
0.910** (0.461) -0.0127 (0.00950)
0.361*** (0.105) -0.0213 (0.0144)
0.499*** (0.151) -0.0105 (0.00705)
-0.340 -2.270*** -2.916***
(0.329) (0.626)
0.0153* 0.118*** 0.153***
(0.00804) (0.0319) (0.0408)
-0.00128*** -0.00172***
(0.000397) (0.000499)
Excl. multi-grade? Excl. repeaters? Observations R-squared
No Yes 2,065 0.176
No Yes 2,301 0.171
No Yes 2,168 0.160
No Yes 2,276 0.170
No Yes 2,285 0.178
No Yes 2,285 0.190
Yes Yes 2,121 0.189
Notes: Each column represents a separate regression, estimated using equation (3). Column headings indicate the variable that is being interacted with treatment dummies. Standard errors are in parentheses and clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1
Table 10. Group-guided reading by class size
Listen to each
Teacher has list of pupil read out
Kling index
(4) One-on-one reading assessment
(5) Stream by ability
(6) Group-guided reading
Training Training x Class size Coaching Coaching x Class size
1.216*** (0.397) -0.0238** (0.00926) 0.654* (0.363) -0.00534 (0.00849)
1.216** (0.522) -0.0246** (0.0122) 0.251 (0.466) 0.00304 (0.0112)
0.686** (0.325) -0.0157** (0.00756) 0.769*** (0.279) -0.0131* (0.00666)
0.272 (0.312) -0.00447 (0.00735) 0.162 (0.262) 0.000372 (0.00613)
0.224 (0.303) -0.00252 (0.00701) 0.129 (0.310) 5.18e-05 (0.00746)
-0.209 (0.431) 0.00820 (0.0102) 0.368 (0.347) -0.00373 (0.00843)
Observations R-squared
254 0.167
216 0.214
253 0.095
254 0.091
245 0.079
254 0.073
Notes: Each column represents a separate regression, estimated using equation (3). Column headings indicate the dependent variable. Standard errors are in parentheses and clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1
Figure 1: Quality of implementation in Coaching arm (a) Number of coach visits
Figure 2: Attrition and repetition rates across treatment arms Note: The figure shows the proportion of surveyed pupils by treatment group who: (i) were not present at endline for the reading assessment; (ii) were present, but are repeating grade one; (iii) were present and are in grade two 36
Figure 3: Implementation quality
(a) Received training in Setswana
(b) Received good support
(c) Feel supported and recognized - strongly agree
(d) Use ClassAct's graded readers
(e) Use ClassAct's scripted lesson plans
(f) Monitored by ClassAct reading coach
Note: ers in schools set of
Panels (a) to (c) are based on teacher questionnaires answered by 276 teachers (118, 76, and 82 teachthe Control, Training and Coaching arms respectively) in 175 schools (77 schools in the Control, and 49 in each intervention arm). Panels (d) to (f) are based on in-depth teacher surveys conducted with a sub60 teachers, 20 teachers in each treatment arm. The black lines indicate 90 per cent confidence intervals.
Figure 4: Mean impacts on learning
(a) Full sample
(b) Excluding multi-grade classes
(c) Excluding repeaters
Figure 5: Relationships between treatment impacts and class size
Note: The coefficients are estimated using equation 4.
From left to right
they are: 1,1, 1,2, 1,3, 2,1, 2,2, 2,3, 0,1, and 0,2.
The data includes
2, 285 pupils, who have been matched with 270 teachers.
R-squared is 0.1759
References Allen, J., Gregory, A., Mikami, A., Lun, J., Hamre, B. & Pianta, R. (2013), `Observations of effective teacher-student interactions in secondary school classrooms: Predicting student achievement with the classroom assessment scoring system-secondary', School Psychology Review 42(1), 76. Araujo, M. C., Carneiro, P., Cruz-Aguayo, Y. & Schady, N. (2016), `Teacher quality and learning outcomes in kindergarten', The Quarterly Journal of Economics 131(3), 1415­1453. Athey, S. & Imbens, G. W. (2017), `The econometrics of randomized experiments', Handbook of Economic Field Experiments 1, 73­140. Banerjee, A. V., Cole, S., Duflo, E. & Linden, L. (2007), `Remedying education: Evidence from two randomized experiments in india', The Quarterly Journal of Economics pp. 1235­1264. Banerji, R., Bhattacharjea, S. & Wadhwa, W. (2013), `The annual status of education report (aser)', Research in Comparative and International Education 8(3), 387­396. Bau, N. & Das, J. (2017), `The misallocation of pay and productivity in the public sector: Evidence from the labor market for teachers'. Bold, T., Filmer, D., Martin, G., Molina, E., Stacy, B., Svensson, J. & Wane, W. (2017), `Enrolment without learning: Teacher effort, knowledge, and skill in primary schools in africa', Journal of Economic Perspectives 31(4). Buhl-Wiggers, J., Kerwin, J., Smith, J. & Thornton, R. (2017), The impact of teacher effectiveness on student learning in africa, in `Centre for the Study of African Economies Conference'. Chetty, R., Friedman, J. N. & Rockoff, J. E. (2014), `Measuring the impacts of teachers ii: Teacher valueadded and student outcomes in adulthood', The American Economic Review 104(9), 2633­2679. Cilliers, J., Kasirye, I., Leaver, C., Serneels, P. & Zeitlin, A. (2017), `Pay for locally monitored teacher attendance?'. Cruz-Aguayo, Y., Ibarrarґan, P. & Schady, N. (2017), `Do tests applied to teachers predict their effectiveness?', Economics Letters 159, 108­111. Das, J., Hammer, J. & Leonard, K. (2008), `The quality of medical advice in low-income countries', The Journal of Economic Perspectives 22(2), 93­114. Das, J., Holla, A., Mohpal, A. & Muralidharan, K. (2016), `Quality and accountability in health care delivery: Audit-study evidence from primary care in india', American Economic Review 106(12), 3765­ 3799. De Ree, J., Muralidharan, K., Pradhan, M. & Rogers, H. (2015), Double for nothing? experimental evidence on the impact of an unconditional teacher salary increase on student performance in indonesia, Technical report, National Bureau of Economic Research. 40
Dhaliwal, I., Duflo, E., Glennerster, R. & Tulloch, C. (2013), `Comparative cost-effectiveness analysis to inform policy in developing countries: a general framework with applications for education', Education Policy in Developing Countries pp. 285­338. Duflo, A. & Kiessel, J. (2014), `Every child counts: Adapting and Evaluating Research results on remedial education across contexts.', Society for Research on Educational Effectiveness . Duflo, E., Dupas, P. & Kremer, M. (2011), `Peer effects, teacher incentives, and the impact of tracking: Evidence from a randomized evaluation in kenya', The American Economic Review 101(5), 1739­ 1774. Evans, D. K. & Popova, A. (2015), `What really works to improve learning in developing countries? an analysis of divergent findings in systematic reviews', World Bank Policy Research Working paper (7203). Evans, D. K. & Popova, A. (2016), `What really works to improve learning in developing countries? an analysis of divergent findings in systematic reviews', The World Bank Research Observer 31(2), 242­ 270. Fleisch, B., SchЁoer, V., Roberts, G. & Thornton, A. (2016), `System-wide improvement of early-grade mathematics: New evidence from the gauteng primary language and mathematics strategy', International Journal of Educational Development 49, 157­174. Fryer, R. G. (2017), `The production of human capital in developed countries: Evidence from 196 randomized field experimentsa', Handbook of Economic Field Experiments 2, 95­322. Garet, M. S., Cronen, S., Eaton, M., Kurki, A., Ludwig, M., Jones, W., Uekawa, K., Falk, A., Bloom, H. S., Doolittle, F. et al. (2008), `The impact of two professional development interventions on early reading instruction and achievement. ncee 2008-4030.', National Center for Education Evaluation and Regional Assistance . Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., Song, M., Brown, S., Hurlburt, S., Zhu, P. et al. (2011), `middle school mathematics professional development impact study: Findings after the second year of implementation', National Center for Education Evaluation and Regional Assistance . Glewwe, P., Ilias, N. & Kremer, M. (2010), `Teacher incentives', American Economic Journal: Applied Economics 2(3), 205­227. Glewwe, P., Kremer, M. & Moulin, S. (2009), Many children left behind? textbooks and test scores in kenya, Technical Report 1. Harris, D. N. & Sass, T. R. (2011), `Teacher training, teacher quality and student achievement', Journal of Public Economics 95(7), 798­812. Jackson, C. K. & Makarin, A. (2016), Simplifying teaching: A field experiment with online" off-the-shelf" lessons, Technical report, National Bureau of Economic Research. 41
Jacob, B. A. & Lefgren, L. (2004), `The impact of teacher training on student achievement quasiexperimental evidence from school reform efforts in chicago', Journal of Human Resources 39(1), 50­ 79. Kane, T. J. & Staiger, D. O. (2008), Estimating teacher impacts on student achievement: An experimental evaluation, Technical report, National Bureau of Economic Research. Kane, T. J. & Staiger, D. O. (2012), `Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. research paper', Bill & Melinda Gates Foundation . Kerwin, J. T., Thornton, R. et al. (2017), `Making the grade: Understanding what works for teaching literacy in rural uganda', Unpublished manuscript. University of Illinois, Urbana, IL . Kling, J. R., Liebman, J. B. & Katz, L. F. (2007), `Experimental analysis of neighborhood effects', Econometrica 75(1), 83­119. Kraft, M. A., Blazar, D. & Hogan, D. (2016), `The effect of teacher coaching on instruction and achievement: A meta-analysis of the causal evidence'. Lucas, A. M., McEwan, P. J., Ngware, M. & Oketch, M. (2014), `Improving early-grade literacy in east africa: Experimental evidence from kenya and uganda', Journal of Policy Analysis and Management 33(4), 950­976. McEwan, P. J. (2015), `Improving learning in primary schools of developing countries: A meta-analysis of randomized experiments', Review of Educational Research 85(3), 353­394. Mullins, I., Martin, M., Foy, P. & Hooper, M. (2017), Pirls 2016 international results in reading, Technical report, International Association for the Evaluation of Educational Achievement. Muralidharan, K. & Sundararaman, V. (2009), Teacher performance pay: Experimental evidence from india, Technical report, National Bureau of Economic Research. Muralidharan, K. & Sundararaman, V. (2013), Contract teachers: Experimental evidence from india, Technical report, National Bureau of Economic Research. Piper, B. & Korda, M. (2011), `Egra plus: Liberia. program evaluation report.', RTI International . Piper, B., Zuilkowski, S. S. & Mugenda, A. (2014), `Improving reading outcomes in kenya: First-year effects of the primr initiative', International Journal of Educational Development 37, 11­21. Popova, A., Evans, D. K. & Arancibia, V. (2016), `Inside in-service teacher training: What works and how do we measure it?'. Randel, B., Beesley, A. D., Apthorp, H., Clark, T. F., Wang, X., Cicchinelli, L. F. & Williams, J. M. (2011), `Classroom assessment for student learning: Impact on elementary school mathematics in the central region. final report. ncee 2011-4005.', National Center for Education Evaluation and Regional Assistance . 42
Rivkin, S. G., Hanushek, E. A. & Kain, J. F. (2005), `Teachers, schools, and academic achievement', Econometrica 73(2), 417­458. Sabarwal, S., Evans, D. K. & Marshak, A. (2014), `The permanent input hypothesis: the case of textbooks and (no) student learning in sierra leone'. Snilstveit, B., Stevenson, J., Menon, R., Phillips, D., Gallagher, E., Geleen, M., Jobse, H., Schmidt, T. & Jimenez, E. (2016), `The impact of education programmes on learning and school participation in low-and middle-income countries'. Staiger, D. O. & Rockoff, J. E. (2010), `Searching for effective teachers with imperfect information', The Journal of Economic Perspectives 24(3), 97­117. Strizek, G. A., Tourkin, S. & Erberber, E. (2014), `Teaching and learning international survey (talis) 2013: Us technical report. nces 2015-010.', National Center for Education Statistics . Venkat, H. & Spaull, N. (2015), `What do we know about primary teachers mathematical content knowledge in south africa? an analysis of sacmeq 2007', International Journal of Educational Development 41, 121­130. 43
Appendix A Further tables 44
Table A.1 Treatment Status Regressions on Attrition Status
(1) Attrite
Reading proficiency
(8) Teacher attrition
Training Coaching Attrite
0.00605 (0.0222) -0.0136 (0.0183)
0.163*** (0.0425)
0.0870 (0.0530) -0.0246 (0.0513)
-0.0357* (0.0215)
-0.0307 (0.0241) -0.0139 (0.0230)
-0.0720 (0.0509)
-0.205* (0.121) 0.0827 (0.151)
-0.0239 (0.0363) -0.0234 (0.0378)
Strata fixed effects?
3,523 2,941
3,518 2,934
3,539 2,951
0.007 0.012
0.001 0.003
0.001 0.061
Control mean
Notes: each column represents a separate regression. Column headings indicate the dependent variable. "Attrite" is a dummy variable
equal to one if the pupil was not surveyed at endline. "Teacher attrition" is a dummy variable equal to one if the pupil's teacher was not
surveyed at endline. In columns (3), (5), (7) and (8) the sample is restricted to non-attriting pupils. Standard errors are clustered at the
school level. *** p<0.01, ** p<0.05, * p<0.1.
Table A.2 Comparing lesson observation schools with full sample
Pupil reading proficiency
Value-added Endline Midline Baseline
(6) Location Rural
In sample Training
0.0594 -0.00586 0.0200 -0.0284 (0.0724) (0.0814) (0.0748) (0.119)
-0.250*** (0.0692)
3,148 3,337 3,539
0.000 0.000 0.000
NSaomteps:leEamcheacnolumn re0pr.0es3e6n8ts a sep0a.r0a0te87re3gres0s.i0o3n0o4n a d-u0m.0m18y0variable indicatin0g.6w3h3ether the
pupil/school is in the sample where we conducted the lesson observation or not. In columns (1) to
(4) the data is at the individual level; in column (5) the data is at the school level. In column (1) the
dependent variable is midline reading proficiency, and the regression includes the full set of
controls used in table 2. Standard errors are clustered at the school level. *** p<0.01, ** p<0.05, *
Table A.3 Descriptive and balance statistics - Lesson observations sample
Coef. Std error Coef.
Std error Obs
(7) R-squared
Pupil Characteristics Age Female Reading proficiency
6.481 0.479 0.0404
0.117 -0.0634 -0.244
(0.0781) (0.0423) (0.253)
0.0263 (0.0767) -0.0653* (0.0356) 0.171 (0.224)
1,194 1,191 1,198
0.021 0.008 0.157
Teacher Characteristics Diploma or degree Age Female Class size Multi-grade Comprehension test
0.947 48.92 1 42.17 0.0619 0.663
0.0451 0.108 -0.0320 -3.470 -0.0570 -0.0663
(0.0444) (2.882) (0.0307) (2.692) (0.0379) (0.0741)
0.0559 0.368 0.00641 -7.309** -0.0183 0.0198
(0.0547) (2.875) (0.0153) (3.057) (0.0382) (0.0808)
88 0.117 89 0.103 87 0.213 87 0.253 88 0.407 88 0.128
School characteristics
Majority parents - highschool 0.443
-0.247 (0.179) -0.00595 (0.170)
59 0.277
-0.0312 (0.151)
-0.144 (0.172)
60 0.239
Bottom quintile (SES)
0.0412 (0.108)
-0.0935 (0.0818) 60 0.791
Pass rate (ANA)
-0.215 (1.446)
0.773 (1.771)
60 0.542
Notes: Each row indicates a separate regression on treatment dummies controlling for strata indicators. Column one shows
the control mean, columns (2) and (4) the coefficient on the two treatment dummies. Standard errors (columns (3) and (5))
are clustered at the school level. *** p<0.01, ** p<0.05, * p<0.1

J Cilliers, B Fleisch, C Prinsloo, V Reddy

File: how-to-improve-teaching-practice-experimental-comparison-of-centralized.pdf
Author: J Cilliers, B Fleisch, C Prinsloo, V Reddy
Published: Wed Mar 28 20:51:42 2018
Pages: 47
File size: 1.09 Mb

ABOUT THE PRODUCTION, 27 pages, 0.53 Mb

, pages, 0 Mb

Avian influenza, 8 pages, 0.33 Mb

Master class, 3 pages, 1.07 Mb
Copyright © 2018