AERO

Why AERO should take a long hard look at itself

How AERO’s failures fail us all: part one published yesterday

To look at AERO’s teaching model is to wonder whether the organisation is living in some other reality, a world in which there are no students who refuse to go to school, or leave school as soon as they can, or last the distance but leave with not much to show for it, or wag it, or bully and harass or are bullied and harassed, sometimes in the classroom often outside it, or have little or no sense of “belonging” at school or “attachment” to it. Why on this crowded stage is AERO putting the spotlight solely on what the teacher is doing in the classroom? Can teaching be expected to change the whole experience of being at school? Or is that somebody else’s problem?

And what about kinds of knowledge other than formal, out-there, discipline-derived knowledge, the staple that has launched a thousand curriculums — know-how, for example, knowing how to learn, how to work in groups, how to think through complicated life and ethical questions? And what about students’ knowledge of their own capabilities and options? The suspicion arises that what AERO is after is schooling for the poor, for the denizens of the “long tail of attainment,” cheap, narrowed down and dried out, a something that is better than nothing.

AERO is misconceived

As well as misconceiving, AERO is misconceived. Its job is to gather research from up there and packaging it for consumption down below. It wants teaching to be based on research evidence — on just two kinds of research evidence, in fact — as if what teachers and school leaders know from experience, debate and intuition isn’t really knowledge at all, as if it’s research evidence or nothing. That most teachers and others in schools don’t use research evidence very often is taken not as a judgement about priorities but as an “obstacle” to uptake.

AERO claims that “evidence-based practices are the cornerstone of effective teaching” without providing or citing evidence to support the claim. More, it implies that the “how” of teaching is the only thing that teachers should concern themselves with, that teaching and schooling are free of doubts and dilemmas, of messy questions of judgement, decision and purpose.

A deeply hierarchical idea

AERO’s deeply hierarchical idea of the relationship between researchers and practitioners is of a piece with its conception of the relationship between teacher and taught. It is, in fact, the kind of institution that John Hattie feared. “There’s a debate going on about building an evidence institute for teaching,” he told Larsen in 2018. “My fear is that it will become like [America’s] What Works Clearinghouse and people will be employed to take academic research and translate it into easy language for teachers.”

At the risk of an apparent sectarianism, let me suggest that Martin Luther had the necessary idea: the priest should not stand between God and the flock but beside the flock reading God’s Word for themselves and finding their own way to salvation. AERO should stand beside teachers and schools, and it should help them stand beside their students. But that is not what AERO was set up to do.

What is AERO?

Nominally the creation of the nine ministers of education and their departments, AERO is actually the handiwork of the NSW Department of Education, long the bastion of the “traditional” classroom, and Social Ventures Australia, or SVA, an organisation privately funded to “influence governments and policymakers to create large scale impact.”

SVA was much taken by Britain’s Education Endowment Foundation, and pitched to the Commonwealth the many benefits that would flow from an Australian equivalent. The pitch included some words about new things to be learned in new ways but more about a “robust evidence ecosystem” serving the cause of “continuous improvement” that would boost school performance. SVA wanted the new organisation to be independently funded and established through a tendering process. As the proposal made its way through the machinery of the “national approach” it was shorn of the progressive talk along with the independent funding and the creation by tender.

The organisation that emerged is indistinguishable from the NSW Department of Education in its underlying assumptions about “evidence-based teaching,” its “teaching model,” its definition of “evidence,” and its view of the relationship between theory and practice and of the control of schools. AERO’s board is chaired by the former chief executive of the Smith Family, a charity so committed to “explicit teaching” that it has taken ads in the mainstream media to urge its universal adoption. The chief executive is a former senior officer of the NSW department and AERO references the department’s Centre for Education Statistics and Evaluation (“the home of education evidence”). AERO’s “partner,” Ochre Education, a not-for-profit provider of “resources [that] support effective, evidence-based practices,” has one “partner” larger than all the rest put together, the NSW Department of Education.

Embedded in its own dogma

AERO is deeply embedded in its own dogma and in the national machinery that was supposed to deliver “top 5 by ’25.” Well, here we are in 2025 and no closer to the top of the OECD’s league tables than we were fifteen years ago when the boast was made. To the contrary, as the former head of Australia’s premier research organisation and of the OECD’s mighty education division concluded recently, inequality is rising, quality is falling, and the system is resistant to reform. What reason is there to expect that another fifteen years doing the same thing will produce a different result?

AERO is not going to go away, but perhaps it can be pressed to lighten up. It should be persuaded, first, to accept that teaching is a sense-making occupation and that schools are sense-making institutions. Schools should not be treated as outlets applying recipes and prescriptions dispensed by AERO or anyone else.

Second, AERO’s evidence should bear on the system in which schools do their work as well as on the schools and their teachers. That should include evidence about whether and how Australia’s schooling system should join schools and teachers as objects of reform.

Rethink the conception of “evidence”

Third, AERO should be pressed to rethink its conception of “evidence.” Schools do and must use many kinds of evidence, including some that they gather formally or informally themselves. Evidence derived from academic research may well be a useful addition to the mix, but that is all. It is — and AERO should say so — provisional and contingent, not altogether different from other kinds of evidence schools use. The contrary idea, that evidence generated by formal academic research is scientific and therefore beyond debate and disagreement is encouraging the gross misconstructions of effectiveness research described by John Hattie.

AERO should also expand the range of academic sources it draws on and the kinds of evidence it embraces, going beyond the “how” to include the “what,” “why” and “whether to” — debates over evidence and evidence-use, and evidence from educational philosophy, sociology, economics and history as well as from that dubious disciple psychology (the source of both the effectiveness paradigm and cognitive load theory) and from beyond the all-too familiar Anglosphere.

The lens must be widened

And, most important of all: while many teachers are no doubt grateful for at least some of AERO’s output, and perhaps particularly for the resources distributed by AERO’s partner, Ochre, those resources go no further than helping teachers do a job in need of a fundamental rethink. The lens must be widened to include the organisation of students’ and teachers’ daily work and the organisation of students’ learning careers as well as what teachers can do in the classroom as it now exists. AERO should identify schools working to organise the curriculum around each students’ intellectual growth and the development of their capacities as individuals and as social beings. It should put those schools in touch with one another, and work with them on a different kind of research, on finding ways through an essential but immensely difficult organisational and intellectual task. •

This is part two of the story by Dean Ashenden on AERO. We published part one yesterday. This was first published in Inside Story. We are republishing with the permission of both Dean Ashenden and of Peter Browne, editor of Inside Story.

Dean Ashenden is a senior honorary fellow in the Melbourne Graduate School of Education, the University of Melbourne. He has worked in and around, over decades, as a teacher, academic, commentator and consultant: He is co-author, with Raewyn Connell, Sandra Kessler and Gary Dowsett, of the 1982 classic Making The Difference: Schools, Families and Social Division. Unbeaching the Whale: Can Australian schooling be reformed? was published last year.

AERO: Why and how its failures fail us all

The Vatican has the Dicastery of the Doctrine of the Faith. Australian schooling has AERO.

New, not very important but very symptomatic, the Australian Education Research Organisation fits snugly into the elaborate machinery of Labor’s “national approach” to schooling. As an “evidence intermediary,” its task is to make a certain kind of research finding more available to teachers and schools. But its key sponsors hope it will proclaim the doctrine in a system dependent on prescription, surveillance and compliance.

The doctrine is this: schooling is first and foremost about knowledge; teaching is first and foremost about getting prescribed knowledge into young heads; research has established the relative effectiveness (“effect size”) of teaching techniques and “interventions”; learning science has reinforced this evidence by showing how to “harmonise” teaching with the brain’s learning mechanisms; teaching must be based on evidence supplied by this research.

The faith: that in this way the long slide in the performance of Australian schools will at last be arrested and reversed.

AERO’s “gold standard”

In AERO’s view, though, there is no doctrine or faith. “Gold standard” research into effective teaching and findings on the workings of the brain have established scientific facts, clear and definitive.

Of AERO’s two intellectual pillars, effectiveness research is the much larger and stronger. Long-established and buttressed by a vast literature, it has become the lingua franca of education policy (including the policies promoted by the national approach) and has been absorbed by many teachers. But effectiveness research and its uses have also concerned and sometimes enraged many, including, surprisingly enough, John Hattie.

For many years Hattie has been by far the most influential exponent of the effectiveness idea in Australia, and perhaps around the world. But in a series of conversations with Danish philosopher Steen Nepper Larsen (published as The Purpose of Education in 2018) Hattie looks back over a formidable body of effectiveness research and his own work with schools and involvement in national policymaking to find flaws and limitations in the research itself, and gross misinterpretation and misuse of it by policymakers and schools alike.

Education research has (Hattie says) “privileged” quantitative studies over qualitative, and has been “obsessed” with the technical quality of studies at the expense of their importance and value. The focus of so much effectiveness research on basic outcomes (80–90 per cent of it by Hattie’s estimate) has been salutary, but has also obscured much of what schools do and should do.

“I want more,” Hattie says. He emphasises: “I want broader. I want schools and systems to value music, art, history, entrepreneurship, curiosity, creativity, and much more.”

Many ways of skinning the cat

In much the same way, measuring “effect size” was useful but has ended up being the reverse, Hattie argues. It helped teachers and school leaders to accept that there are many ways of skinning the educational cat and to rely less on habit, hunch and assumption. But the “effect sizes” summarised in his celebrated Visible Learning (2009) and many publications following are averages, he points out, and too often the fact, extent and causes of variation are forgotten — along with the importance of context. Effect-size tables have been taken as a kind of installer’s guide — policymakers look at them and say “tick, tick, tick to the top influences and no, no, no to the bottom,” thus missing the point entirely.

The point? To inform and prompt thinking, interpretation, explanation: what is this evidence telling us? What do these numbers mean? What’s going on here, and why? What, for example, should we do with evidence showing that smaller classes have not produced better performance? Just say: no more smaller classes? Or ask why smaller classes aren’t being used more effectively?

A sustained failure of policy

How can we actually do what effectiveness research has made possible? Research can go only so far; it reflects schooling as it is, not how it has to be; the rest is up to government and policy. Properly interrogated, Hattie concludes, the evidence first assembled in Visible Learning (2009) reveals a sustained failure of policy.

Hattie’s criticisms cover much but not all of the ground on which effectiveness research stands. He and others were convinced that education research could do for schooling what medical research had done for medicine. Research of the “gold standard” medical kind would reveal what worked in the classroom (or as Hattie later put it, what worked best). They were also convinced that the teacher was the crucial variable in the schooling equation, which made teachers and teaching “quality” the central objects of policy.

Not medical practitioners, not patients

But teachers are not like medical practitioners and students are not like patients. Teachers try to enlist students in their cause; students might or might not join in. They might do their best to make sense of what the teachers seems to want, or pretend that they’re trying to, or subvert or resist the teacher’s efforts in myriad ways. Much of what students learn is not what is taught but what students think has been taught; often it has not been taught at all, for students learn all kinds of other things in the classroom and everywhere else at school. They learn about themselves, the world, how the world treats them, and how they can and should treat others. Students are, in other words, co-producers of learning, of themselves, and of each other. They learn, and they grow.

What students learn and how they grow, taken in its full extent and complexity, depends partly on what teachers do but mostly on the circumstances in which they and teachers meet. Producing learning and growth is in many ways just like producing anything else. Any form of production combines people, time, space, task, expertise, objectives, rewards and sanctions in a specific way. The central question is not how to make teaching more effective (as effectiveness research assumes) but how to make schools more productive. Which combination of the many factors of production is most productive of what kinds of learning and growth for which students? The failure to ask what the evidence is telling us about what is going on and what could go on is the seed of the policy failure Hattie points to.

A less reliable vessel

“Learning science” is an even less reliable vessel. There is in fact no such thing as “learning science.” The learning sciences (plural) include experimental psychology, social and affective neuroscience, cognitive anthropology, developmental psychology, robotics and AI, and neurology, systems theory and many others. AERO relies on a particular subset of a particular branch of the learning sciences, cognitive load theory, or CLT, which is held in low esteem by many for its failure to take into account “the neurodynamic, attitudinal, social, emotional and cultural factors that often play a major, if often invisible and unsung role in every classroom.”

Learning scientists who do pay attention these “often invisible and unsung” factors reach conclusions very different from AERO’s. Two prominent psychologists for example, concludedafter career-long research that learners thrive when they feel competent and successful, challenged, purposeful, connected to community and culturally safe, working collaboratively on things relevant to their lives. A neuroscientist studying the relationship between young people’s behaviour, circumstances and neural development found that “support, safe spaces, and rich opportunities [to] think deeply about complex issues, to build personally relevant connections, and to find purpose and inspiration in their lives” is crucial to the brain’s development. Indeed, “the networks in the brain that are associated with these beneficial outcomes are deactivated during the kinds of fast-paced and often impersonal activities that are the staple of many classrooms” (emphasis added).

What about other kinds of classroom teaching?

One of the consequences of AERO’s use of CLT and effectiveness research is the assumption that teaching “knowledge” is the only game at school and there is only one way to play it. Of course knowledge is core business in schooling: knowledge of reading, writing, maths and science are “basic”; didactic teaching is for most kids and some purposes the shortest route between a fog and an aha! moment; the precepts of “explicit” teaching may well help to improve didactic teaching; and “effectiveness” research and its “effect sizes” can indeed make teachers and school leaders more aware of options and less reliant on hunch, habit and anecdote.

But what about other kinds of classroom teaching? And other ways of learning? Is AERO’s “teaching model” a one-punch knockout? The sovereign solution to the many things that students, teachers and schools contend with?

Tomorrow: How AERO can (and should) take a long hard look at itself.

This story by Dean Ashenden on AERO was first published in Inside Story. We are republishing with the permission of both Dean Ashenden and of Peter Browne, editor of Inside Story.

Dean Ashenden is a senior honorary fellow in the Melbourne Graduate School of Education, the University of Melbourne. He has worked in and around, over decades, as a teacher, academic, commentator and consultant: He is co-author, with Raewyn Connell, Sandra Kessler and Gary Dowsett, of the 1982 classic Making The Difference: Schools, Families and Social Division. Unbeaching the Whale: Can Australian schooling be reformed? was published last year.

Science and writing: Why AERO’s narrow views are a big mistake

Will narrow instructional models promoted by AERO crowd out quality teaching and learning?

A recent ‘practice guide’ from the Australian Education Research Organisation (AERO), on ‘Writing in Science’ raises significant questions about the peak body’s narrow views on teaching and learning. Is AERO leading us in the wrong direction for supporting teachers to provide a rich and meaningful experience for Australian students?

The guide  explains the nature of simple, compound and complex sentences in science. It  provides student writing with feedback  teachers could provide to improve the writing. There are suggestions for teachers to generate and unpack exemplar sentences and lists of nouns and adjectives, provided by practice exercises. 

Yet a close reading shows these analyses fall well short of best practice in analysing science writing. Further, this advice is missing any comprehensive linguistic account of grammar as resource for meaning in text construction;any critical perspective on the function different kinds of texts to make sense of science, and; any attention to the commitment of teachers of science to developing science ideas. 

We are world leaders

Yet, Australian researchers in literacy are world leaders in thinking about the functions of text in generating meaning across different genres and writing to learn in science

AERO has ignored such research. It  sacrifices what we know about engaging and meaningful teaching and learning practice on the altar of its ideological commitment to impoverished interpretations of explicit teaching. 

While the practice guide is  useful for alerting teachers to the importance of explicit attention to writing in science, it could do better by drawing on our rich research base around meaningful pedagogies –  (which include explicit teaching elements) that engage students and enrich science teachers’ practice.  

This story of ignoring a wealth of sophisticated Australian and international research to enforce a simplistic instructional model is repeated across multiple curriculum areas, including science and  mathematics. AERO’s ‘evidence based’ model of a ‘science of learning’ is based exclusively on studies involving one research methodology. It uses experimental and control conditions that inevitably restrict the range of teaching and learning strategies compared to those found in real classrooms. 

The research findings of the community of Australian and International mathematics and science education researchers who have worked with students and teachers over many decades to establish fresh theoretical perspectives and rich teaching and learning approaches have been effectively silenced. 

What underpins this narrowing?

What underpins this narrowing of conceptions of teaching and learning that seems to have taken the Australian education system by storm? AERO bases its instructional model almost entirely on the theoretical framing of Cognitive Load Theory (CLT), particularly the research of John Sweller who over four decades has established an impressive body of work outlining the repercussions of limitations in working memory capacity. 

Sweller argues that when students struggle to solve complex problems with minimal guidance, they can fail to develop the schema that characterise expert practice. His conclusion is that teachers need to provide ‘worked examples’ that students can follow and practice to achieve mastery, an approach aligned with the ‘I do’, ‘we do’, ‘you do’ advocacy of AERO and the basis of the mandated pedagogy models of both New South Wales and Victoria. 

The argument that students can lose themselves in complexity if not appropriately guided is well taken. But this leap from a working memory problem to the explicit ’worked example’ teaching model fails to acknowledge the numerous ways, described in the research literatures of multiple disciplines, that teachers can support students to navigate complexity. In mathematics and science this includes the strategic setting up of problems, guided questioning and prompting, preparatory guidance, communal sharing of ideas, joint teacher-student text construction, or explicit summing up of schema emerging from students’ solutions. 

What really works

The US National Council of Teachers of Mathematics identifies seven, not one, effective mathematics teaching practices some but not all of which involve direct instruction.  An OECD analysis of PISA-related data identified three dominant mathematics teaching strategies of which direct instruction was the most prevalent and least related to mathematics performance, with active learning and in particular cognitive engagement strategies being more effective. 

Sweller himself (1998) warned against overuse of the worked example as a pedagogy, citing student engagement as an important factor. Given these complexities, AERO’s silencing of the international community of mathematics and science educators seems stunningly misplaced. 

This global mathematics and science education research represents a rich range of learning theories, pedagogies, conceptual and affective outcomes, and purposes. The evidence in this literature overwhelmingly rejects the inquiry/direct instruction binary that underpins the AERO model. Further, the real challenge with learning concepts like force, image formation, probability or fractional operations has less to do with managing memory than with arranging the world to be seen in new ways. 

To be fair, the CLT literature has useful things to say about judging the complexity of problems, and the strong focus on teacher guidance is well taken, especially when the procedures and concepts to be learned are counter-intuitive. However, CLT research has mainly concerned problems that are algorithmic in nature, for which an explicit approach can more efficiently lead to the simple procedural knowledge outcomes involved. 

The short term advantage disappears

Even here, studies have shown that over the long term, the short-term advantage of direct instruction disappears. The real issues involved in supporting learning of complex ideas and practices are deciding when to provide explicit support, and of what type. This is where the teacher’s judgment is required, and it will depend on the nature of the knowledge, and the preparedness of students. To reduce these complex strategies to a single approach is the real offence of the AERO agenda, and of the policy prescriptions in Victoria and NSW. 

It amounts to the de-professionalisation of teachers when such decisions are short-circuited. 

Another aspect of this debate is the claim that a reform of Australian teaching and learning is needed because of the poor performance of students on NAPLAN and on international assessments such as PISA and TIMSS. While it is certainly true that we could do much better in education across all subjects, particularly with respect to the inequities in performance based on socio-economic factors and Indigeneity, our relative performance on international rankings is more complex than claimed

Flies in the face of evidence

To claim this slippage results from overuse of inquiry and problem-solving approaches in science and mathematics flies in the face of evidence. In both subjects, teacher-centred approaches currently dominate. An OECD report providing advice for mathematics teachers based on the 2012 PISA mathematics assessment revealed Australian students ranked ninth globally on self-reporting memorisation strategies, and third-last on elaboration strategies (that is, making links between tasks and finding different ways to solve a problem). The latter strategies indicate the capability to solve the more difficult problems. 

While it may be true some versions of inquiry in school science and mathematics may lack necessary support structures, this corrective of a blanket imposition of explicit teaching is shown by the wider evidence to represent a misguided overreaction. 

How has it happened, that one branch of education research misleadingly characterised as ‘the’ science of learning, together with a narrow and hotly contested view of what constitutes ‘evidence’ in education, has become the one guiding star for our national education research organisation to the exclusion of Australian and international disciplinary education research communities? 

Schools are being framed as businesses

It has been argued AERO ‘encapsulates politics at its heart’ through its embedded links to corporate philanthropy and business relations and a brief to attract funding into education. Indeed, schools are increasingly being bombarded with commercial products. Schools are being framed as businesses. 

The teaching profession over the last decade has suffered concerted attacks from the media and from senior government figures. Are we seeing moves here to systematically de-professionalise teachers and restrict their practice through ‘evidence based’ resources focused on ‘efficient’ learning? Is this what we really want as our key purpose in education? In reality, experienced teachers will not feel restricted by these narrow versions of explicit teaching pedagogies and will engage their students in varied ways. How can they not? 

If the resources now being developed and promoted under the AERO rubric, as with ‘Writing in Science’, follow this barren prescription, we run the danger of a growing erosion of teacher agency and impoverishment of student learning.

We need a richer view of pedagogy

What we need, going forward, is a richer view of pedagogy based on the wider research literature, rather than the narrow base that privileges procedural practices. We need to engage with a more complex and informed discussion of the core purposes of education that is not proscribed by a narrow insistence on NAPLAN and international assessments. We need to value our teaching profession and recognise the complex, relational nature of teaching and learning. Our focus should be on strengthening teachers’ contextual decision making, and not on constraining them in ways that will reduce their professionalism, and ultimately their standing.  

  

Russell Tytler is Deakin Distinguished Professor and Chair of Science Education at Deakin University. He researches student reasoning and learning through the multimodal languages of science, socio scientific issues and reasoning, school-community partnerships, and STEM curriculum policy and practice. Professor Tytler is widely published and has led a range of research projects, including current STEM projects investigating a guided inquiry pedagogy for interdisciplinary mathematics and science. He is a member of the Science Expert Group for PISA 2015 and 2025.

Proactive and preventative: Why this new fix could save reading (and more)

When our research on supporting reading, writing, and mathematics for older – struggling  – students was published last week, most of the responses missed the heart of the matter.

In Australia, we have always used “categories of disadvantage” to identify students who may need additional support at school and provide funding for that support. Yet those students who do not fit neatly into those categories slipped through the gaps, and for many, the assistance came far too late, or achieved far too little. Despite an investment of $319 billion, little has changed with inequity still baked into our schooling system. 

Our systematic review, commissioned by the Australian Education Research Organisation, set out to identify the best contemporary model to identify underachievement and provide additional support – a multi-tiered approach containing three levels, or “tiers” that increase in intensity.

(de Bruin & Stocker, 2021)

We found that if schools get Tier 1 classroom teaching right – using the highest possible quality instruction that is accessible and explicit – the largest number of students make satisfactory academic progress. When that happens, resource-intensive support at Tier 2 or Tier 3 is reserved for those who really need it. We also found that if additional layers of targeted support are provided rapidly, schools can get approximately 95% of students meeting academic standards before gaps become entrenched.

The media discussion of our research focused on addressing disadvantages such as intergenerational poverty, unstable housing, and “levelling the playing field from day one” for students starting primary school through early childhood education. 

These are worthy and important initiatives to improve equality in our society, but they are not the most direct actions that need to be taken to address student underachievement. Yes, we need to address both, BUT the most direct and high-leverage approach for reducing underachievement in schools is by improving the quality of instruction and the timeliness of intervention in reading, writing, and mathematics.

Ensuring that Tier 1 instruction is explicit and accessible for all students is both proactive and preventative. It means that the largest number of students acquire foundational skills in reading, writing, and mathematics in the early years of primary school. This greatly reduces the proportion of students with achievement gaps from the outset. 

This is an area that needs urgent attention. The current rate of underachievement in these foundational skills is unacceptable, with approximately 90,000 students failing to meet national minimum standards. These students do not “catch up” on their own. Rather, achievement gaps widen as students progress through their education. Current data show that, on average, one in every five students starting secondary school are significantly behind their peers and have the skills expected of a student in Year 4:

Source

For students in secondary school, aside from the immediate issues of weak skills in reading, writing and mathematics, underachievement can lead to early leaving as well as school failure. Low achievement in reading, writing, and mathematics also means that individuals are more likely to experience negative long-term impacts post-school including aspects of employment and health, resulting in lifelong disadvantage. As achievement gaps disproportionately affect disadvantaged students, this perpetuates and reinforces disadvantage across generations. Our research found that it’s never too late to intervene and support these students. We also highlighted particular practices that are the most effective, such as explicit instruction and strategy instruction.

For too long, persistent underachievement has been disproportionately experienced by disadvantaged students, and efforts to achieve reform have failed. If we are to address this entrenched inequity, we need large-scale systemic improvement as well as improvement within individual schools. Tiered approaches, such as the Multi-Tiered System of Supports (MTSS), build on decades of research and policy reform in the US for just this purpose. These have documented success in helping schools and systems identify and provide targeted intervention to students requiring academic support. 

In general, MTSS is characterised by:

  • the use of evidence-based practices for teaching and assessment
  • a continuum of increasingly intensive instruction and intervention across multiple tiers
  • the collection of universal screening and progress monitoring data from students
  • the use of this data for making educational decisions
  • a clear whole-school and whole-of-system vision for equity

What is important and different about this approach is that support is available to any student who needs it. This contrasts with the traditional approach, where support is too often reserved for students identified as being in particular categories of disadvantage, for example, students with disabilities who receive targeted funding. When MTSS is correctly implemented, students who are identified as requiring support receive it as quickly as possible. 

What is also different is that the MTSS framework is based on the assumption that all students can succeed with the right amount of support. Students who need targeted Tier 2 support receive that in addition to Tier 1. This means that Tier 2 students work in smaller groups and receive more frequent instruction to acquire skills and become fluent until they meet benchmarks. The studies we reviewed showed that when Tiers 1 and 2 were implemented within the MTSS framework, only 5% of students required further individualised and sustained support at Tier 3. Not only did our review show that this was an effective use of resources, but it also resulted in a 70% reduction in special education referrals. This makes MTSS ideal for achieving system-wide improvement in both equity, achievement, and inclusion.

Our research could not be better timed. The National School Reform Agreement (NSRA) is currently being reviewed to make the system “better and fairer”. Clearly, what is needed is a coherent approach for improving equity and school improvement that can be implemented across systems and schools and across states and territories. To this end, MTSS offers a roadmap to achieve these targets, along with some lessons learned from two decades of “getting it right” in the US. One lesson is the importance of using implementation science to ensure MTSS is adopted and sustained at scale and with consistency across states. Another is the creation of national centres for excellence (e.g., for literacy: https://improvingliteracy.org), and technical assistance centres (e.g., for working with data: https://intensiveintervention.org) that can support school and system improvement.

While past national agreements in Australia have emphasised local variation across the states and territories, our research findings highlight that systemic equity-based reform through MTSS requires a consistent approach across states, districts, and schools. Implemented consistently and at scale, MTSS is not just another thing. It has the potential to be the thing that may just change the game for Australia’s most disadvantaged students at last.

From left to right: Dr Kate de Bruin is a senior lecturer in inclusion and disability at Monash University. She has taught in secondary school and higher education for over two decades. Her research examines evidence-based practices for systems, schools and classrooms with a particular focus on students with disability. Her current projects focus on Multi-Tiered Systems of Support with particular focus on academic instruction and intervention. Dr Eugénie Kestel has taught in both school and higher education. She taught secondary school mathematics, science and English and currently teaches mathematics units in the MTeach program at Edith Cowan University. She conducts professional development sessions and offers MTSS mathematics coaching to specialist support staff in primary and secondary schools in WA. Dr Mariko Francis is a researcher and teaching associate at Monash University. She researches and instructs across tertiary, corporate, and community settings, specializing in systems approaches to collaborative family-school partnerships, best practices in program evaluation, and diversity and inclusive education. Professor Helen Forgasz is a Professor Emerita (Education) in the Faculty of Education, Monash University (Australia). Her research includes mathematics education, gender and other equity issues in mathematics and STEM education, attitudes and beliefs, learning settings, as well numeracy, technology for mathematics learning, and the monitoring of achievement and participation rates in STEM subjects. Ms Rachelle Fries is a PhD candidate at Monash University. She is a registered psychologist and an Educational & Developmental registrar with an interest in working to support diverse adolescents and young people. Her PhD focuses on applied ethics in psychology.  

AERO responds to James Ladwig’s critique

AERO’s response is below, with additional comments from Associate Professor Ladwig. For more information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published a post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal? 

AERO’s response is below, with additional comments from Associate Professor Ladwig. 

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one.  All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms.  Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made.  ACARA has reported the reliability estimates to include that in the scores analysis.  It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking.  In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves. 

AERO’s writing report is causing panic. It’s wrong. Here’s why.

If ever there was a time to question public investment in developing reports using  ‘data’ generated by the National Assessment Program, it is now with the release of the Australian Educational Research Organisation’s report ‘Writing development: What does a decade of NAPLAN data reveal?’ 

I am sure the report was meant to provide reliable diagnostic analysis for improving the function of schools. 

It doesn’t. Here’s why.

There are deeply concerning technical questions about both the testing regime which generated the data used in the current report, and the functioning of the newly created (and arguably redundant) office which produced this report.

There are two lines of technical concern which need to be noted. These concerns reveal reasons why this report should be disregarded – and why media response is a beatup.

The first technical concern for all reports of NAPLAN data (and any large scale survey or testing data) is how to represent the inherent fuzziness of estimates generated by this testing apparatus.  

Politicians and almost anyone outside of the very narrow fields reliant on educational measurement would like to talk about these numbers as if they are definitive and certain.

They are not. They are just estimates – but all of the summary statistics reports are just estimates.  

The fact these are estimates is not apparent in the current report.  There is NO presentation of any of the estimates of error in the data used in this report. 

Sampling error is important, and, as ACARA itself has noted, (see, eg, the 2018 NAPLAN technical report) must be taken into account when comparing the different samples used for analyses of NAPLAN.  This form of error is the estimate used to generate confidence intervals and calculations of ‘statistical difference’.  

Readers who recall seeing survey results or polling estimates being represented with a ‘plus or minus’ range will recognise sampling error. 

Sampling error is a measure of the probability of getting a similar result if the same analyses were done again, with a new sample of the same size, with the same instruments, etc.  (I probably should point out that the very common way of expressing statistical confidence often gets this wrong – when we say we have X level of statistical confidence, that isn’t a percentage of how confident you can be with that number, but rather the likelihood of getting a similar result if you did it again.)  

In this case, we know about 10% of the population do not sit the NAPLAN writing exam, so we already know there is sampling error.  

This is also the case when trying to infer something about an entire school from the results of a couple of year levels.  The problem here is that we know the sampling error introduced by test absences is not random and accounting for it can very much change trend analyses, especially for sub-populations So, what does this persuasive writing report say about sampling error? 

Nothing. Nada. Zilch. Zero. 

Anyone who knows basic statistics knows that when you have very large samples, the amount of error is far less than with smaller samples.  In fact, with samples as large as we get in NAPLAN reports, it would take only a very small difference to create enough ripples in the data to show up as being statistically significant.  That doesn’t mean, however, the error introduced is zero – and THAT error must be reported when representing mean differences between different groups (or different measures of the same group).

Given the size of the sampling here, you might think it ok to let that slide.  However, that isn’t the only short cut taken in the report.  The second most obvious measure ignored in this report is measurement error.  Measurement error exists any time we create some instrument to estimate a ‘latent’ variable – ie something you can’t see directly.  We can’t SEE achievement directly – it is an inference based on measuring several things we can theoretically argue are valid indicators of that thing we want to measure.  

Measurement error is by no means a simple issue but directly impacts the validity of any one individual student’s NAPLAN score and an aggregate based on those individual results.  In ‘classical test theory’ a measured score is made of up what is called a ‘true score’ and error (+/-).  In more modern measurement theories error can become much more complicated to estimate, but the general conception remains the same.  Any parent who has looked at NAPLAN results for their child and queried whether or not the test is accurate is implicitly questioning measurement error.

Educational testing advocates have developed many very mathematically complicated ways of dealing with measurement error – and have developed new testing techniques for improving their tests.  The current push for adaptive testing is precisely one of those developments, in the local case being rationalised as adaptive testing (where which specific test item is asked of the person being tested changes depending on prior answers) does a better job of differentiation those at the top and bottom end of the scoring range (see the 2019 NAPLAN technical report for this analysis). 

 This bottom/top of the range problem is referred to as a floor or ceiling effect.  When large proportion of students either don’t score anything or get everything correct, there is no way to differentiate those students from each other – adaptive testing is a way of dealing with floor and ceiling effects better than a predetermined set of test items.  This adaptive testing has been included in the newer deliveries of the online form of the NAPLAN test.

Two important things to note. 

One, the current report claims the performance of high ‘performing’ students’ scores has shifted down – despite new adaptive testing regimes obtaining very different patterns of ceiling effect. Second, the test is not identical for all students (they never have been).  

The process used for selecting test items  is based on ‘credit models’ generated by testers. Test items are determined to have particular levels of ‘difficulty’ based on the probability of correct answers being given from different populations and samples, after assuming population level equivalence in prior ‘ability’ AND creating difficulties score for items while assuming individual student ‘ability’ measures are stable from one time period to the next.  That’s how they can create these 800 point scales that are designed for comparing different year levels.

So what does this report say about any measurement error that may impact the comparisons they are making?  Nothing.

One of the ways ACARA and politicians have settled their worries about such technical concerns as accurately interpreting statistical reports is by introducing the reporting of test results in ‘Bands’.  Now these bands are crucial for qualitatively describing rough ranges of what the number might means in curriculum terms – but they come with a big consequence.  Using ‘Band’ scores is known as ‘coarsening’ data – when you take a more detailed scale and summarise it in a smaller set of ordered categories – and that process is known to increase any estimates of error.  This later problem has received much attention in the statistical literature, with new procedures being recommended for how to adjust estimates to account for that error when conducting group comparisons using that data.  

As before, the amount of reporting of that error issue? Nada.

 This measurement problem is not something you can ignore – and yet the current report is worse than careless on this question.

It takes advantage of readers not knowing about it. 

When the report attempts to diagnose which component of the persuasive writing tasks were of most concern, it does not bother reporting that the error for each of the separate measures of those ten dimensions of writing has far more error than the total writing score, simply because the number of marks for each is a fraction of the total.  The smaller the number of indicators, the more error (and less reliability).

Now all of these technical concerns simply raises the question of whether or not the overall findings of the report will hold up to robust tests and rigorous analysis – there is no way to assess that from this report, but there is even bigger reason to question why it was given as much attention as it was.  That is, for any statistician, there is always a challenge to translate the numeric conclusions into some for of ‘real life’ scenario.

To explain why AERO has significantly dropped the ball on this last point, consider its headline claim that year 9 students have had declining persuasive writing scores and somehow representing that as a major new concern.  

First note that the ONLY reporting of this using the actual scale values is a vaguely labelled line graph showing scores from 2011 until 2018 – skipping 2016 since the writing task that year wasn’t for persuasive writing (p 26 of the report has this graph).  Of those year to year shifts, the only two that may be statistically significant, and are readily visible, are from 2011 to 2012, and then again from 2017 to 2018.  Why speak so vaguely? From the report, we can’t tell you the numeric value of that drop, because there is no reporting of the actual number represented in that line graph.  

Here is where the final reality check comes in.  

If this data matches the data reported in the national reports from 2011 and 2018, the named mean values on the writing scale were 565.9 and 542.9 respectively.  So that is a drop between those two time points of 23 points.  That may sound like a concern, but recall those scores are based on 48 marks given for writing.  In other words, that 23 point difference is no more than one mark difference (it could be far less since each different mark carries a different weighting in formulation that 800 scale).  

Consequently, even if all the technical concerns get sufficient address and the pattern still holds, the realistic title of Year 9 claim would be ‘Year 9 students in 2018 NAPLAN writing test scored one less mark than the Year 9 students of 2011.’

Now assuming that 23 point difference has anything to do with the students at all, start thinking about all the plausible reasons why students in that last year of NAPLAN may not have been as attentive to details as they were when NAPLAN was first getting started.   I can think of several, not least being the way my own kids did everything possible to ignore the Year 9 test – since the Year 9 test had zero consequences for them.  

Personally, these reports are troubling for many reasons, inclusive of the use of statistics to assert certainty without good justification, but also because saying student writing has declined belies that obvious fact that is hasn’t been all that great for decades.  This is where I am totally sympathetic to the issues raised by the report – we do need better writing among the general population.  But using national data to produce a report of this calibre, by an agency beholden to government, really does little more than provide click-bait and knee jerk diagnosis from all sides of a debates we don’t really need to have.

James Ladwig is Associate Professor in the School of Education at the University of Newcastle.  He is internationally recognised for his expertise in educational research and school reform.  Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York. James is on Twitter @jgladwig

AERO’s response to this post

ADDITIONAL COMMENTS FROM AERO provided on November 9: information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published the above post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal? 

AERO’s response is below, with additional comments from Associate Professor Ladwig. 

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one.  All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms.  Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made.  ACARA has reported the reliability estimates to include that in the scores analysis.  It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking.  In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves.