On Interpretations of Statistical Findings and Their Meaning: Are We Becoming Dumber, and Why? (Column 150)
With God's help
A few days ago I read that at long last there is scientific confirmation of the thesis of the decline of the generations. A study conducted in Norway shows that the IQ of men has been declining in recent years (beginning with those born in 1975) relative to their parents at the same age. This is in contrast to what had been accepted until now, namely that each decade saw an average increase of 3 points on the IQ index (the Flynn effect). For me, this is a truly perfect finding, since it gives me a weapon against my children (they should listen to their father, because he is smarter), and at the same time it supports my position that there is no decline in the intelligence of the later authorities—and of us—relative to previous generations (as against that interpretation of the 'decline of the generations' in the context of Jewish law). I could not have hoped for a more efficient yield from modern science in the service of the public.
The article mentions that the researchers attribute the findings to one of two phenomena: a. the use of screens of all kinds keeps young people from reading books. b. a change in methods of teaching languages and mathematics. This immediately raises the question: on what basis do they say this? Do the findings themselves somehow point to these interpretations? And in general, what does one do with raw statistical findings? How does one choose among different interpretations? I have already mentioned here several times that in every report on research findings (usually it is a survey rather than a study, but I will not once again get into the matter of junk science here) I find myself wondering about the interpretation given to the findings. How much of it is grounded in the findings, and how much is just the researcher’s/surveyor’s gut feeling?
I should begin by saying that the purpose of my remarks here is not to criticize that study (which I have not read in the original), but to point to a common interpretive failure regarding statistical findings. I will illustrate it using this survey as it is presented in the above article, and if the description is inaccurate (which is probably usually the case), my apologies to the researchers. I am merely using them and the article about the survey they conducted for my own purposes; I am not discussing the study itself.
Methodological Reflections
I would ask my children not to read this section, so as not to deprive me of the force of the argument ('listen to Dad, he’s smarter').
Since I have not read these things in the original, I cannot criticize or challenge them, but only reflect on these results and their interpretation. I will not enter here into criticisms of the very pretension to measure intelligence and to engage in psychometrics (measurement of the psyche), since even if there is something to such criticisms, they suffer from the postmodern conflation of uncertainty with doubt. True, the measure is not absolute, and it is clear that it does not fully reflect the intelligence it purports to measure, but some correlation presumably exists. Therefore, so long as no better measure is offered, it is reasonable to use this one despite its shortcomings. Beyond that, if Flynn’s results are indeed consistent—that is, if there is a consistent difference in the results of different generations—then this means there is a meaningful trend here. It is probably not a random and meaningless measure. If there were no connection between intelligence and the tests that measure it (that is, if it were random and not reflective at all), I would not expect a consistent change, in the same direction and to the same extent, across generations.[1]
I now turn to the reflections. In psychometric tests or intelligence tests, it is customary to standardize every intelligence test to some fixed average, because each test is different and each group of examinees takes a different test, so one cannot compare the results of examinees from different groups. The assumption behind the standardization is that if the number of examinees is very large, their average psychometric score (indeed, the entire distribution) ought to be similar. In any such measurement, the pre-standardization results are measured relative to the group being tested and have little significance with respect to other groups (that is what standardization is for). After standardization, one can compare the results of examinees from different groups.
But comparative studies like the one described here compare different generations and groups, so they cannot all be standardized in the same way. And if they are not standardized, they cannot be compared. The conclusion is that in order to make a comparison without standardization, one must rely on the very same test. But what happens if we give the very same test to members of different generations? It depends whether for this purpose we choose a test designed today (in the children’s generation) or one designed in the parents’ generation. The test designed today can no longer be given to the parents, because they are no longer the same age (the comparison is at the same age). Beyond that, they have more experience, but because of age they are slower than their children. This would be a problematic comparison from the outset. By contrast, a test designed in the parents’ generation (say, in 1975), which the parents took, is less suitable for psychometric measurement of the young people of 2000 and certainly 2018, since it rests on forms of thought and formulation and on concepts current at that time, and it is reasonable to assume that these are not entirely familiar to the children and are less comfortable for them to use.
The conclusion is that, in practice, the comparison can be made only by administering the old test to this generation, but such a comparison is highly problematic and unrepresentative. Therefore it is hard for me to see how such comparative findings can be established, especially given that we are speaking of a change of 3 points, which is rather small and could certainly be the result of side effects such as these.
Moreover, it is worth noting that the cultural change we are undergoing is becoming ever faster, and therefore it will be harder for today’s young people to understand the tests of thirty years ago and succeed on them than it was for the young people of 1980 relative to tests formulated in 1950. The cultural and linguistic change there was smaller. Had they given the young people of 1980 the tests of the young people of 1900, perhaps we would have seen the same decline (because over 80 years then, changes occurred that today happen in ten years). This effect alone could also explain the difference in results without positing a real decline in intelligence. Of course, if we were to give the young people of then (who are today’s adults) the test designed today and compare them with today’s young people, that would ostensibly be more representative (because those adults live today as well and roughly understand the language, at least more than today’s young people understand the language of the past). But as I explained above, that too would be an incorrect comparison, and as far as I understand, this was not done in the study under discussion either.
The upshot is that until 1975, when the changes from one generation to the next were relatively small, an increase in IQ was demonstrated, and that is probably a measure that may perhaps be regarded as reflecting the actual situation. But from that point on, when cultural changes became very rapid, a decline is demonstrated. If so, it is possible that the real situation has not in fact changed at all. The result showing an increase across generations (the Flynn effect) is the correct result—but with a catch—and the rapid changes beginning in that year create a downward offset to that effect.
Interim Summary
The implication is that the findings obtained admit two interpretations: a. indeed, from the end of the twentieth century there began a decline in IQ. The Flynn effect has been canceled. b. modern measurement is biased relative to its predecessors because of an acceleration in the rate of cultural change. I assume one can think of additional possible explanations, and of course one must carefully examine what exactly was done in this study (which, as noted, I have not done). But when there are two possible explanations, one cannot adopt one over the other without justification. One can of course ask which of them seems more plausible a priori (independently of the findings), but if we choose the explanation that seems more reasonable to us, we will fall into the interpretive fallacy that I will now explain.
Reflections on Interpretation: Is It Right to Adopt the Most Plausible Interpretation?
Up to this point I have discussed the validity of the findings themselves. But even if the findings are valid and significant, and the conclusion is that the measurement is not biased and that there has indeed begun a consistent decline in the IQ of recent generations, the question of interpreting these findings still remains (I do not mean here the previous question: whether this is a matter of the rate of change or of a real decline in IQ). The interpretation given by the researchers to the decline in IQ proposes two causes for these findings: changes in teaching methods and screen culture. I must say that both claims sound a priori plausible to me, but precisely for that reason I would be suspicious of them as interpretations proposed for the findings.
The reader may wonder why the very fact that these claims are plausible should cast a shadow over accepting them as interpretations of the findings. Should we prefer implausible claims? To understand this, it is important to stand on the meaning of statistics in general. Unlike intuitive insights that rely on past experience and on forms of thinking that seem a priori reasonable to us for various reasons, statistics has added value because it anchors our conclusions in empirical facts. Suppose I think a priori that screen culture makes us stupid; why then do I need the above survey, that is, the statistical findings? I carry out the survey in order to test whether my intuitions are correct. My purpose is to strengthen that intuition and anchor it empirically. But if the findings obtained have several possible interpretations, then the statistics themselves admit several interpretations, and therefore they do not necessarily strengthen my a priori intuition. That intuition is of course not weakened, but using it as the explanation of the findings cannot make the findings into a tool that circles back and empirically reinforces that very intuition. Adopting it as the explanation stems from the fact that it seemed sensible to me from the outset, and therefore the findings as such do not really strengthen it.
Here we must distinguish between two situations. In a situation where we have several possible explanations before us, and this explanation is more plausible in terms of the sense it gives to the findings, then there may perhaps be room to say that adopting it as an explanation of the findings can in turn strengthen it itself (although even in the situation described here one can still sharpen the analysis). But if there are several explanations that all seem equally plausible to us as accounts of the findings, and only one of them is my a priori intuition (it seems more plausible to me in itself, a priori, independently of the findings), then it would be a mistake to use that explanation to generate empirical support for itself. In such a situation, the findings have added no strength to it. For example, if I believed that screens do not make us stupid, I could explain the findings by the change in teaching methods, and vice versa. In such a situation, the findings have not really changed anything relative to what I thought beforehand. Everyone may of course remain with his belief (just not claim empirical validity for it).
The conclusion is that a person who wants to prove that screens make us stupid will not be able to do so on the basis of such a survey, both because, as I noted, it is not clear that the survey really shows a process of becoming more stupid, and also because the explanation for that stupidity, even if it exists, is not necessarily screen culture. But we do have a tendency to take findings and use them to reinforce our a priori positions. This is a mistake in statistical thinking.
A Look at the Principles of Faith
I once wrote something like this line of reasoning regarding the principles of faith. I argued there that precisely if the principle in question is reasonable, that makes me suspect all the more that it was not given at Sinai. For example, the survival of the soul sounds very reasonable to a dualistic ear. Those who believe in the existence of a soul beyond the body naturally wonder what happens to it when the body dies (when it 'leaves' it). Materialists think that when the body dies the soul dies with it (because it is not a separate entity but a by-product that emerges from the material whole; this approach is called emergentism), but a dualist will naturally tend to believe a priori in the survival of the soul.
Now the claim arises that the very fact that Jewish tradition transmitted to us the belief in the survival of the soul strengthens it even more. Now this is not merely an a priori reasoning, but information received and transmitted to us from Sinai. The reasoning has ’empirical' reinforcement. But I doubt that reinforcement, precisely because of how reasonable this belief is. If it is so reasonable, then it is entirely possible that the sages arrived at this belief by their own reasoning and not because of a tradition they received from Sinai. And because it seemed so reasonable to them, they established it as a binding principle. Therefore the tradition does not strengthen my belief in this principle. Again, it still seems reasonable to me, but because of the a priori consideration. The fact that a tradition reached me saying this does not strengthen the matter in my eyes.
One must understand that there is a difference between rational conclusions and information that reaches us by revelation. Rational conclusions, however persuasive they may appear, can always turn out to be mistaken. After all, anyone, however wise, can err. By contrast, information that comes by revelation is presumably true. God does not make mistakes, and therefore such information should not turn out to be false. Therefore the belief in the survival of the soul indeed remains reasonable in my eyes, but the confirmation it receives from the fact that it came to us through tradition does not especially strengthen it for me.
Moreover, it is important to understand that beliefs can be formed by sages in various ways. I described above one way: sages arrive at some rational conclusion and because of its rational force establish it as a binding belief. But there is also another way: sages, through their reasoning, interpret the verses in such a way (they conclude that the prophet or the verse in the Torah speaks about the survival of the soul), and now we already have prophecy, information by revelation, and that cannot turn out to be false. But again, this 'information' is the product of interpretation, and that can be mistaken. Especially if that very interpretation is itself produced by the methods I described above (it is adopted because it appears reasonable to us in itself, a priori). Therefore here too the interpretation of the verses does not necessarily strengthen my a priori intuition, despite my tendency to think that it does (because I incline to that conclusion in any case).
Here too one can divide between two situations:
- A situation in which there are two different interpretations of the same verses with equal interpretive weight or validity. I choose the one that seems more plausible to me a priori. In such a situation, the interpretation does not constitute reinforcement for the a priori intuition.
- A situation in which we have chosen the more plausible interpretation on textual grounds. If it converges with our reason, that may indeed strengthen it.
But even in the second situation, if there is someone who disagrees with us (a materialist), it is doubtful whether such an interpretive argument will have any value for him. He will assume the opposite and prefer to interpret the verses differently (even if the interpretation that leads to our approach is preferable on interpretive grounds, for it is commonly accepted that it is better to strain the language than the reasoning). I can only return us to the painful question of drawing conclusions from the study of the Hebrew Bible (Tanakh) (see columns 134–135).
Are We Less Intelligent?
If, in closing, we return to the question of the findings of the above survey, I am not sure I buy such claims. When logarithm tables were abolished (yes, yes, I really am a dinosaur) and calculators took over, everyone warned against the coming dumbing-down. Children would not know how to perform the four basic arithmetic operations. The calculator would think for them. There is something to this, but it is clear that in the age of the calculator and the computer we need other skills, and therefore intelligence will still be required, only of a different kind. It is a pity to insist on preserving old skills just because we have grown used to them, instead of trying to develop skills that are relevant to the present situation.
True, screen culture is problematic. People do not have the patience to delve into arguments and read material. They want everything immediate, short, and focused. But there are advantages to this as well, and I must admit that I am not immune to them: it forces writers to be sharp and focused. Instead of trying to cling by force to the old ways of thinking and relating, it is worth thinking about what to do in the new situation. How does one develop writers’ ability to sharpen and focus, and how does one develop readers’ ability to read critically?
This, of course, is not black-and-white. Clearly the older abilities are also needed, and it is worth trying to preserve whatever can be preserved of them, but the picture is more complex. Perhaps by the parameters of the mid-to-late twentieth century people today are indeed stupider, but perhaps they have other skills in which they are better than their parents. We need only adapt the mode of measurement to contemporary culture (I do not know how to do that).[2]
[1] Of course, one can sharpen this further, but it seems to me entirely reasonable as a starting point for discussion.
[2] By the way, one possibility is to try to construct a test that would yield the same distribution and the same average IQ as in the parents’ generation. True, there is some begging the question here (namely, assuming that IQ ought not to decline), but no more than in the current measurement (which assumes that the skills that express IQ today are identical to those that did so then). Moreover, that is the assumption that underlies the standardization mechanism described above, and I am merely suggesting that it be extended.
The reader who was reminded of the posts on multiple intelligences and emotional intelligence (the end of Column 35, and Column 108) is indeed correct. These matters certainly require examination in light of what was said there, and vice versa.
Discussion
Quite right. If so, then these really are reflections that are not necessarily connected to the study, as I wrote.
With all due respect, the rabbi should consume his content from more serious sources than the Walla website and read the article where it was published.
I think the study דווקא does strengthen (to a limited extent, of course) the theory that screens cause a decline in intelligence. It is true that the study does not strengthen the theory relative to other explanations for the decline in intelligence, but it does strengthen the theory compared to the claim that screens do not cause a decline in intelligence. Overall:
Suppose we examine 3 possibilities:
1. Screens cause a decline in intelligence, and factor B does not cause it.
2. Factor B causes a decline in intelligence, and screens do not.
3. Neither screens nor factor B cause a decline in intelligence.
Suppose that a priori factor B does not seem plausible to us, so we give it no points, while screens do seem plausible to us, so we give them one point, but we also give one point to the possibility that there is no effect at all (because we have no evidence of any effect).
That is, explanations 1 and 3 are tied at 1, and explanation 2 has 0 points.
But after the experiment, the score situation is:
2 points for explanation 1
1 point for explanation 2
0 points (or 1) for explanation 3.
That is, after the experiment we have a win in favor of the screens explanation, even though before the experiment we were in a tie.
Good evening,
I think I can add several significant points, an alternative proposal for interpreting the study’s results, and perhaps an important insight on the subject:
– An IQ test primarily purports to predict academic success, that is, success in academic terms.
– The IQ test was built within an academic world that was based almost entirely on a person’s ability to store information and organize it independently, and thereby derive the maximum from it.
– IQ scores are influenced by the environment; educational systems encourage academic abilities. As long as the test represents relevant academic abilities, and as long as there is an increasing effort to promote academic abilities in the population, test scores rise.
– The abilities currently required in the academic world are steadily changing. The need for abilities to remember knowledge is steadily decreasing; even the ability to recall knowledge is no longer an important requirement. Instead, there is a growing understanding that what is needed for academic success is the ability to access knowledge that is available in a completely democratic way, to distinguish the essential from the secondary within it, and of course, as was also the case before, to identify the logical structure underlying it.
– The change in the abilities required for academic success also creates a change in educational systems, which aspire to impart those abilities to students.
– The abilities measured by the IQ test no longer match the abilities required for academic success, and therefore no longer match the skills that the educational system promotes. Therefore, one may propose as an alternative explanation for the decline in IQ scores a split between the abilities being measured and the abilities currently required for academic success.
– More generally, when the same test is used over many years, it is worth checking from time to time whether the test still predicts what it purports to predict…
Many thanks for the recommendation, but I do not see why. I have little interest in the subject of the article, so I do not see why it would be worthwhile for me to read it in the original. Here I only used it to illustrate a point that is unrelated to this study.
If I understood correctly, your point is that this survey rules out option 3 and leaves us with the first two. True, but that seems marginal to me. Something like the argument connected to the paradox called the raven paradox: suppose we have a theory that all ravens are black, and we want to put it to an experimental test. There is an equivalent claim that says that whatever is not black is not a raven. It is logically equivalent to our theory. How does one test it empirically? One examines something that is not black and verifies that it is not a raven. That is, seeing a pink chair confirms the claim that all ravens are black.
That is true, but negligible.
You are right that it is not always negligible. Sometimes there is more substantial confirmation here. It depends on the a priori probabilities assigned to option 3.
That is roughly what I wrote in more general terms. I would only note that an IQ test is not a psychometric test, and it does not necessarily come to predict academic success or suitability.
The hypothesis of black ravens is not proved by seeing a pink-chair, but only by seeing a pink non-chair.
Obviously, but when you see a pink chair, you also see a pink non-raven.
You are right that if one is conducting a planned experiment, one should look for pink objects and verify that they are chairs (or not ravens), and not look for chairs and verify that they are pink (or not black).
From what I understood, in that article they were simply trying to rule out dependence on the genetics of the test subjects, and concluded that the reasons are not genetic but environmental.
As for the interpretation of what the environmental factor is, they did not even purport to give a clear explanation, but rather suggested that perhaps it is a change in dietary habits, perhaps exposure to screens, perhaps the improvement in living conditions, and perhaps some other environmental factor or a combination of environmental factors…