Is machine learning a pseudo science?

Is machine learning a pseudo science?

Sridhar Mahadevan

Interesting question. The dictionary definition of a pseudoscience is:

  1. a collection of beliefs or practices mistakenly regarded as being based on scientific method.

Naturally, most ML researchers would probably be insulted if you refer to their field as a pseudoscience. My wife, who is a statistician and has taught at Harvard and Johns Hopkins, considers ML to largely be a “gosh and golly” field. In other words, ML, from her viewpoint as a statistician, is an unabashedly empirical field where the bulk of the results are somewhat suspect in terms of their statistical validity. She would undoubtedly classify ML as a pseudoscience.

Let’s examine the evidence. First, it is extremely important to distinguish between statistics and machine learning: I consider the former to unquestionably be a scientific field, as it defines the basic principles of data science. The analogy I often make in my graduate ML class is that ML is to statistics as electrical engineering is to physics. This is at best a weak analogy, as I don’t consider ML to be remotely as well established a field as electrical engineering. Since I have two degrees in EE, I am familiar with a large set of subfields of EE, from basic quantum physics of transistors to large-scale engineering of power transmission.

EE is a scientific field because it builds on a solid foundation of physics (Maxwell’s laws, Ohms law), and it has built infrastructure (the power grid, for example, that billions of people rely on each and every day) that works incredibly reliably. ML is, to put it baldly, not remotely achieved these capabilities, although the field has come a long way in the past 30 years. EE has been around since Faraday discovered electricity, so it has had a significantly longer period of time to become a scientific field.

Physicists are, by and large, interested in broad sweeping principles and scientific laws that hold across the universe. For example, the conservation of energy holds true on Earth just as it does in the M51 galaxy, sometimes called the “whirlpool” galaxy, estimated to be between 15 and 35 million light years from us. This beautiful galaxy was discovered in 1773 by Messier:


Now, isn’t that pretty! It was originally not considered a galaxy until Hubble’s landmark work on red shift, partly by observing so called Cepheid variable stars that pulsate with a known periodicity. Hubble formulated a law that governs the red shift of stars that are far away from Earth, and this is a primary example of a fundamental principle that is considered the bedrock of modern astronomy. It is widely cited as the basis for the expansion of the universe. This is a prime example of the sort of activity that physicists do: they look for broad principles that explain the patterns of motion or energy across the universe.

Statisticians, similarly, look for broad principles that govern the sorts of inferences one can make from data. Consider, for example, the beautiful Rao-Blackwell theorem, one of the foundations of statistics. This theorem is sometimes used as a definition of the fundamental concept of a “sufficient statistic”, namely some function T(X) of a dataset X, such that all information about the parameter u of some family of statistical models G(u) is contained in T(X). Once T(X) is computed, the original dataset X can be discarded.

A simple example of a sufficient statistic is the sample mean. If the data X is assumed to be sampled i.i.d from say a normal distribution with a given (unknown) mean, which is to be inferred, the sample mean is a sufficient statistic, meaning that even if one has a million observations, computing the sample mean gives you the same information about the true mean as keeping around all million samples. One can replace a million numbers by 1 number! That’s the power of a sufficient statistic.

So, what does the Rao Blackwell theorem state? It says that if one computes some estimator S(X) of the data X, then the variance of the estimator can be uniformly improved by conditioning S(X) on a sufficient statistic:

So, var(S(X) | T(X)) <= var(S(X))

Equality holds if and only if S(X) is itself a function of a sufficient statistic. This is a very deep theorem because it governs what one can infer from data in a reliable way. In fact, it suggests how to improve an estimator. In machine learning, Rao-Blackwellization is now a verb, and it means the process of conditioning an estimator on a sufficient statistic to reduce its variance. This process is done routinely in approximate inference methods in graphical models, typically particle filtering. It is used in SLAM algorithms in probabilistic robotics, and undoubtedly is also used in localization of autonomous cars using GPS etc.

But, are there such results in machine learning, without going to statistics? Also, what should we make of the vast number of papers in ML conferences, where the paper is largely an empirical study of some method on some specific datasets.

One genuine discovery of such a broad principle in machine learning is the principle of weak learning, or boosting, discovered by Robert Schapire (in his thesis at MIT). Boosting is a process of taking a “weak learner”, meaning any classifier that performs better than random, and turning it into a “strong” learner, meaning one whose accuracy on a dataset can be extremely high. Boosting is both a well-established theory by which one can build a high quality classifier by combining a large number of very shallow slightly better than random classifiers, as well as a highly practical method that has been tested in countless studies.

The best compliment paid to boosting came from the famous statistician Leo Breiman, who called boosting a way to “turn a sow’s ear into a silk purse”. Statisticians regard boosting as a genuine discovery of first-rate importance to their field, and something that was not known in statistics. There are a small handful of such results in machine learning.

But, there is a more troubling aspect of empirical machine learning that has no parallel in modern statistics. This is what my wife refers to, when she says that machine learning is a “gosh and golly” field. By and large, papers in ICML and NIPS report on empirical studies of some method (e.g, GANs, deep learning Imagenet classifiers, deep reinforcement learning, and so on), where there is simply no attempt to objectively analyze the performance of the proposed method in a statistically rigorous way. One routinely sees, even in the most recent conferences in ML, experimental plots and tables that lack even a modicum of statistical rigor. No one in my experience in ML over a 30 year period uses estimates like p-values, power, and other such criteria that are in fact routinely used in other fields.

If you tried to publish a medical study in a reputable journal on the efficacy of some drug on patients, and you did not report p values or similar measures, you would be laughed off the face of the Earth. Your paper would not be accepted as it would be considered “unscientific”, or even “pseudoscientific”. The same holds in fields like psychology, where experimental studies have long relied on similar statistical metrics.

Unfortunately, ML grew out of a different tradition, shaped more by an empirical “let’s try this idea and see if it kinda works”, and for all the math one sees in ML conferences and journals, the field has not lost its “gosh and golly” foundation. I seriously doubt one can find a paper in deep (reinforcement) learning, for example, where the authors did a serious attempt to validate their results using statistical metrics, and showed whether the differences being reported could be due to chance alone. It’s possible someone has done this, but it is not common, in my long reviewing experience. To a highly trained statistician, like my wife, reporting experimental results, without using something like p values, is just doing pseudoscience. The numbers being reported, she would argue, mean nothing, unless one takes the care to see if the observed improvement is in fact genuine.

I am reminded of a famous commencement lecture given by Richard Feynman at Caltech (in one of the chapters in his book “What do you care what other people think?”), where he defines what makes an empirical study be scientific, and not be a pseudoscience. This lecture is well worth reading in its entirety. He cites one example of a study involving rats in a maze that showed they were able to do some amazing feat of learning to find the exits. Rather than simply report the findings, what the researchers did was painfully analyze each and every conceivable explanation for how the rats were able to do what they did, till they isolated the crucial element that the rats relied on. Once they discovered this, they found they could defeat the rats from being able to run the maze. According to Feynman, it is this sort of due diligence that elevated this particular empirical study from being pseudo-scientific to being scientific.

A great example of such a scientific study of interest to AI and ML is the recent discovery of the code for faces in primate brains, a genuine discovery by Caltech biologists that may one day receive a Nobel prize.

Cracking the Code of Facial Recognition | Caltech

What these biologists did was examine a small group of neurons in the inferotemporal (IT) cortex, a region of the visual cortex that is known to underly face recognition in humans. By doing systematic experiments with monkeys, they were able to determine exactly what this group of neurons were computing, so that they could generate synthetic faces and predict what the firing rate of the neurons would be, as well as predict the face characteristics from firing rates, using a simple regression model. They in effect discovered that primate brains are coding faces using a simple PCA like basis. It is the explanation of the result in terms that are comprehensible that makes this scientific.

Contrast that with a deep learning study that shows that a 400-layer deep net performs the Imagenet task with x% error rate. What can one conclude from such a result? What are the layers computing? What is the underlying theory? Very little can be deduced from such an empirical result. If ML aspires to become a scientific field, results like this should not be publishable. Will this happen? It seems unlikely to me, from the large number of ICML, ICLR, and NIPS papers I review each year. One can only hope the field matures into a more scientific field in time.

Report Page