10.3 Learning Belief Networks

10.3.3 Missing Data

Data can be incomplete in ways other than having an unobserved variable. A data set could simply be missing the values of some variables for some of the tuples. When some of the values of the variables are missing, one must be very careful in using the data set because the missing data may be correlated with the phenomenon of interest.

Example 10.13.

Suppose there is a (claimed) treatment for a disease that does not actually affect the disease or its symptoms. All it does is make sick people sicker. If patients were randomly assign to the treatment, the sickest people would drop out of the study, because they become too sick to participate. The sick people who took the treatment would drop out at a faster rate than the sick people who did not take the treatment. Thus, if the patients with missing data are ignored, it looks like the treatment works; there are fewer sick people among the people who took the treatment and remained in the study!

Data are missing at random when the reason the data is missing is not correlated with any of the variables being modeled. Data missing at random could be ignored or filled in using EM. However, “missing at random” is a strong assumption. In general, an agent should construct a model of why the data are missing or, preferably, it should go out into the world and find out why the data are missing.