One of the first things that an Analyst will want to do, once he/she gets the data, is to do data exploration. PROC MEANS is a SAS procedure that enables the analysts to get the elementary, yet most wanted statistics – N (Number of non-missing Records), Mean (In simple terms, Average), Stdev (Standard Deviation), Min (Minimum) and Max (Maximum), with minimal code.
If any other statistic, other than the default ones, is needed, we can just plug those terms in after the PROC MEANS statement. For example, if we want to have a look at how skewed the data is, we can just plug in SKEW in the PROC MEANS statement.
The NMISS option and N option are 2 interesting options of PROC MEANS. With these, we can generate a new variable that can give us the ratio of missing values to non-missing values. This will give us an idea as to whether we should try imputing the missing values or that variable will not be usable.
Another interesting feature that I like about PROC MEANS is that, apart from being able to generate statistics for individual variables, it also lets us generate statistics for continuous variables w.r.t certain other categorical variables, such as PROC MEANS MEAN; BY GENDER; VAR WEIGHT HEIGHT; RUN;.
The only additional overhead in the above case is that the data should have been sorted by the BY variable in prior. This overhead has also been taken care of currently. That is, CLASS statement can be used instead of BY to generate statistics by some categorical variable. This does not require that the data should have been sorted by the CLASS variable.
Different ways in which the PROC MEANS output can be made usable:
How much useful can this be? Not much. Why? The number of rows that get generated is equal to the product of the number of statistics that are requested for each variable and the number of variables.
Here, the required statistics for each variable will be outputted as a separate column. That way, we can re-merge the statistics back to the original file and use it.
If we have to generate multiple statistics for multiple variables, giving them names explicitly through the third method might be tedious. In order to overcome this, there is another shortcut – AUTONAME.
We can simply say OUTPUT OUT = TEST N= Min= Max=/AUTONAME; this way, the statistics generated for each of the variables will get the name in this format – VariableName_statistic. Eg: Product_Price_Mean.