How To Find Outliers In Data

in #data3 years ago

We should know what outliers are and how they can impact our data. In the video, TSQL: How To Get the Outliers In A Data Set, we look at what outliers are with examples. In this video, we look at how to obtain the outliers in a data set using T-SQL. You may hear the words "generalization" or "exception" in conversation. Regardless of the context of these terms, these words inherently involve a lack of outliers (generalization) or outliers (exception). This doesn't mean the observation is correct - for instance, a person may observe that something is an exception when in reality, it occurs frequently. Generalizations are generally true, in that they are the tendency. Exceptions are extremely rare, in that they fall outside of the tendency. Generalizations don't have to be true 100% of the time, while exceptions must be extremely rare. It would not be an exception for Phoenix to have 100 degree fahrenheit weather, while it would be exceptional for it to have -20 degree fahrenheit weather.

Some questions and information that we can learn from watching the above video:

  • What are some words that we may hear or use in English that often indicate outlier (if used accurately)? Let's keep in mind that some people overuse words when they're not actually those things.
  • When we consider average and median in the context of outliers for this video, which of these measurements are we using?
  • Related to the above point, what are some assumptions we must make when using that definition?
  • When considering the standard deviation and the bell curve, what should we consider about using a larger or small deviation relative to the bell curve shape?
  • With the example used in the video, how do outliers impact the data set? Consider a related example and how outliers might impact that data set?
Despite what you may learn in class, there isn't a hard and fast definition statistically for measuring an outlier. It will depend on the actual data set. In all cases, however, we are looking for deviations outside the tendency of data. An easy example of this to consider is that the human body temperature tends to be 98.6 degrees fahrenheit and with as many measurements as we have, we can use very strict deviations outside this to determine that a person has fever. Unlike data sets where we may have 20 points swings in deviations, a deviation of 1-2 can be extreme in the case of measuring fever. This is because in all of the data points we've captured on human temperature, it is extremely rare that our body temperature differs significantly from 98.6 degrees fahrenheit.

In the course Automating ETL, I provide a few practical examples of using the outliers to solve problems or identify opportunities. Outliers can be used to identify big opportunities as their result curve can be very steep. Once these are identified though, they regress to the mean. In other words, they are no longer exceptional once they have been recognized.

Automating ETL
Check out the highest-rated Automating ETL course on Udemy, if you're interested in data.

Most things described as unusual events (outliers) are not actually unusual. For instance, we often hear weather patterns being described as unusual when these occur frequently. A good example of this is when weather forecasters will say that "we had an unusual tornado" the other day in a city I used to live, when tornadoes happened multiple times a year. This was an incorrect description - they are actually expected events. In fact, most tornadoes in the world occur in the United States of America and most of these occur in tornado alley (the "Great Plains" of the United States). An outlier event in this case would be a tornado in Siberia - that would not be an expected event. This demonstrates why we should have an understanding of what's normal - the tendency of the data we have.

We should also keep in mind that what may be abnormal in a short period of time may be normal over longer periods of time. For example, there's a popular idiom that goes "lightning doesn't strike twice in the same place" as a way of saying "it was a rare moment." That idiom holds true over short periods of time that it is extremely rare for lightning to strike in the same or close area. Over longer periods of time, lightning does strike in the same spot. The same holds true if we reflect back over "rare" moments - over a period of time, they may not actually be that rare. Thus something we feel "never" or "always" happens may not actually be the case. Those words are great examples of how we use expressions in English that are statistically inaccurate.

YouTube | Automating ETL | T-SQL In 2 Hours | Consumer Guide To Digital Security