We should know what outliers are and how they can impact our data. In the video, TSQL: How To Get the Outliers In A Data Set, we look at what outliers are with examples. In this video, we look at how to obtain the outliers in a data set using T-SQL. You may hear the words "generalization" or "exception" in conversation. Regardless of the context of these terms, these words inherently involve a lack of outliers (generalization) or outliers (exception). This doesn't mean the observation is correct - for instance, a person may observe that something is an exception when in reality, it occurs frequently. Generalizations are generally true, in that they are the tendency. Exceptions are extremely rare, in that they fall outside of the tendency. Generalizations don't have to be true 100% of the time, while exceptions must be extremely rare. It would not be an exception for Phoenix to have 100 degree fahrenheit weather, while it would be exceptional for it to have -20 degree fahrenheit weather.
Some questions and information that we can learn from watching the above video:
- What are some words that we may hear or use in English that often indicate outlier (if used accurately)? Let's keep in mind that some people overuse words when they're not actually those things.
- When we consider average and median in the context of outliers for this video, which of these measurements are we using?
- Related to the above point, what are some assumptions we must make when using that definition?
- When considering the standard deviation and the bell curve, what should we consider about using a larger or small deviation relative to the bell curve shape?
- With the example used in the video, how do outliers impact the data set? Consider a related example and how outliers might impact that data set?
In the course Automating ETL, I provide a few practical examples of using the outliers to solve problems or identify opportunities. Outliers can be used to identify big opportunities as their result curve can be very steep. Once these are identified though, they regress to the mean. In other words, they are no longer exceptional once they have been recognized.
Check out the highest-rated Automating ETL course on Udemy, if you're interested in data.
Most things described as unusual events (outliers) are not actually unusual. For instance, we often hear weather patterns being described as unusual when these occur frequently. A good example of this is when weather forecasters will say that "we had an unusual tornado" the other day in a city I used to live, when tornadoes happened multiple times a year. This was an incorrect description - they are actually expected events. In fact, most tornadoes in the world occur in the United States of America and most of these occur in tornado alley (the "Great Plains" of the United States). An outlier event in this case would be a tornado in Siberia - that would not be an expected event. This demonstrates why we should have an understanding of what's normal - the tendency of the data we have.
We should also keep in mind that what may be abnormal in a short period of time may be normal over longer periods of time. For example, there's a popular idiom that goes "lightning doesn't strike twice in the same place" as a way of saying "it was a rare moment." That idiom holds true over short periods of time that it is extremely rare for lightning to strike in the same or close area. Over longer periods of time, lightning does strike in the same spot. The same holds true if we reflect back over "rare" moments - over a period of time, they may not actually be that rare. Thus something we feel "never" or "always" happens may not actually be the case. Those words are great examples of how we use expressions in English that are statistically inaccurate.