1 Nisan 2019 Pazartesi

İstatistik

Standart Sapma
Standart sapma verinin ne kadar homojen olduğunu belirtir. Her standart sapma 0 ise bütün veri aynı değerdir.

Benim anlaşılır bulduğum bir açıklama şöyle. Yani ortalama (mean) aslında yanıltıcı olabiliyor. Rakamların ne kadar yayılmış olduğunu bulmak için Average Deviation kullanılabilir. Standard Deviation ise Average Deviation hesaplamasını bir türevi.

Interpretation of the Mean

When we say that the average value spent on meals was 50 USD - it means that if we take the total amount spent on fast foods and equally divide the sum among all the people who made the purchase - each person would get 50 USD.

However, this number hides a lot of information. We can get an average of 50 USD in a lot of different situations. One extreme is when everyone spends exactly 50 USD. Another extreme is when half the people spend 0 USD and another half spend 100 USD. And there are infinite number of situations in between that would give us a mean value of 50.

Average Deviation

Hence, we are interested in the variability of those amounts. One intuitive way to quantify how much variability there is is to calculate the average deviation from that mean value. So when we know the mean value, for each person we can calculate the difference between their spent amount and the mean value, and get the average of that:

MAD=i|xix¯|n

This is "Mean Absolute Deviation" (MAD). It answers the question: among the customers - what is the average difference between their purchase and the average?

We can check what this score would be in the two extreme scenarios. If every purchase was equal to 50 USD then the average would be 50, and the MAD would be 0. And if half of the purchases were 0 and another half 100 then the mean would be 50 and the MAD would be 50.

Standard Deviation

Standard deviation is a variant of the MAD, but it's harder to interpret. Note that when we look for averages differences from the mean in MAD calculation - we take the absolute value. We want to get rid of the sign, because otherwise roughly half the deviations will be negative and half - positive, and so they would cancel out. Standard deviation, instead of taking the absolute value, uses the square, which, just like absolute value, transforms negative numbers into positive. And then transforms-back by taking the square root:

SD=i(xix¯)2n

The idea is the same. It is harder to interpret, but it has some nice properties. 

Normal Distribution
Normal Distribution için Standart Sapmanın ne anlama geldiği bellidir. C++ ile normal distrubution yazısına bakabilirsiniz. Açıklaması şöyle.
Normally distributed variables have skew = 0.
Verinin %68'inin orta noktadan 1 standart sapma değeri kadar az veya çok olması beklenir.
Verinin %95'inin orta noktadan 2 standart sapma değeri kadar az veya çok olması beklenir.
Verinin %97.7'sinin orta noktadan 3 standart sapma değeri kadar az veya çok olması beklenir.

Popülasyonun Şekli
Açıklaması şöyle.
... the rate of convergence to a normal distribution is very dependent on the shape of the population we are sampling from, in particular, if our population is very skew, we expect it to take a long time to converge to the normal.

Örnek
1. Önce sayıların ortalaması alınır.
float avgValue;  
float totalValue = 0;  
for( var i in setOfNumbers ){  
    totalValue += i;  
}  
avgValue = totalValue / setOfNumbers.Count();
2. Her bir sayının ortalamaya uzaklığının karesi alınır ve sonuçlar toplanır
float sumOfSquares = 0;  
for( var i in setOfNumbers ){  
    float diffFromAvg = i - avgValue;  
    sumOfSquares += sqr (diffFromAvg);
}
3. Varyans 2. adımının kaç tane sayı olduğuna bölünmesidir.
float variance = 0;  
variance = sumOfSquares / setOfNumbers.Count();
4. Standart sapma varyansın kare köküdür.

Örnek
Şöyle yaparız.
#include <numeric>

double sum = std::accumulate(v.begin(), v.end(), 0.0);
double mean = sum / v.size();

double sq_sum = std::inner_product(v.begin(), v.end(), v.begin(), 0.0);
double stdev = std::sqrt(sq_sum / v.size() - mean * mean);
Örnek
Daha basit formülü şöyle
totalSum = Sum(sum_i)
totalSumSquared = Sum(sum_squared_i)
totalCount = Sum(count_i)

mean = totalSum / totalCount
variance = (totalSumSquared - mean / totalCount) / (totalCount - 1)
sd = sqrt(variance)
Varyans - Variance
Klasik formül şöyle. Yani sayıların ortalaması (mean) bulunur. Her bir sayının ortalamaya uzaklığının karesi alınır ve sonuçlar toplanır. Varyans bu toplamın kaç tane sayı olduğuna bölünmesidirr.
 Σ(x-̅x)²/n
Stochastic Processes
Açıklaması şöyle
Random walks, discrete time Markov chains, Poisson processes
Time Series 
Açıklaması şöyle
Important for finance and tech, where there are lots and lots of measurement occasions
Açıklaması şöyle
If you are working as a Data Scientist you will definitely make forecast occasionally. It is important that you understand patterns such as trends, unit roots, seasonalities, etc.
In practice you will be facing data with different frequencies such as monthly or quartely data.

Read Forecasting principle and practice in order to get an understanding of the applications of forecasting.
Modern Statistical Prediction and Machine Learning 
Açıklaması şöyle
For the fancy new prediction stuff, also important for finance and tech.
Açıklaması şöyle
It is definitely worth knowing things such as training and test data. You will always built a model and test it.
Game Theory
Açıklaması şöyle
Game theory is a theoretical concept which is barely directly applied in practice. In economic and psychological research it might be helpful, however it is not in the classical scope of a data scientist.

Hiç yorum yok:

Yorum Gönder