QUANTITATIVE ANALYSIS
This interview is designed to evaluate quantitative reasoning and applied statistics. Quantitative reasoning tests knowledge of relevant mathematical/probabilistic/statistical concepts and how they relate to Facebook products. Applied statistics tests problems drawn from real-world data or estimation.
Scope:
Estimation and logical reasoning in the context of a real-world product.
Elements of descriptive statistics (mean/expected value, median, mode, percentiles).
Common distributions such as binomial or normal distributions.
What does real-world data typically look like?
Law of Large Numbers, Central Limit Theorem, Linear Regression.
Conditional probabilities, including Bayes‘ Theorem.
Sample Question:
What do you think the distribution of time spent per day on Facebook looks like? What metrics would you use to describe that distribution?
What won’t be covered: Advanced stats/math concepts: calculus or advanced statistical/ML models; more complex distributions like the exponential, Weibull, Beta, etc.; brainteasers or contrived estimation problems (e.g. how many golf balls fit in a 747).
- probability,FB ads 有lazy reviewer 和common reviewer, 一个reviewer是common的概率是0.8,是lazy概率是0.2。common给好评概率是0.6,差评0.4。lazy reviewer 只给好评。
1.)一条ads 是好评的概率
P(G) = P(G|L)P(L)+P(G|C)P(C) = 1*0.2+0.6*0.8 = 0.68
2.) 100个ads里number of 好评的expectation。
E(x=100) = 100*0.68 = 68
3.)有五个ads都是好评,是lazy的概率。
P(L|5G) = P(5G|L)P(L)/( P(5G|C)P(C)+ P(5G|L)P(L)) = 1*0.2/(0.6^5*0.8 + 1*0.2) =0.837
- 用最简单易懂的方式解释一下variance.
Variance is a way to measure the spread of data around the mean. It summarizes how close each data point is to the mean value. With a small spread all data are very close to the mean, resulting in a small variance.
- 每天每人发帖的distribution是什么样子的?
The distribution of number of postings per user per day should be right skewed, because the majority of users would be passive users, they probably view a lot but are unlikely to post or just post occasionally. There would be a small proportion of active users who post every day and the majority of this proportion would be business account. There would be much smaller group of users who create multiples postings every day.
Median and mode should be 0, mean would be 1 due to the outliers.
- #comments per active user per day 的distribution
The distribution of number of comments per user per day should be right skewed, because the majority of users would be passive users, they probably view a lot postings but are less likely to comment or just comment occasionally, like birthday or special event. There would be a small proportion of users who comment every day and the majority of this proportion would be very active users like teenagers. There would be much smaller group of users who create a lot comments every day.
Median and mode should be 0, mean would be 1 due to the outliers.
- 条件概率和贝叶斯
P(A|B) = P(A&B)/P(B) = P(B|A)P(A)/P(B)
- two approaches:
a. 5% chance to be an ad per post.
b. every 20 post must have an ad in it.
1) compute each expected value and variance for number of ads in 100 posts.
Approach a: let X denotes the number of ads, the probability distribution of X is binomial distribution. X~n(n,p)
E(X) = 100*0.05 = 5
σ2 = Var(X) = np*(1-p) = 100*0.05*(1-0.05) = 4.75
Approach b:
E(X) = 100/20 = 5
2) probability of getting more than 10 ads in 100 posts with approach a.
approach a应该符合二项分布,p=0.05, q = 1-p = 0.95.
如果用二项分布解的话:
p(more than 10 ads) = 1 - p(less than or equal 10 ads)
p(less than or equal 10 ads) = p(ads = 0) + p(ads=1) + ... + p(ads=10)
等式右边的每一项可以用二项分布的概率密度函数解。。
但是这么解会比较耗时间。可以考虑用正态分布去估计(因为我们看到样本数目比较大np>=5且nq>=5):
此处 mu = 5, var = 100*0.05*0.95 = 4.75, sigma = sqrt(4.75) = 2.18
Z = (x - mu)/sigma = (10 - 5)/2.18 = 2.29
如果没有Z表,可以这么估计:(我不太确定这么估计是否会让面试官满意,但是应该比没有估计好吧。。)
我们很熟悉的单尾Z<1.96的概率是0.975, 所以p(Z>1.96) = 0.025
所以p(Z>2.29) < p(Z>1.96) = 0.025
解答:
100个posts中有超过10个广告的概率不超过2.5%,具体数字根据查表得到为1.1%。
根据二项分布的解法,用Excel算了一下,结果是:1.1472%
3) expected number of seeing back-to-back ads in 100 posts with two approaches. 意思就是平均会有几次连着看两个posts that are both ads
4) 每25个post出现一个ad 或4%的概率出现 哪种好?
- There are two mobile restroom stalls at a construction site where I work. There are also three situations that have an equal chance of occurrence:
- none of them is occupied
- only one of them is occupied
- both are occupied
1) If I were to pick one at random, what is the probability that it is occupied?
P(A1&B1) + P(A1&B0|A0B1) = 1/3 + (1/2)*(1/3) = 1/2
2) follow up: If it turns out that that first one I go to is occupied and I decide to try the other one, what is the probability that the second one is also occupied?
P(S1|F1) = P(S1&F1)/P(F1) = (1/3)/(1/2) = 2/3
- 解释p value,如果有两组人要看他们在某个值上有没有差异需要用什么方法,以及如果我要test两组人在好多好多个不同的measurement上的差异,可以不可以run好多个t test等等。此外还涉及到你要如何跟销售团队沟通你的结果,说服他们相信你的结果有用之类的。
- P-value is the probability of finding the observed or more extreme statistic when the null hypothesis is true.
Based on the data collected, we can calculate the test statistic, if the probability of finding the test statistic or more extreme value is pretty small, that mean it‘s very unlikely to occur given null hypothesis is true. In this case, the null hypothesis can be rejected.
- If we the value we need to compare the means of two groups, we can use two sample t test. If we need to compare two proportions, we can use z test.
- When we run multiple t-tests, we would get significant result by chance. To overcome this issue, we can use Bonferroni correction which is just dividing the desired significance level by the number of tests.
- A test result has statistical significance when it is very unlikely to occur given the null hypothesis is true. Imagine that you flip a coin 10 times and get 7 heads, so you suspect the coin is weighted toward heads. Your null hypothesis is then that the coin is fair. If you repeat this trail many times, count how many times you get 7 or more heads out of 10 flips, if the probability of getting 7 or more heads is very small, that means it’s very unlikely to happen when the null hypothesis is true, so we can reject the null hypothesis.
- 两个ABtesting的metric,以及试验结果的置信区间,一个区间非常大,一个非常小,问你能看出点什么。
A confidence interval estimates are intervals within which the parameter is expected to fall, with a certain degree of confidence. 95% confidence interval means if we repeat the experiment many times, 95% times interval in fact contain the true value of the parameter. With the same confidence interval, the wide CI means it’s more likely that the interval will contain the null hypothesis value that means it’s less likely to reject the null hypothesis when it’s false. In another word, wider CI has high the type 2 error and low power. Also, the wide CI means less precise estimates of effects. Variability
还有就是,小哥问了怎么可以确认metric的变动是由于某某因素导致的
Designed experiment is a method of applying treatments to a group and recording the effects, it’s used to show causality by randomly assigning control and experiment group, and then make comparison. Random assignment makes it unlikely that the samples who have something in common will end up in the same group, that means it creates roughly similar groups by approximately balancing potentially confounding variables between the two groups.
Since we already ran the designed experiment and got significant result, if we can confirm randomized controlled experiments (this is the assumption for ab test)/the random assignment and passed the sanity check, we should be sure the effect is due to the new feature.
- planning to add a new feature on video page, how would you made the decision?
(1. what metrics you will choose? (average_time_watched)
(2. what‘s the null hypothesis and alternative hypothesis?
(3. anything need to look out when setting up the experiement?(这个题范围略广,但也容易答,主要focus在了两个control group的 population分布要一致,尽量保证唯一变量是new feature)
(4. the test you will use? (two samples t test) follow up: what statistics you need to calculate? (答需要两个samples的平均值和方差,但我不记得具体公式了)
(5. say p-value is 0.3, what does this mean? (答这时我们fail to reject H0, p值0.3表示如果我们开始实施new feature,会有30%的时候它并没有效果?这题我自觉答得有点逻辑混乱) - infrastructure Ds - 第一道是给你一堆硬币的厚度重量啥的让你判断他们是不是一个厂生产的。
K-means
第二道类似,给你两袋子硬币,有每袋里面各个硬币的直径,让你判断这两袋是不是一个厂生产。
如果histogram差不多normal的话可以考虑用two sample t test.
如果不normal的话(虽然不太可能)应该就是求empirical distribution之间的距离,用ks或者那几个nonparametric的方法做吧