STAT3005: Midterm Review

Fall 2020

Introduction

We will quickly review selected materials
Please refer to the lecture notes for complete materials
- e.g., be careful with the detail like in A3.3.2

Infinite-dimensional
- the number of parameters is not fixed or
- there are infinitely many parameters
Model-free
- DOES NOT mean no statistical model
- only invariant to a wider class of models
Vs parametric
- Represents different beliefs
- No silver bullet for all statistical problems

Useful in proofs to deal with dependency (example: A3.2)
Basic form: \[ \mathbb{E}(X) = \mathbb{E}\{ \mathbb{E}(X \mid Y) \} \]
Extension used in A3.2: \[ \mathbb{P}(X_2<X_1, X_3<X_1) = \mathbb{E}\{ \mathbb{P}(X_2<X_1 \mid X_1) \mathbb{P}(X_3<X_1 \mid X_1)\} \]
- Note that \(\mathbb{E}\{\mathbb{P}(X_2<X_1 \mid X_1)\}\) and \(\mathbb{E}\{\mathbb{P}(X_3<X_1 \mid X_1)\}\) are still dependent
- Thus we cannot remove the conditional expectation yet
Rewrite probability as expectation of indicator \[ \mathbb{P}(X_2<X_1) = \mathbb{E}( \mathbb{I}_{X_2<X_1} ) \]
- This is called the fundamental bridge between \(\mathbb{P}\) and \(\mathbb{I}\) in the notes
- It also explains why the tower in A3.2 is an extension

If \(X \perp \!\!\! \perp Y\), then \(g(X) \perp \!\!\! \perp h(Y)\) for some functions \(g,h\)
- Usage in A1.1.4: consider \(g(R_1^+, \ldots, R_n^+) = R_1^+\). Apply Theorem 2.2 with this fact
Property of covariance in A1.2.3: \[ \textrm{Cov}(X_1+X_2,X_3+X_4) = \textrm{Cov}(X_1,X_3)+\textrm{Cov}(X_1,X_4)+\textrm{Cov}(X_2,X_3)+\textrm{Cov}(X_2,X_4) \]
- Other properties, e.g., \(\textrm{Cov}(a,X) = 0\) if \(a\) is non-random can be found online easily
Useful trick: full sum of ranks is non-random if permutation does not affect its value
- Example in A2.1.3: \(\sum_{i=1}^{2n} \sqrt{R_i} \equiv \sum_{i=1}^{2n} \sqrt{i}\)
- \(\sum_{i=1}^n \sqrt{R_i}\) is not a full sum, thus still random
- \(\sum_{i=1}^{2n} i \sqrt{R_i}\) is a full sum but permutation affects its value, thus still random
Other basic relationships are less frequently or not used so far
- Read hypothesis testing terminologies if you are not familiar with them

Focus on the principles and concepts
- Since you have 48 hours for the project, it is more than sufficient to check any theory or example
- Thus memorizing formulas are not necessary as well
Broad idea of ranks: the ordinal information is “better” than the value itself
- “better” can be, e.g., in terms of robustness
Broad idea of signs: the side (if symmetric) is “better” than the value itself
Common problems with ranks or signs (see A2.3.2):
- Data are dependent (e.g., time series)
- Data are discrete (tied observations)
- Sample size is small when we want to use asymptotic results

A large focus is testing in the project
- As we can also tell from the mock
- Here we discuss the general flow of answering some question types

Discuss whether \(T\) is a sensible statistic for testing…

Identify the target problem (e.g. location, trend, scale)
Under \(H_0\), discuss the likely value of \(T\) and its standardized version
Under \(H_1\), discuss the likely value of \(T\) and its standardized version
Argue if \(T\) can reject \(H_0\) in flavor of \(H_1\) (see mock 1.3 for a counterexample)
Additional comments, e.g., visualization, simulations, assumptions…

Propose two other sensible tests statistics \(T_2\) and \(T_3\)…

Identify the target problem (e.g. location, trend, scale)
Check the summary table in Section 3.1
Use a few sentence to describe our ideas, e.g.
- (Mock Q3.2) I propose van der Waerden test as I believe data from normal distribution are quite common for Google trend. Doing so enhances robustness under Principle 3.1 with flavor of a common data generating process.
- (Mock Q3.2) I propose rank sum test as well because it is quite classical. While it may not be most powerful for the problem, it serves as a benchmark in comparison.
- These are possible enhancement to our analysis
Conduct simulation experiments (if required by the question)

Test this claim by using the statistic \(T\). Is your test reasonable? Is it improvable? How?

State the assumptions, hypothesis, \(p\)-value and decision…
Arguing reasonable is similar to sensible. See previous discussion
For real data, we may check the data properties such as dependent, discrete, small sample…
Arguing improvable and how from the problems that we have pinpointed
Additional comments, e.g., visualization, simulations, reference readings…

Design a simulation experiment to compare their testing performance…

Assume we have chosen the tests and data generating process
Generate data with the models
- R function: typically r + distribution name, e.g., rexp, rnorm
Perform the tests on the data
- If you plan to use “exact” test, generate the pivotal values once only (and set seed)
Store the testing results in, e.g., an array
- Typical dimensions: replications, parameter of interest, models, tests
- Check mock project suggested code
Plot the power curves
- Remember to change the parameters, e.g., axis label if you use our code
Comment of the plot

Answer ALL questions
- Blank answer must score 0 (do not waste the chance!)
- We give partial credit if you can show understanding of the materials
The only thing that you CANNOT do is communicating with others
In other words, you can:
- Use WolframAlpha to check algebra
  - A3.2: integrate 2xabs(x) +abs(x) from -0.5 to 0.5
- Read Stack Overflow to debug R code
  - But you cannot ask questions there
- Watch YouTube video tutorials…
Remember to cite the source
Submit on time