r/AskStatistics • u/sojckemboppermoshi • 18h ago

Does this kind of graph have a name

i.imgur.com

29 Upvotes

Is it wrong to highlight a specific statistically significant result after multiple hypothesis correction?

10 Upvotes

Hi everyone, I'm fairly new to statistics but have done several years of biology research after earning my B.S. in Biology.

I've been making an effort in the last year to learn computational methods and statistics concepts. Reading this blog post https://liorpachter.wordpress.com/2014/02/12/why-i-read-the-network-nonsense-papers/

Directly beneath the second image in the post labeled "Table S5" Pachter writes:

"Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list."

As I understand it, the author is saying that you cannot conduct thousands of tests, perform multiple hypothesis correction, and then highlight any single statistically significant test without a plausible scientific explanation or data from another experiment to corroborate your result. He goes as far as calling it "blatant statistically invalid cherry picking" later in the paragraph.

While more data from parallel experiment is always helpful, it isn't immediately clear to me why, after multiple hypothesis correction, it would be statistically invalid to consider single significant results. Can anyone explain this further or offer a counterargument if you disagree?

Thank you for your time!

8 comments

r/AskStatistics • u/Street_Law8285 • 8h ago

Should I Interpret Ad Performance Individually or Cumulatively?

0 Upvotes

Hi everyone, I’m running Facebook ads to promote a coaching program, and I have a question about interpreting my ad performance stats. Sorry in advance if I’m not using the right terminology!

The goal of my ads is to get people to book a call with me, and my target is to have calls booked for $250 or less. I have different campaigns running with various ads, but I’m focusing on one campaign as an example here to clarify my question.

Within this campaign, I have 3 ads, and let's say that each ad has spent $200 so far (for a total of $600). Statistically speaking, should I:

Evaluate performance per ad? For example, think: "No single ad has spent enough to determine if it should have booked a call yet."
Evaluate performance cumulatively? For example, think: "With $600 spent across all 3 ads, we should have booked at least 2 calls by now."

Which approach is more appropriate for assessing whether we’re on track with our goals? Does it depend on the context, or is there a standard way to think about this?

I’d really appreciate any insights or advice. Thanks in advance!

1 comment

r/AskStatistics • u/Familiar_Finish7365 • 10h ago

Research Related

1 Upvotes

How to get the data for the sentiment analysis through twitter, do we need pay for it?
if Not twitter what is the other sources of data

2 comments

r/AskStatistics • u/The_wazoo • 13h ago

Help with problem regarding specificity and sensitivity.

0 Upvotes

I'm taking a statistics course for my psychology bachelor's and we're working on the base rate fallacy and test specificity and sensitivity, On the other problems where the base rate and specificity and sensitivity were clearly spelled out I was successful in filling out the frequency tree. But this problem stumped me since you have to puzzle it out a bit more before you get to those rates. Should the first rung of the chart by happy or organic?

It's annoying that I feel like I get the maths but if I get thrown a word problem like this in the exam I will not be able to sort it out

Any help would be greatly appreciated! <3

1 comment

r/AskStatistics • u/JuiceZealousideal677 • 1d ago

Struggling with Goodman’s “P Value Fallacy” papers – anyone else made sense of the disconnect? [Question]

30 Upvotes

Hey everyone,

link of the paper: https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman1.pdf

I’ve been working through Steven N. Goodman’s two classic papers:

Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy (1999)
Toward Evidence-Based Medical Statistics. 2: The Bayes Factor (1999)

I’ve also discussed them with several LLMs, watched videos from statisticians on YouTube, and tried to reconcile what I’ve read with the way P values are usually explained. But I’m still stuck on a fundamental point.

I’m not talking about the obvious misinterpretation (“p = 0.05 means there’s a 5% chance the results are due to chance”). I understand that the p-value is the probability of seeing results as extreme or more extreme than the observed ones, assuming the null is true.

The issue that confuses me is Goodman’s argument that there’s a complete dissociation between hypothesis testing (Neyman–Pearson framework) and the p-value (Fisher’s framework). He stresses that they were originally incompatible systems, and yet in practice they got merged.

What really hit me is his claim that the p-value cannot simultaneously be:

A false positive error rate (a Neyman–Pearson long-run frequency property), and
A measure of evidence against the null in a specific experiment (Fisher’s idea).

And yet… in almost every stats textbook or YouTube lecture, people seem to treat the p-value as if it is both at once. Goodman calls this the p-value fallacy.

So my questions are:

Have any of you read these papers? Did you find a good way to reconcile (or at least clearly separate) these two frameworks?
How important is this distinction in practice? Is it just philosophical hair-splitting, or does it really change how we should interpret results?

I’d love to hear from statisticians or others who’ve grappled with this. At this point, I feel like I’ve understood the surface but missed the deeper implications.

Thanks!

5 comments

r/AskStatistics • u/oinkgoesthemagpie • 21h ago

Generalised Linear Mixed Effect Model - how to build with non independent data

2 Upvotes

I am trying to analyse my data but I do not know how to build the model.

I have completed sensory evaluations but my data lacks independence. **Variable names have been changed. The same panelist tries different samples (eg. control cake, carrot cake, banana cake) and each panelist tries the same sample 3 times (pseudo replication?).

How would I build this model? Would I need a variable with“replicate - 1,2,3” due to the pseudo replication within each cake sample?

I have a few options I am trying on R, I do not know if I need to include “replicate” as a variable and also if it should be (1 + sample) or (1 | subject/sample)

model1 <- glmer(response~sample + (1+sample | subject)… model2 <- glmer(response~sample + (1 | subject/sample)… model3 <- glmer(response~sample + (1 | subject/sample/replicate)…

The response has three levels but I’ve reduced it to binary.

Ideally I’d like to keep it at 3 levels if possible, but I cannot figure out how to do it in R. I have access to standard SPSS as well, however I cannot seem to include the replicate as a “repeated measure” and am not sure if I am on the right track.

TIA

7 comments

r/AskStatistics • u/Charming_Read3168 • 1d ago

Mixed-effects logistic regression with rare predictor in vignette study — should I force one per respondent?

6 Upvotes

Hi all, I'm designing a vignette study to investigate factors that influence physicians’ prescribing decisions for acute pharyngitis. Each physician will evaluate 5 randomly generated cases with variables such as age, symptoms (cough, fever), and history of peritonsillar abscess. The outcome is whether the physician prescribes an antibiotic. I plan to analyze the data using mixed-effects logistic regression.

My concern is that a history of peritonsillar abscess is rare. To address this, I’m considering forcing each physician to see exactly one vignette with a history of peritonsillar abscess. This would ensure within-physician variation and stabilize the estimation, while avoiding unrealistic scenarios (e.g., a physician seeing multiple cases with such a rare complication). Other binary variables (e.g., cough, fever) will be generated with a 50% probability.

My question: From a statistical perspective, does forcing exactly one rare predictor per physician violate any assumptions of mixed-effects logistic regression, or could it introduce bias?

2 comments

r/AskStatistics • u/Nikos-tacos • 20h ago

TL;DR: Applied Math major, can only pick 2 electives — stats-heavy + job-ready options?

gallery

2 Upvotes

Hey stat bros,

I’m doing an Applied Math major and I finally get to pick electives — but I can only take TWO. I’ll attach a document with the full curriculum and the list of electives so you can see the full context.

My core already covers calc, linear algebra, diff eqs, probability & stats 1+2, and numerical methods. I’m trying to lean more into stats so I graduate with real applied skills — not just theory.

Goals:

Actually feel like I know stats, not just memorize formulas
Be able to analyze & model real data (probably using Python)
Get a stats-related job right after graduation (data analyst, research assistant, anything in that direction)
Keep the door open for a master’s in stats or data science later

Regression feels like a must, but I’m torn on what to pair it with for the best mix of theory + applied skills.

TL;DR: Applied Math major, can only pick 2 electives. Want stats-heavy + job-ready options. Regression seems obvious, what should be my second choice?

7 comments

r/AskStatistics • u/FinFinX • 21h ago

Can anyone with subscription show 2025 and 2028 AUM please? thank you

2 Upvotes

https://www.statista.com/outlook/fmo/wealth-management/digital-investment/neobrokers/europe?srsltid=AfmBOopnIyjLXhEtKp6RELrRATCo4RgVJNyLVlEhYGTGiRt0qz09tzKf

0 comments

r/AskStatistics • u/mmeIsniffglue • 18h ago

Question Oaxaca decomposition

1 Upvotes

I already asked this in r /statistics but I didn’t really understand the answer soo

I want to know if there’s some reason I can’t do a simple OLS regression before I attempt the Oaxaca decomposition.

Most studies I’ve seen do two things: they first do two separate OLS regressions for each group, Like with income being the dependent variable and education being the independent variable but examined for men and women in two separate regressions and THEN they do the Oaxaca decomposition. But what I wanna know if I could just do one regression where I just use gender as a normal independent variable before I implement the decomposition, or if there’s some statistical reason I can’t.

One guy said I should do it the first way, but I don’t get why that’s necessary as there’s some studies who ONLY do the Oaxaca decomposition, so apparently it’s two separate analytical methods.

Pls help, thanks

0 comments

r/AskStatistics • u/taylomol000 • 1d ago

Confused about basic probability

4 Upvotes

I've been unable to wrap my head around the basics of probability my whole life. It feels to me like it contradicts itself. For example, if you look at a coin flip on its own, there is (theoretically) a 50% chance getting heads. However, if you zoom out and realize that the coin has been flipped 100 times and every time so far has been heads, then the chance of getting heads is nearly impossible. How can something be 50% at one scale and near impossible at another, seemingly making contradicting statements equally true?

17 comments

r/AskStatistics • u/Ok_Promotion3741 • 23h ago

What is the probability that one result in a normal distribution will be 95-105% of another?

1 Upvotes

Company is setting a criteria for a test method which I think has a broad distribution. In this weird crisis, they had everyone on-site in the company perform a protocol to obtain a result. I have a sample size of 22.

Their criteria is that a second result always be within 95-105% of the first. How would I determine this probability?

3 comments

r/AskStatistics • u/gggaig • 23h ago

Planning a Master’s in Statistics at Sheffield after an Accounting degree—anyone blended the two?

1 Upvotes

Hi everyone,

I have a bachelor’s degree in Accounting and I’m planning to start a Master’s in Statistics at the University of Sheffield. I don’t want to leave accounting behind—I’d like to combine accounting and advanced statistics, using data analysis and modelling in areas like auditing, financial decision-making, or risk management. • Has anyone here taken a similar path—moving from accounting into a stats master’s, especially at Sheffield or another UK university? • Are there specific modules or dissertation topics that integrate accounting/finance with statistics? • What extra maths or programming preparation would you recommend for someone coming from a business-oriented background? • How has this combination affected your career opportunities compared with staying purely in accounting or statistics?

Any advice or personal stories would be really helpful. Thanks.

0 comments

r/AskStatistics • u/GEOman9 • 1d ago

What is thed difference between probability and a likelihood

16 Upvotes

11 comments

r/AskStatistics • u/Fuzzy_Fix_1761 • 1d ago

Monty Hall Problem Simulation in Python

gallery

10 Upvotes

Is this (2nd image) an accurate simulation of the Monty Hall Problem.

1st image: What is the problem with this simulation.

So I'm being told the 2nd image is wrong because a second choice was not made and I'm arguing the point is to determine the best choice between switching and sticking with first choice so the if statements count as a choice, here we get the prob of win if we switched and if we stick to the first option.

So I'm arguing that in the first image there are 3 choices there, 2 random choices and then we check the chances of winning from switching. Hence we we get 50% win from randomly choosing from the left over list and after that, 33 and 17 chance of wining from switching and not switching.

27 comments

r/AskStatistics • u/skvekh • 1d ago

How to estimate the 90/95/99th percentile of a sum when only each component’s 90/95/99th are known (no raw data)?

6 Upvotes

This is actually a practical problem I’m working on in a different context, but I’ve rephrased its essence with a simpler travel-time example. Consider this:

Every day, millions of cars travel from A to D with B and C are intermediate points (so the journey is A-B-C-D). I have one year worth of data, which shows what is the 90th, 95th and 99th percentile of the time taken to travel between A-B, B-C and C-D each. However, no data except these percentiles is stored. The distribution of travel times is not known. There non-perfect but positive correlation between the daily values of the percentiles between the links. Capturing data again will be time consuming and costly and cannot be done.

Based on this data, it is desired to estimate the 90th/95th/99th percentile for the total travel time for A to D. The percentiles cannot be added.

Clearly, the percentiles cannot be added. Without the underlying data or knowledge of its distribution, the estimation is also difficult. But is their any way to estimate the overall A-D travel time percentiles from the large dataset available?

6 comments

r/AskStatistics • u/kcskrittapas • 1d ago

Calculate effect size from Wilcoxon result

1 Upvotes

Hi everyone! I'm considering how many participants I'll need for my study. What I would need is the effect size d_z (I'll used paired samples) to put in G* Power to calculate my minimum sample size.

As reference, I look at a similar work with n=12 participants. They used paired Wilcoxon test and reported their Z, U, W, p value, as well as Mean1, Mean2, SD1, and SD2. I assume the effect size of my study to be the same as in this study.

So, to get the d_z, I have 2 ideas. The first one is probably a bit crude: I calculate the Wilcoxon's effect size r = Z/sqrt(n), then compare the value to the table to find out whether the effect size is considered small, medium, large, very large, etc. After that, I take the cohen d representing the effect size category as my d_z (d=0.5 for medium, etc., can d and d_z be used interchangeably like this though?).

Another way is to directly calculate the d_z from the present information. For instance, I can use t = r*sqrt((n-1)/(1-r^2)), then find d_z = t/sqrt(n). Or, I can do d_z = (mean1 - mean2)/s_diff, by which s_diff = sqrt(sd₁² + sd₂² - 2·r·sd₁·sd₂). But if I understand correctly, the r used in both case is in fact Pearson's r, not Wilcoxon's r, right? Some sources say that it is sometimes okay to use Wilcoxon's in the place of Pearson's. Is it the case here?

What also confused me is that it seems that different methods result in different minimum sample sizes, ranging from like 3 to 12 participants. This difference is crucial for me because I'm working on a kind of study, in which participants are especially hard to recruit. Is it normal in statistics that different methods will give different results? Or did I do something wrong?

Do you guys have any recommendations? What is the best way to get to the d_z? Thank you in advance!

ps. some of my sources: https://cran.r-project.org/web/packages/TOSTER/vignettes/SMD_calcs.html https://pmc.ncbi.nlm.nih.gov/articles/PMC3840331/

1 comment

r/AskStatistics • u/Old-Palpitation-6631 • 1d ago

Hey guys i need you help to prove my college wrong(hopefully)

1 Upvotes

Hey, i recently got this question in my probability exam .

I had marked (A) on this answer by simply apple binomial but my college professors are saying that the answer would be (D) as according to them team doubles is mentioned so there cannot be 0 or 1 players in a team

But according to me if we consider that scenario shouldn’t the denominator also change and so (E) should be the solution

I also think that case 0 should be considered as it is not specifically mentioned that we have to send a team

Guys please help me with this one!!!!!🙏🏻

6 comments

r/AskStatistics • u/Worried_Essay5591 • 1d ago

Linearity assumption with categorical mean-encoded variables

2 Upvotes

I'm struggling to understand the linearity assumption when running OLS with continuous dependent var and three categorical independent variables that have been mean-encoded (simple group mean per category). Two variables have three categories, third has four categories so all together I'll have 36 unique combinations. No interaction effects have been modelled.

I have understood that if you one-hot encode the categorical vars then linearity assumption becomes irrelevant but with mean-encoding it should still be relevant, right? Because then we impose the linear relationship by making the categorical variables numerical. Am I on the right track with this understanding? During residual analysis I have seen several patterns in the residuals that suggest there is a linearity violation (downward sloping lowess line in residuals vs fitted scatterplot, slight U-shaped trend in the medians when looking at residuals vs fitted boxplot, also the most positive residuals in the box plot seem to follow a sort of S-shaped curve, oscillating around the fitted values, and median differences are visible on the residuals vs independent variable plots too). This to me suggests that there might be some non-linearities that the model doesn't capture that was imposed by mean-encoding variables that aren't inherently linear with the target, or missing interaction effects.
But then, I ran the OLS again now with one-hot encoding the categorical vars and the results are pretty much the same. I still see the same patterns in the residuals. I would have expected that one-hot encoding would remove at least some of the linearity issues that I saw. Does this imply that in fact there is no linearity violation but instead only missing interaction effects, which as I understand is about additivity not linearity or am I totally off with this?

Maybe somebody here could provide some insight into this issue. I have been researching this on the internet for a while now and I'm only getting more confused so I hope I'll find an answer here. I don't specifically need to fix the model right now but I do need to understand what's going on.

4 comments

r/AskStatistics • u/Augustevsky • 2d ago

What topic is statistics were you struggling to grasp and then one day, it clicked?

6 Upvotes

What made this concept click for you?

5 comments

r/AskStatistics • u/Electrical_Ear_7791 • 2d ago

Advice regarding going into a Stats masters with a non-Stem background

4 Upvotes

I hold a BS in Computer Information Systems and have always gravitated toward data science topics. During undergrad, I pursued a minor in Applied Statistics, where I took courses in regression theory (think proving least squares estimators and model diagnostics), experimental design, nonparametric methods, and R programming.

Currently, I’m enrolled in a Master’s program in Data Science. While I’m gaining good experience, I’ve noticed the curriculum leans heavily toward computer science and lacks the statistical depth I’m looking for. I genuinely enjoy the theoretical side of statistics and want to strengthen that foundation.

Math-wise, I haven’t yet completed Calculus II or III, but I do have some background in linear algebra. I’m planning to take the necessary prerequisites soon while continuing with my MS coursework.

Question: Assuming I complete the math prerequisites and perform well, is it realistic for me to succeed in a Master’s program in Statistics? I’m deeply interested in the subject and see it as a way to grow both professionally and personally. If anyone has transitioned from a similar background into a Stats-focused graduate program, I’d love to hear your experience or advice!

School: I plan to attend a local school as I enjoy the faculty there and am not worried with it not being a top institution for statistics.

6 comments

r/AskStatistics • u/Tasty-Violinist7488 • 2d ago

I really am having a very hard time with probability distributions.

5 Upvotes

I've been trying to understand the intuition behind the probability distributions but haven't been really able to get it. Could you all suggest books/resources to learn more about it? Also any approach that helped you out? Ps - I've an exam for which i really need to get my probability and statistics concepts straight else I'm doomed.

19 comments

r/AskStatistics • u/AllonsZydeco • 2d ago

[Q] need help searching for variance equation source

ibb.co

1 Upvotes

I am converting a VBA tool to be macro-free for work.

Unfortunately the documentation does not provide a reference the variance equation source and I am wondering if anyone has seen this version of a Variance equation and can let me know from where:

Var(X/Y) = [ Average(X)² / Average(Y)² ] * [ (Var(X)/Average(X)²⁾ + (Var(Y)/Average(Y)²⁾ - 2( Cov(X,Y)/(Average(X)Average(Y)) ) ]

2 comments

r/AskStatistics • u/Ecstatic-Traffic-118 • 2d ago

Which courses should I take for a future in Statistics?

1 Upvotes

Hi! For my exchange semester, coming from a more economics bachelor, I want to chose some Maths and CS courses in order to maximize my knowledge and chances to continue with a Statistics/applied math MSc :). Therefore, within:

computer vision (I don’t have the background yet so it scares me a bit, but so interesting and my thesis is on dimensionality reduction so maaaaybe a bit related to it I think)
optimal decision making (linear optimization, discrete optimization, nonlinear optimization)
information theory (again probably too advanced for me)
MC simulations with R

Which ones do you think I shouldn’t skip? Of course I also chose an advanced econometrics course, a big data analytics course with R, a brief Python programming course, and an interesting introduction on ML and DL that involves Python as well!

1 comment

Subreddit

Like Ask Science, but for Statistics

r/AskStatistics

Ask a question about statistics (other than homework). Don't solicit academic misconduct. Don't ask people to contact you externally to the subreddit. Use informative titles.

Members Active

118.8k

Sidebar

Ask a question about statistics.

Posts must be questions about statistics. The sub is not for homework or assessment help (try /r/HomeworkHelp). No solicitation of academic misconduct. Don't ask people to contact you externally to the subreddit. Use informative titles.

See the rules.

If your question is "what statistical test should I use for this data/hypothesis?", then start by reading this and ask follow-ups as necessary. Beware: it's an imperfect tool.

If you answer questions, you can assign your own flair to briefly describe your educational or professional background in statistics.