I have a @statistics question that I'd like some help with. I've got an actual problem related to environmental science, but I'm going to frame it in terms of the fediverse, for various reasons. So, if you feel like asking, "Why do you want to know this?!?" please realize that it's an example question.
Conversation
Notices
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:13:12 JST Evan Prodromou -
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:17:38 JST Evan Prodromou @statistics So, let's say in my example question, I want to find the servers on the fediverse that have the highest rate of communication with other servers. To do this, I'm going to take a list of known servers, and then get samples from the public feeds of those servers. I'm going to count the unique domains of the addressees of all the replies in the public feeds.
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:19:26 JST Evan Prodromou @statistics so, on server1.example, if a@server1.example replied to b@server2.example and c@server3.example, and d@server1.example replied to e@server3.example and f@server4.example, we have 3 unique domains replied to (server2.example, server3.example, and server4.example).
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:21:50 JST Evan Prodromou @statistics so, when I do this analysis, I find that the servers with the most other domains replied to are also the servers with the most accounts on them. Number of accounts on the server is a confounding variable, here; I'm not finding out about cultural norms in connectedness.
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:24:03 JST Evan Prodromou @statistics So, in this fictional example, I first try dividing the number of replied-to domains by the number of total posts. This seems like it would be OK, but now I'm favouring very small, inactive servers instead. If a server has only one post, with 2 or 3 domains replied to, it's got a very high rate of domains per post.
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:27:27 JST Evan Prodromou @statistics The best I've been able to do in this situation is set a threshold value that I consider statistically significant -- say, 100 posts/day. So, I don't get a distorted view from those very small servers. This is providing satisfactory results, but I still have questions.
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:28:17 JST Evan Prodromou @statistics first, if I'm trying to get a measure of diversity in my different sample sets, is this an OK way to do it? and second, is there a way to determine what this threshold of statistical significance is?
-
Embed this notice
Evan Prodromou (evan@cosocial.ca)'s status on Saturday, 23-Dec-2023 03:54:35 JST Evan Prodromou @mrcopilot @statistics unrelated to real question.
-
Embed this notice
MrCopilot (mrcopilot@mstdn.social)'s status on Saturday, 23-Dec-2023 03:54:36 JST MrCopilot Ok so quick question (don't know how well it translates from example.)
Your sampling every post not a normalized list of identical amount of replies?
For instance
Post A has 8 replies on tiny server
Post B has 1 replies on large serveris different info than two 8 reply posts, no?
-
Embed this notice
Joseph Szymborski :qcca: (jszym@cosocial.ca)'s status on Saturday, 23-Dec-2023 04:03:32 JST Joseph Szymborski :qcca: @evan @statistics I love these sort of graph analysis questions, and I'm going to do some more reading but my immediate thought is that you might have more success dividing by the log of total posts.
Also, using the number of accounts rather than posts sounds like a better approach intuitively. I think the log of the # of accounts my be interesting.
-
Embed this notice