Just an FYI, on Thursday (2 days) we will be re-attemping our 5-in-1 update. It will be a very major update so expected as much as 3-days downtime as we are finally moving to the new architecture. We **will** be up at the other end with hopefully everything working and a few new features.
Seems our load balancer is a little under powered and its stalling momentarily from time to time. Most of you probably didnt even notice it but some of you may have.
I am going to upgrade that today to the next level. May result in up to 5 minutes down time (the new architecture will have multiple load balancers).
So while that last fix did make things better, there seems to be a bit of lingering slowness for a few things, but things are mostly working.
I worked all day(and its midnight now) to bring up some very extensive monitoring tools for the team for us to help diagnose the issue.
I will work through the weekend to help improve things further, though I may need to get some sleep before I can fully resolve this. At least things are working other than a slight bit of lag. I will keep everyone on #QOTO updated.
#QOTO is back up after a short downtime. As far as I can tell the fix went smoothly. Hopefully that will address the last of the problems from migration.
In about 10 minutes #QOTO will be going down shortly in an attempt to fix a 16G table that may be at the root of one small lingering problem post migration. Luckily we have good backups and the table can always be recreated from scratch.
So should be back up shortly hopefully with the last needed fix in place and we can start the upgrades soon.
after another short down time everything is fixed! #QOTO had to recreate an index that got lost in the migration which was slowing down the DB and everything else. No amount of resources was going to help.
But it is fixed now all ques are empty or very close to it. we will now downgrade the DB to a more sane level now that it is fixed (we upgrades shortly to maintain the system). But it will sill be a pretty hefty system for us, and we can always scale back up when needed.
TL;DR everything is wording fine now.
PS we are now going to work on a staging environment to test upgrades so we can safely start moving the main server through the upgrade cycle. stay tuned.
So we found the real problem haunting us. Turns out we didnt even need the bigger database. There was just an index that got dropped during the migration. We are working now to put it back in place. At which point things should be back up to their normal speed.
One last update before I go to bed and disapear for 12 hours.
The backlog has went from 1.2 mil at its peak earlier today to 0,4 mil now after we reconfigured things. It is steadily going down and should have everything up to working order before I get up.
One or two people were able to get images loaded after a VERY long wait. So while images still arent working it seems very likely related to the backlog. In a few hours when the backlog clears I expect image uploads should work again. If not I will check what the problem is in the morning.
Other than that most things appear to be working and everything should be functional soon.
The back log is about 2/3rds complete. This afternoon it peaked at 1.2 million and now it is ~0.4 million. I jut moved to pg_bouncer to sped that up a bit. Looks like I more than doubled the process time. Almost there.
So the Sidekiq backlog on #QOTO is still progressing. As of a few hours ago at the peak of the problem our backlog was 1.2 million jobs. As of right now its down to 0.7 million jobs. It is steadily decreasing there is just a lot to get through from the downtime. We are expecting everything to be back in working order when its done which should be around end of day. In the meantime things are still usable but you may experience very long lag on some actions.
Images still cant be uploaded, we hope this is the same problem.
As spam comes in we will block the servers on the #Fediverse. Please be patient this is happening across the whole fedi and we are working on better ways to address it.
So I went to bed last night and work up to find the sidekiq workers were backlogging and we have 800K backlogged jobs. It was due to a misconfiguration that I have now fixed and it appears the backlog is quickly resolving itself.
If you noticed any weirdness this should be resolved int he next few hours as the backlog clears.
Please keep in mind #QOTO will need quite a few hours to handle the backlog from downtime. Tomorrow we are going to split out the workers so they can begin using the scaling. So tomorrow this should be fixed. For the next 24 hours expect things to be a bit slow and uploading pictures probably wont work until we fix that.
Give it 24 hours and hopefully things will be back to normal at that time.
Please note today's server migration was moved to Friday. Expect some downtime on Friday, EST morning. We are migrating servers and begining the upgrade process for several updates we have in the pipeline. Expect brand new features in the near future.
@mapto Hi, so this is generally not a monitored account, so im sorry you didnt get a response for 5-days. I will tag the main admin and he will see this and be able to respond.
We only posts announcements here and occasionally use polls to help us make decisions that effect our user base. We don't usually respond quickly to direct messages to this account. If you need help with anything related to the QOTO servers, including moderation, then please contact one of our administrators. They are listed out about page:https://qoto.org/about/more