Unexpected downtime September 2019

Message boards : Server backend and mirrors : Unexpected downtime September 2019
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 15581 - Posted: 21 Sep 2019, 11:51:58 UTC

Storage issues
The main storage server is experiencing some issues with the raid cards and the connections to the disks. Disks are dropping randomly and it is somewhat unclear why.
A replacement raid card was ordered, delivered and installed but the hardware failed completely after the first few tests - looks like it is simply dead on arrival. A replacement replacement is now being ordered.

We are now running on a temporary solution and have had to rebuild the site from weekly backups + daily deltas. Fortunately no data was lost because the switch-over was instantaneous.
In a few hours the disk resyncs should be complete and a number of general maintenance tasks will begin, followed by creating another full restore point.

More storage issues
A BIOS issue with the SATA controllers has also been causing a separate set of disk-related issues, and the BIOS has now been updated. It seems like this improved the situation. Unfortunately it requires a full shutdown of the backend to update the BIOS like this, so hopefully it will not happen too often.

Storage space
We're trying to use less storage overall and recently the first pass of the year-long compression project completed. Most of our old sessions have had their frames compressed with lossless compression and the disk-space occupied by the original files is now ready to be released.

Future work
Other issues currently on the list:
- We're running out of memory very often because the Java backend does not release it (and we have a few very memory hungry sessions in the mix currently). An update to Java should hopefully fix this.
- There is an issue causing the SQL database to be very slow at starting up
- The raid autodetect feature on the server is broken and needs to be fixed and tested. For now the raid is initialized manually.
- A new version of Blender is available and the farm needs to be updated

While this is going on I'm in the process of selling the apartment that I live in. Things are very much in a state of flux.
ID: 15581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 13 May 18
Posts: 10
Credit: 238,462
RAC: 3,398
Message 15582 - Posted: 1 Oct 2019, 3:19:01 UTC

Have been getting the same recurring error ever since this was done.

Tue 01 Oct 2019 13:18:22 AEST | BURP | update requested by user
Tue 01 Oct 2019 13:18:27 AEST | BURP | Sending scheduler request: Requested by user.
Tue 01 Oct 2019 13:18:27 AEST | BURP | Requesting new tasks for NVIDIA GPU
Tue 01 Oct 2019 13:18:30 AEST | BURP | Scheduler request completed: got 0 new tasks
Tue 01 Oct 2019 13:18:30 AEST | BURP | Project is temporarily shut down for maintenance
ID: 15582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 15583 - Posted: 27 Nov 2019, 20:44:07 UTC - in response to Message 15581.  

Quick update:
- Raid is back in shape and the raid autodetect issue was resolved and tested
- Java was updated but is still using quite a lot of memory after EXR image operations. Maybe previews of big EXR sessions will have to be disabled for a while until a better solution can be found.
- The database startup issue was resolved and it now starts really quickly. Reboots are no longer multi-hour endeavours

Also, the SSL certificate for the site was updated today.

Ongoing work:
- Work has begun on support for Blender 2.80 with the interesting addition of a new rendering engine: Eevee
- Clear broken sessions from the renderqueue so it can be restarted
ID: 15583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 15590 - Posted: 28 May 2020, 13:00:13 UTC

A quick update on the current situation:
- Blender 2.80 was ready and deployed to the farm
- Corona virus took 2020 out of the calendar
- BURP is essentially in maintenance mode until the end of the year
ID: 15590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Server backend and mirrors : Unexpected downtime September 2019