Minutes from the Tue Jan 16 infra call

classic Classic list List threaded Threaded
1 message Options
Guilhem Moulin Guilhem Moulin
Reply | Threaded
Open this post in threaded view
|

Minutes from the Tue Jan 16 infra call

Participants
 1. davido
 2. guilhem
 3. Brett
 4. Christian

Agenda
 * Upgrade ancient Gerrit version 2.11.8 to 2.13.9 (used by OpenStack
   and Wikimedia for years now, without any issue)
   - Q: according to my notes 2.11.8 was released on 2016-03-09 and
     2.13.9 on 2017-07-03?  Are there known vuln in 2.11.x?  Is it about
     getting feature fixes and new shiniest software?
     . No known vulnerability, but there are a bunch of new features,
       especially inline edit feature
   - David: dedup scripts should keep working with 2.13.x
   - David: see old redmine ticket Norbert filed about migration
     . do you mean my comments in https://redmine.documentfoundation.org/issues/1587#note-4 ?
     . I meant this Norbert's comment: https://redmine.documentfoundation.org/issues/1587#note-8
   - Cloph: difficult to test everything as OAuth needs proper DNS setup
   - Cloph: can't copy the database to a test VM and grant access to
     everyone as we have private repos
   - Cloph: release-wise, it would be ideal to do that (switching the
     live instance) in March or so (after 6.0.1)
   - Q: Is Norbert coming to FOSDEM? Would be ideal time to brainstorm
     there
   - Roadmap:
     - Set up staging gerrit instance:
     - Synchronize production gerrit content to gerrit-test:
     - Simulate upgrade process:
       . Stop gerrit
       . Perform database and git repository backup
       . Update gerrit version
       . Update all external plugins (gerrit-oauth-provider)
       . Run init command in batch mode, all used internal plugins
         should be updated (double-check)
       . Run reindex command
       . Start gerrit
       . Verify, that gerrit still works as expected
         → this is the (very) hard part, as test-instance cannot have
           all features enabled, and of course you don't think of any
           possible user-mistakes that had to be dealt with.
     - Schedule gerrit upgrade in production
   - AI guilhem/cloph: create redmine ticket to follow progress
 * Gerrit: added `git gc` to a weekly cronjob so crap doesn't accumulate
   and slow down the repo
   - Q: is the frequency suitable?  Also pass --aggressive (cloph: no)
     and/or --prune=<date>?
   - Cloph: slowness might be caused by gerrit keeping FDs open
 * Network issues (hypers, gustl) seem fixed since manitu plugged to a
   new switch last week (Wed Jan 10)
   - Need to keep an eye on the RX FCS counter (gustl) and the link
     speed (hypers)
 * Saltstack:
   - mail.postfix state is now ready, since mail/README for the config
     options (and on pillar for usage examples: antares, vm150, vm194,
     vm202, etc.)
   - Proposal: more aggressive control for SSH and sudo access:
     . ACL for SSH access already in place (user must be in ssh-login
       Unix group, which is assigned — and possibly trimmed — with the
       ssh salt state)
     . Also limit the authorized_key(5) list to the keys that are found
       in pillar?  would avoid eg, leaving your key in
       ~root/.ssh/authorized_keys during a migration and forgetting
       about it afterwards → OK
     . Also assign — and possibly trim — the list of sudo group members
       in salt? → OK
         group_map:
           sudo: [ superdev1, superdev2 ]
           adm: other-username
     . Cloph: beware of shared users (eg, tdf@ci); yaml-foo to share ssh
       keys
     . These would provide a clear overview (in pillar) of who has
       access to what; the same could be done using NSS and pam_ldap(5).
 * Backup
   - right now rsnapshot-based (using `ssh -lroot` from berta as rsync's
     remote shell)
     . do we really want to open a root shell to each host from berta?
       → Nope :-)
     . for rsync we could at least add restriction on the ssh-key
       (remount fs read-only, and use `rsync --server --sender …` as
       forcecommand)
   - databases are downloaded in full each time, using pg_dumpall(1) or
     mysqldump(1) and compressed locally
     . large database clutter disk IO and network bandwith (even though
       we're far from saturating the link since the upgrade to the new
       switch, that's wasteful), for instance the bugzilla PostgreSQL
       database is currently 44.9GiB (20.3GiB after gzip compression),
       and takes around 95min to transfer at sustained 5MiB/s transfer
       rate *on the public interface*
       → AI guilhem: add a private interface on br1 to all VMs (brought
         that before, didn't do it yet)
     . Q: do we know what is the bottleneck? local disk IO? local
       compression (zstd to the rescue)? network (probably not)? berta's
       disk IO (probably not)?
       → cloph: it's single-threaded compression maxing out CPU thread
     . full backup is wasteful, especially with large databases.  Does
       anyone have experience with with PostgreSQL continuous archiving
       (PITR)?
       https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
       → AI guilhem: deploy that on some hosts to try out (push base dir
         to berta, + xlogs for incremental backups)
 * Pootle backmigration
   - in discussion with manitu
   - backend domain tdf.pootle.tech doesn't exist anymore (NXDOMAIN),
     added to vm183:/etc/hosts
 * Monitoring status update.
   - Q: Is Icinga the desired monitoring platform over something like
     TICK? telegraf (client); central server with influxdb, chronograf,
     kapacitor
     . cronograf ping home, needs to turn that off
     . Q: which transport does it use?  brett: go server.  G. HTTPd?
     . all component BSD licence Telegraf, InfluxDB and Chronograf: MIT,
       Chronograf: AGPL3
     . AI guilhem: setup test VM for influxdb, chronograf, kapacitor
     . Brett: get telegraf running on vm191 as show case
     . https://github.com/influxdata/telegraf#input-plugins
 * SSO adoption <https://user.documentfoundation.org>:
   - 500 accounts in total \o/
   - 54 new accounts created since the last infra call; 20 since
     2018-01-01 00:00:00 UTC
   - governance: 3/10 (new+old) board; 1/9 MC; 42/185 (23%) TDF members
     missing
   - contributors: 69/140 (49%) recent (last 90 days) wiki editors
     missing
 * Gluster: healing takes too long, perhaps due to 512B sharding?  or
   large files?
   - Cloph: shouldn't limit ourselves to delta for new VMs, but other
     volumes should be reconfigured to use arbiters, and images
     rebalanced evenly
 * Next call: Tuesday Feburary 20 2018 at 18:30 Berlin time (17:30 UTC)

--
Guilhem.

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/website/
All messages sent to this list will be publicly archived and cannot be deleted