Episode 157

ZFS, The “Universal” File-system

00:00:00
/
01:22:42

August 31st, 2016

1 hr 22 mins 42 secs

Your Hosts
Tags

About this Episode

This week on BSDNow, we have an interview with Richard Yao, who will be telling us about the experience and challenges of porting ZFS to Linux. That plus the latest news and feedback is coming your way, on your place

This episode was brought to you by

iXsystems - Enterprise Servers and Storage for Open Source src="/images/2.png" alt="DigitalOcean - Simple Cloud Hosting, Built for Developers" />Tarsnap - Online Backups for the Truly Paranoid />


Headlines

Registration for MeetBSD 2016 is now Open

  • “Beastie’s coming home!” This year, MeetBSD will be held at UC Berkeley’s Clark Kerr Campus
  • November 11th and 12th, preceded by a two day FreeBSD Vendor/Dev Summit (Nov 9th and 10th)

MeetBSD can be traced back to its humble roots as a local workshop for BSD developers and users, hosted annually in Poland since 2004. Since then, MeetBSD’s popularity has spread, and it’s now widely recognized as its own conference with participants from all over the world.

  • The US version runs every two years in California since 2008, and now trades off with the east coast vBSDCon which runs on the odd years.
  • “MeetBSD 2016 uses a mixed unConference format featuring both scheduled talks and community-driven events such as birds-of-a-feather meetings, lightning talks, hackable presentations, stump the chumps, and speed geeking sessions. Speakers are to be determined – stay tuned for more information!”
  • Register before September 30th, and get $30 off
  • Kris and I will be there, along with lots of other FreeBSD Developers, Vendors, and Users.
  • MeetBSD’s unconference style does a very good job of mingling users with developers and is one of my favourite conferences. ***

Dual Booting FreeBSD and Windows UEFI

  • Looking to install FreeBSD alongside Windows 10? What happens if that that system is pre-installed and UEFI? Well you could run TrueOS, but if that isn’t your bag and you want vanilla FreeBSD we have you covered this week!
  • Over on Kevin Bowling’s blog, we have a detailed article showing exactly how to do that.
  • First up, as prep you’ll need to go into the Windows disk manager and shrink your existing NTFS partition.
  • You’ll need to next boot FreeBSD 11 or later.
  • From there the walkthrough takes us through disk partitioning using gpart, and setup of ZFS into a boot-environment friendly layout.
  • Once you get through the typical FreeBSD setup / extraction, the tutorial gives us a nice bonus, showing how to setup “rEFInd” for a graphical boot-menu.
  • A great walkthrough, and hopefully it encourages others to try out dual-booting “EFI-style”. ***

ZFS High-Availability NAS

  • Interested in a DiY HA ZFS NAS? Edmund White (ewwhite on github) has posted a very detailed look at how he has custom-rolled his own Linux + ZFS + HA setup.
  • Most of the concepts are already ones used in various other HA products, but it is interesting and informative to see a public detailed look at how ZFS and HA works.
  • In particular this setup require some very specific hardware, such as dual-port SAS drives, so you will have to pre-plan according.
  • The only bummer is this is a ZFS on Linux setup. Maybe this can serve as the guide / inspiration for somebody in our community to do their own FreeBSD + HA + ZFS setup and blog about it in similar detail. ***

First public release of chyves - version 0.1.0

  • As bhyve continues to mature we are seeing tooling evolve around it. Enter ‘chyves’ which started life as a fork of iohyve.
  • We are looking to do an interview with the author in the near future, but we still want to bring you some of the new features / changes in this evolution of bhyve management.
  • First up, nearly every function from iohyve has either been re-written in part or full.
  • Among the new features, a full logging system (master and per-vm logs), multiple pool configurations, properties stored outside of ZFS (for speed) and self-upgrading. (Will that work with pkg’d version?)
  • In addition to the above features, the website has a large chart showing the original ‘iohyve’ commands, and how that usage has changed moving to chyves.
  • Give it a spin, let the author know of issues! ***

Interview - Richard Yao - ryao@gentoo.org

Sr. Kernel Engineer at ClusterHQ - Major Contributor to ZFS on Linux


News Roundup

ZFS Deadlock: 'Directory of Death'

  • A user reports that when they try to install npm (the Node.js package manager), their system deadlocks
  • It turns out, this was also hitting the FreeBSD package building machines
  • PR 209158
  • The problem was a race condition in the way renames are handled in the FreeBSD VFS vs how ZFS does them internally
  • This bug has existed since the original import of ZFS, but some other change caused it to happen much more frequently
  • “ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different Way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS.”
  • “The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS.”
  • The fixes have been merged to the 10.x and 11.x branches ***

New BSD Magazine out (2016-07)

  • Articles include: Implementing in-memory cache in the BeaST architecture, Docker Cleanup, FreeNAS Getting Started Guide, and starting at the very beginning with open source
  • The August issue is also out
  • This issue features two articles about MINIX 3, continues the FreeNAS getting started guide, Optimizes the in-memory cache for the BeaST architecture, and talks about fixing failed ports for Hardened and LibreBSD
  • We hope to have an interview with the creator of the BeaST architecture in the coming weeks ***

DragonflyBSD and UEFI

  • We’ve featured a few stories and walkthroughs about using UEFI to dual-boot BSD, and now its Dragonfly BSD’s turn.
  • Dave McFarlane writes into the DF mailing lists, telling us about the specific steps taken to get UEFI installed and boot-strapped on his system.
  • If you’ve done a FreeBSD manual UEFI install, the process looks very similar, but you will end up manually running ‘gpt’ to create partitions, installing dist files, and eventually installing boot1.efi into the FAT EFI partition.
  • Dave also ran into an issue with resulted in no /etc/fstab being present, and helpfully includes what his system needed to fully boot hammer properly.
  • Somebody should document this fully for DFLY, since I would expect to become more commonplace as commodity hardware is shipped with UEFI on by default. ***

Netflix and Fill

  • The Netflix team has produced a technical blog post describing how their OpenConnect appliances work
  • First the content is received from the content provider, and the Netflix content team makes it ready for deployment, by transcoding the various bitrates, packaging the subtitles, etc.
  • The finished files are then pushed to Amazon S3 storage
  • “We deploy the majority of our updates proactively during configured fill windows. An important difference between our OpenConnect CDN and other commercial CDNs is the concept of proactive caching. Because we can predict with high accuracy what our members will watch and what time of day they will watch it, we can make use of non-peak bandwidth to download most of the content updates to the OCAs in our network during these configurable time windows. By reducing disk reads (content serving) while we are performing disk writes (adding new content to the OCAs), we are able to optimize our disk efficiency by avoiding read/write contention. The predictability of off-peak traffic patterns helps with this optimization, but we still only have a finite amount of time every day to get our content pre-positioned to where it needs to be before our traffic starts to ramp up and we want to make all of the OCA capacity available for content serving.”
  • The OCA may actually contain more than one copy of the same video, because each disk in the OCA is independent, storing the same video on two different disks will provide twice the available read bandwidth
  • Normally the filesystem cache would obviate the need for this, but the Netflix OCA has so much storage, and not a lot of memory, and the requests from users are offset enough that the cache is useless
  • “OCAs communicate at regular intervals with the control plane services, requesting (among other things) a manifest file that contains the list of titles they should be storing and serving to members. If there is a delta between the list of titles in the manifest and what they are currently storing, each OCA will send a request, during its configured fill window, that includes a list of the new or updated titles that it needs. The response from the control plane in AWS is a ranked list of potential download locations, aka fill sources, for each title.”
  • “It would be inefficient, in terms of both time and cost, to distribute a title directly from S3 to all of our OCAs, so we use a tiered approach. The goal is to ensure that the title is passed from one part of our network to another using the most efficient route possible.”
  • The article then goes on to explain how they calculate the least cost filling source
  • “Now that Netflix operates in 190 countries and we have thousands of appliances embedded within many ISP networks around the world, we are even more obsessed with making sure that our OCAs get the latest content as quickly as possible while continuing to minimize bandwidth cost to our ISP partners.” ***

Beastie Bits:

Feedback/Questions