Ron Toland
About Canadian Adventures Keeping Score Archive Photos Also on Micro.blog
  • Notes from Clojure/Conj 2017

    It was a good Conj. Saw several friends and former co-workers of mine, heard some great talks about machine learning, and got some good ideas to take back to my current gig.

    There were some dud talks, too, and others that promised one (awesome) thing and delivered another (boring) thing, but overall it’s inspiring to see how far Clojure has come in just ten years.

    My notes from the conference:

    DAY ONE

    KEYNOTE FROM Rich Hickey: Effective Programs, or: 10 years of Clojure

    • clojure released 10 years ago
    • never thought more than 100 people would use it
    • clojure is opinionated
      • few idioms, strongly supported
      • born out of the pain of experience
    • had been programming for 18 years when wrote clojure, mostly in c++, java, and common lisp
    • almost every project used a database
    • was struck by two language designers that talked disparagingly of databases, said they’d never used them
    • situated programs
      • run for long time
      • data-driven
      • have memory that usually resides in db
      • have to handle the weirdness of reality (ex: “two-fer tuesday” for radio station scheduling)
      • interact with other systems and humans
      • leverage code written by others (libraries)
    • effective: producing the intended result
      • prefers above the word “correctness”, none of his programs ever cared about correctness
    • but: what is computing about?
      • making computers effective in the world
      • computers are effective in the same way people are:
        • generate predictions from experience
        • enable good decisions
      • experience => information => facts
    • programming is NOT about itself, or just algorithms
    • programs are dominated by information processing
    • but that’s not all: when we start talking to the database or other libraries, we need different protocols to talk to them
    • but there’s more! everything continues to mutate over time (db changes, requirements change, libraries change, etc)
    • we aspire to write general purpose languages, but will get very different results depending on your target (phone switches, device drivers, etc)
    • clojure written for information-driven situated programs
    • clojure design objectives
      • create programs out of simpler stuff
      • want a low cognitive load for the language
      • a lisp you can use instead of java/c# (his common lisp programs were never allowed to run in production)
    • says classes and algebraic types are terrible for the information programming problem, claims there are no first-class names, and nothing is composable
    • in contrast to java’s multitude of classes and haskell’s multitude of types, clojure says “just use maps”
    • says pattern matching doesn’t scale, flowing type information between programs is a major source of coupling and brittleness
    • positional semantics (arg-lists) don’t scale, eventually you get a function with 17 args, and no one wants to use it
    • sees passing maps as args as a way to attach names to things, thinks it’s superior to positional args or typed args
    • “types are an anti-pattern for program maintenance”
    • using maps means you can deal with them on a need-to-know basis
    • things he left out deliberately:
      • parochialism: data types
      • “rdf got it right”, allows merging data from different sources, without regard for how the schemas differ
      • “more elaborate your type system, more parochial the types”
      • in clojure, namespace-qualified keys allow data merging without worrying about colliding schemas (should use the reverse-domain scheme, same as java, but not all clojure libraries do)
      • another point: when data goes out over the wire, it’s simple: strings, vectors, maps. clojure aims to have you program the same inside as outside
    • smalltalk and common lisp: both languages that were designed by people for working programmers, and it shows
      • surprisingly, the jvm has a similar sensibility (though java itself doesn’t)
    • also wanted to nail concurrency
      • functional gets you 90% of the way there
    • pulled out the good parts of lisp
    • fixed the bad parts: not everything is a list, packages are bad, cons cell is mutable, lists were kind of functional, but not really
    • edn data model: values only, the heart of clojure, compatible with other languages, too
    • static types: basically disagrees with everything from the Joy of Types talk
    • spec: clojure is dynamic for good reasons, it’s not arbitrary, but if you want checking, it should be your choice, both to use it at all and where to use it
    Learning Clojure and Clojurescript by Playing a Game
    • inspired by the gin rummy card game example in dr scheme for the scheme programming language
    • found the java.awt.Robot class, which can take screenshots and move the mouse, click things
    • decided to combine the two, build a robot that could play gin rummy
    • robot takes a screenshot, finds the cards, their edges, and which ones they are, then plays the game
    • lessons learned:
      • java interop was great
    • when clojurescript came out, decided to rebuild it, but in the browser
    • robot still functions independently, but now takes screenshot of the browser-based game
    • built a third version with datomic as a db to store state, allowing two clients to play against each other
    • absolutely loves the “time travel” aspects of datomic
    • also loves pedestal
    Bayesian Data Analysis in Clojure
    • using clojure for about two years
    • developed toolkit for doing bayesian statistics in clojure
    • why clojure?
      • not as many existing tools ass julia or R
      • but: easier to develop new libraries than in julia (and most certainly R)
      • other stats languages like matlab and R don’t require much programming knowledge to get started, but harder to dev new tools in them
    • michaellindon/distributions
      • open-source clojure lib for generating and working with probability distributions in clojure
      • can also provide data and prior to get posterior distribution
      • and do posterior-predictive distributions
    • wrote a way to generate random functions over a set of points (for trying to match noisy non-linear data)
    • was easy in clojure, because of lazy evaluation (can treat the function as defined over an infinite vector, and only pull out the values we need, without blowing up)
    • …insert lots of math that i couldn’t follow…
    Building Machine Learning Models with Clojure and Cortex
    • came from a python background for machine learning
    • thinks there’s a good intersection between functional programming and machine learning
    • will show how to build a baby classification model in clojure
    • expert systems: dominant theory for AI through 2010s
      • limitations: sometimes we don’t know the rules, and sometimes we don’t know how to teach a computer the rules (even if we can articulate them)
    • can think of the goal of machine learning as being to learn the function F that, when applied to a set of inputs, will produce the correct outputs (labels) for the data
    • power of neural nets: assume that they can make accurate approximations for a function of any dimensionality (using simpler pieces)
    • goal of neural nets is to learn the right coefficients or weights for each of the factors that affect the final label
    • deep learning: a “fat” neural net…basically a series of layers of perceptrons between the input and output
    • why clojure? we already have a lot of good tools in other languages for doing machine learning: tensorflow, caffe, theano, torch, deeplearning4j
    • functional composition: maps well to neural nets and perceptrons
    • also maps well to basically any ML pipeline: data loading, feature extraction, data shuffling, sampling, recursive feedback loop for building the model, etc
    • clojure really strong for data processing, which is a large part of each step of the ML pipeline
      • ex: lazy sequences really help when processing batches of data multiple times
      • can also do everything we need with just the built-in data structures
    • cortex: meant to be the theano of clojure
      • basically: import everything from it, let it do the heavy lifting
    • backend: compute lib executes on both cpu and gpu
    • implements as much of neural nets as possible in pure clojure
    • meant to be highly transparent and highly customizable
    • cortex represents neural nets as DAG, just like tensorflow
      • nodes, edges, buffers, streams
    • basically, a map of maps
      • can go in at any time and see exactly what all the parameters are, for everything
    • basic steps to building model:
      • load and process data (most time consuming step until you get to tuning the model)
      • define minimal description of the model
      • build full network from that description and train it on the model
    • for example: chose a credit card fraud dataset
    Clojure: Scaling the Event Stream
    • director, programmer of his own company
    • recommends ccm-clj for cassandra-clojure interaction
    • expertise: high-availability streaming systems (for smallish clients)
    • systems he builds deal with “inconvenient” sized data for non-apple-sized clients
    • has own company: troy west, founded three years ago
    • one client: processes billions of emails, logs 10–100 events per email, multiple systems log in different formats, 5K–50K event/s
    • 10–100 TB of data
    • originally, everything logged on disk for analysis after the fact
    • requirements: convert events into meaning, support ad-hoc querying, generate reports, do real-time analysis and alerting, and do it all without falling over at scale or losing uptime
    • early observations:
      • each stream is a seq of immutable facts
      • want to index the stream
      • want to keep the original events
      • want idempotent writes
      • just transforming data
    • originally reached for java, since that’s the language he’s used to using
    • data
      • in-flight: kafka
      • compute over the data: storm (very powerful, might move in the direction of onyx later on)
      • at-rest: cassandra (drives more business to his company than anything else)
    • kafka: partitioning really powerful tool for converting one large problem into many smaller problems
    • storm: makes it easy to spin up more workers to process individual pieces of your computation
    • cassandra: their source of truth
    • query planner, query optimizer: services written in clojure, instead of throwing elasticsearch at the problem
    • recommends: Designing Data-Intensive Applications, by Martin Kleppmann
    • thinks these applications are clojure’s killer app
    • core.async gave them fine-grained control of parallelism
    • recommends using pipeline-async as add-on tool
    • composeable channels are really powerful, lets you set off several parallel ops at once, as they return have another process combine their results and produce another channel
    • but: go easy on the hot sauce, gets very tempting to put it everywhere
    • instaparse lib critical to handling verification of email addresses
    • REPL DEMO
    • some numbers: 0 times static types would have saved the day, 27 repos, 2 team members
    DAY TWO

    The Power of Lacinia and Hystrix in Production

    • few questions:
      • anyone tried to combine lysinia and hystrix?
      • anyone played with lacinia?
      • anyone used graphql?
      • anyone used hystrix?
    • hystrix : circuit-breaker implementation
    • lacinia: walmart-labs’ graphql
    • why both?
    • simple example: ecommerce site, aldo shoes, came to his company wanting to rennovate the whole website
    • likes starting his implementations by designing the model/schema
    • in this case, products have categories, and categories have parent/child categories, etc
    • uses graphvis to write up his model designs
    • initial diagram renders it all into a clojure map
    • they have a tool called umlaut that they used to write a schema in a single language, then generate via instaparse representations in graphql, or clojure schema, etc
    • lacinia resolver: takes a graphql query and returns json result
    • lacinia ships with a react application called GraphiQL, that allows you to through the browser explore your db (via live queries, etc)
    • gives a lot of power to the front-end when you do this, lets them change their queries on the fly, without having to redo anything on the backend
    • problem: the images are huge, 3200x3200 px
    • need something smaller to send to users
    • add a new param to the schema: image-obj, holds width and height of the image
    • leave the old image attribute in place, so don’t break old queries
    • can then write new queries on the front-end for the new attribute, fetch only the size of image that you want
    • one thing he’s learned from marathon running (and stolen from the navy seals): “embrace the suck.” translation: the situation is bad. deal with it.
    • his suck: ran into problem where front-end engineers were sending queries that timed out against the back-end
    • root cause: front-end queries hitting backend that used 3rd-party services that took too long and broke
    • wrote a tiny latency simulator: added random extra time to round-trip against db
    • even with 100ms max, latency diagram showed ~6% of the requests (top-to-bottom) took over 500ms to finish
    • now tweak it a bit more: have two dependencies, and one of them has a severe slowdown
    • now latency could go up to MINUTES
    • initial response: reach for bumping the timeouts
    • time for hystrix: introduce a circuit breaker into the system, to protect the system as a whole when an individual piece goes down
    • hystrix has an official cloure wrapper (!)
    • provides a macro: defcommand, wrap it around functions that will call out to dependencies
    • if it detects a long timeout, in the future, it will fail immediately, rather than waiting
    • as part of the macro, can also specify a fallback-fn, to be called when the circuit breaker is tripped
    • adding that in, the latency diagram is completely different. performance stays fast under much higher load
    • failback strategies:
      • fail fast
      • fail silently
      • send static content
      • use cached content
      • use stubbed content: infer the proper response, and send it back
      • chained fallbacks: a little more advanced, like connecting multiple circuit breakers in a row, in case one fails, the other can take over
    • hystrix dashboard: displays info on every defcommand you’ve got, tracks health, etc
    • seven takeaways
      • MUST embrace change in prod
      • MUST embrace failure: things are going to break, you might as well prepare for it
      • graphql is just part of the equation, if your resolvers get too complex, can introduce hystrix and push the complexity into other service
      • monitor at the function level (via hystrix dashboard)
      • adopt a consumer-driven mindset: the users have the money, don’t leave their money on the table by giving them a bad experience
      • force yourself to think about fallbacks
      • start thinking about the whole product: production issues LOOK to users like production features
    • question: do circuit-breakers introduce latency?
      • answer: a little bit at the upper end, once it’s been tripped
    The Tensors Must Flow
    • works at magento, lives in philly
    • really wants to be sure our future robot masters are written in clojure, not python
    • guildsman: tensorflow library for clojure
    • tensorflow: ML lib from google, recently got a c api so other languages can call into it
    • spoiler alert: don’t get TOO excited. everything’s still a bit of a mess
    • but it DOES work, promise
    • note on architecture: the python client (from google) has access to a “cheater” api that isn’t part of the open c api. thus, there’s some things it can do that guildsman can’t because the api isn’t there
    • also: ye gods, there’s a lot of python in the python client. harder to port everything over to guildsman than he thought
    • very recently, tensorflow started shipping with a java layer built on top of a c++ lib (via jni), which itself sits on top of the tensorflow c api, some people have started building on top of that
    • but not guildsman: it sits diretly on the c api
    • in guildsman: put together a plan, then build it, and execute it
    • functions like guildsman/add produce plan maps, instead of executing things themselves
    • simple example: adding two numbers: just one line in guildsman
    • another simple example: have machine learn to solve | x - 2.0 | by finding the value of x that minimizes it
    • tensorflow gives you the tools to find minima/maxima: gradient descent, etc
    • gradient gap: guildsman can use either the clojure gradients, or the c++ ones, but not both at once
      • needs help to port the c++ ones over to clojure (please!)
    • “python occupies the 9th circle of dependency hell”: been using python lightly for years, and still has problems getting dependencies resolved (took a left turn at the virtual environment, started looking for my oculus rift)
    • demo: using mnist dataset, try to learn to recognize handwritten characters
    The Dawn of Lisp: How to Write Eval and Apply in Clojure
    • educator, started using scheme in 1994, picked up clojure in 2009
    • origins of lisp: john mccarthy’s paper: recursive functions of symbolic expressions and their computation by machine, part i
    • implementation of the ideas of alonzo church, from his book “the calculi of lambda-conversion”
    • “you can always tell the lisp programmers, they have pockets full of punch cards with lots of closing parenthses on them”
    • steve russel (one of the creators of spaceware) was the first to actually implement the description from mccarthy’s paper
    • 1962: lisp 1.5 programmer’s manual, included section on how to define lisp in terms of itself (section 1.6: a universal lisp function)
    • alan kay described this definition (of lisp in terms of lisp) as the maxwell equations of software
    • how eval and apply work in clojure:
      • eval: send it a quoted list (data structure, which is also lisp), eval will produce the result from evaluating that list
        • ex: (eval '(+ 2 2)) => 4
      • apply: takes a function and a quoted list, applies that function to the list, then returns the result
        • ex: (apply + '(2 2)) => 4
    • rules for converting the lisp 1.5 spec to clojure
      • convert all m-expression to s-expressions
      • keep the definitions as close to original as possible
      • drop the use of dotted pairs
      • give all global identifiers a ‘$’ prefix (not really the way clojure says it should be used, but helps the conversion)
      • add whitespace for legibility
    • m-expressions vs s-expressions:
      • F[1;2;3] becomes (F 1 2 3)
      • [X < 0 -> -X; T -> X] becomes (COND ((< X 0) (- X)) (T X))
    • dotted pairs
      • basically (CONS (QUOTE A) (QUOTE B))
    • definitions: $T -> true, $F -> false, $NIL, $cons, $atom, $eq, $null, $car, $cdr, $caar, $cdar, $caddr, $cadar
      • note: anything that cannot be divided is an atom, no relation to clojure atoms
      • last few: various combos of car and cdr for convenience
    • elaborate definitions:
      • $cond: own version of cond to keep syntax close to the original
      • $pairlis: accepts three args: two lists, and a list of existing pairs, combines the first two lists pairwise, and combines with the existing paired list
      • $assoc: lets you pull key-value pair out of an association list (list of paired lists)
      • $evcon: takes list of paired conditions and expressions, plus a context, will return the result of the expression for the first condition that evaluates to true
      • $evlist: takes list of expressions, with a condition, and a context, and then evalutes the result of the condition + the expression in a single list
      • $apply
      • $eval
    • live code demo
    INVITED TALK FROM GUY STEELE: It’s time for a New Old Language
    • “the most popular programming language in computer science”
    • no compiler, but lots of cross-translations
    • would say the name of the language, but doesn’t seem to have one
    • so: CSM (computer science metanotation)
    • has built-in datatypes, expressions, etc
    • it’s beautiful, but it’s getting messed up!
    • walk-throughs of examples, how to read it (drawn from recent ACM papers)
    • “isn’t it odd that language theorists wanting to talk about types, do it in an untyped language?”
    • wrote toy compiler to turn latex expressions of CSM from emacs buffer into prolog code, proved it can run (to do type checking)
    • inference rules: Gentzen Notation (notation for “natural deduction”)
    • BNF: can trace it all the way back to 4th century BCE, with Panini’s sanskrit grammar
    • regexes: took thirty years to settle on a notation (51–81), but basically hasn’t changed since 1981!
    • final form of BNF: not set down till 1996, though based on a paper from 1977
    • but even then, variants persist and continue to be used (especially in books)
    • variants haven’t been a problem, because they common pieces are easy enough to identify and read
    • modern BNF in current papers is very similar to classic BNF, but with 2 changes to make it more concise:
      • use single letters instead of meaningful phrases
      • use bar to indicate repetition instead of … or *
    • substitution notation: started with Church, has evolved and diversified over time
    • current favorite: e[v/x] to represent “replace x with v in e”
    • number in live use has continued to increase over time, instead of variants rising and falling (!)
    • bigger problem: some sub variants are also being used to mean function/map update, which is a completely different thing
    • theory: these changes are being driven by the page limits for computer science journals (all papers must fit within 10 years)
    • overline notation (dots and parentheses, used for grouping): can go back to 1484, when chuquet used underline to indicate grouping
      • 1702: leibnitz switched from overlines to parentheses for grouping, to help typesetters publishing his books
    • three notations duking it out for 300 years!
    • vectors: notation -> goes back to 1813, and jean-robert argand (for graphing complex numbers)
    • nested overline notation leads to confusion: how do we know how to expand the expressions that are nested?
    • one solution: use an escape from the defaults, when needed, like backtick and tilde notation in clojure
    • conclusions:
      • CMS is a completely valid language
      • should be a subject of study
      • has issues, but those can be fixed
      • would like to see a formal theory of the language, along with tooling for developing in it, checking it, etc
      • thinks there are opportunities for expressing parallelism in it
    Day Three

    Declarative Deep Learning in Clojure

    • starts with explanation of human cognition and memory
    • at-your-desk memory vs in-the-hammock memory
    • limitation of neural networks: once trained for a task, it can’t be retrained to another without losing the first
      • if you train a NN to recognize cats in photos, you can’t then ask it to analyze a time series
    • ART architecture: uses two layers, F1 and F2, the first to handle data that has been seen before, the second to “learn” on data that hasn’t been encountered before
    • LSTM-cell processing:
      • what should we forget?
      • what’s new that we care about?
      • what part of our updated state should we pass on?
    • dealing with the builder pattern in java: more declarative than sending a set of ordered args to a constructor
    • his lib allows keyword args to be passed in to the builder function, don’t have to worry about ordering or anything
    • by default, all functions produce a data structure that evaluates to a d4j object
    • live demos (but using pre-trained models, no live training, just evaluation)
    • what’s left?
      • graphs
      • front end
      • kafka support
      • reinforcement learning
    Learning Clojure Through Logo
    • disclaimer: personal views, not views of employers
    • logo: language to control a turtle, with a pen that can be up (no lines) or down (draws lines as it moves)
    • …technical difficulties, please stand by…
    • live demo of clojure/script version in the browser
    • turns out the logo is a lisp (!): function call is always in first position, give it all args, etc
    • even scratch is basically lisp-like
    • irony: we’re using lisp to teach kids how to program, but then they go off to work in the world of curly braces and semicolons
    • clojure-turtle lib: open-source implementation of the logo commands in clojure
    • more live demos
    • recommends reading seymour papert’s book: “Mindstorms: Children, Computers, and Powerful Ideas”
    • think clojure (with the power of clojurescript) is the best learning language
    • have a tutorial that introduces the turtle, logo syntax, moving the turtle, etc
    • slowly introduces more and more clojure-like syntax, function definitions, etc
    • fairly powerful environment: can add own buttons for repeatable steps, can add animations, etc
    • everything’s in the browser, so no tools to download, nothing to mess with
    • “explaining too early can hurt”: want to start with as few primitives as possible, make the intro slow
    • can create your own lessons in markdown files, can just append the url to the markdown file and it’ll load (!)
    • prefer that you send in the lessons to them, so they can put them in the lessons index for everyone to benefit
    • have even translated the commands over to multiple languages, so you don’t have to learn english before learning programming (!)
    • lib: cban, which has translations of clojure core, can be used to offer translations of your lib code into something other than english
    • clojurescript repls: Klipse (replaces all clojure code in your page with an interactive repl)
    • comments/suggestions/contributions welcome
    → 5:00 AM, Oct 17
  • Follow the Tweeting Bot

    I have a problem.

    No, not my fondness for singing 80s country in a bad twang during karaoke.

    I mean a real, nerd-world problem: I have too many books to read.

    I can’t leave any bookstore without buying at least one. For a good bookstore, I’ll walk out with half a dozen or more, balancing them in my arms, hoping none of them fall over.

    I get them home and try to squeeze them into my bookshelf of “books I have yet to read” (not to be confused with my “books I’ve read and need to donate” or “books I’ve read and will re-read someday when I have the time” shelves). That shelf is full, floor to ceiling.

    My list of books to read is already too long for me to remember them all. And that’s not counting the ones I have sitting in ebook format, waiting on my Kobo or iPhone for me to tap their cover art and dive in.

    Faced with so much reading material, so many good books waiting to be read, my question is this: What do I read next?

    I could pick based on mood. But that usually means me sitting in front of my physical books, picking out the one that grabs me. I could pick based on which ones I’ve bought most recently, which would probably narrow things down to just my ebooks.

    But I want to be able to choose from all of my books, physical and virtual, at any time.

    So I wrote a bot to help me.

    It listens to my twitter stream for instructions. When I give it the right command, it pulls down my to-read shelf from Goodreads (yes, I put all of my books, real and electronic, into Goodreads. yes, it took much longer than I thought it would), ranks them in order of which ones I should read first, and then tweets back to me the top 3.

    I’ve been following its recommendations for about a month now, and so far, it’s working. Footsteps in the Sky was great. Data and Goliath was eye-opening. The Aesthetic of Play changed the way I view art and games.

    Now, if only I could train it to order books for me automatically…

    → 6:00 AM, Apr 27
  • Notes from Strange Loop 2015: Day One

    Unconventional Programming with Chemical Computing

    • Carin Meier
    • Living Clojure
    • @Cognitect
    • Inspired by Book - unconventional programming paradigms
    • "the grass is computing"
      • all living things process information via chemical reactions on molecular level
      •  hormones
      • immune system
      • bacteria signal processing
    • will NOT be programming with chemicalsusing metaphor of molecules and reactions to do computing
      • nothing currently in the wild using chemical computing
    • at the heart of chemical programming: the reaction
    • will calculate primes two ways:
      • traditional
      • with prime reaction
    • uses clojure for the examples
    • prime reaction
      • think of the integers as molecules
      • simple rule: take a vector of 2 integers, divide them, if the mod is zero, return the result of the division, otherwise, return the vector unchanged
      • name of this procedure: gamma chemical programming
      • reaction is a condition + action
      • execute: replacement of original elements by resulting element
      • solution is known when it results in a steady state (hence, for prime reaction, have to churn over lists of integers multiple times to filter out all the non-primes)
    • possible advantages:
      • modeling probabilistic systems
      • drive a computation towards a global max or min
    • higher order
      • make the functions molecules as well
      • fn could "capture" integer molecules to use as args
      • what does it do?
      • it "hatches" => yields original fn and result of applying fn to the captured arguments
      • reducing reaction fn: return fewer arguments than is taken in
      • two fns interacting: allow to exchange captured values (leads to more "stirring" in the chem sims)
    • no real need for sequential processing; can do things in any order and still get the "right" answer
    • dining philosophers problem
      • something chemical programming handles well
      • two forks: eating philosopher
      • one fork or no forks: thinking philosopher
      • TP with 2fs reacting with EAT => EP
    • "self organizing": simple behaviors combine to create what look like complex behaviors
    • mail system: messages, servers, networks, mailboxes, membranes
      • membranes control reactions, keep molecules sorted
      • passage through membranes controlled by servers and network
      • "self organizing"

    How Machine Learning helps Cancer Research

    • evelina gabasova
    • university of cambridge
    • cost per human genome has gone down from $100mil (2001) to a few thousand dollars (methodology change in mid-2000s paid big dividends)
    • cancer is not a single disease; underlying cause is mutations in the genetic code that regulates protein formation inside the cell
    • brca1 and brca2 are guardians; they check the chromosomes for mistakes and kill cells that have them, so suppress tumor growth; when they stop working correctly or get mutated, you can have tumors
    • clustering: finding groups in data that are more similar to each other than to other data points
      • example: clustering customers
      • but: clustering might vary based on the attributes chosen (or they way those attributes are lumped together)?
      • yes: but choose projection based on which ones give the most variance between data points
      • can use in cancer research by plotting genes and their expression and looking for grouping
    • want to be able to craft more targeted responses to the diagnosis of cancer based on the patient and how they will react
    • collaborative filtering
      • used in netflix recommendation engine
      • filling in cells in a matrix
      • compute as the product of two smaller matrices
      • in cancer research, can help because the number of people with certain mutations is small, leading to a sparsely populated database
    • theorem proving
      • basically prolog-style programming, constraints plus relations leading to single (or multiple) solutions
      • can use to model cancer systems
      • was used to show that chronic myeloid leukemia is a very stable system, that just knocking out one part will not be enough to kill the bad cell and slow the disease; helps with drug and treatment design
      • data taken from academic papers reporting the results of different treatments on different populations
    • machine learning not just for targeted ads or algorithmic trading
    • will become more important in the future as more and more data becomes available
    • Q: how long does the calculation take for stabilization sims?
      • A: for very simple systems, can take milliseconds
    • Q: how much discovery is involved, to find the data?
      • A: actually, whole teams developing text mining techniques for extracting data from academic papers (!)

    When Worst is Best

    • Peter Bailis
    • what if we designed computer systems for the worst-case scenarios?
    • website that served 7.3Billion simultaneous users; would on average have lots of idle resources
    • hardware: what if we built this chip for the mars rover? would lead to very expensive packaging (and a lot of R&D to handle low-power low-weight environments)
    • security: all our devs are malicious; makes code deployment harder
    • designing for the worst case often penalizes the average case
    • could we break the curve? design for the worst case and improve the average case too
    • distributed systems
      • almost everything non-trivial is distributed these days
      • operate over a network
      • networks make designs hard
        • packets can be delayed
        • packets may be dropped
      • async network: can't tell if message has been delayed or dropped
        • handle this by adding replicas that can respond to any request at any time
        • network interruptions don't stop service
    • no coordination means even when everything is fine, we don't have to talk
      • possible infinite service scale-out
    • coordinated multi-server transactions pay large penalty as we add more servers (from locks); get more throughput if we let access be uncoordinated
    • don't care about latency if you don't have to send messages everywhere
    • but what about the CAP theorem?
      • inktomi from eric brewer: for large scale services, have to trade off between always giving an answer and always giving the right answer
      • takeaway: certain properties of a system (like serializability) require unavailability
      • original paper: cathy lynch
      • common conclusion: availability is too expensive, and we have to give up too much, and it only matters during failures, so forget about it
    • if you use worst case as design tool, you skew toward coordination-avoiding databases
      • high coordination is legacy of old db design
      • coordination-free designs are possible
    • example: read committed isolation
      • goal: never read uncommitted data
      • legacy implementation: lock records during access (coordination)
      • one way: copy on write (x -> x', do stuff -> write back to x)
      • or: versioning
      • for more detail, see martin's talk on saturday about transactions
    • research on coordination-free systems have potential for huge speedups
    • other situations where worst-case thinking yields good results
      • replication for fault tolerance can also increase your request-serving capacity
      • fail-over can help deployments/upgrades: if it's automatic, you can shut off the primary whenever you want and know that the backups will take over, then bring the primary back up when your work is done
      • tail latency in services:
        • avg of 1.2ms (not bad) can mean 0.1% of requests have 100ms (which is terrible)
        • if you're one of many services being used to fulfill a front-end request, your worst case is more likely to happen, and so drag down the avg latency for the end-user
    • universal design: designing well for everyone; ex: curb cuts, subtitles on netflix
    • sometimes best is brittle: global maximum can sit on top of a very narrow peak, where any little change in the inputs can drive it away from the optimum
    • defining normal defines our designs; considering a different edge case as normal can open up new design spaces
    • hardware: what happens if we have bit flips?
    • clusters: what's our scale-out strategy?
    • security: how do we audit data access?
    • examine your biases

    All In with Determinism for Performance and Testing in Distributed Systems

    • John Hugg
    • VoltDB
    • so you need a replicated setup?
      • could run primary and secondary
      • could allow writes to 2 servers, do conflict detection, and merge all writes
      • NOPE
    • active-active: state a + deterministic op = state b
      • if do same ops across all servers, should end up with the same state
      • have client that sends A B C to coordination system, which then ends ABC to all replicas, which do the ops in order
      • ABC: a logical log, the ordering is what's important
      • can write log to disk, for later replay
      • can replicate log to all servers, for constant active-active updates
      • can also send log across network for cluster replication
    • look out for non-determinism
      • random numbers
      • wall-clock time
      • record order
      • external systems (ping noaa for weather)
      • bad memory
      • libraries that use randomness for security
    • how to protect from non-determinism?
      • make sure sql is as deterministic as possible
      • 100% of their DML is deterministic
      • rw transactions are hard to make deterministic, have to do a little more planning (swap row-scan for tree-index scan)
      • use seeded random-number generators that are lists created in advance
      • hash up the write ops, and require replicas to send back their computed hashes once the ops are done so the coordinator can confirm the ops were deterministic
      • can also hash the whole replica state when doing a transactional snapshot
      • reduce latency by sending condensed representation of ops instead of all the steps (the recipe name, not the recipe)
    • why do it?
      • replicate faster, reduces concerns for latency
      • persist everything faster: start logging when the work is requested, not when the work is completed
      • bounded sizes: the work comes in as fast as the network allows, so the log will only be written no faster than the network (no firehose)
    • trade-offs?
      • it's more work: testing, enforcing determinism
      • running mixed versions is scary: if you fix a bug, and you're running different versions of the software between the replicas, you no longer have deterministic transactions
      • if you trip the safety checks, we shut down the cluster
    • testing?
      • multi-pronged approach: acid, sql correctness, etc
      • simulation a la foundationDB not as useful for them, since they have more states
      • message/state-machine fuzzing
      • unit tests
      • smoke tests
      • self-checking workload (best value)
        • everything written gets self-checked; so to check a read value, write it back out and see if it comes back unchanged
      • use "nefarious app": application that runs a lot of nasty transactions, checks for ACID failures
      • nasty transactions:
        • read values, hash them, write them back
        • add huge blobs to rows to slow down processing
        • add mayhem threads that run ad-hoc sql doing updates
        • multi-table joins
          • read and write multiple values
        • do it all many many times within the same transaction
      • mix up all different kinds of environment tweaks
      • different jvms
      • different VM hosts
      • different OSes
      • inject latency, disk faults, etc
    • client knows last sent and last acknowledged transaction, checker can be sure recovered data (shut down and restart) contains all the acknowledged transactions

    Scaling Stateful Services

    • Caitie MacCaffrey
    • been using stateless services for a long time, depending on db to store and coordinate our state
    • has worked for a long time, but got to place where one db wasn't enough, so we went to no-sql and sharded dbs
    • data shipping paradigm: client makes request, service fetches data, sends data to client, throws away "stale" data
    • will talk about stateful services, and their benefits, but WARNING: NOT A MAGIC BULLET
    • data locality: keep the fetched data on the service machine
      • lower latency
      • good for data intensive ops where client needs quick responses to operations on large amounts of data
    • sticky connections and consistency
      • using sticky connections and stateful services gives you more consistency models to use: pipelined random access memory, read your write, etc
    • blog post from werner vogel: eventual consistency revisited
    • building sticky connections
      • client connecting to a cluster always gets routed to the same server
    • easiest way: persistent connections
      • but: no stickiness once connection breaks
      • also: mucks with your load balancing (connections might not all last the same amount of time, can end up with one machine holding everything)
      • will need backpressure on the machines so they can break connections when they need to
    • next easiest: routing logic in cluster
      • but: how do you know who's in the cluster?
      • and: how do you ensure the work is evenly distributed?
      • static cluster membership: dumbest thing that might work; not very fault tolerant; painful to expand;
      • next better: dynamic cluster membership
        • gossip protocols: machines chat about who is alive and dead, each machine on its own decides who's in the cluster and who's not; works so long as system is relatively stable, but can lead to split-brain pretty quickly
        • consensus systems: better consistency; but if the consensus truth holder goes down, the whole cluster goes down
    • work distribution: random placement
      • write anywhere
      • read from everywhere
      • not sticky connection, but stateful service
    • work distribution: consistent hashing
      • deterministic request placement
      • nodes in cluster get placed on a ring, request gets mapped to spot in the ring
      • can still have hot spots form, since different requests will have different work that needs to be done, can have a lot of heavy work requests placed on one node
      • work around the hot spots by having larger cluster, but that's more expensive
    • work distribution: distributed hash table
      • non-deterministic placement
    • stateful services in the real world
    • scuba:
      • in-memory db from facebook
      • believe to be static cluster membership
      • random fan-out on write
      • reads from every machine in cluster
      • results get composed by machine running query
      • results include a completeness metric
    • uber ringpop
      • nodejs library that does application-layer sharding for their dispatching services
      • swim gossip protocol for cluster membership
      • consistent hashing for work distribution
    • orleans
      • from Microsoft Research
      • used for Halo4
      • runtime and programming model for building distributed systems based on Actor Model
      • gossip protocol for cluster membership
      • consistent hashing + distributed hash table for work distribution
      • actors can take request and:
        • update their state
        • return their state
        • create a new Actor
      • request comes in to any machine in cluster, it applies hash to find where the DHT is for that client, then that DHT machine routes the request to the right Actor
      • if a machine fails, the DHT is updated to point new requests to a different Actor
      • can also update the DHT if it detects a hot machine
    • cautions
      • unbounded data structures (huge requests, clients asking for too much data, having to hold a lot of things in memory, etc)
      • memory management (get ready to make friends with the garbage collector profiler)
      • reloading state: recovering from crashes, deploying a new node, the very first connection of a session (no data, have to fetch it all)
      • sometimes can get away with lazy loading, because even if the first connection fails, you know the client's going to come back and ask for the same data anyway
      • fast restarts at facebook: with lots of data in memory, shutting down your process and restarting causes a long wait time for the data to come back up; had success decoupling memory lifetime from process lifetime, would write data to shared memory before shutting process down and then bring new process up and copy over the data from shared to the process' memory
    • should i read papers? YES!
    → 9:00 AM, Oct 5
  • RSS
  • JSON Feed
  • Surprise me!