Thoughts of an allocated mind.

Where Clause Options

Mon, 11 Aug 2014 00:00:00 +0000

Thoughts on `where` Clauses for analytics API

The current API where clause is limited in terms of functionality and the set of operations it can support. This document represents a survey of the possible alternatives.

The requirements of any alternative is that it supports (either partially or fully) the following list of features: * An extensible array of operators, namely: * ==, != Equality and inequality (both point query and ‘in’ query) operator * <, <=, > and >= Range query opeators * &&, || and ! Logical operators * (, ) grouping operators

The alternatives below are evaluated on three axes: * Ease of use * A new user should find it easy to read the documentation/examples and not find something in the API very difficult to use. This covers the syntax as well as the readability of the syntax. This requirement also ties into how much effort is needed to on-board a new user in terms of reading the documentation, examples and being able to get started. The effort needed to get from zero to understanding a sizable chuck of the functionality so that the developer may feel productive. * Extensibility * This requirement is more forward facing and intends to capture how easy is it to extend the functionality provided by an option. The developer who creates a new piece of functionality should be able to add that feature without disturbing the ease of use argument.

Option 1

This option suggests that the URI query string be used like a regular string and we construct queries based on a set of developer defined operators. This means that we could potentially specify a query string that is very readable but not necessarily similar to the key-value syntax of most URI query strings.

Extensibility

The developer is expected to maintain a proper parser to ensure that a user request and all the components are parsed into a well defined parse tree. This option may exactly contain the operators as defined above in infix notation.
There are a lot of options to create a parser, such as antlr or even using regular regex combined with custom logic to interpret the final set of values.

Ease of use

In terms of readability the user would find that the query string closely resembles any logical infix expression. This ensures that the syntax is mostly intuitive and after understanding the function of each of the operators they would be able to understand their intended purpose.
There may be some confusion about escaping the operators and arguments.
A lot of the http clients out there only support key-value syntax and the user may need to switch or figure out workarounds for this syntax.

Example:

In this example we use ;, , to denote logical and and or operations.

https://api.brightcove.com/v1/accounts/1234567890/report?from=2014-01-04&to=now&offset=200&limit=100&dimensions=video&where=(video_duration%3E=300;video==1,2,3,4;video_view%3C=100),(video_view%3E=9000)

(video_duration>=300;video==1,2,3,4;video_view<=100),(video_view>=9000)

The where clause if fairly easy to understand (once decoded) the user knows what the different operators mean but it would still require some getting used to.

The current Google analytics reporting API v3 is represented in the same way.

Option 2

This option proposes the use of a grammar that represents a conjunctive query with negation. Parse.com does this form of api. As an example consider:

bash curl -X GET \ -H "X-Parse-Application-Id: ${APPLICATION_ID}" \ -H "X-Parse-REST-API-Key: ${REST_API_KEY}" \ -G \ --data-urlencode 'where={"score":{"$gte":1000,"$lte":3000}}' \ https://api.parse.com/1/classes/GameScore

As is evident, here the syntax of the where parameter is a json document that will be url encoded which increases the readability but also makes the user think in a prefix (conjunctive query like) syntax for the expressions.

Another way to represent may use a more familiar syntax by not using json and replacing it with “regular looking” function calls. Eg:

and(gte(video_duration, 200), or(eq(video, 1), eq(video, 2), eq(video, 3), eq(video_name, foo%2Cbar))

Extensibility

The developer of an this format would need to implement a parser that can pase the expression and then convert it to a corresponding database query.
Adding new operators would be simple as registering a new function name and the corresponding logic needed to generate the database query.

Ease of use

As long as the content of the where parameter is properly url encoded, any regualar http client would easily handle the request.
The user of the API, after learning about the set of functions available would need to think in prefix notation terms and not the regular infix terms. This means that some novice users may find the onboarding process more involved than simply writing a usual logical expression.

Paper: Anti-caching: A new approach to database management system architecture

Sun, 01 Jun 2014 00:00:00 +0000

Traditionally, databases have involved heavily encoded disk storage format + buffer pool for caching hot segments of data. Executing query first checks the buffer pool and if data is not present there an eviction occurs for the disk block needed and query resumes. This involves substantial overhead in maintaining the buffer pool. In some cases almost 1/3rd of the CPU (if all data exists in the buffer pool).

ALT Main memory databases

Main memory databases store all data in memory and do not have a buffer pool. => drastic improvement in performance but it also needs all of the data to fit in memory otherwist the OS will pagefault when we try to access data that is not in physical memory (virtual memory paging) i.e. page faults. This causes all txns to be stalled.

ALT Distributed Cache

This is a widely adopted stratefy wherein we use a main memory distributed cache (eg: memcached) in front of a regular disk backed DBMS. But,

Objects are double buffered (DB buffer pool + distributed cache)
Requires apps to embed logic to update/maintain/invalidate cache.

Intro Anti-caching

DBMS runs with the data in memory and when memory is exhausted, it evicts the coldest tuples from memory to disk with minimal encoding. => ‘hottest’ data resides in memory and ‘colder’ data is on disk. Data is either on disk or in memory but never in both places.

Data starts off in memory and cold data is evicted to disk.
Allows for fine-grained control (at tuple level) for evictions.
Non-blocking fetches: When a txn needs a block that is not in memory, it is aborted, the tuples are fetched to memory (and evictions may happen to accommodat this) and the txn is restarted when the blocks are available. Meanwhile, all other txns continue.
We can batch disk block reads so that multiple disk blocks can be read together - increases performance/throughput.

Anticaching architecture out performs traditional disk-based and hybrid architecture for popular OLTP workloads.

Assumptions

Restricts the scope of queries to fit in main memory.
All indexes fit in main memory.

H-STORE system overview

Traditional DBMS - if buffer pool is full, the dbms chooses a block to evict and make space for the incoming dbms one (from disk) - needs concurrency control mechanisms to allow other txns to continue while the stalled one is waiting.

Recently, RAM is cheap enough to store all or most of the dataset in memory for most OLTP workloads. - This is the scenario H-Store attempts to target

Components of H-Store 1. H-Store node - single computer that manages multiple partitions. 2. Partition - a disjoint subset of the data. Each partition is assigned a single threaded execution engine that executes txns and queries. 3. Hstore can execute adhoc queries but it is primarily targeted to work with stored procedures. -> a txn is an invocation of a stored procedure

Stored procedure has control code that invokes predifined parameterized sql code.

Workload ->

Single partition transactions 1. Most txns are local to a single node. 2. Examined in the user-space hstore client, params are substituted to form a runnable query, so the txn can be sent to the correct node where it is executed completely.

Multi-partitio transactions Consists of multiple phases in which more than one partition is touched.

Each transaction is given a unique txn identifier based on the time it arrived into the system. If a txn with a higher transaction id has been already executed, incoming transaction is rejected.

Multipartiton transactions use an extension of this protocol where each local executor cannot run other transactions until the multipartition transaction is completed.

Each DBMS node continuously writes async snapshots of the database to the disk at fixed intervals. Between these intervals, it writes out a record, to a command log, of each txn that completes successfully.

Anticaching system model

The disk is used as a place to spill cold tuples when size of the database exceeds the size of main memory. A tuple is never copied. It either lives in memory or on disk based on anti-cache.

The DBMS evicts cold data to the anti-cache to make space for new data – constructs fixed sized blocks for LRU tuples to be sent to the anti-cache (disk).

When a txn needs an evicted block, it switches to pre-pass mode to learn about all the blocks that the txn needs. The txn is then aborted (rollback changes if needed) and holds it while tuples are fetched into memory in the background. Once the data has been merged back to the memory resident data, txn is restarted. Other txns keep executing while data is being fetched from disk.

Storage Architecture

Storage manager (in each partition) contains:

Disk resident hash table that stores evicted blocks of tuples called the Block Table.
In-memory Evicted table that maps evicted tuples to block ids.
In-memory LRU chain of tuples for each table.

These structures are all single threaded so no concurrency control mechanisms are needed.

Currently, we require that all the primary key and the secondary indexes fit in memory.

Block Table: A hash table that maintains the blocks of tuples that have been evicted from the DBMS’s main memory storage. Each block is the same fixed-size and is assigned a unique 4-byte key. A block header contains the identifier for the single table that its tuples were evicted from and timestamp for the block creation time. Body containst eh serialized evicted tuples from a single table. Each evicted tuple is prefixed with its size and is serialized in a format that closely resembles the in-memory format. The key portion of the Block Table stays in memory but the values are stored on disk without OS or filesystem caching.

Evicted Table: Keeps track of the tuples that have been evicted out to disk. Each evicted tuple is assigned a 4-byte identifies that corresponds to its offset in the block it resides in. The dbms updates any indexes containing evicted tuples to reference the Evicted Table.

LRU Chain: Allows the DBMS to quickly determine at runtime th least recently used tuples to combine into a new block to evict. LRU chain is a doubly linked list where each tuple points to the next and previous most recently used tuple for its table. The dbms embeds pointers directly in the tuples’ headers. The pointer for each tuple is a 4-byte offset of that record in its table’s memory at the partition (instead of an 8-byte memory location). The DBMS selects a fraction of the txn to monitor at runtime. Because hot tuples are by definition accessed more frequently and thus more likely to be updated in the LRU chain. The rate at which transactions are sampled can be tuned by a parameter - 0 < \alpha < 1. Some tables can be specifically marked as evictable during schema creation. Any table not marked as evictable will not be maintained in the LRU chain and will remain entirely in memory.

Block Retrieval

The system first issues a non-blocking read to retrieve the blocks from disk. This allows other transaction to continue while block is being read off the disk. Any transaction that attempts to read an evicted tuple in these blocks is aborted while it is still on disk. Once the requested blocks are retrieved, the aborted transaction(s) is rescheduled. Before it starts, the DBMS performs a “stop-and-copy” operation whereby all transactions are blocked at that partition while the unevicted tuples are merged from the staging buffer into the regular table storage. Then removes all of th entries for these tuples in the evicted table and then updates the table’s indexes to point to the real tuples. We may merge either the whole block (Block-Merging) or just the tuples needed (Tuple-Merging).

Block Merging: Merge all the tuples from the retrieved block into memory, may add tuples that won’t ever be needed by transactions. This can either lead to continuous un-eviction/re-eviction cycles (and be detrimental) or can cause tuples that may eventually be needed anyway ( and avoid more stop-load-merge operations).

Tuple-Merging: Only merge the tuples that are necessary for the transaction. Once the desired tuples are merged, the fetched block is discarded. THis can lead to a lot of wasted effort if only a small number of tuples are merged and subsequent transactions cause the same set of blocks to be loaded and reloaded. This can lead to holes in the Evicted blocks which should be compacted lazily using some lazy block compaction algorithm during the merge process.

Distributed Transactions

H-Store will switch a distributed txn into the “pre-pass” mode just as a single partition txn when it attempts to access evicted tuples at any one of its partition. The txn is aborted and not requeued until it receives a notification that all of the blocks that it needs have been retrieved from the nodes in the cluster.

Snapshots & Recovery

In main-memory DBMS, we use snapshots and command logging [22, 29] are used. The DBMS serializes all the contents of the regular tables and index data, as well as the contents of the Evicted Table, and writes it to disk. At the same time, the DBMS also makes a copy of the Block Table on disk as it existed when the snapshot began. No evictions are allowed to occur while the snapshotting is in progress. To recover from a crash, the DBMS loads the last snapshot from disk, then replays the txns from the command log that were created after this snapshot was taken.

To keep the size of snapshots small, the DBMS takes delta snapshots. These delta-snapshots may be collapsed at regular intervals to avoid keeping a large number of deltas.

Results / Evaluation

For highly-skewed workloads ( workloads with skews of 1.5 and 2.5 ) the anti-caching architecture outperforms MySQL by a factor of 9x for readonly, 18x for read-heavy and 10x on write-heavy workloads for datasets 8x memory.
For hybrid MySQL + memcached architecture, by a factor of 2x for read-only, 4x for read-heavy and 9x for write-heavy workloads for 8x memory.

The lower performance for workloads in hybrid-MySQL architecture is due to the overhead of synchronizing values in Memcached and in MySQL in event of a write. For lower skew, there is a high cost of cache miss in this hybrid architecture.

No one needs to know that ...

Fri, 29 Nov 2013 00:00:00 +0000

So, recently (more like yesterday), I was chatting with a person who I don’t really know. So without thinking a lot about it, I said hi. Now, I am not a very verbose person and my abilities of actually catching social queues is, well, let’s just say are not very polished. In any case, this person went on to point out how I have been quite inconsistent with my blog and haven’t posted anything in quite some time. This got me thinking and as this person is into the marketing side of things, so they do possibly have a different outlook on things and it might be very interesting to talk to them on the same topics I think about, I asked about their blog and something that they have written.

Now, Back to the original topic. I assume, at this point, that I need to be a little more regular. So, I asked what archie’s blog looks like and the thing to which I got pointed to (and I will refrain from putting a link here!) is well, let’s just say, is nothing but a brain fart/vomit.

The blog was just a continuous stream of cat pictures and spotify music posts. Now, I might not be a big editor or some scholar in creative writing, but even I can tell that this blog is of no consequence to anyone/anything.

This brought me to a very important observation about our generation’s obsession with divulging every single detail of our lives. Most of the things that we do are NOT gems of our civilization and quite frankly is not something that you or for that matter anyone needs to share/know about. No one, I think, is bothered with how you got yourself out of the bed in the morning or how bad your nail painting abilities are or how cute your cat looks sleeping for the 500th time.

Sharing is one thing that we have somehow compeletely got out of whack. The idea behind this might be pretty simple, so, for a minute, rewind back to the stone age where the caveman lived pretty much exposed and survived primarily because the herd stuck together.

As a caveman, I shared my bison meat with others and others shared the fruits they forraged with me. In a similar way, the cavemen shared their knowledge, where to hunt, where to get food, places to stay away from and predators. Now, if I was to waste everyone’s time with grunts about how I threw a stone on a duck and an egg came out, it might be funny at first and maybe even a good pick-up grunt at a cave-party (yea! I said it!). But sooner or later, someone will club me in the face.

Now, let me be clear, by no means, I am a hermit. I just think that there needs to be some form of personal accountability about what we share. Social media has made the world a closely knit and smaller place. People living across the street are almost the same as the ones living half way across the globe. This would be unthinkable even 10 years ago. The world economies have evolved around this concept and continues to do so. Newer areas of research have emerged, this has allowed us to tap into newer avenues of collaboration. We have seen revolutions being reported, a new era of journalism rise to power and more accountability arise from this. Heck, I myself owe my bread and butter to this very concept.

But it makes me sad, yes actually sad, when I hear that people are now wasting (yes I do mean waste in the cores sense of the word) their time in front of a screen, oogling at senseless posts on facebook and twitter about how much one of your acquaintances just loves hershey’s kisses. A relief, once in a while, is fine. It lets us connect with other people on a personal level (which, is just foo-bar to my brain). It informs us and connects us to the human race. But, that’s where we get carried away.

People overshare and then regret it. We hear about cyber stalkers or companies who find out the stupid things that people shared on their public facebook pages without thinking. Followed by people getting mad at service providers for sharing information, to actually do their job (and dare I point out, you agreed to the T&C). It just makes my jaw drop at the stupidity of people.

How dumb do you need to be? If you don’t want something online, don’t put it there! period. It’s as simple as that. I have learnt this simple fact over the last two and a half decades of my existance. The human brian is a really good optimization engine. You give it any scenario, it will work it out and share and collaborate with others and process information to give you the best possible way of doing the task. Failing to do something is just it’s way of working out the problem.

If you apply this same strategy to stalkers and employers they are just optimizing, in a way, to know their target (obsessively) or reducing risk of the new employee being a business risk.

To end this, rather poorly structured post, I would just like to say that I don’t intend to come off as a person who is overly paranoid. But we live in a time where everything has more impact than you thought. Technologies are advancing much faster than we can take stock of. It is very possible to cross reference anything with anything else if one has the time and patience. One migth be bold enough to propose that the surgeon general may mandate a warning to sent with every new internet bill saying “Sharing and surfing on the internet may pose a serious health risk.” Though that might be a little over the top ;)

PS: Archie* here you go! I updated my blog and now you have a piece of my mind on something for you to ruminate upon.

Spark Summit 2013 notes

Sun, 12 May 2013 00:00:00 +0000

Keynote by Matei Zaharia

Spark is now one of the largest big data products out there surpassing hadoop in the number of active contributors in the last 6 months.
Has more contributors than Storm, Giraph, Dril and Tez combined.

Spark 0.8.1 will have

MLlib - machine learning algorithms out of the box.
YARN 2.2 resource manager support
codahale metrics based metrics reporting and a new monitoring UI
Spark stream comes out of alpha and will be stabilized
Shark imporovements
EC2 support

Other contributions

YARN support - Yahoo
Columnar compression of data - Yahoo
Metrics reporting - Quantifind
Fair scheduling and code generation optimization - Intel
New RDD operators
scala 2.10 support (finally!)
Master HA
External hashing and sorting (0.9 release)
Better support for large number of tasks

Current priorities

Standardise libraries
Deployment ease/automation
Out of the box usability - better defaults and configuration system
Enterprise support provided by Cloudera & Databrix

Spark Streaming and Shark will be optimised for 24/7 operation and will get other performance imporvements

SIMR - Will be released soon: Provides a seamless way to run spark as a regular hadoop job (it’s not production ready just yet) * Recommended to upgrade to hadoop 2.2

In terms of raw performance Spark/Shark is comparable to Apache Impala

Time spent is generally as follows

90% time is spent just reading and writing in Hadoop

Int spark and other in memory systems, the data is loaded just once and all subsequent operations are performed on it

What happens when data does not fit in memory? * Spark manages swapping data as it is needed Scale up vs scale out? *Depends on application and requires looking at the workload but given the ability to use spot instances, it is preferable to scale out and increase parallelism.

Value from Data

Questions: * Insights, diagnosis - Why is X happening? * Decision support - How do I do X? - What do I do to make X better?

Solutions: * Interactive queries * Streaming queries * Sophisticated data processing (in time)

Challenge #1: need to maintain 3 stacks - expensive and difficult to make consistent

Challenge #2: hard/slow to share data and code

Spark helps in both areas by allowing for the same effort to be utilised for both batch and streaming with minimal change.

Analogy:

Unification of technologies

* Step 1: first cellular phone                        --- hadoop/batch processing
* Step 2: specialized devices (mp3 player, gps, pda)  --- storm, mahout, specialised systems
* Step 3: unification (iphone or smartphones)         --- spark (others also there but not as mature)

Unified realtime and historical data analysis
Unified streaming, historical & predictive analysis.

Why streaming ML? * fraud detection * decision support * trend change * no point in suggesting a product a day later

Enterprise support by * Databrix * Cloudera * Wandisco * Tuplejump

BlinkDB will soon go out of alpha and reach stability * Better HIVEQL operator support * Performance improvements * Interesting numbers on conviva datasets

CrowdDB - SQL to Crowdsourcing answers (SIGMOD 2011)

Hadoop and Spark join forces at Yahoo

Pre-2012 - Editors’ decisions drive the yahoo homepage

2012 - data driven editorial decisions for homepage

ML driven suggestions
Per user customisation
instantly saw 300% increase in engagement

2013 - Personalised homepage

All parts are personalised, tracked, trained
Personalised mobile experience
Personalised properties = websearch + vertical content, deep search -> video, maps, weather when you search

Data Science @ scale requires ability to do quick turnaround discovery

Challenge #1: Science
- Single model for all items in homepage stream
- millions of items
- thousands of item/user features
  - categories
  - wikipedia entity names
  - Objective function
    - Relevance / User engagement
    - Freshness / popularity
    - Diversity
    - Algorithm exploration
      - Logistic regression
      - Collaborative filtering
      - Decision trees
      - Hybrid
Challenge#2: Speed
- Freshness is very important
  - Surface relevant results while they are still relevant
Challenge #3: Scale
- 150 PB of data on hadoop
  - data for model building and BI analytics
  - Avoid latency
  - hours of latency is not acceptable
  - 35,000 baremetal hadoop cluster
  - Store all data on a giant netapp based nfs store
  - different log sources continuously ship data

Solution: - Hadoop + Spark - Hadoop is and will be the core for doing most of the heavy lifting - Spark for iterative research and queries of most of the processed data - Both running in the same cluster - YARN brings hadoop datasets and servers at scientists’ disposal

Projects that are using spark: - Collaborative filtering - Ecommerce - viewed-also-viewed - bought-also-bought - bought-after-bought

Huge speedup = 14 minutes vs 106 minutes
Stream Ads
Logistic regression algorithm implements
- 120 LOC in scala/spark - much easier to manage vowpal/wabbi
- 30 mins for 100M samples with 13K features and 30 iterations

0 to production experiments - 2 hours turnaround - Big win for experimentation and hypothesis validation.

Slides have diagrams of the architecture used by Yahoo teams - Standalone mode - YARN manages master and workers - Client mode - YARN manages only workers

Future contributions
- Dynamic resource allocation
- Generic history server
- Preemption

Old Architecture

Logs collection to huge NFS storage
Logs moved to hdfs
PIG/MR ETL processes - massive joins
Aggregations and pre-determined reports
Load data into DBMS
UI/API on top

Problems: - massive data volumes - many TBs - Pure hadoop throughput - Report latency is high - Culprit: - Row data processing through MR is slow (IO) - Many chained stages - IO - massive joins - lack of interactive dsl - expressibility of business logic as MR is difficult

Aggregates pre-computation problems - Pre-calculated reports - counting distincts - Number of reports along dimensions - datacubes are just huge

hadoop is not built for this - No way to do data discovery - need a data workbench for BI

Most spark clusters are “small” and hand managed - 9.2 TB of addressable RAM - 96GB/192GB RAM per machine - Tableau interface to shark (OLAP) - overlap analysis - time series analysis - construct cubes offline and use them for fast analysis (MOLAP and HOLAP) - column pruning (in shark) - map side joins (in shark) - cached table compression (in shark)

Satellite cluster pattern or per-team cluster on the same underlying data

Making Spark Fly: Creating an elastic Spark cluster on Amazon EMR

EMR integrates directly with all other amazon services seamlessly (*kinesis)
Aggressively use spot instances to get all the processing done with the excess capacity when prices drop.
Task nodes can be added and removed whereas core nodes can only be added.
- Use task nodes as spot instances (scalable workers) and use core nodes for a basic set of available workers at all times
Spark installed as a part of a bootstrap script
1 TB memory cluster using spot prices can be very cheap.
- 63 x m1.xlarge = $4.44/hr vs $30/hr on demand
- 6 x cc2.8xlarge = $4.64/hr
- 15 x m2.4xlarge - $2.25/hr
Use ssd+ram for extra credit
Autoscaling spark
- Using cloudwatch metrics
  - define metrics as
  - CPU and memory (probably scale up/down on both)
- Pick a good set of thresholds for “TotalLoad”
- Spark 0.8.1 will provide cluster metrics that can also be used for scaling
- Look up CloudwatchMetricsSink
- When new nodes join the cluster the data is not eagerly rebalanced, but if tasks would get scheduled to them and they will eventually get used.
Kinesis integration with sparkstreaming
- Kinesis is just like kafka
- Upcoming KinesisReceiver using NetworkReceivers ( where to get this?)
C3 and i2 instance will be awesome!
Use placement groups and fetch data from S3 to local hdfs when working with it.

SIMR

Wraps spark as a normal hadoop job so that you can run it on an existing cluster

FLINT - Deploying BDAs on AWS (Adobe)

Shared nothing management
Use simpleDB and S3 to manage all state
Efficient and scalable
Access to all tools
Pending open source

ADATAO - R and Python with NLP over spark

Use R and Python with spark
Seamlessly translates R to spark
Use Cases
- BI
- Adhoc business query
- Sensor network analytics
- Ad networks
- MLLib + Rest API

TupleJump

Ubercube - distributed olap cube over spark and cassandra
Indexes for cassandra
Analytics s/w using spark
SnapFS
Calliope
- R/W data with cassandra
- Shark + calliope
  - builtin indexing (clustered)
- Stargate
- Hydra: Common bus for messages
- Ops center

Realtime analytics processing - Intel

Alibaba, youku, baidu
Value ?
- descriptive analytics - SQL
- predictive analytics - non-SQL
- interactive
- streaming/online
A series of mini batch jobs that flush aggregations to rdbms at regular intervals
colocate kafka and a spark worker
- log collection is bottlenecked by n/w
- processing is bottlenecked by cpu and mem
  - use memory_only_ser2
    - tune spark.cleaner.ttl [ throughput*spark.cleaner.ttl < memory ]
- Don’t be scared to add columns, let shark worry about that
- Complex machine learning
  - mostly matrix/graph analysis

Sparrow - Next-Gen spark scheduling

Current scheduler is not so good when jobs are very short
- most of the time spent waiting in queue to be assigned to a worker.
Current scheduler bottlenecked at about 1500 tasks/sec
- bad for big clusters with small tasks
- Scheduler delay gives an estimate of amount of time spent in scheduler queue
- Use: Batch sampling + Late binding
- https://www.github.com/radlab/sparrow

Spark as a Service - OOYALA

raw events in cassandra
spark job server
anyone can submit a job
available as a service to avoid rebuilding the universe over and over
- REST API
  - contexts
    - allows the user to create a context and keep it alive for later use
    - low latency query now possible
    - submit subsequent querries with &context=<id> to run in an existing context
    - async and sync requests also possible
    - challenges/lessons
      - spark is based on contexts
      - we need a manager ( as contexts may take multiple seconds to come up and allocate lots of threads)
Opensource and Currently a pull request

Spark streaming

Many environments require processing same data in live and as batch
- no single framework does this
- Traditionally, stateful event processing
  - each node has mutable state that is synchronized regularly with others
  - prone to loss of data when node is lost
Storm
- atleast once
- gets slow
- Trident - adds transactions by using external db (slows down further and adds external dependency/hell)
Built on RDDs
- divides stream into small deterministic batches (up to 0.5 sec each) with lineage
- code looks and feels same as the batch system code
Window based transformations
- window ( length of window, frequency)
- updateByKey( updateFunction, Key)
- transform
  - allows us to combine historical/regular RDDs with streaming
- Applications
  - online ML
- combine live and historical data
- CEP style processing (http://en.wikipedia.org/wiki/Event-driven_architecture)
- Data sources
  - kafk, hdfs, flume, akka actors, raw tcp
  - easy to write new receiver
- Fault tolerance
  - Batches of input data are replicated
  - data lost due to worker is recomputed from lineage
    - how long is the data kept? - uses LRU (0.9 - active throwaway)
    - spark 0.9 will have
    - automated master failure recovery
    - perf improvement
    - better monitoring/UI for streaming specifically
- Long term goals
  - MLlib for streaming
  - shark for streaming
  - python api

Slides: http://spark-summit.org/agenda/

Hand notes: Flickr

Note 1: As it goes without saying, all the information in this post is what I understood during the conference and whatever i could remember while writing stuff down. There is a possibility that I got something wrong, if that is the case please let me know and I’ll be happy to make the changes.

Virtualised Brian - A thought experiment

Tue, 10 Jul 2012 00:00:00 +0000

What would it mean if we could combine the concepts of Virtualisation and the Brain? What would be the possible? A couple of days ago at lunch with Paul, I thought about this and lets see where we ended up.

Assumptions

The scene is set in a distant future.
Evolution has not imparted increased brain utilization.
During sleep or people who are not using their brains or people with death sentence or life imprisonment or other long term imprisonment (Ethical issues aside) may be put into a “base functions only” state.
The “suspended” brain only needs a fraction of its total capacity (in addition to the brain stem and the spine) to manage life supporting body functions.
Important: Humans have found a way to “install” a Xen/KVM like virtualization layer into brains.

Basics

Brain, is the key organ of every human being. Although the human brain represents only 2% of the body weight, it receives 15% of the cardiac output, 20% of total body oxygen consumption, and 25% of total body glucose utilization. This complex organ is responsible for all the advancements we see today.

Virtualisation technoloies such as KVM and Xen have revolutionized computing and are a cornerstone to cloud computing. There are different forms of virtualization(complete, partial or para virtualization) but in any form it is a big step forward.

Given the current virtualisation architectures (such as the ones used by Amazon, Google and other opensource project such as KVM and XEN), there is a host systems which is Dom0 and is the one that provides the guest consciousness(maybe multiple) with access to the underlying wetware. This is an important bit, we want the host to have some elevated access level in terms of being able to manage the other brain instances (referred to as “instances” from now on). This way in case we have a requirement of having to shutdown/revoke access/bill brain power the Dom0 instance is the one which does this.

Details

The Dom0 here is pretty thing and is the instance which is closest to the the wetware. It provides the other instances with networking (ability to connect to the wet-ware internet), on demand computing resource (how much of the brain capacity are we allocating to this instance), billing subsystem (if we are doing this I suppose we want to do some form of accounting on the usage).

Networking

I propose that the host body is connected to a sort of an internet via some non-surgically attached means. Something like some sort of headgear which allows for high bandwidth connection to the outside world. From what I understand (and I am not a neurosurgeon), this may require a one time interface implant. Which can be controlled via thought and a physical hardware switch - You always want a big red button on the back of your head which shuts down the virtualized brains.

The different instances within the brain would communicate over the different neurological pathways already existing inside the brain. The only thing which I can see as the bottle neck is the corpus callosum. This is a part which is like a narrow bridge inside each of our brains. My point is, the inter-hemisphere communication would be slower than the intra-hemisphere communication and the slowest one would be inter-brain communication.

Compute resources

This is the real thing that we are supporting, having virtually unlimited amount of brain power for on-demand use is the biggest thing. It would let the human race jump atleast 2-3 orders of magnitude in terms of innovation ability. My assumption is that, the current rate of scientific or progress in other fields limited by the amount of “thinking” which people in these fields can do at a time. If a brain virtualization was available and if we tought ourselves to package and construct thoughts in a way such that they can be run a compute jobs or more like programs, scientists could potentially think continuously and work out hundreds or millions of thoughts in parallel, eliminate the bad ones in a single day which would otherwise take years.

Types of compute resources

As far as I know/understand the human brain’s parts are fundamentally different in terms of the kind of work/computation that they can perform. Hence, we could serve two kinds of resource types.

Left brain instances: numerical computation (exact calculation, numerical comparison, estimation), direct fact retrieval, grammar/vocabulary and literal functions.
Right brian instances: numerical computation (approximate calculation, numerical comparison, estimation), intonation/accentuation, prosody, pragmatic and contextual functions.

In adddition to the basic brain functions, there may also be somehigher level specialized brian “abilities”. Certain brains are proficient in certain abilities. Let’s say a brain which provides “Calculus abilities” so that calculus is a basic and high performant capability so that if a scientist needs to perform some complex math, she/he could get a specialized “calculus brain” which would be able to perform integral/differential calculus with much higher efficiency when compared to a “basic brain”. This does not mean that a the “basic brain” is any less capable of performing calculation but just that the neural pathways for doing these functions would be slower and less performant by an order of magnitude.

This means, there is an absolute incentive to learning more and diverse things, at the same time, specializing in certain fields would make certain attributes a basic capability of your brain. I draw these assumption from personal experience, ater years of math and computer science, I can solve certain thoughts almost natively i.e. I don’t have to think much when I look at an algorithm and try to figure out the algorithmic complexity or maybe when I need to calculate powers of 2 or a simple/medium level calculus problem.

Thinking workflow reimagined

The new thought process would probably be something like this:

Think of a big new idea.
Figure out different parts of the idea, the different thought components involved.
Reserve instances of the instances that you may need to reason each of these thoughts.
Send thought “jobs” to these remote instances over some wetware communcation media/network for processing and contemplation.
Watch the brain cluster for failures and re-initiate thoughts for which the brain instances got terminated.
Get processed results back from instances as they complete.
Repeat till desired result is obtained or you give up.

Now where does this take us, we have at our disposal, unlimited brain power. Now given that you can yourself virtualize your own brain in case your original brain gets tired, you could background your thoughts, send them to another instance, sleep or get some down time and later retrieve your calculated/processed thoughts.

With such technology humans would stop being isolated beings and would become more like hive consciousness beings.

What does this mean to people who give out resources? When you sleep or are just not using your brain you can “rent out” your brain, wake up and continue where you left off. In case you need more computation resources, terminate the guest instances and reclaim some more local resources.

Brain Resource management and billing

The natural question here is what about the usual management of your own brain resources. Maybe you just feel tired or want to go on vacation and not have anything running in your brain except you. Being Dom0 or the host consciousness you have additional control over the capabilities of your resources. This allows you to suspend or maybe completely disconnect the brain from external entities and hence run in isolation.

The host may find that some instance has gone astray and misbehaving. This is something that should be caught by either the virtualization layer or something like a monitor process available to the host. This way it can be terminated or in extreme cases, banned completely - I am thinking of API token based access control if OAuth is too heavy-weight. :)

The other aspect of doing all this may be purely financial. We just want to “sell” brain compute power. Billed by the second or minute or hour, whatever is appropriate. This might be the economy of the real “information age”.

Brain Storage

Now that we have got around to performing computation, we need some way storing the exabytes of thoughts and memories that we would generate, even if they are transient thoughts.

The virtualization layer may provide access to “brain storage”, which may let a user store and process thoughts which are not constrained by the brian of the person initiating the thought. This is really important if we want the compute resources to be of any use, my hypothesis is that the original thought is probably really small thought so the network utilization for transferring thought is probably very small (A picture or phrase or some other brain wave chatter) but the scratch space requirement to build the thought tree and work through the all different things involved may be really large. Maybe even span beyond the capabilities of a single brain, needing sharding of the thought across different brians.

The basic requirement of this storage platform is complete isolation of thoughts. It has to be impossible for any other brain instance or even the host to look at the information/ideas stored by a different user. Something like encrypting the thoughts with some for of encryption and signing it for integrity.

Conclusion

Where are we at the end of this? I believe we are making huge amount of progress in the fields of science and technology and I think this might be just a pipe dream but then so was flying and handheld computers. Having unlimited brain power would change the world in a fundamental way. It would not only push the human race years forward but also make us much closer to each other. The concept of political boundaries would essentially cease to exist. Internet would be a thing of the past (or just a minor subset of the Cognitive-Net or Cognet). Peace would be just a thought that would become natural (I hope) to every human and essential to the continued survival.

Thoughts of an allocated mind.

Where Clause Options

Thoughts on where Clauses for analytics API

Option 1

Extensibility

Ease of use

Example:

Option 2

Extensibility

Ease of use

Paper: Anti-caching: A new approach to database management system architecture

ALT Main memory databases

ALT Distributed Cache

Intro Anti-caching

Assumptions

H-STORE system overview

Workload ->

Anticaching system model

Storage Architecture

Block Retrieval

Distributed Transactions

Snapshots & Recovery

Results / Evaluation

No one needs to know that ...

Spark Summit 2013 notes

Keynote by Matei Zaharia

Spark 0.8.1 will have

Other contributions

Current priorities

Hadoop and Spark join forces at Yahoo

Making Spark Fly: Creating an elastic Spark cluster on Amazon EMR

SIMR

FLINT - Deploying BDAs on AWS (Adobe)

ADATAO - R and Python with NLP over spark

TupleJump

Realtime analytics processing - Intel

Sparrow - Next-Gen spark scheduling

Spark as a Service - OOYALA

Spark streaming

Virtualised Brian - A thought experiment

Assumptions

Basics

Details

Networking

Compute resources

Types of compute resources

Thinking workflow reimagined

Brain Resource management and billing

Brain Storage

Conclusion

Thoughts on `where` Clauses for analytics API