Beyond The Relational - NoSQL Databases
Facilitated by Evan “Rabble” Henshaw-Plath
The web development world has gone mad over non-relational databases, so called nosql db's. In this session we'll go over the post-relational landscape. From Memcached to Hbase, Cassandra to CouchDB. What's the difference between the key value stores vs map reduce engines, the document vs column oriented db's. This workshop is not particularly practical.
Session Notes
NOSQL (what databases people attending session use)
mysql google app engines cassandra mongodb sqlserver oracle formix couch db (has a json interface) object databases postgres
non-relational databases
in last year or two people have gotten excited about non-relational database ex. simple key value stores, to very complicated data stores
in web dev ppl don't use relational databases as they were designed ex: locking issues, people waiting for transactions to complete, not being able to hit harddrive so you have to store everything in memory – with limits of 16mb, for example
performance started falling apart joins being impossible in production database systems - non-normalized data has to be thrown out – as well as things that aren't indexs, and also doing selects across multiple things
writing and reading off slave databases – something about order of writing and reading and timing
dev null mysql engine – using lots of mysql servers to load balance
no indexes, no storing – bizarre (ex facebook engineers – having to do weird things to make facebook work – giant server, gigaeathernet to series of slaves – running binary log at capacity... 400 mysql servers acting as replication -
which is why they built cassandra
probably the facebook engineers thought this was a good idea at the beginning – but then it went crazy – adding 400 extra servers
too much data to possibly fit on a computer so they made distributed databases – cassandra designed for the queries you could do cassandra is a lossy database – cassandra is awesome database that loses data twitter and facebook use it – they have a normal mysql, and then cassandra that stores the cached versions of the original database
flickr has hundreds or thousands of databases 'sharding' user id's 'a-b' = database one = 'users c-d' = database two
which causes you to put logic in
key value store – memcache is default version that people use – ex. User + cached user information
if you turn off computer – all the data goes away
memcache was the magic sauce of the internet that no one talked about
mixy – parallel technology by japanese engineers – tokyo tyrant
Key/Value stores – use to make your site faster memcache retis (might assign ids) toykocabinet (very good and people use it on very big sites) - stored toykotyrant – distributed version to use you need some software that does id assignments
memcache doesn't do replication doesn't have persistent disk option
cache and validation – how do you know what was stored in the database – how to know if it is still valued – set timeout + be ok with inconsistent data for 15 minutes
facebook was losing 95% of newsfeed of friends
twitter was crashing – missing friends tweets was not ok – twitter kept trying to get everything to show to users and then the servers would just crash
finding updates you are interested on write – vs read on write sets flag to followers that it's a posts users would want
on facebook – it is set on read – which limits friends on facebook to 5000 (on read lookup – which can't scale [maybe someone can explain the logic of this better – I missed it]
websites do not consistentyl know when/where all user data is because it is always in flux.
'generation id' – making everything invalid every 15 minutes
what happens when format of relational database isn't write ex. project about persons energy consumption - user – hundreds of fields per table efficiency 2.0 – also what friend feed implemented
lcuene inveryed index
break relational database –
[ SOMEONE PLEASE FILL IN THE EXPLANATION HERE]
another idea column oriented data store
(he's not talking about object oriented data stores because they do not apply)
google app engine (java is query language)– column oriented data store (one big table, can't use googles, but they published academic papers and people replicated based on this)
hoodupe (pig is query language) h-base + cassandra have to write application level code (no query language)
language of set math
documents – querying
mongo db and couch db (javascript of whatever prog. Languages – passes query to data crawls over it and you have to do indexing before query (?)
document oriented databases – for huge amount of fields document versions do best vesrion of replication
big tables - lots of columns – fields – everyhting is indexed – fill in columns actually used, leave other ones blank
state column - being able to move columns from one to another
plone is an object database
when fields change – translate schema – have to go open each one
document data – have to tell them which data you want to store/persist
idea – storing everything as rdf triple -
ning uses a triple store
holy grail of semantic stores – if you stored everything in triple – then in therory you could know everything – but it didn't scale – only to millions, it now scales to billions
object predicate value way of seeing things doesn't map well to way things are actually stored for web apps (ex. - many to many relastionships – which are used a lot )
academic instutitoins use triple stores a lot
relational data: field + value
ex usertype: person
most web apps don't need meaning – that meaning doesn't need to be stored
log file data (big) getting too much data than would fit on harddrive
grep – to search – requires opening files that don't even fit on harddrive
distributed store of data – query – one by one – map of list of referrals – map to things that match query produces more limited set of results, called 'map reduce'
teeny – 'search for this, limit to this' – command you run over the data to extract useful informaiton the map reduce sessions run for a long time
making tables of results – every possible query application would ever need to do – data is already ready – did join at some point in the past
assembling at write time – instead of read
can write data to do this
cassandra – open source db that facebook built
update script – finds new data thrift – a protocol does realtime queing system looks for – pushes updates – deleting least active query over time
google 'big table' hadoo
can do richers queries – and prerun them
response formats – have changed a lot lately binary protocol (database driver) – and database - people switching to return format of json – can see datastructure
disadvantage bson (binary json) – and other formats for transit data (thrift, protocal buffers (google) binary packing format – getting back data that looks like tables and values
these databse have no objects relations there are no objects so you store records
experimentation of data store examples – for different kinds of data stores
couch rest
active rest
active couch
REST - representations state transfer – idea to use http as intentend put get post delete – as fundamental protocols of your api
almost everything in webservices is rest or rest like unless it came from legacy/bank
USE CASES form generator -
plone – SOAP – python pickles
lucene – indexing search engine
myisam – full text search doesn't scale well, wonky
doug cutting made lucene – rich query language for searching text
sphinx – smaller, lighter weight search indexing - - craigslist uses sphnix – needing to remove stuff from index -
can use sphix with a standard relational database
seaside – heretical web app framework written in smalltalk continuations-based – every part of webpage has execution of code – waite for pages that are pretty – mind fuck way of web development
roll oven OO database – and own indexing
lots of blog posts and conferences (NOSQL conferences ) - can see videos
no ideas which will win – people are experimenting
couch db written in 'erlang' a difficult functional language
couch not cool anymore, (wants to own your life) obliterating webservers – federated connections to other servers, lots of work – but ultimately you could own your own data
mongo db is cool at the moment
erlang only has constants (no variable) – wrtten by ericson in 1980s – always up, never goes down
Mongo – not as weird as couch db collections – which are kind of like tables – written in c++ 1.0 version drop in replacement for rails 3 lead developer was CTO of doubleclick available on irc, nice & very responsive people
'lionage version of php'
hadoo people very friendly
cloudera – provide commercial support for hadoop