Beyond The Relational - NoSQL Databases

From DevSummit
Jump to navigation Jump to search

Facilitated by Evan “Rabble” Henshaw-Plath

The web development world has gone mad over non-relational databases, so called nosql db's. In this session we'll go over the post-relational landscape. From Memcached to Hbase, Cassandra to CouchDB. What's the difference between the key value stores vs map reduce engines, the document vs column oriented db's. This workshop is not particularly practical.

Session Notes

NOSQL (what databases people attending session use)

mysql google app engines cassandra mongodb sqlserver oracle formix couch db (has a json interface) object databases postgres

non-relational databases

in last year or two people have gotten excited about non-relational database ex. simple key value stores, to very complicated data stores

in web dev ppl don't use relational databases as they were designed ex: locking issues, people waiting for transactions to complete, not being able to hit harddrive so you have to store everything in memory – with limits of 16mb, for example

performance started falling apart joins being impossible in production database systems - non-normalized data has to be thrown out – as well as things that aren't indexs, and also doing selects across multiple things

writing and reading off slave databases – something about order of writing and reading and timing

dev null mysql engine – using lots of mysql servers to load balance

no indexes, no storing – bizarre (ex facebook engineers – having to do weird things to make facebook work – giant server, gigaeathernet to series of slaves – running binary log at capacity... 400 mysql servers acting as replication -

which is why they built cassandra

probably the facebook engineers thought this was a good idea at the beginning – but then it went crazy – adding 400 extra servers

too much data to possibly fit on a computer so they made distributed databases – cassandra designed for the queries you could do cassandra is a lossy database – cassandra is awesome database that loses data twitter and facebook use it – they have a normal mysql, and then cassandra that stores the cached versions of the original database

flickr has hundreds or thousands of databases 'sharding' user id's 'a-b' = database one = 'users c-d' = database two

which causes you to put logic in

key value store – memcache is default version that people use – ex. User + cached user information

if you turn off computer – all the data goes away

memcache was the magic sauce of the internet that no one talked about

mixy – parallel technology by japanese engineers – tokyo tyrant

Key/Value stores – use to make your site faster memcache retis (might assign ids) toykocabinet (very good and people use it on very big sites) - stored toykotyrant – distributed version to use you need some software that does id assignments

memcache doesn't do replication doesn't have persistent disk option

cache and validation – how do you know what was stored in the database – how to know if it is still valued – set timeout + be ok with inconsistent data for 15 minutes

facebook was losing 95% of newsfeed of friends

twitter was crashing – missing friends tweets was not ok – twitter kept trying to get everything to show to users and then the servers would just crash

finding updates you are interested on write – vs read on write sets flag to followers that it's a posts users would want

on facebook – it is set on read – which limits friends on facebook to 5000 (on read lookup – which can't scale [maybe someone can explain the logic of this better – I missed it]

websites do not consistentyl know when/where all user data is because it is always in flux.

'generation id' – making everything invalid every 15 minutes

what happens when format of relational database isn't write ex. project about persons energy consumption - user – hundreds of fields per table efficiency 2.0 – also what friend feed implemented

lcuene inveryed index

break relational database –


another idea column oriented data store

(he's not talking about object oriented data stores because they do not apply)

google app engine (java is query language)– column oriented data store (one big table, can't use googles, but they published academic papers and people replicated based on this)

hoodupe (pig is query language) h-base + cassandra have to write application level code (no query language)

language of set math

documents – querying

mongo db and couch db (javascript of whatever prog. Languages – passes query to data crawls over it and you have to do indexing before query (?)

document oriented databases – for huge amount of fields document versions do best vesrion of replication

big tables - lots of columns – fields – everyhting is indexed – fill in columns actually used, leave other ones blank

state column - being able to move columns from one to another

plone is an object database

when fields change – translate schema – have to go open each one

document data – have to tell them which data you want to store/persist

idea – storing everything as rdf triple - ning uses a triple store

holy grail of semantic stores – if you stored everything in triple – then in therory you could know everything – but it didn't scale – only to millions, it now scales to billions

object predicate value way of seeing things doesn't map well to way things are actually stored for web apps (ex. - many to many relastionships – which are used a lot )

academic instutitoins use triple stores a lot

relational data: field + value

ex usertype: person

most web apps don't need meaning – that meaning doesn't need to be stored

log file data (big) getting too much data than would fit on harddrive

grep – to search – requires opening files that don't even fit on harddrive

distributed store of data – query – one by one – map of list of referrals – map to things that match query produces more limited set of results, called 'map reduce'

teeny – 'search for this, limit to this' – command you run over the data to extract useful informaiton the map reduce sessions run for a long time

making tables of results – every possible query application would ever need to do – data is already ready – did join at some point in the past

assembling at write time – instead of read

can write data to do this

cassandra – open source db that facebook built

update script – finds new data thrift – a protocol does realtime queing system looks for – pushes updates – deleting least active query over time

google 'big table' hadoo

can do richers queries – and prerun them

response formats – have changed a lot lately binary protocol (database driver) – and database - people switching to return format of json – can see datastructure

disadvantage bson (binary json) – and other formats for transit data (thrift, protocal buffers (google) binary packing format – getting back data that looks like tables and values

these databse have no objects relations there are no objects so you store records

experimentation of data store examples – for different kinds of data stores couch rest active rest active couch

REST - representations state transfer – idea to use http as intentend put get post delete – as fundamental protocols of your api

almost everything in webservices is rest or rest like unless it came from legacy/bank

USE CASES form generator -

plone – SOAP – python pickles

lucene – indexing search engine myisam – full text search doesn't scale well, wonky doug cutting made lucene – rich query language for searching text sphinx – smaller, lighter weight search indexing - - craigslist uses sphnix – needing to remove stuff from index - can use sphix with a standard relational database

seaside – heretical web app framework written in smalltalk continuations-based – every part of webpage has execution of code – waite for pages that are pretty – mind fuck way of web development

roll oven OO database – and own indexing

lots of blog posts and conferences (NOSQL conferences ) - can see videos no ideas which will win – people are experimenting

couch db written in 'erlang' a difficult functional language

couch not cool anymore, (wants to own your life) obliterating webservers – federated connections to other servers, lots of work – but ultimately you could own your own data

mongo db is cool at the moment

erlang only has constants (no variable) – wrtten by ericson in 1980s – always up, never goes down

Mongo – not as weird as couch db collections – which are kind of like tables – written in c++ 1.0 version drop in replacement for rails 3 lead developer was CTO of doubleclick available on irc, nice & very responsive people

'lionage version of php'

hadoo people very friendly

cloudera – provide commercial support for hadoop