Difference between revisions of "Data Munging"

From DevSummit
Jump to navigation Jump to search
m (Jack moved page Data formats best practices to Data Munging: Previous title didn't reflect actual session content)
(formatting)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
DATA MUNGING w Jon Goldberg
 
DATA MUNGING w Jon Goldberg
  
Go around: what is folks' interests? What are the problem spaces we work in?
+
== Go around: what is folks' interests? What are the problem spaces we work in? ==
  
* Organizations looking to get many different sources of data in many
+
* Organizations looking to get many different sources of data in many different forms all centralized, organized, de-siloed, standardized
different forms all centralized, organized, de-siloed, standardized
+
* Migration projects--Drupal, Joomla to WordPress, also handwritten CMSes from the 90s.
* Migration projects--Drupal, Joomla to WordPress, also handwritten
+
* Donor lists, stuff in Excel or Access, trying to be pulled together; huge PHP files and MySQL queries is current approach
CMSes from the 90s.
+
* Coming from a PM position, wanting to better understand the work and the tools
* Donor lists, stuff in Excel or Access, trying to be pulled together;
+
* Migrations, more CRMs than CMSes, work with lots of orgs that do things in funny duct tape ways, series of Excel spreadsheets with macros no one understands
huge PHP files and MySQL queries is current approach
+
* Campaign finance data, lots of feelings about bad gov't data not-standardized across jurisdictions
* Coming from a PM position, wanting to better understand the work and
+
* Amazon now has 990 library that's OCR-readable; could have an impact if they're more accessible and usable, exploring ways to use that data.
the tools
+
* Dealing with sensitive data, want to articulate the problems and issues with the data currently while not incurring risk
* Migrations, more CRMs than CMSes, work with lots of orgs that do
+
* Putting numbers on value of environmental resources to make arguments against certain projects
things in funny duct tape ways, series of Excel spreadsheets with macros
 
no one understands
 
* Campaign finance data, lots of feelings about bad gov't data
 
not-standardized across jurisdictions
 
* Amazon now has 990 library that's OCR-readable; could have an impact
 
if they're more accessible and usable, exploring ways to use that data.
 
* Dealing with sensitive data, want to articulate the problems and
 
issues with the data currently while not incurring risk
 
* Putting numbers on value of environmental resources to make arguments
 
against certain projects
 
 
* CMS and CRM migrations (mentioned frequently)
 
* CMS and CRM migrations (mentioned frequently)
 
* Combining similar data from multiple sources, esp gov't data
 
* Combining similar data from multiple sources, esp gov't data
  
What tools are folks using?
+
== What tools are folks using? ==
  
 
* MySQL
 
* MySQL
Line 33: Line 23:
 
* Pentaho data integration
 
* Pentaho data integration
 
* OpenRefine
 
* OpenRefine
* Excel and LibreCalc forumlas and pivot tables
+
* Excel and LibreCalc formulas and pivot tables
 
* Custom scripts (PHP, Perl, bash etc)
 
* Custom scripts (PHP, Perl, bash etc)
* WP-all-import in WordPress
+
* [http://www.wpallimport.com/ WP-all-import] in WordPress
 
* Feeds module in Drupal and Backdrop
 
* Feeds module in Drupal and Backdrop
 
* RSS and Atom feeds
 
* RSS and Atom feeds
Line 41: Line 31:
 
* StackExchange!
 
* StackExchange!
  
What sorts of frustrations are folks encountering with what they're
+
== What sorts of frustrations are folks encountering with what they're doing now? ==
doing now?
 
  
* Always inheriting really poor information architecture; no thought
+
* Always inheriting really poor information architecture; no thought went into data structure, things are jumbled and jammed badly. Part of headache is working with client to help them understand information architecture so there's a home for this data to go to.
went into data structure, things are jumbled and jammed badly. Part of
+
* Frustration to do migration or munging from sources that are very messy to start with to a destination, but then the client doesn't understand why it's not perfectly sensible in the new format. Process of explaining that there's no magical way to make data that didn't make sense before now make sense.
headache is working with client to help them understand information
+
* Documentation isn't available--maybe the architecture does make sense but you don't know what it means!
architecture so there's a home for this data to go to.
 
* Frustration to do migration or munging from sources that are very
 
messy to start with to a destination, but then the client doesn't
 
understand why it's not perfectly sensible in the new format. Process of
 
explaining that there's no magical way to make data that didn't make
 
sense before now make sense.
 
* Documentation isn't available--maybe the architecture does make sense
 
but you don't know what it means!
 
 
* Proprietary formats: someone knows what the fields all are.
 
* Proprietary formats: someone knows what the fields all are.
* Tedious of manual processes; long to learn more about automating
+
* Tedious of manual processes; long to learn more about automating processes.
processes.
+
* Automating things that are similar but not exactly the same; figuring just how far.
* Automating things that are similar but not exactly the same; figuring
+
* Working with small budgets and pouring so much of them into migrations, knowing more can be automated
just how far.
 
* Working with small budgets and pouring so much of them into
 
migrations, knowing more can be automated
 
 
* Working with poorly validated data
 
* Working with poorly validated data
 
* Debt from previous migrations
 
* Debt from previous migrations
 
* Joomla in general
 
* Joomla in general
  
Open conversation: talk about the strategies and tools we've used to
+
== Strategies and tools we've used to mitigate the pain points we've discussed ==
mitigate the pain points we've discussed.
 
  
* Thinking about problem of data architecture being obscure, proprietary
+
* Thinking about problem of data architecture being obscure, proprietary or hard to understand: have made mistake of just trying to figure it out alone, but more valuable to think about who else might know what it is, can figure it out more easily than me. Send a few example rows to somebody who can explain the structure, saves a lot of time.  
or hard to understand: have made mistake of just trying to figure it out
+
* Problem: so many of these migrations are bad munges that there are so many different, individual cases that no one
alone, but more valuable to think about who else might know what it is,
+
* Do as much planning up front as possible so you know as much as possible about the data, where it's going and what it means to the client before starting the technical work.
can figure it out more easily than me. Send a few example rows to
+
* When you start using tools, you find problems in the data that throw the planning into some chaos. Come up with a data dictionary: the field in the source data, how we intend to map it to the destination data. Work with people who aren't technical to develop that; once you start using the tools you see all the problems and flaws with the data dictionary but it still forms the basis of a shared agreement with the client. "OK, this is what we said, but this is why we can't do that."
somebody who can explain the structure, saves a lot of time.
+
* Hands-on process with the client: "You understand your data better than I ever will; I need your help to understand how to do it well." At very least available to answer questions about the data dictionary and plans.
* Problem: so many of these migrations are bad munges that there are so
+
* Find the things that are irrelevant, never used properly, correct but too old to be useful; the smaller the data set, the less unique cases drag you down.
many different, individual cases that no one
+
** Old content that isn't actually useful or would be too much trouble to migrate
* Do as much planning up front as possible so you know as much as
 
possible about the data, where it's going and what it means to the
 
client before starting the technical work.
 
* When you start using tools, you find problems in the data that throw
 
the planning into some chaos. Come up with a data dictoinary: the field
 
in the source data, how we intend to map it to the destination data.
 
Work with people who aren't technical to develop that; once you start
 
using the tools you see all the problems and flaws with the data
 
dictionary but it still forms the basis of a shared agreement with the
 
client. "OK, this is what we said, but this is why we can't do that."
 
* Hands-on process with the client: "You understand your data better
 
than I ever will; I need your help to understand how to do it well." At
 
very least available to answer questions about the data dictionary and
 
plans; they can be forced
 
* Find the things that are irrelevant, never used properly, correct but
 
too old to be useful; the smaller the data set, the less unique cases
 
drag you down.
 
** Old content that isn't actually useful or would be too much trouble
 
to migrate
 
 
** Old org CRM-type data
 
** Old org CRM-type data
* Working with own data and gov't data that's more or less opaque; if
+
* Working with own data and gov't data that's more or less opaque; if you can't find any documentation on a data set is trying to track down the tech folks who are working on that, talk to them directly.
you can't find any documentation on a data set is trying to track down
+
* Often times when you're looking at the data as you prepare to migrate, you realize that the new data structure you're building might need to change to omit things that won't really be used.
the tech folks who are working on that, talk to them directly.
+
* Sometimes you need to lose things to realize you really didn't need it in the first place!
* Often times when you're looking at the data as you prepare to migrate,
 
you realize that the new data structure you're building might need to
 
change to omit things that won't really be used.
 
* Sometimes you need to lose things to realize you really didn't need it
 
in the first place!
 
 
* How do you convince clients that you don't have to migrate EVERYTHING?
 
* How do you convince clients that you don't have to migrate EVERYTHING?
** Explain to the client--there are certain things you think you want to
+
** Explain to the client--there are certain things you think you want to access once a year only; perhaps keep the old system (Filemaker on a desktop in this example) available to look that information up, but not go through the trouble of migration.
access once a year only; perhaps keep the old system (Filemaker on a
+
** Keep archives and backups available in the original state in case that data is ever needed.
desktop in this example) available to look that information up, but not
+
** Through educating the client, they often make the decisions to drop some data themselves.
go through the trouble of migration.
 
** Keep archives and backups available in the original state in case
 
that data is ever needed.
 
** Through educating the client, they often make the decisions to drop
 
some data themselves.
 
 
* Consider migrating stuff into a messy, admin-only fields
 
* Consider migrating stuff into a messy, admin-only fields
* Sometimes even putting names on the columns in the spreadsheet/data
+
* Sometimes even putting names on the columns in the spreadsheet/data set to know who put that data in, how and why.
set to know who put that data in, how and why.
+
* If data has been already extracted but it's not in a great state, can try going back to the source to get a better data dump.
* If data has been already extracted but it's not in a great state, can
 
try going back to the source to get a better data dump.
 
  
LOOKING AT THE TOOLS
+
== LOOKING AT THE TOOLS ==
  
* Lots of these problems I used to try to solve with spreadsheets and
+
* Lots of these problems I used to try to solve with spreadsheets and formulas; ran into a lot of similar problems. It's hard, takes a lot of time, not repeatable, not very good at surfacing the exceptional data and problem areas.
forumlas; ran into a lot of similar problems. It's hard, takes a lot of
 
time, not repeatable, not very good at surfacing the exceptional data
 
and problem areas.
 
 
* Started using some more advanced tools that folks mentioned
 
* Started using some more advanced tools that folks mentioned
* *OpenRefine*
+
* [http://openrefine.org/ OpenRefine]
** Bills itself as being a tool to work with messy data; was originally
+
** Bills itself as being a tool to work with messy data; was originally from Google, got spun off into own project
from Google, got spun off into own project
 
 
** available for all major operating systems
 
** available for all major operating systems
 
** Decompress the file, doesn't need to be installed; a script runs,
 
** Decompress the file, doesn't need to be installed; a script runs,
 
starts a web server on your computer, opens an address in your web browser
 
starts a web server on your computer, opens an address in your web browser
** When you open a data set and create a project, you can do some cool
+
** When you open a data set and create a project, you can do some cool things with the data
things with the data
+
** One of the coolest is a facet: click on the header of any column, then choose to do a facet that gives a count of how many records in the DB have every value so you can see the exceptional values, e.g. out of about 48K records, only 1K as an email type of personal or work, most or blank. If some said "wrok" or "summer home" or things that should be combined, you can click on the items from the left column and merge them together. Will even try to guess some for you and put them together.
** One of the coolest is a facet: click on the header of any column,
+
** When you do facets, you can limit your dataset to a particular one, run multiple facets at once and only clean the data within your given facets
then choose to do a facet that gives a count of how many records in the
+
** Numerical facets: show me all the numerical values, see the outliers, give you a distribution graph, see where outliers are and where your distributions fall
DB have every value so you can see the exceptional values, e.g. out of
 
about 48K records, only 1K has an email type of personal or work, most
 
or blank. If some said "wrok" or "summer home" or things that should be
 
combined, you can click on the items from the left column and merge them
 
together. Will even try to guess some for you and put them together.
 
** When you do facets, you can limit your dataset to a particular one,
 
run multiple facets at once and only clean the data within your given facets
 
** Numerical facets: show me all the numerical values, see the outliers,
 
give you a distribution graph, see where outliers are and where your
 
distributions fall
 
 
** Change numbers--spot and merge the outliers
 
** Change numbers--spot and merge the outliers
** Filter by numerical and date ranges, particularly helpful when
+
** Filter by numerical and date ranges, particularly helpful when there's good reason to believe that how the tool was used previously has changed over time. If a field has 1K out of 48K non-blank entries; if out of 48K only 4 records exist, recommend dropping that, but 1K out of 48K is enough to say "there's probably something useful there" but also you can see if maybe someone just cared about tracking that for a year and before and after that no one cares. Combine the "email addresses without type" facet with a date facet to see whether that's
there's good reason to believe that how the tool was used previously has
+
** Can you script this?  One person doesn't use it to make it repeatable because it's got Python bindings; if you write a script in Python you can interact with its API
changed over time. If a field has 1K out of 48K non-blank entries; if
+
** There are scripting steps you can write to create facets on more complex criteria, write some things in JavaScript to either facet data or transform it.
out of 48K only 4 records exist, recommend dropping that, but 1K out of
+
** Once you drill down there are transformations you can run on it through OpenRefine. Some transformations are built in, you can create others yourself.
48K is enough to say "there's probably something useful there" but also
+
** Use it to figure out the data dictionary before the munging; find it easier to use OpenRefine to do that investigation of the data and identifying problems in the data than MySQL. What are the possible values for this column? What's the distribution of those possible values?
you can see if maybe someone just cared about tracking that for a year
 
and before and after that no one cares. Combine the "email addresses
 
without type" facet with a date facet to see whether that's
 
** Can you script this?  One person doesn't use it to make it repeatable
 
because it's got Python bindings; if you write a script in Python you
 
can interact with its API
 
** There are scripting steps you can write to create facets on more
 
complex criteria, write some things in JavaScript to either facet data
 
or transform it.
 
** Once you drill down there are transformations you can run on it
 
through OpenRefine. Some transformations are built in, you can create
 
others yourself.
 
** Use it to figure out the data dictionary before the munging; find it
 
easier to use OpenRefine to do that investigation of the data and
 
identifying problems in the data than MySQL. What are the possible
 
values for this column? What's the distribution of those possible values?
 
 
** Use it to separate content from one cell into multiple cells.
 
** Use it to separate content from one cell into multiple cells.
** Some folks use it identify duplicate data; not sure it would be
+
** Some folks use it identify duplicate data; not sure it would be really efficient on a mass scale. (Most folks aren't using OpenRefine for that.)
really efficient on a mass scale. (Most folks aren't using OpenRefine
+
** Advantages over LibreOffice Calc: autofilter results on columns can't be copied elsewhere easily, OpenRefine can.
for that.)
+
** Hint: putting MySQL on an entirely RAM disk can be useful when dealing with huge datasets
** Advantages over LibreOffice Calc: autofilter results on columns can't
+
 
be copied elsewhere easily, OpenRefine can.
+
== ETL tools - Extract, Transform, Load ==
** Hint: putting MySQL on an entirely RAM disk can be useful when
+
* designed to pull data out of a data source, munge it, and put it in a destination.
dealing with huge datasets
+
* Largely used for data warehousing: a big biz with many databases, have to run reports across all of them, want a nightly script that dumps everything into one master data warehouse that you can run the big queries on.
* ETL tools - Extract, Transform, Load
+
* Also super-useful for migrations, scripted and repeatable which is useful when you're working to migrate data from constantly-changing sources like an active use CRM or CMS.
** designed to pull data out of a data source, munge it, and put it in a
+
* Right now, groups that do online advocacy need to match people to their voter record; you pay a lot of money to Salsa or other similar people who get the info in one easy-to-use place. National Voter File project is trying to do that for free, want to use ETL to repeat it year by year.
destination.
+
* Many different tools, Jon uses Kettle by a company called Pentaho, aka [http://www.pentaho.com/product/data-integration Pentaho Data Integration]
** Largely used for data warehousing: a big biz with many databases,
+
** [https://sourceforge.net/projects/pentaho/ Non-corporate download link for Pentaho Data Integration]
have to run reports across all of them, want a nightly script that dumps
+
* In Kettle,you configure different blocks to transform the data.
everything into one master data warehouse that you can run the big
+
* Multiple tasks make up a migration; each task is a different block, e.g. donations, events, contacts, participants, all different tables in a DB. Migrating them each separately; run in a particular order so you can depend on previously migrated data.
queries on.
+
* Start with a data import source, which can be fairly varied; currently pulling directly from a Postgres database but can pull from CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live feeds of the data
** Also super-useful for migrations, scripted and repeatable which is
+
* You can combine data inputs, do joins on CSV files from Excel to json
useful when you're working to migrate data from constantly-changing
+
* Lets you break it down step by step, unlike trying to write SQL queries to do the same, they're difficult to troubleshoot because they're so dense.
sources like an active use CRM or CMS.
+
* Pentaho Kettle can let you import data, get the new ID from the database, put that new ID on your legacy data so it gets matched up.  
* Right now, groups that do online advocacy need to match people to
+
* As the transformation is running through different steps, you can pause and look at the state of all your data at any one of these steps. You can see it being transformed as it's going through; if it's not coming out you expect you can go back and look what the data looked like at an earlier successful state.
their voter record; you pay a lot of money to Salsa or other similar
+
* Pentaho Kettle documentation is decent; this tool is used often by large organizations. There's an enterprise version that's part of a suite, but not sure if Kettle makes much difference between community version and enterprise version.
people who get the info in one easy-to-use place. National Voter File
+
* ETL solves a lot of the problems in the room: cleaning up the messy data, but also makes a lot of steps repeatable in a way that takes a lot of pain out of everything around it. e.g. as importing data into a data source, it not only cleans the data but imports it for you. Copy a file to the remote server, run the script to import the data, take a dump of MySQL DB on remote, save to local with a particular name: that's a repeatable step.
project is trying to do that for free, want to use ETL to repeat it year
+
* You can run this on your local machine, but it can also work with remote data sources.
by year.
+
* Once you create a job you can put it on a remote server and running it headlessly on a schedule.
** Many different tools, Jon uses Kettle by a company called Pentaho,
+
* Wrote a script that pulls data from original source, munges it, uploads it; scheduled Kettle job that runs nightly, sends an email if it fails. Much more common usage of Kettle
aka Pentaho data integration
+
* Using it for sanitizing data quite a bit, e.g. sensitive data running locally
** In Kettle,you configure different blocks to transform the data.
+
* Many people use Kettle to send data to third party services
** Multiple tasks make up a migration; each task is a different block,
+
* There are lots of problems that come up over and over, e.g. email address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own transforms that you can use over and over again and can drag into new transformations. It can check for valid email addresses, either by format OR by sending test emails
e.g. donations, events, contacts, participants, all different tables in
+
* Kettle jobs are XML, not super readable, but can be kept in version control for updating and deployment.
a DB. Migrating them each separately; run in a particular order so you
+
** [https://github.com/MegaphoneJon/civicrm_kettle_transforms MegaphoneJon on GitHub] -- has a lot of reusable Kettle transforms up there, lots for migrating from proprietary databases to CiviCRM
can depend on previously migrated data.
+
 
** Start with a data import source, which can be fairly varied;
+
== Deduping ==
currently pulling directly from a Postgres database but can pull from
+
 
CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live
+
* two ways: 1) I've got this person in one file 5 times, 2) I've got five files and need to dedupe them all without one common identifier ID. Kettle makes it easy to do lookups between heterogeneous data sources, you can do lookups to assign a common ID to do that.
feeds of the data
+
* More classical deduping -- [https://github.com/datamade Datamade] is a company that has libraries of tools for the more classic deduping.
** You can combine data inputs, do joins on CSV files from Excel to json
+
* Use "group by" function in Kettle and munge the data.  
** Lets you break it down step by step, unlike trying to write SQL
+
* Dedupes can take you on an adventure; adding an admin issues field where comments are dropped in about a record, might put "possible duplicate" or "no email address," give that back to the client with an army of interns who can work on those.
queries to do the same, they're difficult to troubleshoot because
+
* You can split your outputs into multiple files; run data validation steps at the end to find ones that have bad email, improperly formatted numbers, add an "Error" field that notes the problems; don't put them in the same file, put them in an errors output file, e.g. a CSV that you can share with the client so they can see the problems. Rerun the script every time they send you new data, put the problem data someplace where the client can look at it and help clean up. Tell them "fix it in your source data" if your tool is usable, unlike Excel.
they're so dense.
+
* Ability to say "anytime you do X, say Y", e.g. "every time you see Reverend as the title, change it to Rev" or "drop it into a separate field"
** Pentaho Kettle can let you import data, get the new ID from the
+
* Data dictionaries and documentation: valuable to make those, but can be very complicated to translate your logic into a chart. Like to give at least a basic explanation to the client.
database, put that new ID on your legacy data so it gets matched up.
+
* Visual ETL tools are in many ways self-documenting; you can see the process visually, find a problem later, pretty easy to see what step did it and why.
** As the transformation is running through different steps, you can
+
* Will document the new data structure and any gotchas; will usually put it up someplace searchable on the web.
pause and look at the state of all your data at any one of these steps.
+
* A lot of data ocumentation anxiety comes because of staff turnover at organizations, people leave and no one knows why things were done. "What would future me/someone in the same position want to know?" Write down a
You can see it being transformed as it's going through; if it's not
 
coming out you expect you can go back and look what the data looked like
 
at an earlier successful state.
 
** Pentaho Kettle documentation is decent; this tool is used often by
 
large organizations. There's an enterprise version that's part of a
 
suite, but not sure if Kettle makes much difference between community
 
version and enterprise version.
 
** ETL solves a lot of the problems in the room: cleaning up the messy
 
data, but also makes a lot of steps repeatable in a way that takes a lot
 
of pain out of everything around it. e.g. as importing data into a data
 
source, it not only cleans the data but imports it for you. Copy a file
 
to the remote server, run the script to import the data, take a dump of
 
MySQL DB on remote, save to local with a particular name: that's a
 
repeatable step.
 
** You can run this on your local machine, but it can also work with
 
remote data sources.
 
** Once you create a job you can put it on a remote server and running
 
it headlessly on a schedule.
 
** Wrote a script that pulls data from original source, munges it,
 
uploads it; scheduled Kettle job that runs nightly, sends an email if it
 
fails. Much more common usage of Kettle
 
** Using it for sanitizing data quite a bit, e.g. sensitive data running
 
locally
 
** Many people use Kettle to send data to third party services
 
** There are lots of problems that come up over and over, e.g. email
 
address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own
 
transforms that you can use over and over again and can drag into new
 
transformations. It can check for valid email addresses, either by
 
format OR by
 
** Kettle jobs are XML, not super readable, but can be kept in version
 
control for updating and deployment.
 
** MegaphoneJon on GitHub -- has a lot of reusable Kettle transforms up
 
there, lots for migrating from proprietary databases to CiviCRM
 
* Deduping
 
** two ways: 1) I've got this person in one file 5 times, 2) I've got
 
five files and need to dedupe them all without one common identifier ID.
 
Kettle makes it easy to do lookups between heterogeneous data sources,
 
you can do lookups to assign a common ID to do that.
 
** More classical deduping -- DataMaid is a company that has libraries
 
of tools for the more classic deduping.
 
** Use "group by" function in Kettle and munge the data.
 
** Dedupes can take you on an adventure; adding an admin issues field
 
where comments are dropped in about a record, might put "possible
 
duplicate" or "no email address," give that back to the client with an
 
army of interns who can work on those.
 
* You can split your outputs into multiple files; run data validation
 
steps at the end to find ones that have bad email, improperly formatted
 
numbers, add an "Error" field that notes the problems; don't put them in
 
the same file, put them in an errors output file, e.g. a CSV that you
 
can share with the client so they can see the problems. Rerun the script
 
every time they send you new data, put the problem data someplace where
 
the client can look at it and help clean up. Tell them "fix it in your
 
source data" if your tool is usable, unlike Excel.
 
* Ability to say "anytime you do X, say Y", e.g. "every time you see
 
Reverend as the title, change it to Rev" or "drop it into a separate field"
 
* Data dictionaries and documentation: valuable to make those, but can
 
be very complicated to translate your logic into a chart. Like to give
 
at least a basic explanation to the client.
 
* Visual ETL tools are in many ways self-documenting; you can see the
 
process visually, find a problem later, pretty easy to see what step did
 
it and why.
 
* Will document the new data structure and any gotchas; will usually put
 
it up someplace searchable on the web.
 
* A lot of data ocumentation anxiety comes because of staff turnover at
 
organizations, people leave and no one knows why things were done. "What
 
would future me/someone in the same position want to know?" Write down a
 
 
lot, but variable.
 
lot, but variable.

Latest revision as of 18:50, 1 December 2017

DATA MUNGING w Jon Goldberg

Go around: what is folks' interests? What are the problem spaces we work in?

  • Organizations looking to get many different sources of data in many different forms all centralized, organized, de-siloed, standardized
  • Migration projects--Drupal, Joomla to WordPress, also handwritten CMSes from the 90s.
  • Donor lists, stuff in Excel or Access, trying to be pulled together; huge PHP files and MySQL queries is current approach
  • Coming from a PM position, wanting to better understand the work and the tools
  • Migrations, more CRMs than CMSes, work with lots of orgs that do things in funny duct tape ways, series of Excel spreadsheets with macros no one understands
  • Campaign finance data, lots of feelings about bad gov't data not-standardized across jurisdictions
  • Amazon now has 990 library that's OCR-readable; could have an impact if they're more accessible and usable, exploring ways to use that data.
  • Dealing with sensitive data, want to articulate the problems and issues with the data currently while not incurring risk
  • Putting numbers on value of environmental resources to make arguments against certain projects
  • CMS and CRM migrations (mentioned frequently)
  • Combining similar data from multiple sources, esp gov't data

What tools are folks using?

  • MySQL
  • sed
  • text editor
  • regular expressions
  • Pentaho data integration
  • OpenRefine
  • Excel and LibreCalc formulas and pivot tables
  • Custom scripts (PHP, Perl, bash etc)
  • WP-all-import in WordPress
  • Feeds module in Drupal and Backdrop
  • RSS and Atom feeds
  • QGIS
  • StackExchange!

What sorts of frustrations are folks encountering with what they're doing now?

  • Always inheriting really poor information architecture; no thought went into data structure, things are jumbled and jammed badly. Part of headache is working with client to help them understand information architecture so there's a home for this data to go to.
  • Frustration to do migration or munging from sources that are very messy to start with to a destination, but then the client doesn't understand why it's not perfectly sensible in the new format. Process of explaining that there's no magical way to make data that didn't make sense before now make sense.
  • Documentation isn't available--maybe the architecture does make sense but you don't know what it means!
  • Proprietary formats: someone knows what the fields all are.
  • Tedious of manual processes; long to learn more about automating processes.
  • Automating things that are similar but not exactly the same; figuring just how far.
  • Working with small budgets and pouring so much of them into migrations, knowing more can be automated
  • Working with poorly validated data
  • Debt from previous migrations
  • Joomla in general

Strategies and tools we've used to mitigate the pain points we've discussed

  • Thinking about problem of data architecture being obscure, proprietary or hard to understand: have made mistake of just trying to figure it out alone, but more valuable to think about who else might know what it is, can figure it out more easily than me. Send a few example rows to somebody who can explain the structure, saves a lot of time.
  • Problem: so many of these migrations are bad munges that there are so many different, individual cases that no one
  • Do as much planning up front as possible so you know as much as possible about the data, where it's going and what it means to the client before starting the technical work.
  • When you start using tools, you find problems in the data that throw the planning into some chaos. Come up with a data dictionary: the field in the source data, how we intend to map it to the destination data. Work with people who aren't technical to develop that; once you start using the tools you see all the problems and flaws with the data dictionary but it still forms the basis of a shared agreement with the client. "OK, this is what we said, but this is why we can't do that."
  • Hands-on process with the client: "You understand your data better than I ever will; I need your help to understand how to do it well." At very least available to answer questions about the data dictionary and plans.
  • Find the things that are irrelevant, never used properly, correct but too old to be useful; the smaller the data set, the less unique cases drag you down.
    • Old content that isn't actually useful or would be too much trouble to migrate
    • Old org CRM-type data
  • Working with own data and gov't data that's more or less opaque; if you can't find any documentation on a data set is trying to track down the tech folks who are working on that, talk to them directly.
  • Often times when you're looking at the data as you prepare to migrate, you realize that the new data structure you're building might need to change to omit things that won't really be used.
  • Sometimes you need to lose things to realize you really didn't need it in the first place!
  • How do you convince clients that you don't have to migrate EVERYTHING?
    • Explain to the client--there are certain things you think you want to access once a year only; perhaps keep the old system (Filemaker on a desktop in this example) available to look that information up, but not go through the trouble of migration.
    • Keep archives and backups available in the original state in case that data is ever needed.
    • Through educating the client, they often make the decisions to drop some data themselves.
  • Consider migrating stuff into a messy, admin-only fields
  • Sometimes even putting names on the columns in the spreadsheet/data set to know who put that data in, how and why.
  • If data has been already extracted but it's not in a great state, can try going back to the source to get a better data dump.

LOOKING AT THE TOOLS

  • Lots of these problems I used to try to solve with spreadsheets and formulas; ran into a lot of similar problems. It's hard, takes a lot of time, not repeatable, not very good at surfacing the exceptional data and problem areas.
  • Started using some more advanced tools that folks mentioned
  • OpenRefine
    • Bills itself as being a tool to work with messy data; was originally from Google, got spun off into own project
    • available for all major operating systems
    • Decompress the file, doesn't need to be installed; a script runs,

starts a web server on your computer, opens an address in your web browser

    • When you open a data set and create a project, you can do some cool things with the data
    • One of the coolest is a facet: click on the header of any column, then choose to do a facet that gives a count of how many records in the DB have every value so you can see the exceptional values, e.g. out of about 48K records, only 1K as an email type of personal or work, most or blank. If some said "wrok" or "summer home" or things that should be combined, you can click on the items from the left column and merge them together. Will even try to guess some for you and put them together.
    • When you do facets, you can limit your dataset to a particular one, run multiple facets at once and only clean the data within your given facets
    • Numerical facets: show me all the numerical values, see the outliers, give you a distribution graph, see where outliers are and where your distributions fall
    • Change numbers--spot and merge the outliers
    • Filter by numerical and date ranges, particularly helpful when there's good reason to believe that how the tool was used previously has changed over time. If a field has 1K out of 48K non-blank entries; if out of 48K only 4 records exist, recommend dropping that, but 1K out of 48K is enough to say "there's probably something useful there" but also you can see if maybe someone just cared about tracking that for a year and before and after that no one cares. Combine the "email addresses without type" facet with a date facet to see whether that's
    • Can you script this? One person doesn't use it to make it repeatable because it's got Python bindings; if you write a script in Python you can interact with its API
    • There are scripting steps you can write to create facets on more complex criteria, write some things in JavaScript to either facet data or transform it.
    • Once you drill down there are transformations you can run on it through OpenRefine. Some transformations are built in, you can create others yourself.
    • Use it to figure out the data dictionary before the munging; find it easier to use OpenRefine to do that investigation of the data and identifying problems in the data than MySQL. What are the possible values for this column? What's the distribution of those possible values?
    • Use it to separate content from one cell into multiple cells.
    • Some folks use it identify duplicate data; not sure it would be really efficient on a mass scale. (Most folks aren't using OpenRefine for that.)
    • Advantages over LibreOffice Calc: autofilter results on columns can't be copied elsewhere easily, OpenRefine can.
    • Hint: putting MySQL on an entirely RAM disk can be useful when dealing with huge datasets

ETL tools - Extract, Transform, Load

  • designed to pull data out of a data source, munge it, and put it in a destination.
  • Largely used for data warehousing: a big biz with many databases, have to run reports across all of them, want a nightly script that dumps everything into one master data warehouse that you can run the big queries on.
  • Also super-useful for migrations, scripted and repeatable which is useful when you're working to migrate data from constantly-changing sources like an active use CRM or CMS.
  • Right now, groups that do online advocacy need to match people to their voter record; you pay a lot of money to Salsa or other similar people who get the info in one easy-to-use place. National Voter File project is trying to do that for free, want to use ETL to repeat it year by year.
  • Many different tools, Jon uses Kettle by a company called Pentaho, aka Pentaho Data Integration
  • In Kettle,you configure different blocks to transform the data.
  • Multiple tasks make up a migration; each task is a different block, e.g. donations, events, contacts, participants, all different tables in a DB. Migrating them each separately; run in a particular order so you can depend on previously migrated data.
  • Start with a data import source, which can be fairly varied; currently pulling directly from a Postgres database but can pull from CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live feeds of the data
  • You can combine data inputs, do joins on CSV files from Excel to json
  • Lets you break it down step by step, unlike trying to write SQL queries to do the same, they're difficult to troubleshoot because they're so dense.
  • Pentaho Kettle can let you import data, get the new ID from the database, put that new ID on your legacy data so it gets matched up.
  • As the transformation is running through different steps, you can pause and look at the state of all your data at any one of these steps. You can see it being transformed as it's going through; if it's not coming out you expect you can go back and look what the data looked like at an earlier successful state.
  • Pentaho Kettle documentation is decent; this tool is used often by large organizations. There's an enterprise version that's part of a suite, but not sure if Kettle makes much difference between community version and enterprise version.
  • ETL solves a lot of the problems in the room: cleaning up the messy data, but also makes a lot of steps repeatable in a way that takes a lot of pain out of everything around it. e.g. as importing data into a data source, it not only cleans the data but imports it for you. Copy a file to the remote server, run the script to import the data, take a dump of MySQL DB on remote, save to local with a particular name: that's a repeatable step.
  • You can run this on your local machine, but it can also work with remote data sources.
  • Once you create a job you can put it on a remote server and running it headlessly on a schedule.
  • Wrote a script that pulls data from original source, munges it, uploads it; scheduled Kettle job that runs nightly, sends an email if it fails. Much more common usage of Kettle
  • Using it for sanitizing data quite a bit, e.g. sensitive data running locally
  • Many people use Kettle to send data to third party services
  • There are lots of problems that come up over and over, e.g. email address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own transforms that you can use over and over again and can drag into new transformations. It can check for valid email addresses, either by format OR by sending test emails
  • Kettle jobs are XML, not super readable, but can be kept in version control for updating and deployment.
    • MegaphoneJon on GitHub -- has a lot of reusable Kettle transforms up there, lots for migrating from proprietary databases to CiviCRM

Deduping

  • two ways: 1) I've got this person in one file 5 times, 2) I've got five files and need to dedupe them all without one common identifier ID. Kettle makes it easy to do lookups between heterogeneous data sources, you can do lookups to assign a common ID to do that.
  • More classical deduping -- Datamade is a company that has libraries of tools for the more classic deduping.
  • Use "group by" function in Kettle and munge the data.
  • Dedupes can take you on an adventure; adding an admin issues field where comments are dropped in about a record, might put "possible duplicate" or "no email address," give that back to the client with an army of interns who can work on those.
  • You can split your outputs into multiple files; run data validation steps at the end to find ones that have bad email, improperly formatted numbers, add an "Error" field that notes the problems; don't put them in the same file, put them in an errors output file, e.g. a CSV that you can share with the client so they can see the problems. Rerun the script every time they send you new data, put the problem data someplace where the client can look at it and help clean up. Tell them "fix it in your source data" if your tool is usable, unlike Excel.
  • Ability to say "anytime you do X, say Y", e.g. "every time you see Reverend as the title, change it to Rev" or "drop it into a separate field"
  • Data dictionaries and documentation: valuable to make those, but can be very complicated to translate your logic into a chart. Like to give at least a basic explanation to the client.
  • Visual ETL tools are in many ways self-documenting; you can see the process visually, find a problem later, pretty easy to see what step did it and why.
  • Will document the new data structure and any gotchas; will usually put it up someplace searchable on the web.
  • A lot of data ocumentation anxiety comes because of staff turnover at organizations, people leave and no one knows why things were done. "What would future me/someone in the same position want to know?" Write down a

lot, but variable.