Difference between revisions of "Data Munging"
Jump to navigation
Jump to search
(Created page with "DATA MUNGING w Jon Goldberg Go around: what is folks' interests? What are the problem spaces we work in? * Organizations looking to get many different sources of data in man...") |
(formatting) |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
DATA MUNGING w Jon Goldberg | DATA MUNGING w Jon Goldberg | ||
− | Go around: what is folks' interests? What are the problem spaces we work in? | + | == Go around: what is folks' interests? What are the problem spaces we work in? == |
− | * Organizations looking to get many different sources of data in many | + | * Organizations looking to get many different sources of data in many different forms all centralized, organized, de-siloed, standardized |
− | different forms all centralized, organized, de-siloed, standardized | + | * Migration projects--Drupal, Joomla to WordPress, also handwritten CMSes from the 90s. |
− | * Migration projects--Drupal, Joomla to WordPress, also handwritten | + | * Donor lists, stuff in Excel or Access, trying to be pulled together; huge PHP files and MySQL queries is current approach |
− | CMSes from the 90s. | + | * Coming from a PM position, wanting to better understand the work and the tools |
− | * Donor lists, stuff in Excel or Access, trying to be pulled together; | + | * Migrations, more CRMs than CMSes, work with lots of orgs that do things in funny duct tape ways, series of Excel spreadsheets with macros no one understands |
− | huge PHP files and MySQL queries is current approach | + | * Campaign finance data, lots of feelings about bad gov't data not-standardized across jurisdictions |
− | * Coming from a PM position, wanting to better understand the work and | + | * Amazon now has 990 library that's OCR-readable; could have an impact if they're more accessible and usable, exploring ways to use that data. |
− | the tools | + | * Dealing with sensitive data, want to articulate the problems and issues with the data currently while not incurring risk |
− | * Migrations, more CRMs than CMSes, work with lots of orgs that do | + | * Putting numbers on value of environmental resources to make arguments against certain projects |
− | things in funny duct tape ways, series of Excel spreadsheets with macros | ||
− | no one understands | ||
− | * Campaign finance data, lots of feelings about bad gov't data | ||
− | not-standardized across jurisdictions | ||
− | * Amazon now has 990 library that's OCR-readable; could have an impact | ||
− | if they're more accessible and usable, exploring ways to use that data. | ||
− | * Dealing with sensitive data, want to articulate the problems and | ||
− | issues with the data currently while not incurring risk | ||
− | * Putting numbers on value of environmental resources to make arguments | ||
− | against certain projects | ||
* CMS and CRM migrations (mentioned frequently) | * CMS and CRM migrations (mentioned frequently) | ||
* Combining similar data from multiple sources, esp gov't data | * Combining similar data from multiple sources, esp gov't data | ||
− | What tools are folks using? | + | == What tools are folks using? == |
* MySQL | * MySQL | ||
Line 33: | Line 23: | ||
* Pentaho data integration | * Pentaho data integration | ||
* OpenRefine | * OpenRefine | ||
− | * Excel and LibreCalc | + | * Excel and LibreCalc formulas and pivot tables |
* Custom scripts (PHP, Perl, bash etc) | * Custom scripts (PHP, Perl, bash etc) | ||
− | * WP-all-import in WordPress | + | * [http://www.wpallimport.com/ WP-all-import] in WordPress |
* Feeds module in Drupal and Backdrop | * Feeds module in Drupal and Backdrop | ||
* RSS and Atom feeds | * RSS and Atom feeds | ||
Line 41: | Line 31: | ||
* StackExchange! | * StackExchange! | ||
− | What sorts of frustrations are folks encountering with what they're | + | == What sorts of frustrations are folks encountering with what they're doing now? == |
− | doing now? | ||
− | * Always inheriting really poor information architecture; no thought | + | * Always inheriting really poor information architecture; no thought went into data structure, things are jumbled and jammed badly. Part of headache is working with client to help them understand information architecture so there's a home for this data to go to. |
− | went into data structure, things are jumbled and jammed badly. Part of | + | * Frustration to do migration or munging from sources that are very messy to start with to a destination, but then the client doesn't understand why it's not perfectly sensible in the new format. Process of explaining that there's no magical way to make data that didn't make sense before now make sense. |
− | headache is working with client to help them understand information | + | * Documentation isn't available--maybe the architecture does make sense but you don't know what it means! |
− | architecture so there's a home for this data to go to. | ||
− | * Frustration to do migration or munging from sources that are very | ||
− | messy to start with to a destination, but then the client doesn't | ||
− | understand why it's not perfectly sensible in the new format. Process of | ||
− | explaining that there's no magical way to make data that didn't make | ||
− | sense before now make sense. | ||
− | * Documentation isn't available--maybe the architecture does make sense | ||
− | but you don't know what it means! | ||
* Proprietary formats: someone knows what the fields all are. | * Proprietary formats: someone knows what the fields all are. | ||
− | * Tedious of manual processes; long to learn more about automating | + | * Tedious of manual processes; long to learn more about automating processes. |
− | processes. | + | * Automating things that are similar but not exactly the same; figuring just how far. |
− | * Automating things that are similar but not exactly the same; figuring | + | * Working with small budgets and pouring so much of them into migrations, knowing more can be automated |
− | just how far. | ||
− | * Working with small budgets and pouring so much of them into | ||
− | migrations, knowing more can be automated | ||
* Working with poorly validated data | * Working with poorly validated data | ||
* Debt from previous migrations | * Debt from previous migrations | ||
* Joomla in general | * Joomla in general | ||
− | + | == Strategies and tools we've used to mitigate the pain points we've discussed == | |
− | mitigate the pain points we've discussed | ||
− | * Thinking about problem of data architecture being obscure, proprietary | + | * Thinking about problem of data architecture being obscure, proprietary or hard to understand: have made mistake of just trying to figure it out alone, but more valuable to think about who else might know what it is, can figure it out more easily than me. Send a few example rows to somebody who can explain the structure, saves a lot of time. |
− | or hard to understand: have made mistake of just trying to figure it out | + | * Problem: so many of these migrations are bad munges that there are so many different, individual cases that no one |
− | alone, but more valuable to think about who else might know what it is, | + | * Do as much planning up front as possible so you know as much as possible about the data, where it's going and what it means to the client before starting the technical work. |
− | can figure it out more easily than me. Send a few example rows to | + | * When you start using tools, you find problems in the data that throw the planning into some chaos. Come up with a data dictionary: the field in the source data, how we intend to map it to the destination data. Work with people who aren't technical to develop that; once you start using the tools you see all the problems and flaws with the data dictionary but it still forms the basis of a shared agreement with the client. "OK, this is what we said, but this is why we can't do that." |
− | somebody who can explain the structure, saves a lot of time. | + | * Hands-on process with the client: "You understand your data better than I ever will; I need your help to understand how to do it well." At very least available to answer questions about the data dictionary and plans. |
− | * Problem: so many of these migrations are bad munges that there are so | + | * Find the things that are irrelevant, never used properly, correct but too old to be useful; the smaller the data set, the less unique cases drag you down. |
− | many different, individual cases that no one | + | ** Old content that isn't actually useful or would be too much trouble to migrate |
− | * Do as much planning up front as possible so you know as much as | ||
− | possible about the data, where it's going and what it means to the | ||
− | client before starting the technical work. | ||
− | * When you start using tools, you find problems in the data that throw | ||
− | the planning into some chaos. Come up with a data | ||
− | in the source data, how we intend to map it to the destination data. | ||
− | Work with people who aren't technical to develop that; once you start | ||
− | using the tools you see all the problems and flaws with the data | ||
− | dictionary but it still forms the basis of a shared agreement with the | ||
− | client. "OK, this is what we said, but this is why we can't do that." | ||
− | * Hands-on process with the client: "You understand your data better | ||
− | than I ever will; I need your help to understand how to do it well." At | ||
− | very least available to answer questions about the data dictionary and | ||
− | plans | ||
− | * Find the things that are irrelevant, never used properly, correct but | ||
− | too old to be useful; the smaller the data set, the less unique cases | ||
− | drag you down. | ||
− | ** Old content that isn't actually useful or would be too much trouble | ||
− | to migrate | ||
** Old org CRM-type data | ** Old org CRM-type data | ||
− | * Working with own data and gov't data that's more or less opaque; if | + | * Working with own data and gov't data that's more or less opaque; if you can't find any documentation on a data set is trying to track down the tech folks who are working on that, talk to them directly. |
− | you can't find any documentation on a data set is trying to track down | + | * Often times when you're looking at the data as you prepare to migrate, you realize that the new data structure you're building might need to change to omit things that won't really be used. |
− | the tech folks who are working on that, talk to them directly. | + | * Sometimes you need to lose things to realize you really didn't need it in the first place! |
− | * Often times when you're looking at the data as you prepare to migrate, | ||
− | you realize that the new data structure you're building might need to | ||
− | change to omit things that won't really be used. | ||
− | * Sometimes you need to lose things to realize you really didn't need it | ||
− | in the first place! | ||
* How do you convince clients that you don't have to migrate EVERYTHING? | * How do you convince clients that you don't have to migrate EVERYTHING? | ||
− | ** Explain to the client--there are certain things you think you want to | + | ** Explain to the client--there are certain things you think you want to access once a year only; perhaps keep the old system (Filemaker on a desktop in this example) available to look that information up, but not go through the trouble of migration. |
− | access once a year only; perhaps keep the old system (Filemaker on a | + | ** Keep archives and backups available in the original state in case that data is ever needed. |
− | desktop in this example) available to look that information up, but not | + | ** Through educating the client, they often make the decisions to drop some data themselves. |
− | go through the trouble of migration. | ||
− | ** Keep archives and backups available in the original state in case | ||
− | that data is ever needed. | ||
− | ** Through educating the client, they often make the decisions to drop | ||
− | some data themselves. | ||
* Consider migrating stuff into a messy, admin-only fields | * Consider migrating stuff into a messy, admin-only fields | ||
− | * Sometimes even putting names on the columns in the spreadsheet/data | + | * Sometimes even putting names on the columns in the spreadsheet/data set to know who put that data in, how and why. |
− | set to know who put that data in, how and why. | + | * If data has been already extracted but it's not in a great state, can try going back to the source to get a better data dump. |
− | * If data has been already extracted but it's not in a great state, can | ||
− | try going back to the source to get a better data dump. | ||
− | LOOKING AT THE TOOLS | + | == LOOKING AT THE TOOLS == |
− | * Lots of these problems I used to try to solve with spreadsheets and | + | * Lots of these problems I used to try to solve with spreadsheets and formulas; ran into a lot of similar problems. It's hard, takes a lot of time, not repeatable, not very good at surfacing the exceptional data and problem areas. |
− | |||
− | time, not repeatable, not very good at surfacing the exceptional data | ||
− | and problem areas. | ||
* Started using some more advanced tools that folks mentioned | * Started using some more advanced tools that folks mentioned | ||
− | * | + | * [http://openrefine.org/ OpenRefine] |
− | ** Bills itself as being a tool to work with messy data; was originally | + | ** Bills itself as being a tool to work with messy data; was originally from Google, got spun off into own project |
− | from Google, got spun off into own project | ||
** available for all major operating systems | ** available for all major operating systems | ||
** Decompress the file, doesn't need to be installed; a script runs, | ** Decompress the file, doesn't need to be installed; a script runs, | ||
starts a web server on your computer, opens an address in your web browser | starts a web server on your computer, opens an address in your web browser | ||
− | ** When you open a data set and create a project, you can do some cool | + | ** When you open a data set and create a project, you can do some cool things with the data |
− | things with the data | + | ** One of the coolest is a facet: click on the header of any column, then choose to do a facet that gives a count of how many records in the DB have every value so you can see the exceptional values, e.g. out of about 48K records, only 1K as an email type of personal or work, most or blank. If some said "wrok" or "summer home" or things that should be combined, you can click on the items from the left column and merge them together. Will even try to guess some for you and put them together. |
− | ** One of the coolest is a facet: click on the header of any column, | + | ** When you do facets, you can limit your dataset to a particular one, run multiple facets at once and only clean the data within your given facets |
− | then choose to do a facet that gives a count of how many records in the | + | ** Numerical facets: show me all the numerical values, see the outliers, give you a distribution graph, see where outliers are and where your distributions fall |
− | DB have every value so you can see the exceptional values, e.g. out of | ||
− | about 48K records, only 1K | ||
− | or blank. If some said "wrok" or "summer home" or things that should be | ||
− | combined, you can click on the items from the left column and merge them | ||
− | together. Will even try to guess some for you and put them together. | ||
− | ** When you do facets, you can limit your dataset to a particular one, | ||
− | run multiple facets at once and only clean the data within your given facets | ||
− | ** Numerical facets: show me all the numerical values, see the outliers, | ||
− | give you a distribution graph, see where outliers are and where your | ||
− | distributions fall | ||
** Change numbers--spot and merge the outliers | ** Change numbers--spot and merge the outliers | ||
− | ** Filter by numerical and date ranges, particularly helpful when | + | ** Filter by numerical and date ranges, particularly helpful when there's good reason to believe that how the tool was used previously has changed over time. If a field has 1K out of 48K non-blank entries; if out of 48K only 4 records exist, recommend dropping that, but 1K out of 48K is enough to say "there's probably something useful there" but also you can see if maybe someone just cared about tracking that for a year and before and after that no one cares. Combine the "email addresses without type" facet with a date facet to see whether that's |
− | there's good reason to believe that how the tool was used previously has | + | ** Can you script this? One person doesn't use it to make it repeatable because it's got Python bindings; if you write a script in Python you can interact with its API |
− | changed over time. If a field has 1K out of 48K non-blank entries; if | + | ** There are scripting steps you can write to create facets on more complex criteria, write some things in JavaScript to either facet data or transform it. |
− | out of 48K only 4 records exist, recommend dropping that, but 1K out of | + | ** Once you drill down there are transformations you can run on it through OpenRefine. Some transformations are built in, you can create others yourself. |
− | 48K is enough to say "there's probably something useful there" but also | + | ** Use it to figure out the data dictionary before the munging; find it easier to use OpenRefine to do that investigation of the data and identifying problems in the data than MySQL. What are the possible values for this column? What's the distribution of those possible values? |
− | you can see if maybe someone just cared about tracking that for a year | ||
− | and before and after that no one cares. Combine the "email addresses | ||
− | without type" facet with a date facet to see whether that's | ||
− | ** Can you script this? One person doesn't use it to make it repeatable | ||
− | because it's got Python bindings; if you write a script in Python you | ||
− | can interact with its API | ||
− | ** There are scripting steps you can write to create facets on more | ||
− | complex criteria, write some things in JavaScript to either facet data | ||
− | or transform it. | ||
− | ** Once you drill down there are transformations you can run on it | ||
− | through OpenRefine. Some transformations are built in, you can create | ||
− | others yourself. | ||
− | ** Use it to figure out the data dictionary before the munging; find it | ||
− | easier to use OpenRefine to do that investigation of the data and | ||
− | identifying problems in the data than MySQL. What are the possible | ||
− | values for this column? What's the distribution of those possible values? | ||
** Use it to separate content from one cell into multiple cells. | ** Use it to separate content from one cell into multiple cells. | ||
− | ** Some folks use it identify duplicate data; not sure it would be | + | ** Some folks use it identify duplicate data; not sure it would be really efficient on a mass scale. (Most folks aren't using OpenRefine for that.) |
− | really efficient on a mass scale. (Most folks aren't using OpenRefine | + | ** Advantages over LibreOffice Calc: autofilter results on columns can't be copied elsewhere easily, OpenRefine can. |
− | for that.) | + | ** Hint: putting MySQL on an entirely RAM disk can be useful when dealing with huge datasets |
− | ** Advantages over LibreOffice Calc: autofilter results on columns can't | + | |
− | be copied elsewhere easily, OpenRefine can. | + | == ETL tools - Extract, Transform, Load == |
− | ** Hint: putting MySQL on an entirely RAM disk can be useful when | + | * designed to pull data out of a data source, munge it, and put it in a destination. |
− | dealing with huge datasets | + | * Largely used for data warehousing: a big biz with many databases, have to run reports across all of them, want a nightly script that dumps everything into one master data warehouse that you can run the big queries on. |
− | + | * Also super-useful for migrations, scripted and repeatable which is useful when you're working to migrate data from constantly-changing sources like an active use CRM or CMS. | |
− | + | * Right now, groups that do online advocacy need to match people to their voter record; you pay a lot of money to Salsa or other similar people who get the info in one easy-to-use place. National Voter File project is trying to do that for free, want to use ETL to repeat it year by year. | |
− | destination. | + | * Many different tools, Jon uses Kettle by a company called Pentaho, aka [http://www.pentaho.com/product/data-integration Pentaho Data Integration] |
− | + | ** [https://sourceforge.net/projects/pentaho/ Non-corporate download link for Pentaho Data Integration] | |
− | have to run reports across all of them, want a nightly script that dumps | + | * In Kettle,you configure different blocks to transform the data. |
− | everything into one master data warehouse that you can run the big | + | * Multiple tasks make up a migration; each task is a different block, e.g. donations, events, contacts, participants, all different tables in a DB. Migrating them each separately; run in a particular order so you can depend on previously migrated data. |
− | queries on. | + | * Start with a data import source, which can be fairly varied; currently pulling directly from a Postgres database but can pull from CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live feeds of the data |
− | + | * You can combine data inputs, do joins on CSV files from Excel to json | |
− | useful when you're working to migrate data from constantly-changing | + | * Lets you break it down step by step, unlike trying to write SQL queries to do the same, they're difficult to troubleshoot because they're so dense. |
− | sources like an active use CRM or CMS. | + | * Pentaho Kettle can let you import data, get the new ID from the database, put that new ID on your legacy data so it gets matched up. |
− | * Right now, groups that do online advocacy need to match people to | + | * As the transformation is running through different steps, you can pause and look at the state of all your data at any one of these steps. You can see it being transformed as it's going through; if it's not coming out you expect you can go back and look what the data looked like at an earlier successful state. |
− | their voter record; you pay a lot of money to Salsa or other similar | + | * Pentaho Kettle documentation is decent; this tool is used often by large organizations. There's an enterprise version that's part of a suite, but not sure if Kettle makes much difference between community version and enterprise version. |
− | people who get the info in one easy-to-use place. National Voter File | + | * ETL solves a lot of the problems in the room: cleaning up the messy data, but also makes a lot of steps repeatable in a way that takes a lot of pain out of everything around it. e.g. as importing data into a data source, it not only cleans the data but imports it for you. Copy a file to the remote server, run the script to import the data, take a dump of MySQL DB on remote, save to local with a particular name: that's a repeatable step. |
− | project is trying to do that for free, want to use ETL to repeat it year | + | * You can run this on your local machine, but it can also work with remote data sources. |
− | by year. | + | * Once you create a job you can put it on a remote server and running it headlessly on a schedule. |
− | + | * Wrote a script that pulls data from original source, munges it, uploads it; scheduled Kettle job that runs nightly, sends an email if it fails. Much more common usage of Kettle | |
− | aka | + | * Using it for sanitizing data quite a bit, e.g. sensitive data running locally |
− | ** In Kettle,you configure different blocks to transform the data. | + | * Many people use Kettle to send data to third party services |
− | + | * There are lots of problems that come up over and over, e.g. email address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own transforms that you can use over and over again and can drag into new transformations. It can check for valid email addresses, either by format OR by sending test emails | |
− | e.g. donations, events, contacts, participants, all different tables in | + | * Kettle jobs are XML, not super readable, but can be kept in version control for updating and deployment. |
− | a DB. Migrating them each separately; run in a particular order so you | + | ** [https://github.com/MegaphoneJon/civicrm_kettle_transforms MegaphoneJon on GitHub] -- has a lot of reusable Kettle transforms up there, lots for migrating from proprietary databases to CiviCRM |
− | can depend on previously migrated data. | + | |
− | + | == Deduping == | |
− | currently pulling directly from a Postgres database but can pull from | + | |
− | CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live | + | * two ways: 1) I've got this person in one file 5 times, 2) I've got five files and need to dedupe them all without one common identifier ID. Kettle makes it easy to do lookups between heterogeneous data sources, you can do lookups to assign a common ID to do that. |
− | feeds of the data | + | * More classical deduping -- [https://github.com/datamade Datamade] is a company that has libraries of tools for the more classic deduping. |
− | + | * Use "group by" function in Kettle and munge the data. | |
− | + | * Dedupes can take you on an adventure; adding an admin issues field where comments are dropped in about a record, might put "possible duplicate" or "no email address," give that back to the client with an army of interns who can work on those. | |
− | queries to do the same, they're difficult to troubleshoot because | + | * You can split your outputs into multiple files; run data validation steps at the end to find ones that have bad email, improperly formatted numbers, add an "Error" field that notes the problems; don't put them in the same file, put them in an errors output file, e.g. a CSV that you can share with the client so they can see the problems. Rerun the script every time they send you new data, put the problem data someplace where the client can look at it and help clean up. Tell them "fix it in your source data" if your tool is usable, unlike Excel. |
− | they're so dense. | + | * Ability to say "anytime you do X, say Y", e.g. "every time you see Reverend as the title, change it to Rev" or "drop it into a separate field" |
− | + | * Data dictionaries and documentation: valuable to make those, but can be very complicated to translate your logic into a chart. Like to give at least a basic explanation to the client. | |
− | database, put that new ID on your legacy data so it gets matched up. | + | * Visual ETL tools are in many ways self-documenting; you can see the process visually, find a problem later, pretty easy to see what step did it and why. |
− | + | * Will document the new data structure and any gotchas; will usually put it up someplace searchable on the web. | |
− | pause and look at the state of all your data at any one of these steps. | + | * A lot of data ocumentation anxiety comes because of staff turnover at organizations, people leave and no one knows why things were done. "What would future me/someone in the same position want to know?" Write down a |
− | You can see it being transformed as it's going through; if it's not | ||
− | coming out you expect you can go back and look what the data looked like | ||
− | at an earlier successful state. | ||
− | |||
− | large organizations. There's an enterprise version that's part of a | ||
− | suite, but not sure if Kettle makes much difference between community | ||
− | version and enterprise version. | ||
− | |||
− | data, but also makes a lot of steps repeatable in a way that takes a lot | ||
− | of pain out of everything around it. e.g. as importing data into a data | ||
− | source, it not only cleans the data but imports it for you. Copy a file | ||
− | to the remote server, run the script to import the data, take a dump of | ||
− | MySQL DB on remote, save to local with a particular name: that's a | ||
− | repeatable step. | ||
− | |||
− | remote data sources. | ||
− | |||
− | it headlessly on a schedule. | ||
− | |||
− | uploads it; scheduled Kettle job that runs nightly, sends an email if it | ||
− | fails. Much more common usage of Kettle | ||
− | |||
− | locally | ||
− | |||
− | |||
− | address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own | ||
− | transforms that you can use over and over again and can drag into new | ||
− | transformations. It can check for valid email addresses, either by | ||
− | format OR by | ||
− | |||
− | control for updating and deployment. | ||
− | ** MegaphoneJon on GitHub -- has a lot of reusable Kettle transforms up | ||
− | there, lots for migrating from proprietary databases to CiviCRM | ||
− | |||
− | |||
− | five files and need to dedupe them all without one common identifier ID. | ||
− | Kettle makes it easy to do lookups between heterogeneous data sources, | ||
− | you can do lookups to assign a common ID to do that. | ||
− | * | ||
− | of tools for the more classic deduping. | ||
− | |||
− | |||
− | where comments are dropped in about a record, might put "possible | ||
− | duplicate" or "no email address," give that back to the client with an | ||
− | army of interns who can work on those. | ||
− | * You can split your outputs into multiple files; run data validation | ||
− | steps at the end to find ones that have bad email, improperly formatted | ||
− | numbers, add an "Error" field that notes the problems; don't put them in | ||
− | the same file, put them in an errors output file, e.g. a CSV that you | ||
− | can share with the client so they can see the problems. Rerun the script | ||
− | every time they send you new data, put the problem data someplace where | ||
− | the client can look at it and help clean up. Tell them "fix it in your | ||
− | source data" if your tool is usable, unlike Excel. | ||
− | * Ability to say "anytime you do X, say Y", e.g. "every time you see | ||
− | Reverend as the title, change it to Rev" or "drop it into a separate field" | ||
− | * Data dictionaries and documentation: valuable to make those, but can | ||
− | be very complicated to translate your logic into a chart. Like to give | ||
− | at least a basic explanation to the client. | ||
− | * Visual ETL tools are in many ways self-documenting; you can see the | ||
− | process visually, find a problem later, pretty easy to see what step did | ||
− | it and why. | ||
− | * Will document the new data structure and any gotchas; will usually put | ||
− | it up someplace searchable on the web. | ||
− | * A lot of data ocumentation anxiety comes because of staff turnover at | ||
− | organizations, people leave and no one knows why things were done. "What | ||
− | would future me/someone in the same position want to know?" Write down a | ||
lot, but variable. | lot, but variable. |
Latest revision as of 18:50, 1 December 2017
DATA MUNGING w Jon Goldberg
Go around: what is folks' interests? What are the problem spaces we work in?
- Organizations looking to get many different sources of data in many different forms all centralized, organized, de-siloed, standardized
- Migration projects--Drupal, Joomla to WordPress, also handwritten CMSes from the 90s.
- Donor lists, stuff in Excel or Access, trying to be pulled together; huge PHP files and MySQL queries is current approach
- Coming from a PM position, wanting to better understand the work and the tools
- Migrations, more CRMs than CMSes, work with lots of orgs that do things in funny duct tape ways, series of Excel spreadsheets with macros no one understands
- Campaign finance data, lots of feelings about bad gov't data not-standardized across jurisdictions
- Amazon now has 990 library that's OCR-readable; could have an impact if they're more accessible and usable, exploring ways to use that data.
- Dealing with sensitive data, want to articulate the problems and issues with the data currently while not incurring risk
- Putting numbers on value of environmental resources to make arguments against certain projects
- CMS and CRM migrations (mentioned frequently)
- Combining similar data from multiple sources, esp gov't data
What tools are folks using?
- MySQL
- sed
- text editor
- regular expressions
- Pentaho data integration
- OpenRefine
- Excel and LibreCalc formulas and pivot tables
- Custom scripts (PHP, Perl, bash etc)
- WP-all-import in WordPress
- Feeds module in Drupal and Backdrop
- RSS and Atom feeds
- QGIS
- StackExchange!
What sorts of frustrations are folks encountering with what they're doing now?
- Always inheriting really poor information architecture; no thought went into data structure, things are jumbled and jammed badly. Part of headache is working with client to help them understand information architecture so there's a home for this data to go to.
- Frustration to do migration or munging from sources that are very messy to start with to a destination, but then the client doesn't understand why it's not perfectly sensible in the new format. Process of explaining that there's no magical way to make data that didn't make sense before now make sense.
- Documentation isn't available--maybe the architecture does make sense but you don't know what it means!
- Proprietary formats: someone knows what the fields all are.
- Tedious of manual processes; long to learn more about automating processes.
- Automating things that are similar but not exactly the same; figuring just how far.
- Working with small budgets and pouring so much of them into migrations, knowing more can be automated
- Working with poorly validated data
- Debt from previous migrations
- Joomla in general
Strategies and tools we've used to mitigate the pain points we've discussed
- Thinking about problem of data architecture being obscure, proprietary or hard to understand: have made mistake of just trying to figure it out alone, but more valuable to think about who else might know what it is, can figure it out more easily than me. Send a few example rows to somebody who can explain the structure, saves a lot of time.
- Problem: so many of these migrations are bad munges that there are so many different, individual cases that no one
- Do as much planning up front as possible so you know as much as possible about the data, where it's going and what it means to the client before starting the technical work.
- When you start using tools, you find problems in the data that throw the planning into some chaos. Come up with a data dictionary: the field in the source data, how we intend to map it to the destination data. Work with people who aren't technical to develop that; once you start using the tools you see all the problems and flaws with the data dictionary but it still forms the basis of a shared agreement with the client. "OK, this is what we said, but this is why we can't do that."
- Hands-on process with the client: "You understand your data better than I ever will; I need your help to understand how to do it well." At very least available to answer questions about the data dictionary and plans.
- Find the things that are irrelevant, never used properly, correct but too old to be useful; the smaller the data set, the less unique cases drag you down.
- Old content that isn't actually useful or would be too much trouble to migrate
- Old org CRM-type data
- Working with own data and gov't data that's more or less opaque; if you can't find any documentation on a data set is trying to track down the tech folks who are working on that, talk to them directly.
- Often times when you're looking at the data as you prepare to migrate, you realize that the new data structure you're building might need to change to omit things that won't really be used.
- Sometimes you need to lose things to realize you really didn't need it in the first place!
- How do you convince clients that you don't have to migrate EVERYTHING?
- Explain to the client--there are certain things you think you want to access once a year only; perhaps keep the old system (Filemaker on a desktop in this example) available to look that information up, but not go through the trouble of migration.
- Keep archives and backups available in the original state in case that data is ever needed.
- Through educating the client, they often make the decisions to drop some data themselves.
- Consider migrating stuff into a messy, admin-only fields
- Sometimes even putting names on the columns in the spreadsheet/data set to know who put that data in, how and why.
- If data has been already extracted but it's not in a great state, can try going back to the source to get a better data dump.
LOOKING AT THE TOOLS
- Lots of these problems I used to try to solve with spreadsheets and formulas; ran into a lot of similar problems. It's hard, takes a lot of time, not repeatable, not very good at surfacing the exceptional data and problem areas.
- Started using some more advanced tools that folks mentioned
- OpenRefine
- Bills itself as being a tool to work with messy data; was originally from Google, got spun off into own project
- available for all major operating systems
- Decompress the file, doesn't need to be installed; a script runs,
starts a web server on your computer, opens an address in your web browser
- When you open a data set and create a project, you can do some cool things with the data
- One of the coolest is a facet: click on the header of any column, then choose to do a facet that gives a count of how many records in the DB have every value so you can see the exceptional values, e.g. out of about 48K records, only 1K as an email type of personal or work, most or blank. If some said "wrok" or "summer home" or things that should be combined, you can click on the items from the left column and merge them together. Will even try to guess some for you and put them together.
- When you do facets, you can limit your dataset to a particular one, run multiple facets at once and only clean the data within your given facets
- Numerical facets: show me all the numerical values, see the outliers, give you a distribution graph, see where outliers are and where your distributions fall
- Change numbers--spot and merge the outliers
- Filter by numerical and date ranges, particularly helpful when there's good reason to believe that how the tool was used previously has changed over time. If a field has 1K out of 48K non-blank entries; if out of 48K only 4 records exist, recommend dropping that, but 1K out of 48K is enough to say "there's probably something useful there" but also you can see if maybe someone just cared about tracking that for a year and before and after that no one cares. Combine the "email addresses without type" facet with a date facet to see whether that's
- Can you script this? One person doesn't use it to make it repeatable because it's got Python bindings; if you write a script in Python you can interact with its API
- There are scripting steps you can write to create facets on more complex criteria, write some things in JavaScript to either facet data or transform it.
- Once you drill down there are transformations you can run on it through OpenRefine. Some transformations are built in, you can create others yourself.
- Use it to figure out the data dictionary before the munging; find it easier to use OpenRefine to do that investigation of the data and identifying problems in the data than MySQL. What are the possible values for this column? What's the distribution of those possible values?
- Use it to separate content from one cell into multiple cells.
- Some folks use it identify duplicate data; not sure it would be really efficient on a mass scale. (Most folks aren't using OpenRefine for that.)
- Advantages over LibreOffice Calc: autofilter results on columns can't be copied elsewhere easily, OpenRefine can.
- Hint: putting MySQL on an entirely RAM disk can be useful when dealing with huge datasets
ETL tools - Extract, Transform, Load
- designed to pull data out of a data source, munge it, and put it in a destination.
- Largely used for data warehousing: a big biz with many databases, have to run reports across all of them, want a nightly script that dumps everything into one master data warehouse that you can run the big queries on.
- Also super-useful for migrations, scripted and repeatable which is useful when you're working to migrate data from constantly-changing sources like an active use CRM or CMS.
- Right now, groups that do online advocacy need to match people to their voter record; you pay a lot of money to Salsa or other similar people who get the info in one easy-to-use place. National Voter File project is trying to do that for free, want to use ETL to repeat it year by year.
- Many different tools, Jon uses Kettle by a company called Pentaho, aka Pentaho Data Integration
- In Kettle,you configure different blocks to transform the data.
- Multiple tasks make up a migration; each task is a different block, e.g. donations, events, contacts, participants, all different tables in a DB. Migrating them each separately; run in a particular order so you can depend on previously migrated data.
- Start with a data import source, which can be fairly varied; currently pulling directly from a Postgres database but can pull from CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live feeds of the data
- You can combine data inputs, do joins on CSV files from Excel to json
- Lets you break it down step by step, unlike trying to write SQL queries to do the same, they're difficult to troubleshoot because they're so dense.
- Pentaho Kettle can let you import data, get the new ID from the database, put that new ID on your legacy data so it gets matched up.
- As the transformation is running through different steps, you can pause and look at the state of all your data at any one of these steps. You can see it being transformed as it's going through; if it's not coming out you expect you can go back and look what the data looked like at an earlier successful state.
- Pentaho Kettle documentation is decent; this tool is used often by large organizations. There's an enterprise version that's part of a suite, but not sure if Kettle makes much difference between community version and enterprise version.
- ETL solves a lot of the problems in the room: cleaning up the messy data, but also makes a lot of steps repeatable in a way that takes a lot of pain out of everything around it. e.g. as importing data into a data source, it not only cleans the data but imports it for you. Copy a file to the remote server, run the script to import the data, take a dump of MySQL DB on remote, save to local with a particular name: that's a repeatable step.
- You can run this on your local machine, but it can also work with remote data sources.
- Once you create a job you can put it on a remote server and running it headlessly on a schedule.
- Wrote a script that pulls data from original source, munges it, uploads it; scheduled Kettle job that runs nightly, sends an email if it fails. Much more common usage of Kettle
- Using it for sanitizing data quite a bit, e.g. sensitive data running locally
- Many people use Kettle to send data to third party services
- There are lots of problems that come up over and over, e.g. email address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own transforms that you can use over and over again and can drag into new transformations. It can check for valid email addresses, either by format OR by sending test emails
- Kettle jobs are XML, not super readable, but can be kept in version control for updating and deployment.
- MegaphoneJon on GitHub -- has a lot of reusable Kettle transforms up there, lots for migrating from proprietary databases to CiviCRM
Deduping
- two ways: 1) I've got this person in one file 5 times, 2) I've got five files and need to dedupe them all without one common identifier ID. Kettle makes it easy to do lookups between heterogeneous data sources, you can do lookups to assign a common ID to do that.
- More classical deduping -- Datamade is a company that has libraries of tools for the more classic deduping.
- Use "group by" function in Kettle and munge the data.
- Dedupes can take you on an adventure; adding an admin issues field where comments are dropped in about a record, might put "possible duplicate" or "no email address," give that back to the client with an army of interns who can work on those.
- You can split your outputs into multiple files; run data validation steps at the end to find ones that have bad email, improperly formatted numbers, add an "Error" field that notes the problems; don't put them in the same file, put them in an errors output file, e.g. a CSV that you can share with the client so they can see the problems. Rerun the script every time they send you new data, put the problem data someplace where the client can look at it and help clean up. Tell them "fix it in your source data" if your tool is usable, unlike Excel.
- Ability to say "anytime you do X, say Y", e.g. "every time you see Reverend as the title, change it to Rev" or "drop it into a separate field"
- Data dictionaries and documentation: valuable to make those, but can be very complicated to translate your logic into a chart. Like to give at least a basic explanation to the client.
- Visual ETL tools are in many ways self-documenting; you can see the process visually, find a problem later, pretty easy to see what step did it and why.
- Will document the new data structure and any gotchas; will usually put it up someplace searchable on the web.
- A lot of data ocumentation anxiety comes because of staff turnover at organizations, people leave and no one knows why things were done. "What would future me/someone in the same position want to know?" Write down a
lot, but variable.