DATA MUNGING w Jon Goldberg
- 1 Go around: what is folks' interests? What are the problem spaces we work in?
- 2 What tools are folks using?
- 3 What sorts of frustrations are folks encountering with what they're doing now?
- 4 Strategies and tools we've used to mitigate the pain points we've discussed
- 5 LOOKING AT THE TOOLS
- 6 ETL tools - Extract, Transform, Load
- 7 Deduping
Go around: what is folks' interests? What are the problem spaces we work in?
- Organizations looking to get many different sources of data in many different forms all centralized, organized, de-siloed, standardized
- Migration projects--Drupal, Joomla to WordPress, also handwritten CMSes from the 90s.
- Donor lists, stuff in Excel or Access, trying to be pulled together; huge PHP files and MySQL queries is current approach
- Coming from a PM position, wanting to better understand the work and the tools
- Migrations, more CRMs than CMSes, work with lots of orgs that do things in funny duct tape ways, series of Excel spreadsheets with macros no one understands
- Campaign finance data, lots of feelings about bad gov't data not-standardized across jurisdictions
- Amazon now has 990 library that's OCR-readable; could have an impact if they're more accessible and usable, exploring ways to use that data.
- Dealing with sensitive data, want to articulate the problems and issues with the data currently while not incurring risk
- Putting numbers on value of environmental resources to make arguments against certain projects
- CMS and CRM migrations (mentioned frequently)
- Combining similar data from multiple sources, esp gov't data
What tools are folks using?
- text editor
- regular expressions
- Pentaho data integration
- Excel and LibreCalc formulas and pivot tables
- Custom scripts (PHP, Perl, bash etc)
- WP-all-import in WordPress
- Feeds module in Drupal and Backdrop
- RSS and Atom feeds
What sorts of frustrations are folks encountering with what they're doing now?
- Always inheriting really poor information architecture; no thought went into data structure, things are jumbled and jammed badly. Part of headache is working with client to help them understand information architecture so there's a home for this data to go to.
- Frustration to do migration or munging from sources that are very messy to start with to a destination, but then the client doesn't understand why it's not perfectly sensible in the new format. Process of explaining that there's no magical way to make data that didn't make sense before now make sense.
- Documentation isn't available--maybe the architecture does make sense but you don't know what it means!
- Proprietary formats: someone knows what the fields all are.
- Tedious of manual processes; long to learn more about automating processes.
- Automating things that are similar but not exactly the same; figuring just how far.
- Working with small budgets and pouring so much of them into migrations, knowing more can be automated
- Working with poorly validated data
- Debt from previous migrations
- Joomla in general
Strategies and tools we've used to mitigate the pain points we've discussed
- Thinking about problem of data architecture being obscure, proprietary or hard to understand: have made mistake of just trying to figure it out alone, but more valuable to think about who else might know what it is, can figure it out more easily than me. Send a few example rows to somebody who can explain the structure, saves a lot of time.
- Problem: so many of these migrations are bad munges that there are so many different, individual cases that no one
- Do as much planning up front as possible so you know as much as possible about the data, where it's going and what it means to the client before starting the technical work.
- When you start using tools, you find problems in the data that throw the planning into some chaos. Come up with a data dictionary: the field in the source data, how we intend to map it to the destination data. Work with people who aren't technical to develop that; once you start using the tools you see all the problems and flaws with the data dictionary but it still forms the basis of a shared agreement with the client. "OK, this is what we said, but this is why we can't do that."
- Hands-on process with the client: "You understand your data better than I ever will; I need your help to understand how to do it well." At very least available to answer questions about the data dictionary and plans.
- Find the things that are irrelevant, never used properly, correct but too old to be useful; the smaller the data set, the less unique cases drag you down.
- Old content that isn't actually useful or would be too much trouble to migrate
- Old org CRM-type data
- Working with own data and gov't data that's more or less opaque; if you can't find any documentation on a data set is trying to track down the tech folks who are working on that, talk to them directly.
- Often times when you're looking at the data as you prepare to migrate, you realize that the new data structure you're building might need to change to omit things that won't really be used.
- Sometimes you need to lose things to realize you really didn't need it in the first place!
- How do you convince clients that you don't have to migrate EVERYTHING?
- Explain to the client--there are certain things you think you want to access once a year only; perhaps keep the old system (Filemaker on a desktop in this example) available to look that information up, but not go through the trouble of migration.
- Keep archives and backups available in the original state in case that data is ever needed.
- Through educating the client, they often make the decisions to drop some data themselves.
- Consider migrating stuff into a messy, admin-only fields
- Sometimes even putting names on the columns in the spreadsheet/data set to know who put that data in, how and why.
- If data has been already extracted but it's not in a great state, can try going back to the source to get a better data dump.
LOOKING AT THE TOOLS
- Lots of these problems I used to try to solve with spreadsheets and formulas; ran into a lot of similar problems. It's hard, takes a lot of time, not repeatable, not very good at surfacing the exceptional data and problem areas.
- Started using some more advanced tools that folks mentioned
- Bills itself as being a tool to work with messy data; was originally from Google, got spun off into own project
- available for all major operating systems
- Decompress the file, doesn't need to be installed; a script runs,
starts a web server on your computer, opens an address in your web browser
- When you open a data set and create a project, you can do some cool things with the data
- One of the coolest is a facet: click on the header of any column, then choose to do a facet that gives a count of how many records in the DB have every value so you can see the exceptional values, e.g. out of about 48K records, only 1K as an email type of personal or work, most or blank. If some said "wrok" or "summer home" or things that should be combined, you can click on the items from the left column and merge them together. Will even try to guess some for you and put them together.
- When you do facets, you can limit your dataset to a particular one, run multiple facets at once and only clean the data within your given facets
- Numerical facets: show me all the numerical values, see the outliers, give you a distribution graph, see where outliers are and where your distributions fall
- Change numbers--spot and merge the outliers
- Filter by numerical and date ranges, particularly helpful when there's good reason to believe that how the tool was used previously has changed over time. If a field has 1K out of 48K non-blank entries; if out of 48K only 4 records exist, recommend dropping that, but 1K out of 48K is enough to say "there's probably something useful there" but also you can see if maybe someone just cared about tracking that for a year and before and after that no one cares. Combine the "email addresses without type" facet with a date facet to see whether that's
- Can you script this? One person doesn't use it to make it repeatable because it's got Python bindings; if you write a script in Python you can interact with its API
- Once you drill down there are transformations you can run on it through OpenRefine. Some transformations are built in, you can create others yourself.
- Use it to figure out the data dictionary before the munging; find it easier to use OpenRefine to do that investigation of the data and identifying problems in the data than MySQL. What are the possible values for this column? What's the distribution of those possible values?
- Use it to separate content from one cell into multiple cells.
- Some folks use it identify duplicate data; not sure it would be really efficient on a mass scale. (Most folks aren't using OpenRefine for that.)
- Advantages over LibreOffice Calc: autofilter results on columns can't be copied elsewhere easily, OpenRefine can.
- Hint: putting MySQL on an entirely RAM disk can be useful when dealing with huge datasets
ETL tools - Extract, Transform, Load
- designed to pull data out of a data source, munge it, and put it in a destination.
- Largely used for data warehousing: a big biz with many databases, have to run reports across all of them, want a nightly script that dumps everything into one master data warehouse that you can run the big queries on.
- Also super-useful for migrations, scripted and repeatable which is useful when you're working to migrate data from constantly-changing sources like an active use CRM or CMS.
- Right now, groups that do online advocacy need to match people to their voter record; you pay a lot of money to Salsa or other similar people who get the info in one easy-to-use place. National Voter File project is trying to do that for free, want to use ETL to repeat it year by year.
- Many different tools, Jon uses Kettle by a company called Pentaho, aka Pentaho Data Integration
- In Kettle,you configure different blocks to transform the data.
- Multiple tasks make up a migration; each task is a different block, e.g. donations, events, contacts, participants, all different tables in a DB. Migrating them each separately; run in a particular order so you can depend on previously migrated data.
- Start with a data import source, which can be fairly varied; currently pulling directly from a Postgres database but can pull from CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live feeds of the data
- You can combine data inputs, do joins on CSV files from Excel to json
- Lets you break it down step by step, unlike trying to write SQL queries to do the same, they're difficult to troubleshoot because they're so dense.
- Pentaho Kettle can let you import data, get the new ID from the database, put that new ID on your legacy data so it gets matched up.
- As the transformation is running through different steps, you can pause and look at the state of all your data at any one of these steps. You can see it being transformed as it's going through; if it's not coming out you expect you can go back and look what the data looked like at an earlier successful state.
- Pentaho Kettle documentation is decent; this tool is used often by large organizations. There's an enterprise version that's part of a suite, but not sure if Kettle makes much difference between community version and enterprise version.
- ETL solves a lot of the problems in the room: cleaning up the messy data, but also makes a lot of steps repeatable in a way that takes a lot of pain out of everything around it. e.g. as importing data into a data source, it not only cleans the data but imports it for you. Copy a file to the remote server, run the script to import the data, take a dump of MySQL DB on remote, save to local with a particular name: that's a repeatable step.
- You can run this on your local machine, but it can also work with remote data sources.
- Once you create a job you can put it on a remote server and running it headlessly on a schedule.
- Wrote a script that pulls data from original source, munges it, uploads it; scheduled Kettle job that runs nightly, sends an email if it fails. Much more common usage of Kettle
- Using it for sanitizing data quite a bit, e.g. sensitive data running locally
- Many people use Kettle to send data to third party services
- There are lots of problems that come up over and over, e.g. email address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own transforms that you can use over and over again and can drag into new transformations. It can check for valid email addresses, either by format OR by sending test emails
- Kettle jobs are XML, not super readable, but can be kept in version control for updating and deployment.
- MegaphoneJon on GitHub -- has a lot of reusable Kettle transforms up there, lots for migrating from proprietary databases to CiviCRM
- two ways: 1) I've got this person in one file 5 times, 2) I've got five files and need to dedupe them all without one common identifier ID. Kettle makes it easy to do lookups between heterogeneous data sources, you can do lookups to assign a common ID to do that.
- More classical deduping -- Datamade is a company that has libraries of tools for the more classic deduping.
- Use "group by" function in Kettle and munge the data.
- Dedupes can take you on an adventure; adding an admin issues field where comments are dropped in about a record, might put "possible duplicate" or "no email address," give that back to the client with an army of interns who can work on those.
- You can split your outputs into multiple files; run data validation steps at the end to find ones that have bad email, improperly formatted numbers, add an "Error" field that notes the problems; don't put them in the same file, put them in an errors output file, e.g. a CSV that you can share with the client so they can see the problems. Rerun the script every time they send you new data, put the problem data someplace where the client can look at it and help clean up. Tell them "fix it in your source data" if your tool is usable, unlike Excel.
- Ability to say "anytime you do X, say Y", e.g. "every time you see Reverend as the title, change it to Rev" or "drop it into a separate field"
- Data dictionaries and documentation: valuable to make those, but can be very complicated to translate your logic into a chart. Like to give at least a basic explanation to the client.
- Visual ETL tools are in many ways self-documenting; you can see the process visually, find a problem later, pretty easy to see what step did it and why.
- Will document the new data structure and any gotchas; will usually put it up someplace searchable on the web.
- A lot of data ocumentation anxiety comes because of staff turnover at organizations, people leave and no one knows why things were done. "What would future me/someone in the same position want to know?" Write down a
lot, but variable.