Data Munging
DATA MUNGING w Jon Goldberg
Go around: what is folks' interests? What are the problem spaces we work in?
- Organizations looking to get many different sources of data in many
different forms all centralized, organized, de-siloed, standardized
- Migration projects--Drupal, Joomla to WordPress, also handwritten
CMSes from the 90s.
- Donor lists, stuff in Excel or Access, trying to be pulled together;
huge PHP files and MySQL queries is current approach
- Coming from a PM position, wanting to better understand the work and
the tools
- Migrations, more CRMs than CMSes, work with lots of orgs that do
things in funny duct tape ways, series of Excel spreadsheets with macros no one understands
- Campaign finance data, lots of feelings about bad gov't data
not-standardized across jurisdictions
- Amazon now has 990 library that's OCR-readable; could have an impact
if they're more accessible and usable, exploring ways to use that data.
- Dealing with sensitive data, want to articulate the problems and
issues with the data currently while not incurring risk
- Putting numbers on value of environmental resources to make arguments
against certain projects
- CMS and CRM migrations (mentioned frequently)
- Combining similar data from multiple sources, esp gov't data
What tools are folks using?
- MySQL
- sed
- text editor
- regular expressions
- Pentaho data integration
- OpenRefine
- Excel and LibreCalc formulas and pivot tables
- Custom scripts (PHP, Perl, bash etc)
- WP-all-import in WordPress
- Feeds module in Drupal and Backdrop
- RSS and Atom feeds
- QGIS
- StackExchange!
What sorts of frustrations are folks encountering with what they're doing now?
- Always inheriting really poor information architecture; no thought
went into data structure, things are jumbled and jammed badly. Part of headache is working with client to help them understand information architecture so there's a home for this data to go to.
- Frustration to do migration or munging from sources that are very
messy to start with to a destination, but then the client doesn't understand why it's not perfectly sensible in the new format. Process of explaining that there's no magical way to make data that didn't make sense before now make sense.
- Documentation isn't available--maybe the architecture does make sense
but you don't know what it means!
- Proprietary formats: someone knows what the fields all are.
- Tedious of manual processes; long to learn more about automating
processes.
- Automating things that are similar but not exactly the same; figuring
just how far.
- Working with small budgets and pouring so much of them into
migrations, knowing more can be automated
- Working with poorly validated data
- Debt from previous migrations
- Joomla in general
Open conversation: talk about the strategies and tools we've used to mitigate the pain points we've discussed.
- Thinking about problem of data architecture being obscure, proprietary
or hard to understand: have made mistake of just trying to figure it out alone, but more valuable to think about who else might know what it is, can figure it out more easily than me. Send a few example rows to somebody who can explain the structure, saves a lot of time.
- Problem: so many of these migrations are bad munges that there are so
many different, individual cases that no one
- Do as much planning up front as possible so you know as much as
possible about the data, where it's going and what it means to the client before starting the technical work.
- When you start using tools, you find problems in the data that throw
the planning into some chaos. Come up with a data dictionary: the field in the source data, how we intend to map it to the destination data. Work with people who aren't technical to develop that; once you start using the tools you see all the problems and flaws with the data dictionary but it still forms the basis of a shared agreement with the client. "OK, this is what we said, but this is why we can't do that."
- Hands-on process with the client: "You understand your data better
than I ever will; I need your help to understand how to do it well." At very least available to answer questions about the data dictionary and plans; they can be forced
- Find the things that are irrelevant, never used properly, correct but
too old to be useful; the smaller the data set, the less unique cases drag you down.
- Old content that isn't actually useful or would be too much trouble
to migrate
- Old org CRM-type data
- Working with own data and gov't data that's more or less opaque; if
you can't find any documentation on a data set is trying to track down the tech folks who are working on that, talk to them directly.
- Often times when you're looking at the data as you prepare to migrate,
you realize that the new data structure you're building might need to change to omit things that won't really be used.
- Sometimes you need to lose things to realize you really didn't need it
in the first place!
- How do you convince clients that you don't have to migrate EVERYTHING?
- Explain to the client--there are certain things you think you want to
access once a year only; perhaps keep the old system (Filemaker on a desktop in this example) available to look that information up, but not go through the trouble of migration.
- Keep archives and backups available in the original state in case
that data is ever needed.
- Through educating the client, they often make the decisions to drop
some data themselves.
- Consider migrating stuff into a messy, admin-only fields
- Sometimes even putting names on the columns in the spreadsheet/data
set to know who put that data in, how and why.
- If data has been already extracted but it's not in a great state, can
try going back to the source to get a better data dump.
LOOKING AT THE TOOLS
- Lots of these problems I used to try to solve with spreadsheets and
formulas; ran into a lot of similar problems. It's hard, takes a lot of time, not repeatable, not very good at surfacing the exceptional data and problem areas.
- Started using some more advanced tools that folks mentioned
- OpenRefine
- Bills itself as being a tool to work with messy data; was originally
from Google, got spun off into own project
- available for all major operating systems
- Decompress the file, doesn't need to be installed; a script runs,
starts a web server on your computer, opens an address in your web browser
- When you open a data set and create a project, you can do some cool
things with the data
- One of the coolest is a facet: click on the header of any column,
then choose to do a facet that gives a count of how many records in the DB have every value so you can see the exceptional values, e.g. out of about 48K records, only 1K has an email type of personal or work, most or blank. If some said "wrok" or "summer home" or things that should be combined, you can click on the items from the left column and merge them together. Will even try to guess some for you and put them together.
- When you do facets, you can limit your dataset to a particular one,
run multiple facets at once and only clean the data within your given facets
- Numerical facets: show me all the numerical values, see the outliers,
give you a distribution graph, see where outliers are and where your distributions fall
- Change numbers--spot and merge the outliers
- Filter by numerical and date ranges, particularly helpful when
there's good reason to believe that how the tool was used previously has changed over time. If a field has 1K out of 48K non-blank entries; if out of 48K only 4 records exist, recommend dropping that, but 1K out of 48K is enough to say "there's probably something useful there" but also you can see if maybe someone just cared about tracking that for a year and before and after that no one cares. Combine the "email addresses without type" facet with a date facet to see whether that's
- Can you script this? One person doesn't use it to make it repeatable
because it's got Python bindings; if you write a script in Python you can interact with its API
- There are scripting steps you can write to create facets on more
complex criteria, write some things in JavaScript to either facet data or transform it.
- Once you drill down there are transformations you can run on it
through OpenRefine. Some transformations are built in, you can create others yourself.
- Use it to figure out the data dictionary before the munging; find it
easier to use OpenRefine to do that investigation of the data and identifying problems in the data than MySQL. What are the possible values for this column? What's the distribution of those possible values?
- Use it to separate content from one cell into multiple cells.
- Some folks use it identify duplicate data; not sure it would be
really efficient on a mass scale. (Most folks aren't using OpenRefine for that.)
- Advantages over LibreOffice Calc: autofilter results on columns can't
be copied elsewhere easily, OpenRefine can.
- Hint: putting MySQL on an entirely RAM disk can be useful when
dealing with huge datasets
- ETL tools - Extract, Transform, Load
- designed to pull data out of a data source, munge it, and put it in a
destination.
- Largely used for data warehousing: a big biz with many databases,
have to run reports across all of them, want a nightly script that dumps everything into one master data warehouse that you can run the big queries on.
- Also super-useful for migrations, scripted and repeatable which is
useful when you're working to migrate data from constantly-changing sources like an active use CRM or CMS.
- Right now, groups that do online advocacy need to match people to
their voter record; you pay a lot of money to Salsa or other similar people who get the info in one easy-to-use place. National Voter File project is trying to do that for free, want to use ETL to repeat it year by year.
- Many different tools, Jon uses Kettle by a company called Pentaho,
- Non-corporate download link for Pentaho Data Integration
- In Kettle,you configure different blocks to transform the data.
- Multiple tasks make up a migration; each task is a different block,
e.g. donations, events, contacts, participants, all different tables in a DB. Migrating them each separately; run in a particular order so you can depend on previously migrated data.
- Start with a data import source, which can be fairly varied;
currently pulling directly from a Postgres database but can pull from CSV files, MySQL, Excel Sheets, legacy DB formats, json, query live feeds of the data
- You can combine data inputs, do joins on CSV files from Excel to json
- Lets you break it down step by step, unlike trying to write SQL
queries to do the same, they're difficult to troubleshoot because they're so dense.
- Pentaho Kettle can let you import data, get the new ID from the
database, put that new ID on your legacy data so it gets matched up.
- As the transformation is running through different steps, you can
pause and look at the state of all your data at any one of these steps. You can see it being transformed as it's going through; if it's not coming out you expect you can go back and look what the data looked like at an earlier successful state.
- Pentaho Kettle documentation is decent; this tool is used often by
large organizations. There's an enterprise version that's part of a suite, but not sure if Kettle makes much difference between community version and enterprise version.
- ETL solves a lot of the problems in the room: cleaning up the messy
data, but also makes a lot of steps repeatable in a way that takes a lot of pain out of everything around it. e.g. as importing data into a data source, it not only cleans the data but imports it for you. Copy a file to the remote server, run the script to import the data, take a dump of MySQL DB on remote, save to local with a particular name: that's a repeatable step.
- You can run this on your local machine, but it can also work with
remote data sources.
- Once you create a job you can put it on a remote server and running
it headlessly on a schedule.
- Wrote a script that pulls data from original source, munges it,
uploads it; scheduled Kettle job that runs nightly, sends an email if it fails. Much more common usage of Kettle
- Using it for sanitizing data quite a bit, e.g. sensitive data running
locally
- Many people use Kettle to send data to third party services
- There are lots of problems that come up over and over, e.g. email
address issues, "jon2gmail.com", "jon@gmailcom" etc. Create your own transforms that you can use over and over again and can drag into new transformations. It can check for valid email addresses, either by format OR by
- Kettle jobs are XML, not super readable, but can be kept in version
control for updating and deployment.
- MegaphoneJon on GitHub -- has a lot of reusable Kettle transforms up there, lots for migrating from proprietary databases to CiviCRM
- Deduping
- two ways: 1) I've got this person in one file 5 times, 2) I've got
five files and need to dedupe them all without one common identifier ID. Kettle makes it easy to do lookups between heterogeneous data sources, you can do lookups to assign a common ID to do that.
- More classical deduping -- Datamade is a company that has libraries
of tools for the more classic deduping.
- Use "group by" function in Kettle and munge the data.
- Dedupes can take you on an adventure; adding an admin issues field
where comments are dropped in about a record, might put "possible duplicate" or "no email address," give that back to the client with an army of interns who can work on those.
- You can split your outputs into multiple files; run data validation
steps at the end to find ones that have bad email, improperly formatted numbers, add an "Error" field that notes the problems; don't put them in the same file, put them in an errors output file, e.g. a CSV that you can share with the client so they can see the problems. Rerun the script every time they send you new data, put the problem data someplace where the client can look at it and help clean up. Tell them "fix it in your source data" if your tool is usable, unlike Excel.
- Ability to say "anytime you do X, say Y", e.g. "every time you see
Reverend as the title, change it to Rev" or "drop it into a separate field"
- Data dictionaries and documentation: valuable to make those, but can
be very complicated to translate your logic into a chart. Like to give at least a basic explanation to the client.
- Visual ETL tools are in many ways self-documenting; you can see the
process visually, find a problem later, pretty easy to see what step did it and why.
- Will document the new data structure and any gotchas; will usually put
it up someplace searchable on the web.
- A lot of data ocumentation anxiety comes because of staff turnover at
organizations, people leave and no one knows why things were done. "What would future me/someone in the same position want to know?" Write down a lot, but variable.