When Spreadsheets Run Their Course
Spreadsheets are convenient and often fun to use. They (normally) load quickly, are transportable using thumb drives/e-mail/etc., and stand a good chance of being openable in future versions of your favorite spreadsheet software. It is easy to open one up and simply start typing, formulating rules as you go along. This flexibility is essential for data analysis where you might want to just grab chunks of data and apply various mathematical transformations and/or tests against them.
Analysis is different than the aggregation, control, and data storage operations that you'll normally need outside of it however. You might have heard the following acronym: GIGO. It stands for 'garbage in, garbage out' and it represents a common problem inexperienced people find when trying to utilize data. When your data collection, control, and storage is not done properly that data can effectively become garbage. Any subsequent analysis of said data will suffer. This can and likely will happen even if you are using an online, shared type of spreadsheet.
Aggregate, Control, and Store
Aggregation refers to data collection processes, both manual and automatic. Let's say we doing some kind of weather project and are collecting temperature readings from various geographical points in a city. You have a person with some temperature reading equipment at the various points collecting readings every fifteen minutes for ten hours. The minimum number of readings you are going to get will be 10*4*x or 40x where 'x' represents the number of collection points for ten hours (4 readings per hour).
If you used a spreadsheet, or even a series of spreadsheets, to collect that temperature data you might find,
- a) Incomplete data -- various data points missed.
- b) Improperly entered data -- someone entered 800.5 degrees!
- c) Useless data -- operator was collecting from the wrong point.
Spreadsheets do not normally stop you from entering bad, insufficient, or incomplete data. Errors in their use exemplify the classic case of paying later for early convenience.
Control and storage issues are largely related. Control refers to which persons can create, view, modify, and delete data records and under what circumstances these actions can be performed. Returning to our temperature example, let's say that only a laboratory manager (and not a field technician) can delete any reading. A technician can flag the reading as potentially bad but cannot actually remove it. It would be difficult to enforce this rule on a spreadsheet (it is possible using protected ranges and code, but not trivial). The rule that denies the technician the freedom to remove the record is a data control issue. It impacts the integrity of the stored data.
Pure storage issues involve data retention, backups, and other performance-related concerns. If our temperature experiment expands to other cities the number of data points will expand significantly. Although today's spreadsheet software can easily contain millions of data points it is not the best medium for long-term storage due to the very real possibility of data loss/corruption. Having all of your data just a click away is a bit like having your life savings in cash on your desktop. It just is not safe.
What to Do
Get or build some software that has the following features:
- a) Controls when, how, and who can access the data.
- b) Allows for periodic and unattended backups.
- c) Keeps you and your staff from entering garbage.
- d) Makes it easy to periodically audit your backups (just because so-and-so says it can be recovered is meaningless).
Citation: Spreadsheets Are Dogmeat. (2015). Retrieved Wed Mar 22 22:12:39 2017, from http://www.limsexpert.com/cgi-bin/bixchange/bixchange.cgi?pom=limsexpert3;iid=readMore;go=1428683576