Back to Top

Using Data Mining Techniques in the Healthcare Sector

Data mining is the process of exploring large datasets, with the goal of determining patterns and figuring out connections that help solve business related problems. Since the amount of data that needs to be processed is huge, the analysis is done using specialized software applications.

data miningElectronic Health Records, also known as EHRs, are the main sources of data in the health care industry. Through data mining, specialists can evaluate if a certain treatment/drug does its job or not by checking its effectiveness, for example.


The process begins by picking a subsample of the original data, which must be as relevant as possible. Why use sampling? Because even with today's lightning fast computers, it would take way too much time to process all the information. Additionally, working with large data sets increases costs.


It is crucial to pick a relevant sample, though; the chosen data section must be almost as reliable as the entire dataset it was extracted from. Ideally, the sample should include all the patterns which can be found in the large dataset.


Data mining specialists who work in the health care industry will often use the "sampling with replacement" technique. This is a mathematical model in which individual health records are extracted randomly, and then reinserted in the initial data set. Let us assume that we want to study the link between coffee consumption and blood pressure, using a database which stores the information for 100,000 people. The algorithm will extract the information for a person without removing it from the population; this means that the next extraction could retrieve the data for the same person, because we aren't picking randomly from the remaining 99,999 records. By making use of sampling with replacement, scientists can identify data patterns accurately.


It is important to understand that sampling will always lead to loss of information. The smaller the sample size, the greater the errors. With progressive sampling, the software starts analyzing a small data set, and then increases the size until the errors are acceptable.


The next step is data cleaning, where the program will decide what happens with the information that is missing or erroneous. To give you an idea, a person may have forgotten to fill in his/her age in a form, because the application didn't make it compulsory for people to complete all the form fields. When this happens, the algorithm may choose to compute the average age of the participants, and then use that value for all the missing "age" fields in the data set, with the goal of keeping the errors to a minimum.


The information is then mined using various techniques; here are a few of the frequently used ones. Pattern tracking helps researchers determine abnormal variable variations over time. Variable association determines if two or more variables correlate or not. Regression allows specialists to determine the relationship which may exist between several variables. Data classification involves grouping similar chunks of information, based on their common features. These are just a few examples, of course.


Finally, the data is interpreted and evaluated by qualified personnel. The users can now extract actionable information by making use of the data which has been highlighted by the mining algorithms during the process.