Preparing your Data

Here are some things to think about during this stage:

Why would we need to clean data?

Data Accuracy: We don’t need data that is affected by bad sensors or taken at a non-typical time.
Data Consistency: We want to make confident decisions based on any part of the data
Data Validity: All our analyses should work on the data; it should be in the right format etc
Ensures models are looking at the ‘correct’ optimum operating ranges
So that the results of analysis are not influenced by outliers

What do we aim for when we’re cleaning?

Not too much noise
No errors/nulls
Reduce outliers
Application of global filters

Filtering for outliers can be a contentious topic!

A data scientist will tell you that you should not filter for outliers unless you can statistically prove they are a real outlier. A metallurgist, however, will know that a flowmeter which reads -500m³/h is impossible and should be excluded.

Context and understanding of the data are important!

Another point to remember is to not lose sight of how much data has been filtered/excluded. Use a data point count; a simple way to pick up potential filtering or data errors.

How many data points in your raw data?
How many data points are left after filtering?
Would you feel confident making a decision based on the remaining data?

In the Clarofy app, the Global Filters are located in the top right hand side corner and you can use Numeric, Categorical or Temporal (time-based) filters for your cleaning and exploration.

Clarofy Knowledge Base: Prepare

Connect your data: Upload a dataset

Clarofy Knowledge Base: Explore