Skip to main content Skip to footer

Article

Data Transformation and Harmonization with Crosser

A common problem in our industry is data mismatch, we want to use data from different sources but they deliver different formats, or we have data in one format and want to send it to another system that expects data in another format. These situations are where data transformation and harmonization comes into play.

The Crosser streaming analytics solution is the optimal tool for resolving these types of issues. In this blog we will look at some examples of how this can be done.

Data Transformation

Transformation and Harmonization are closely related, the only difference being that with harmonization we are dealing with multiple sources and need to apply different transformations to each of them with the goal to convert them into a common format. Let’s start with some typical transformations.

Transformations can be divided into two groups structural and content related.

Structural transformations deals with the format of the data, such as:

  • Hierarchical structures
  • Arrays
  • Objects
  • Naming conventions

Depending on what we get and what we want we must be able to convert back and forth between these alternatives.

Content transformations changes the actual data, e.g:

  • Scaling of values (e.g. change units)
  • Change resolution/sampling rates
  • Remove outliers and missing values
  • Remove noise

Let’s look at an example:

We want to get the values of 5 registers in a PLC every second and store them in a database. The data from the PLC looks like this:

[
   {“Name”: “Reg1”, “Value”: 77},
   {“Name”: “Reg2”, “Value”: 935},
   {“Name”: “Reg3”, “Value”: “True”},
   {“Name”: “Reg4”, “Value”: 18594},
   {“Name”: “Reg5”, “Value”: “Good”}
]

We get an array of objects, one per register, each with corresponding “Name” and “Value” properties.

The database expects a key/value map so that values can be mapped to the correct column when we are adding a new row of data. The output we want should look like this:

{
   “Temperature”: 25,
   “RPM”: 935,
   “Running”: true,
   “Pressure”: 12.5,
   “Quality”: “Good”
}

Let’s see what we need to do to get this output, starting with the input data we have. First the structural problems:

  • The array with name/value properties must be changed into an object with key/value pairs
  • The names we get from the PLC must be replaced with the proper column names in the database

There’s also a couple of content issues we need to deal with:

  • The temperature value we get from the PLC has the wrong unit. We get Farenheit while the database expects Celsius. A conversion is needed.
  • The running state is delivered as a string, while the database expects a boolean. A type conversion is needed.
  • The pressure value is delivered as a 16-bit integer with the range 0-65535, while it’s actually representing an analog value between 0-100 psi. A scaling is needed.

This example shows some basic transformations you may encounter when working with machine data. Implementing these types of transformations is easy with the Crosser Streaming Analytics system using standard functions from the Crosser module library. The transformations above would end up in a processing flow like this with Crosser:

Crosser Data Transformation Example

Other transformations such as removing outliers/noise and changing resolution (aggregation/filtering) can easily be added to the flow above using other standard modules from the library.

Data Harmonization

Data harmonization comes into play when we have multiple sources with different formats and we want to combine the data so that we can treat the data in the same way independent of the original source. To harmonize the data we typically apply different transformations to each of the sources to produce a common format.

Crosser Data Harmonization Example

The above example also introduces a transformation before the output. Sometimes it is advantageous to first transform each of the inputs to a format that is optimized for processing and then apply a transformation before the output to adapt the data to the requirements of the receiving system.

The transformations of each of the inputs are of the same type as described above. When harmonizing time series data from multiple sources there might be an additional issue that must be dealt with: data from different sources arriving at different times or with different sample rates.

Depending on the requirements of the processing and/or receiving system we might have to align data on common time steps. This can be done by shifting the data, if the sampling rate is the same, or by interpolating/aggregating data if different sampling rates are used.

This is especially important if the data will be used with machine learning models, since these expect each new sample to contain data from each of the sources the model was trained on. A similar problem is when we are missing data from one source at a specific time. We may then need to fill in a value as good as we can, such as repeating the last known value or interpolate a value using data we have received.

Again, these types of data preparations are easily implemented using standard modules from the Crosser library.

Summary

Crosser provides the perfect tools to help you make your data useful. Get insights to optimize your operations and take appropriate actions immediately based on your data. Contact us to discuss how Crosser can be relevant and how to get going in no-time with the self-service capabilities of the platform.

Read more:

Learn more about the Crosser Platform →

Read more about Advanced Edge Analytics →

About the author

Goran Appelquist (Ph.D) | CTO

Göran has 20 years experience in leading technology teams. He’s the lead architect of our end-to-end solution and is extremely focused in securing the lowest possible Total Cost of Ownership for our customers.

"Hidden Lifecycle (employee) cost can account for 5-10 times the purchase price of software. Our goal is to offer a solution that automates and removes most of the tasks that is costly over the lifecycle.

My career started in the academic world where I got a PhD in physics by researching large scale data acquisition systems for physics experiments, such as the LHC at CERN. After leaving academia I have been working in several tech startups in different management positions over the last 20 years.

In most of these positions I have stood with one foot in the R&D team and another in the product/business teams. My passion is learning new technologies, use it to develop innovative products and explain the solutions to end users, technical or non-technical."

Close