5.Talend Data Integration Data Processing Operations-Part 1

Talend Data Processing Components helps to do several operations like filtering, aggregation, joining, lookup, etc. In this Talend tutorial, we will have a look at how to work with frequently used data processing components.

The complete documentation of Talend can be found here.

Wait a moment! If you are a newbie to Talend, then I will strongly recommend you first to go through the other tutorial posts mentioned in order here.

Filtering data based on rows

Filtering the data can be done using the tFilterRow component. The data that is rejected can also be captured in a separate flow as well.

STEP1:Drag and drop tFixedFlowInput,tFilterRow,tLogRow,tLogRow(to capture the rejections) components from the Palette to the workspace.

STEP2:Join the tFixedFlowInput and tFilterRow using Row Main , Join tFilterRow and tLogRow_1 with Row Main and tFilterRow and tLogRow_2 with Row Reject.

STEP3:Click on tFixedFlowInput component properties and slect on Use Inline Content and edit the content as below

Content

James;Maths;20
James;Science;25
John;Maths;25
Martin;Science;100
Mary;Maths;100

STEP4:Click on Edit schema and define the column names as below.

STEP5:Now we will get the rows for which Marks greater than 40 into one flow [tLogRow_1]and which are rejected from this are sent to another flow[tLogRow_2]. Enter the conditions in tFilterRow as below

STEP6:Select the tLogRow_1 component and set the mode as Table. Repeat the same for other tLogRow component as well.

STEP7:Run the job and check the output.

Suppressing the columns

If we need to suppress few columns , then we can achieve it using tFilterColumns component.

STEP1 : Drag and drop tFixedFlowInput,tFilterColumns,tlogRow components as below,Connect tFixedFlowInput and tFilterColumns with Row Main , tFilterColumns with tLogRow with Row Main

STEP2 : Edit the properties of tFixedFlowInput as below .Content same as above section.

STEP3:Click on Edit schema and define the column names as below.

STEP4: We will see how to suppress the column SubjectName Edit the schema of tFixedFlowInput as below. In the output columns, SubjectName is removed.

STEP5: Click on tLogRow and Ensure Table Mode is selected.

STEP6: Run the job and check the output as below.

Sorting the data

Sorting of data can be done through tSortRow component based on the columns.

STEP1:Drag and drop tFixedFlowInput,tSortRow,tlogRow components as below,Connect tFixedFlowInput and tSortRow with Row Main , tSortRow, with tLogRow with Row Main

STEP2:Edit tFixedFlowInput with Content as below,and also the add the details of column names in Edit Schema .This dataset represents the name of the student,Subject , and their Marks.

Content

James;Maths;20
James;Science;25
John;Maths;25
John;Science;65
Martin;Science;100
Martin;Maths;75
Mary;Maths;100
Mary;Science;80

STEP3:We will try to sort out the data based on first Marks ,SubjectName.Enter the details of tSortRow as below.

STEP4:Edit tLogRow and set the Table mode. and Run the job

You can see the dataset is sorted out with Marks first and then with the SubjectName.

Aggregating the data

Aggregation of data can be done using tAggregateRow or tAggregateSortedRow.The difference between both the components is tAggregateSortedRow expects the input data to be sorted at first explicitly by us.

Aggregation using tAggregateRow:

In this scenario, we will try to find the aggregated marks of each student.

STEP1:Drag and drop tFixedFlowInput,tAggregateRow,tlogRow components as below,Connect tFixedFlowInput and tAggregateRow with Row Main , tAggregateRow, with tLogRow with Row Main

STEP2:Edit tFixedFlowInput with Content which is same as mentioned in STEP2 of Sorting the data.

STEP3: Click on tAggregateRow properties and edit the schema details as below. Also mention the Group by and Operations.Since we need to find out total marks for each student, StudentName comes in Groupby and sum in the Operations section.

STEP4:Edit tLogRow and set the Table mode. and Run the job

Aggregation of Sorted Data:

Unlike tAggregateRow,tAggregateSortedRow expects the data to be sorted first. So we have to use tSortRow at first before using tAggregateSortedRow.

STEP1: Drag and drop tFixedFlowInput,tSortRow,tAggregateRow,tlogRow components as below,Connect tFixedFlowInput and tSortRow with Row Main , tSortRow, with tAggregateRow with Row Main and tSortRow, with tAggregateRow with Row Main

STEP2:Edit tFixedFlowInput with Content which is same as mentioned in STEP2 of Sorting the datadata. So refer this step.

STEP3:Edit tSortRow schema Properties and Criteria as below

STEP4:Edit tAggregatedSortRow schema Properties and Criteria as below. Also we need to specify the number of Input rows , here we have 8 rows. Since we need to find the total marks for each student.Since we need to find out total marks for each student, StudentName comes in Groupby and sum in the Operations section

STEP5:Edit tLogRow and set the Table mode. and Run the job.

Leave a Comment