diff --git a/README.md b/README.md new file mode 100644 index 0000000..72eaeec --- /dev/null +++ b/README.md @@ -0,0 +1,100 @@ +# Flink Side Output Sample + +This is an example of working with Flink and Side outputs. + +## What this is + +The pipeline is, basically, processing log lines, turning them into metrics, +reducing the results and applying them to time windows (tumbling windows, in +the Flink jargon, which basically are consecutive blocks of elements split by +their event time). Once the elements are grouped in their windows, we want +them to go to different sinks (stored). + +## The layout + +### Metrics + +The metrics are the information being extracted from the log lines. We have +two different metrics: a simple metrics (`SimpleMetric`) that has a single +value and a more complex (`ComplexMetric`) with more values. + +The exercise is to have two different types of elements floating in the data +stream, which would require different sinks for each. + +Even if those two are different elements, both use the same trait (interface), +so even of both float in the data stream, they are processed in the same way. + +### The Source + +In this example, we use a function (`SourceFunction`) to generate the +elements. It basically have a list of lines and throw each in the data stream. + +### Making it easier to deal with the lines + +Because the lines are pure text, we need an easy way to extract the +information on them. For this, we used a `flatMap` to split the lines in their +separator (in this example, a tab character) and then name each field, +creating a map/object/dictionary (Scala and steam/functional processing names +coliding here). This way, when we actually create the metrics, we can simply +request the fields in the map by their names. + +Note, though, that each line becomes a single map, so a `map` would also work +here. We simply used `flatMap` because instead of working with a single line +of log, we could work with blocks of lines and the map function would generate +more maps/objects/dictionaries. + +### Extracting metrics + +To extract the metrics, we use another `flatMap`, this time because we are +extracting more than one metric from each line. + +### Windows + +As mentioned, we group elements by window, so we need to define how the window +works. The very definition on how we define the time of the events that should +create/close windows is in the very start of the pipeline, when we indicated +to use `TimeCharacteristic.EventTime`, which means "use the time of the event, +instead of the time the log is being processed or some other information". + +Because we are using the event time, we need to indicate how the time needs to +be extracted. This is done in +`AssignerWithPeriodicWatermarks.extractTimestamp`. Another thing to notice is +that we also define the watermakr, the point in which, if an even before this +time appears, the window of time it belongs will be fired (sent to the sink). +In this example, it is 1 second after the more recent event that appeared in +the data stream. + +The metrics, inside their windows, are grouped by their key, with windows of 1 +minute, which will survive for 30 seconds (as this is defined as the late +possible time) and everything sent after this is put on a side output for +later processing. + +### Reducing + +Once an element is added to a window, it is reduced (in functional jargon) +along elements of its own key. This is what we do with `ReduceFunction` and +the fact that the Metric trait have an `add` method. + +### Sinks + +When the window fires, we divide the results into two different side outputs +-- remember, we have two different metrics and each require a different sink. +the `ProcessFunction` does that, based on the class of the metric, sending +each metric type to a different side outputs. + +Those side outputs are then captured and sent to different sinks. + +## Running + +Simply, install [SBT](https://www.scala-sbt.org/) and run `sbt run`. SBT will +download all necessary dependencies and run the pipeline in standalone mode. + +We didn't test it using the full Jobmanager+TaskManager model of Flink and, +thus, this is given as an exercise to the reader. :) + +## The problem + +The problem here -- which this sample tried to demonstrate -- is that once the +windows are generated and are processed by the `ProcessFunction` that sends +each metric to its side output, the side outputs are never captured again -- +so even if they are sent to side outputs, they die in the limbo.