Saturday, September 11, 2021

Process Based Isolation

Well, that’s a fancy way of saying we run an exe/binary for every job we have so that we don’t have to work with the (most of) concurrency details.

When you have standard jobs to be done in selected intervals, let us say every day, every hour etc. you need timers. When a timer kicks in you start a job. We love the jobs that finish without our intervention. Since life is cruel to us human beings, we need to check every process to see if they are finished, interrupted due to an error or stopped prematurely. 

One of the proper ways to do this is to create a service or a daemon (Windows or Linux) and have a separate thread for every job. Developing multi thread applications is hard. Debugging them is even harder. Multi-threading is a headache for the novice programmers and experienced programmers are not immune to the same problems. Thanks to the Object Oriented Programming paradigm, we can create separate objects that have their own business logic, encapsulate their internal variables etc to isolate these jobs. But what if a thread collapses and drags the rest of the world into its black hole?


(By the way this is our first recording system, back in 2004. A bunch of PCs connected to a monitor via a KVM, and they are recording from ordinary DVB devices)

Let me explain it from the very beginning. I used to be a recognition specialist on audio and video streams. For this job, first I need to record TV and radio broadcasts to analyze, process, extract information (which is called Media Information Retrieval), compress and then backup with the metadata.

Here is one of my early recording applications. It can record one channel at a time. It needs to read its configuration from an ini file to know which hardware it is going to use, which disk and directory to save the media file etc.


Is it working? Yes. But what happens if the process terminates prematurely? It's just between the process and the Windows OS. We never have a chance to know without checking it manually. Checking it manually? C’mon, in which century are we living?

There are several reasons for a process to go down. First one is a memory leak. A rough application gets memory when in need and never returns it to the system. That happens with classical memory allocation paradigm where the coder needs to free what they have created (pun intended). In Delphi, you need to free an object if you have created it. In C++ you need to delete a pointer if you acquired it with new() operator. In C you need to call free() for everything you allocated with malloc(). That’s all. That’s that easy. Well, not. In real life problems, sometimes the coder loses control over their code and forgets to free what they have created. 

Managed memory model clearly solves this problem by employing a garbage collector. This is good and well if all the objects you are using are managed objects. I’ve seen many colleagues whining about a memory leak in their .Net project especially if they work with fancy bitmaps in a classic Windows Forms application. Memory leak in a .Net project? Yes. Coders who never used classic memory allocation model easily forget to use Dispose() method when they need to free system memory back to the operating system. This usually happens with bitmap images in C# .Net and coders who do not think in classical approach easily forget to dispose their objects. Disposing of a bitmap object when you have done with it requires you to write a dispose method for your object that uses the bitmap object. Disposing of your object when you have done with it requires you to write a dispose method in your form when you have done with it. Dispose is contagious. It contaminates its container objects.

Let’s go back to our multi-threading application. We need only one executable to do our job. That application might run one or more threads to do the jobs. If one of our threads collapse we might detect it in a number of ways. We can update a date time information every time our thread finishes a bunch of commands. This way we can say, for example, an hour later that the thread is dead. We may free its resources and restart it all over again. This usually happens when moving the files over a network connection. There are lots of reasons for a move operation to fail. A disk might get corrupted, a file could be left open (hence other processes cannot access it), or some jerk might unplug the network cable.

A DBV-S broadcast consists of a bunch of transport streams. A transport stream could have 10, 20, or 50 radio and TV channels. I noticed that there are small MPEG headers in that stream and I tried to record every packet in this stream to a different file using the stream number information. Voila! It worked!  Just because I was able to code it using pipes, I coded it using pipes. I get a transport stream from the capture card, run it through my "capture graph" of DirectShow filters. Zero decoding, zero CPU usage! All I was doing is to write packages into a different file. I then decided to use a different process for each of these recordings (read it as a different executable running for every channel I record). These programs are connecting to the pipe of the main program's pipe and read their data from it. Here is the result: 


If you are running an application under Windows a few things might happen when you lost connection to the network or to a disk. Windows displays a dialog to inform you that the drive you are trying to reach is not accessible. Windows does that by blocking the messaging queue. Your application will stop even when repainting its own canvas. As soon as your threads try to get messages from the message queue, one thread fails and the others will fail too.

The same thing happens when you run out of memory or other system resources. Your thread collapses and drags the whole application with it. You’ll lose a lot of computing resources and resources mean money. We learned that hard way. After realizing that we can record two channels with one sound card (left and right channels of a stereo input connected to different sound sources) we designed a new application that records multi-channel audio. Adding 4 sound cards to a computer we would record 16 channels at once. 


We needed more information about the recording process, we broadcast UDP messages to the local network informing the management panels. We also added new telemetry points to it like sound level, absolute silence, empty disk space on recording drive, last time we get a message from the application etc. 


Recording an audio file is easy. Once you get the basics you’re done. A .wav consists of a few headers and then sound samples. There are a few problems that could occur and one of the most frequent is running out of disk space. While you record, analyze and move the files to the backup server on time, there will be no problems. We were looking at the management panel’s dashboard to see the whole picture. If all the machines are recording and we have a buffer space on the server, we would sleep well that night.

Up-scaling a simple system is not just multiplying the resources used with the number of channels. It's not that easy. After we started to record 16 records with one application, we noticed that if that application crashes for a reason we lose 16 channels. Since we were reporting the advertisement broadcast logs, we would lose money.

We also had file transformations, frequency analysis of audio files, detection of stripe ads on TV recordings. We decided to run these processes in a different process space. Which, in simple words, we run an exe file for every job. If that exe collapses it does not stop other processes with it. 


OK. Now it's the next question. How many processes could we run at the same time? Our CPU and memory are not infinite resources. There is a limit that we could use them because other processes are also in need of these resources. We can run the infamous ffmpeg to convert our media files from one format to another but converting media files requires decoding and encoding the media. That takes a lot of computing power. But we have other processes to work for our business. Now we need a process scheduler.


Without a process scheduler, we run the applications with parameters as much as we have jobs. For example, let’s say we have 50 channels are recording on a computer. We record files an hour long. At the beginning of a new hour, we close the recording and start a new file. Yes. We would have 50 transcoders, analyzers, file movers etc running at the same time. When we are talking about media recording systems, this means anarchy, chaos and even catastrophe. Because media files are big, they cannot be in cache for a long time and we would lose data (and of course, we learned that hard way too). You can see our CUDA powered stripe advertisement recognition system running as different processes at the same time on our Windows system.


A process scheduler works like a thread queue. Creating and destroying threads takes time. Also having too many threads in your application consumes resources. For this reason, we create a thread and then feed its internal queue with jobs. It's almost the same in our process scheduler. We do not care about the time it takes to start a process because our media files are huge, and it really does not matter. All we want is not to block I/O devices and leave some computing resources to the other processes. 

After all these years and I left the company, I had my own experiences in my pocket. I never use GUI workers anymore (except for the demo purposes). I create console workers and redirect their output to an MDI application. There is a limit to the number of console applications and our MDI parent shows "fake" console applications by printing their STDOUT to its window. Also, it redirects the STDERR to a logger which logs the errors to a database to check later. 

This way I keep the processes at a fixed number. I do not have floating windows all over the desktop. I can connect to this MDI parent from another machine to see its queue and running jobs. If a process fails, it is contained in corresponding process space. Gets removed from the queue and another process will be up and running in seconds.


This way I can keep the number of running jobs at a constant (you can see that it is set to 16 jobs at a time). I see the errors in red, normal console text in green (yes, a little bit nostalgia is good for anyone). 

It can be called "poor man's micro service". Here is a screenshot of a working application:




That's all for now. There might be some points to be touched on.

No comments:

Post a Comment

A Survey of Body Area Networks