Artificial Intelligence in the Real World: Part 2

Joshua Johnson
Sep 20, 2018
6 min read

As I discussed in the last post, I want to share with you my experience of trying to implement AI in the real world and some of the lessons that I learned along the way. I left off describing how we wish to hand off the high level job of finding an optimal trajectory to the agent in order to achieve some set of objectives. More specifically, we wish to have the agent learn an optimal policy that achieves these objectives. A policy can be thought of as the agent’s decisions that it would take for a given environment state, or in other words, as the replacement to the current hand-tuned algorithm. In order to find this policy we need to train the agent with experience and reward or punish it based on our objectives. Just like a newborn baby has to learn to walk by receiving positive and negative feedback through its experience, the newborn agent will need to learn to do its job through gaining experience in its environment.

In our case, what constitutes experience? Well, every time the device is activated and reaches its end target. In the case of a vessel sealing device this means that the surgeon or research engineer has pressed a button to start delivering energy and after several seconds the energy stops when a target has been reached. The timestep data that was collected and acted upon is the experience that is needed to improve the agent’s decisions. This data will consist of parameters like the timestamp, power level at each step, and other proprietary data that the instrument measures. Unfortunately, due to confidentiality I can’t go into detail on how the current devices operate but I can give an analogy.

Let’s say that you wanted to invent a robotic grill that can cook any steak to perfection. Mmmmmm... steak. You have a preference for medium rare steak, so timing and flame size (i.e. power level) are important. You don’t want your steak overcooked or undercooked. So, you decide to gather data on how you personally cook the steak. This data might consist of real-time data of power level and let’s say a built-in camera that captures the color of the steak. Bear with me on this. For argument’s sake you don’t have a way to check the internal color of the steak, just the color of the surface of the steak. These two sensors - power and color, are the states of your environment. Your action space is the amount of power you deliver to the meat, or tissue.

Now we start to see the first challenge of our problem domain. How do we know when we’ve finished cooking the steak? What is the target? The target is to cook the steak to medium rare but how would we know when we got there? We don’t really have a direct measurement of “rareness”. The target is also ambiguous in terms of its definition. This problem is known more generally as a Partially Observable Markov Decision Process (POMDP). RL is used to solve problems which can be described as a Markov Decision Process (MDP). This means that the process an agent is engaging in can be described by states in which the future state transition probability need only be described by the current state and action and does not need to keep track of its history. Additionally, it’s assumed that the state space is fully observable. However, as in most real world applications, we don’t have sensors that measure everything we need in real time. The best we could do for now is to use a proxy for the target, say the color.

We can also start to see the second challenge which was hinted at above. MDPs assume you don’t need to remember the past. Here again, most problems in the real world need to keep some form of memory of the past. In this particular case, each steak cooks slightly differently because of its size and proportions. This can lead to aliasing of states - two different states that look like their the same one. This condition can confuse the agent if not addressed, so we will need to bring history into the state-space. Simply put, we include several steps of the past in the current state space:

S_t = <p_t, c_t, p_t-1, c_t-1, p_t-2, c_t-2>

So we’ve kind of solved our first two problems but before we can even train an RL agent we immediately hit another snag: we need a simulation of the environment to train the agent. Most RL problems such as the baseline examples in OpenAI’s gym or MuJoCo come with physics-based simulators for games or robotic applications. But for anyone striking out on a novel problem domain there is likely no open-source simulator. You will have to built it yourself. This is probably the biggest problem for most non-standard applications. RL methods are sample-inefficient and require the agent to explore trajectories that may not be obvious. Even on simple toy problems like gym’s inverted pendulum or mountain car examples it may take 100’s of episodes to solve the problem. This makes simulation much more cost effective compared to cooking 100-1000 steaks. Or in the case of our medical device instruments, sealing 1000-10k’s of vessels.

One way to solve this is to build a time-series model of your environment using a feedforward Multi-Layer Perceptron (MLP) or Recurrent Neural Network (RNN). If you have enough data from the real world, you can train a network that predicts the next state, S_t+1, given an input of the current state, S_t, and action, a_t . This architecture follows the model-based RL methods which can improve sample efficiency. In practice most environment dynamics are complex enough that they require fairly sophisticated models. For instance, I found that a simple MLP of 2-3 layers with < 100 nodes each was not able to accurately predict the tissue dynamics over the course of several seconds of simulated time (1000’s of time-step predictions). I then looked to RNNs and focused on using an Long Short Term Memory (LSTM) architecture since those have been shown to be more accurate over longer periods of time. However, even LSTMs started to fall apart due to the difference between how the LSTM is trained vs. how it is run when predicting, or inferring. When it was trained, the LSTM had access to the next state’s ground truth. But when it was inferring, the network had to feed its own prediction back onto itself. Any small errors compounded over time and led to large errors at the end of the episode even though I fed it the exact same actions as the ground truth. Additionally, if the AI agent were to explore state-action pairs that the model had never seen before then the model may not be able to accurately predict the next state. In other words, the training data trajectories are skewed towards prior experience.

One approach that worked fairly well was to rescale prior experience episodes to a maximum number of timesteps, downsample them and use an autoencoder to recreate the trajectories. This method essentially uses an encoder and a decoder to feed the entire sequence of data, compress it into a latent space, and then decode, or recreate, the original data sequence from the latent space. This offers a couple interesting benefits. First, the the encoder must learn an efficient latent representation of the original data distribution. In other words, it must figure out how to efficiently encode the data such that relevant features are preserved in the vector space. By doing so, the engineer can then take several samples of known classes such as different types of tissue (arteries or thoracic vessels) or field conditions (wet or bloody environments) and observe where the data lie in the latent space. Then the engineer can generate new data sequences by sampling from around those locations in the latent space. Even more interesting, the engineer can do vector math in the latent space and generate samples of sequences for which they don’t have ground truth data. For instance, by finding the vector between samples of dry and wet tissue samples, the engineer can add this vector to other classes of tissue, such as mesentery, in order to generate wet mesentery samples.

Another benefit of using the autoencoder architecture is that the loss is calculated over the entire sequence and so this can help improve the long-term accuracy. In order to use it in inference mode, I simply fed the cumulative vector of states up to time, t, through the entire autoencoder in order to predict the t+1 timestep. By ignoring all the predictions after t+1, I was able to get a fairly accurate model of the tissue response.

At this point I’ve shown you how to overcome some of the early problems I had in trying to build an AI agent for a medical device product. In the next post, I’ll discuss the RL methods that I used along with the pros and cons of each.

#AI

Artificial Intelligence in the Real World: Part 2

Recent Posts

Kommentare