013 - Ryan Summers
2023-08-18In this episode, James chats with Ryan Summers about the process of developing protocols, the guidelines of working in safety critical and embedded engineering.
Originally recorded on 2023-08-03
Audio
FLAC
M4A
MP3
Show Notes
- 0:00:38 Preface + Intro
- 0:02:18 Explanation of Booster + Stabilizer
- 0:05:33 MQTT as control plane vs data plane, PubSub message brokers
- 0:09:05 How to integrate lightweight protocols and heavyweight stacks, telemetry of lab devices
- 0:13:27 Messaging schema for specifications at scale and developing for multiple projects
- 0:17:25 Point-to-point schema comprehension + Rust error handling
- 0:22:50 Using a protocol the way it was designed and its flexibility, bespoke protocols and the changes with MQTT Version 5.0
- 0:29:38 Development / evolution of protocols, seeing other people using it, designing firmware
- 0:43:54 Designing firmware dumb vs complex, building logic with strict real time requirements
- 0:48:00 Avionics, gathering data from multiple devices, communication between devices
- 0:57:57 Safety critical, systems problems and the varying levels of embedded systems teams
- 1:06:59 Protocols being Legacy Code
- 1:08:58 Partial understanding, serialization and data interpretation, building a virtual machine
Transcript
James Munns
So I think the thing that originally got us to reach out for a podcast episode was talking about MQTT right?
Ryan Summers
Yeah I think it was mostly MQTT and kind of runtime configuration of devices and kind of embedded and things like that.
James Munns
Okay, yeah, because I mean I think the thing that I've seen you work on the most is Stabilizer and I know that's got all of that I assume. I assume like every experiment run is configured and it's always talking over MQTT over ethernet I think, so I think you're basically the perfect person to talk about that about right?
Ryan Summers
Yeah, actually Stabilizer is not even the only project I've done for a quarter. There's also Booster which is a big kind of, for you rack mount power amplifier for lasers as well. And so we've kind of tried to make a common stack for them that they can talk and configure to devices, because the idea is to have kind of all of these things in a lab sitting together and network them together hook up lasers maybe sometimes send signals all the way across Germany to other labs and things like that, and so the idea is that you could connect all of these devices have them talk to each other and control things.
James Munns
Cool. So before we get too far with all of that — because I want to talk about all of that — Do you want to give yourself a quick introduction?
Ryan Summers
Yeah, my name's Ryan Summers I have been doing embedded rust for 4, 5 years now something around that. I do embedded consulting and also have a startup doing embedded manufacturing automation. So if you ever want to look up forged.dev.
James Munns
Oh you're working — Oh that's right, you have mentioned that you're working with Noah…
Ryan Summers
Yeah I've done a Rust startup too.
James Munns
Very cool.
Ryan Summers
So there's many things we could talk about today if you want.
James Munns
So you mentioned Stabilizer and Booster. Do you want to give a quick example of what each of those are, and how those typically get used for scientific things?
Ryan Summers
Yeah, the easier one to start with is definitely Booster. The general idea is that you've got some kind of signal source that you're going to be using to drive some laser in a lab, and generally the laser needs to be running in some kind of high power, and your signal generator doesn't generate high power and so the idea is run coaxial into this thing, specify how much gain you want on the output and it handles all of the power management and generates the, the necessary gain so that you can actually drive your big laser load. And it's got all kinds of nice safety protection mechanisms. It's got fans in there that you can automatically spin up and keep things cool. It'll handle interlock tripping over power and all kinds of fancy things.
James Munns
So what does high power in this case means, because it's one of those domain-specific thing where like sometimes high power means like, an amp and then sometimes high power means like, well it's thousands of amps. But I guess for lasers… like, because “blind you” starts pretty early with lasers. So I have no difference between like “blinds you” and like, “could cut wood” sort of laser power.
Ryan Summers
That's — yeah, no, that's a good question. I don't actually know, I don't do much in the physics realm. The only thing I do know is they they use some of these lasers for things like both neutral atom and ion traps, and so they might need more high power lasers than what I normally would think a laser would be. So I don't know, good question. But there's some pretty beefy fans. The device is pretty large and it has direct mains wall connections. So I Imagine you've probably in the few amps range.
James Munns
Gotcha. So that's your — it's your amplifier, but I'm assuming with a much lower latency and a much higher power than your typical, like, speaker amplifier type of system.
Ryan Summers
Yeah.
James Munns
Very cool. So then that's Booster, so then what is Stabilizer?
Ryan Summers
Yeah, Stabilizer — that's the hard one. The best way to describe it is that it's a Swiss Army Knife in that it takes in arbitrary analog signals, does all kinds of DSP math on that and then generates some analog output. And so essentially you get this big two channel input, do all the math you want in software realm, modify the signal, whatever you want, implement your own control loops, have PID controllers, have low pass filters, all kinds of things and then you generate your DACK signal and you write them out to the buffer. The whole idea is that it's pretty low latency, very deterministic, can handle a whole ton of throughput and it's got a pretty beefy CPU on there so we've got full live stream data streaming so you can offload it onto a computer for analysis. All kinds of good stuff. I think that's where we first started working with MQTT, because there was this need to be able to say, like — Oh, well I need to adjust these filters, or my scientific experiment has these different requirements. And so we're like, well — we don't want to flash the device every time we need to, like, update the filter parameters and because we don't know what they're going to be. Kind of need to adjust in the fly on the lab. And so what it actually allowed us to do was we put MQTT on there, our configuration software, and then we just run some Python scripts that automatically calibrate things.
James Munns
So you're using ethernet for the only data link to the device right? But I'm guessing you have a separate channel for high-throughput, digital-acquisition-data data. And then MQTT is primarily your control plane? Or is the data plane also MQTT, just sending serialized data out over MQTT or something like that?
Ryan Summers
Nah, so we've got it set up in a few different ways where the data streaming is just kind of raw UDP stream to an endpoint. I think we have it set up so that on MQTT you basically say — stream all of the raw data to this IP address at this port, and once you set that it will just start dumping it out there… but it would be interesting to see if you actually could run that over MQTT because honestly the protocol is pretty tiny on top of the actual data but it's a good question.
James Munns
This is one of those things that I was going to poke at: so, I mean, I've done IoT stuff for the last while and it's one of those things that, like, I came from like the avionic side and then went into IoT and first it was a huge culture shock of “What do you mean product cycles aren't measured in the better part of a decade?” or something like that. I was doing a lot of rapid prototyping stuff too for startups and things like that. The change of pace was just really interesting to me. And MQTT is not new, I mean like, it has a really interesting history of… I think it came from, like, oil rigs or something like that? The idea was that it was over basically the satellite text messages, and that was the idea was that it was this super low overhead, pub sub protocol so that when you were talking with middle of the ocean satellite ah rigs you could do broadcast and things like that, and then eventually it became the standard that people use for IoT stuff now… where nowadays it's almost exclusively over TCP. Even the embedded systems that I see use it will do it over like an ESP32 for WiFi or some kind of hardwired ethernet or something like that, and everyone's using, like, JSON for the message formats and that's even sort of now even codified by these cloud platforms like AWS Greengrass. It's like web requests over MQTT where it almost feels like you've taken what everyone describes as a lightweight protocol with very embedded-friendly and things like that. It's like well but you're doing TCP with TLS. Especially like five or six years ago when the ESP32 wasn't even really a thing there was really only like the 8266 and your Wi-Fi choices were usually super limited where everyone's like, “It's this embedded friendly platform!” But then everyone runs it on a Raspberry Pi or something like that and it's like... If you have a device that's doing TLS and doing all of this, why do you need a lightweight message protocol versus something like ZeroMQ or RabbitMQ or, like, any of the other PubSub message brokers, which get used a lot for backend services for either like control plane. I came to you and I reached out to you because you were saying very positive things about MQTT and I was like, I have this big chip on my shoulder from doing a lot of one-off demos where it worked and it's, like, a reasonable protocol and it has very, like, straightforward nice to work with semantics. But it always bugged me as someone who has done a lot of like — either really hard, real-time embedded stuff, or very like “Ah, we've got 8 kilobytes of RAM” kind of thing, when people describe it as like a lightweight embedded protocol and I'm like — as soon as you have TLS like you've sort of left the realm of lightweight protocol. But I guess since your control plane is largely wired ethernet and you're on I'm hoping a fairly well-segmented network… These aren't connected-to-the-internet sort of devices, these are connected-to-local-control-land kind of devices, hopefully.
Ryan Summers
Yeah, you make a very good point about the TCP/TLS being a very heavyweight stack, especially when you're coming into the embedded realm and you have this low bandwidth protocol, ideally. And in reality I kind of have the same assessment as you that it feels weird that you would write this very low overhead protocol and then require this super heavyweight TLS/TCP underneath. In our use case, but we did the exact same thing: we're taking JSON so obviously not using these very condensed packets and we're actually not using MQTT in a manner where we care about kind of throughput and data rates and things like that. I think the main reason we ended up going for it is because it's actually got kind of an ecosystem that's developed around it. Like now we could potentially allow someone who uses one of these devices in the lab to hook straight up to some of those, like, AWS services and start logging all of those things. And one of the nice things we have is suddenly you can spin up Grafana in 5 minutes and get yourself a dashboard that shows everything about the device over the last week and so you can see how hot things were, what kind of gains, how the control loop was behaving. We've actually used this to diagnose why test setups in labs were malfunctioning because like, oh we see this huge correlation where like suddenly when we get a few tenths of a degree C increase on this device, we start seeing — immediately, like the control loop starts getting out of whack and our error starts increasing. I don't know if I'd use it for a really deep real-time, deeply embedded kind of application, but here it works really nicely where we just want to be able to have something that's connected to the network, not think about it, there's a well-established protocol and we can use that for telemetry and control.
James Munns
Yeah, that's one of those like make-or-breaks for embedded projects in my opinion is having a backbone to the device that you're talking to and having some sort of protocol where you can do multiple things over. So like, it's almost the first thing that I do on nearly every project is if it's over USB I set up some kind of data pipe over USB where I can send logging messages and command and responses, and then ideally having things like instrumentation command so you can trigger it to do behaviors either for testing or just for like, “Okay, now I've hooked up my oscilloscope onto this and I need to figure out why this relay is glitching a little bit,” and I can look at it and I can just press a button on my laptop and make it do things. And then when you start getting into devices where you have tens or hundreds or thousands especially then having a network where, you can talk to a fleet at once and address them using things like that. And you mentioned being able to use off-the-shelf Python libraries. I've built a lot of bespoke protocols. Like, when I do hobby stuff, I do bespoke everything because that's just you know — that's what brings me joy. But for for customer projects and stuff like that where you're like — I'm not going to be the only one maintaining this. I don't have unlimited time and there are actually deadlines and things like that and, like you said — just being able to have something that everyone deals with with like MQTT plus JSON is great because literally any language can download a library and if you pointed at an IP address or the same broker that all of your end devices are talking about, it doesn't matter whether you speak Python or Rust or Bash or C or whatever you have this sort of like common language where, even though it's not ideal in any sense of the word, the real value is everyone can use it. And I think that's really what I saw for a lot of the iot prototyping if nothing else was — even then Mosquitto was a good well-known broker and there were libraries like the Paho MQTT or dozens of other ones where it was always really easy if you had a device, could do TLS or could do whatever networking you were doing MQTT was always just one of those ‘set it and forget it’ kind of things. Then the problem is it doesn't really give you much on top of that. You've got topics which you can use to subscribe to. Then just the payloads are whatever but everyone uses JSON which means everything's freeform which means everything- like there's no actual schema a lot of the time. It's just okay, we have LED on. We send a message that's like LED state :``"``on
and it gets like, really ad-hoc. Which on one hand means that you can power through something really quickly — I'm wondering actually how you handle that at scale whether you just have like an internal spec of like, “Topics look like this. We use these kind of wildcards here. We expect all devices to listen to these topics with their name in the thing, or we expect the message schema to be this,” or just by convention like, it just is what it is and you have a markdown document somewhere that has all of the example JSON messages or something in there.
Ryan Summers
Yeah, so this is actually where I think things get really interesting with Rust. First of all, we're in the no TLS realm right now, I think the next thing on my plate over the next few months is like, “Maybe we could get embedded TLS running on this, and get that going and talking to Amazon Cloud.” But going back to the the schema, one of the things we wanted on top, I talked about this, is kind of like the runtime configuration. So one would be able to say like, “How do I change a setting on the device in like a sustainable manner where we can reuse this through multiple projects?” So we actually developed a method using Rust derive macros and so you just write a struct in Rust and you just put this derive mini confine and it'll automatically interpret that into an entire tree of strings and publish that automatically over in MQTT of like this is what my current settings tree looks like, and then you can modify those settings to what if you want. And so we have like all of these setting structures for Booster for Stabilizer that have all of their different things that they're using and is all just in a Rust struct. And you just hand that to the MQTT client and say like, “Hey, use this as the settings, derive your tree from it, publish everything that we currently have on boot ups,” so that someone listening knows what settings you're using and then maybe they want to change them and so it makes it really nice because suddenly you can like, “Ah this setting isn't actually what I want.” Just restructure that in Rust and it automatically propagates itself over MQTT and handles all of the publication, so…
James Munns
I see that's really interesting because that's one of the things I've struggled with, because in Rust typically I use like Serde a lot where your schema is your struct definition. I do a lot of like binary serialization with Postcard too, where you really need the schema to match. There's no like, “Oops I don't know about that field, so I'll just skip it,” because in a binary protocol, that just gets interpreted as the next field, which means now all all of your deserialization is garbage and things like that. That’s one of those interesting things where for more dynamic protocols like JSON where you can have new fields and added fields and restructured fields is — how do you handle the case where what your device is sending doesn't match what your tooling is expecting. You mentioned the devices send out their schema on boot, which I guess if you're using like Python or something like that, you can dynamically evaluate and make yourself, like, a dynamic class that has all the fields and things like that you could deal with, or in Rust you could just use like Serde value and interpret that, but when you have to send messages back — Is it just something you can just tell when you've changed the schema because you go, “Wait a second… The tooling is is expecting the word temp
for temperature but someone changed it to temp_C
or temperature
or something…” like that's a really minor renaming change. But there's also a ton of like reorganization and reordering and things like that that can really screw up schemas… I wrote a whole post that was like is there any value in partially understanding a message because if you don't understand it, the best case you can do is gracefully ignore it like from a programmatic standpoint. It's a little different when you have a human operator. And maybe they get, like, a dynamically built GUI and they're expected to go in and, like, as a human, you know, respond to these sort of things. But like if you're writing a long-running script — if you all of a sudden are getting messages that you don't understand, the best thing you can hope for is: your program just goes “Huh, I don't know…” and doesn't do anything that might cause problems or something like that and especially when you have high-powered lasers, you really don't want any, uh, misinterpretation of the kind of settings that you're doing. So I'm interested to hear how you handle that, or if you just like careful deployment, or “we just don't touch the schemas for well-known messages” or…
Ryan Summers
Yeah, so that's an interesting point. In regard to schemas — I don't know if you could strictly say that we're publishing a schema. What we are publishing when we first start up is like these are all the settings that are available to you, and this is the current value and right now we're using Serde JSON. A lot of work just happened in the last week or two with Robert refactoring miniconf that actually took away the assumptions of knowing that like there's Serde underneath, or that MQTT is used in a way. And the idea now is that you can use it to just map an arbitrary string of keys into an endpoint within your structure. And then you can — anything that implements serialize or deserialize you can pass that in and you pass in your key iterator and then you just say, “Okay, I'll deserialize it when I get to the terminal endpoint.” It'll figure out the type when it's there. The idea is like you could now use this kind of miniconf structure over any kind of data link layer like if you had USB or UART or something like that. You could probably use Postcard use this just fine. But in terms of getting back to your question about schema — this is really where Rust error handling is pretty incredible because essentially when you get to the end, someone's given you some payload and said like, “Hey, set the setting to this value.” And really, we're leveraging Serde at that point where we say like, “Hey, try and turn that into the type we want.” And what's really nice is you can start just propagating errors out using just the ?
and levering things go all the way back. But then you can start catching them at the MQTT interface and start using that error formatting, and then print that back as a response over MQTT. And so suddenly, on your Python side or on your PC where you're trying to configure it when someone tries to set a setting, you immediately get a response stream, and says like, “Hey, that deserialization failed because I expected this, like, /
and this location we didn't get it.” And so it's really cool to be able to see, like, proper error handling written and code immediately translate into this wonderful user experience on the tooling side, but it it also makes it more complicated because as you're writing a library, you're like, “I need to make these errors right, because this propagates all the way back out to the end user.” It's not just code at this point. Someone needs to be able to read this and understand how to fix it.
James Munns
Interesting. Yeah, so I mean that's definitely one of those things where yeah, having those kind of channels whether you're setting up a request response… Oh, that's another thing I kind of want to poke at is is like your actual communication model of PubSub versus RPC. But having that ability to get that response back is is huge and the more fidelity that you can give back either that just gets shoved into a log so you can see if this is like a one-off “Oops, see how the message was corrupted on the line,” or the message was partially dropped or things like that or really no, you tried to set temp
and we don't know what the field temp
is we only know temperature
is one of those quick feedback things that becomes a really big deal. The messaging style — it's another thing that I sort of poke at MQTT because you have this world where everyone's connected to a broker. And what I see a lot of people end up doing is they just do a lot of point-to-point communication and almost to the point where they are doing mandatory responses. So every command you send gets an ack which means — or not just an ack but like an ah application layer acknowledgement so — you send the message, the protocol sends back, “Yes, I have received that message.” Depending on your quality of QoS settings, you either then get, like, a double acknowledgement or whatever. And then the application deals with it and then it sends back a message that says, “Yes, that succeeded” or “No, that failed” and things like that where even though you have this PubSub world, you end up doing exclusively point-to-point links. It is still useful because all of your end devices end up talking back to the same endpoint which is very useful because it means you don't have to think about it and once your tooling or whatever connects to that as well. You have at least the routing to all of that regardless of how they're actually wired up. But — it's not really like PubSub-y you know what I mean? Like if you're always sending one message to one device every single time, it's not very PubSub-y. It's like you have a point to point link with like ah you have a star topology really um and especially when you have like, AWS handling this and then you go oh well, we're also like, you don't really broadcast messages to multiple devices, you end up getting this like point-to-point link enforced by security, which is also not something that MQTT as a protocol super understands, but all of the brokers like the commercial brokers that I've seen — like, not Mosquitto, Mosquitto handles it where you send everything out and everyone could see anything and there's no real like ACLing or there's… Now there's probably ACL-ing but um, but in like managed brokers from AWS or — I'm picking on AWS just because I had a client a couple weeks ago that used Greengrass, so like this is in my brain right now. You end up like, enforcing at the ACL-level that these are all point-to-point links and things like that. So is that something that you actually leverage ready set everyone to this or is it really still like a ah point-to-point link for you. I'm just interested if I can find anyone in the field who has used it sort of the way the protocol was designed rather than, “Oh, this is a protocol that's close enough to what we actually want to do and it means that we don't have to write tooling, so we just use it because it's portable. Even if we're sort of like — not abusing but like — evolution of how the actual protocol gets used.”
Ryan Summers
So actually in MQTTV5 they updated a lot of the way MQTT works to support this kind of point-to-point communication a lot more because now messages have properties associated with them. And one of the properties is something called correlation data where it's just a binary blob that you say like, “This is this message” and the intended use case is that you send that message to someone, they inspect all the properties they say, “Ah! There's this correlation data!” and when they generate a response, they're going to put that correlation data in there. And there's one of the properties is like, what is the response topic. So they've obviously recognized that a lot of people are using it in this kind of not traditional PubSub but point-to-point communication and in our case, we're also mainly using it for point-to-point for the configuration stuff. But I really think the main point is that you get the best of both worlds. Because we do use the telemetry for real PubSub usage, where you're broadcasting out your current state to everyone that's listening, and in some cases we've got our telegraph scraper that's collecting it and putting it into a DB and Influx so that we can start visualizing our dashboard. We've also got our development tools up there looking at it so we can see like, “Ah, what's the state of the devices we're like running? These control loops and stuff” and so really, you get both things which is really nice. Back in my university days, I did a lot of work with ROS the Robot Operating System and that uses the traditional PubSub methodology. And really like that being able to publish data somewhere makes developing your system a lot simpler because you no longer have to think about how tightly structured it is, you just start blasting out the data for everyone to listen to. Maybe someone will need it and then when you're going and working on some other component in the future, you’re like, “Ah yeah, let me just subscribe to that, because I actually, I could use this specific piece of information and I'm going to go and implement my new functionality based on that.” And so really, you don't have to go and modify that original component you made now to get that data because it's just out there in the open for you. So I think really that the benefits of that PubSub style — it's really just flexibility in your design.
James Munns
Yeah, and that's actually a really good point because when I was doing MQTT stuff — this is still in the 4-point-whatever days, and that's been five or six years — and you know I'm sure the actual use now that Greengrass has launched and I'm sure there are millions and millions and millions more users now of MQTT than when I was poking around it. I was working at an IoT — it was an iot platform and that was sort of our main ingress from a lot of these devices was MQTT. So we were building a lot of those things, and I think that was around the time when Greengrass launched for the first time and then I think maybe Azure had ah, an offering that came out around the same time and things like that. So it was interesting to see what we did, and you could tell that there were a lot of these patterns being built because we weren't the only ones doing it that way. You know, you could see a lot of other people doing that. So it's- it's interesting to hear that a lot of that has come out in MQTT5, so I might be just totally off base and going, “Ah, it doesn't do like that ,” or- or, “It doesn't act like that” and the answer is maybe in MQTT4 it was awkward but in MQTT5, they’ve just gone, “No, this is how people are using it.” Well, we'll live with it.” But it's interesting because I think you also nail a really big point. There's no, like, ideal pure usage of a tool the way it was designed. It was like you're solving problems, and if it's good enough and it means that you don't have to spend a ton of time developing it and all of that — done, ship it, like… so it's one of those things of like for my personal projects, I super overdo it. It's weird because my brain will turn off for personal projects versus customer projects, because if you ask me a question when I'm just doing hobby stuff I'll be like no, it must be perfect. It has to be like this. But for customer projects I have a much better way of like turning off my brain and being like, “No, we're shipping in three weeks it needs to be done. Good, you know, good enough.” And I think the actual points that you raised of having that ability of a way to talk to the device and tooling that lets you talk to the device and that flexibility to go, “Hey, I need more logs, I need to add more logs, and I need to add the ability to understand these logs very quickly,” is one of those things that really does shine with tools like JSON or MQTT and things like that because you can always throw out another topic which existing tooling doesn't have to pay attention to, they can just pretend like those messages don't exist and you don't have to think about routing or filtering or things like that. It's just like — Ah, I will add up one service that just looks at the like hourly fault rate from every device because then I can see patterns and maybe correlate that with temperature or something like that where it does lend itself to the “Throw all the data into the ether, and catch it, and then worry about data analysis later” or figuring out correlation or trends and things like that and… especially when I was working on IoT devices like that's one of those things where trends end up being a lot more interesting than absolute values. Like in the scientific pursuits and things like that you are very interested in making sure that things are calibrated and accurate and you're very interested in the raw data captured or the process data that's captured. But especially when you have consumer devices or when you're, when you're dealing with the like more operational level rather than the scientific level, trends matter because it might not matter if you have 5 faults an hour or 10 faults an hour or whatever, and maybe some devices will be 5 and some will be 10 and some will be 20 or whatever for some metric of line noise. But when they start changing, that's the really interesting thing because that's when your devices experienced something weird, or you can see those trends even if the devices aren't calibrated to some exact reference where they have a different baseline, but when they all start going up then it becomes really interesting. So I mean, like I feel like that kind of “capture first, process later” or writing pieces of the- the system that only listen to the parts they care about I think that does really shine with MQTT and that's something I deal with some of my bespoke fancy protocols, because I've built this all to be perfect and efficient, then all of a sudden I want to change one thing and everything falls over. So I totally see the value of that… and I don't- well I didn't super mean to put you on the spot of defending MQTT and things like that because I.. I think it's an interesting protocol but it's one of those things where I learned it because I needed to learn it for work. So I mean you learn enough to become productive very quickly and as you're learning, you're sort of battling your preconceived notions in your past experience versus what you're learning and sometimes you hit these hitches where you go like… that's not how it should be, it should be something else… And sometimes the answer is, “Yeah. It should. But it isn't, so… you know… whatever,” and then the other time is, “Ah, you're missing this piece of information.” Which is why you think it doesn't click but it's just because you're missing some step in there. And I've always sort of wondered is my… lack of love or- or problems or chip on my shoulder about MQTT because yeah, it's just, you know, it's evolved. People just use it because it was close enough and don't try and worry about why it's not the way it isn't… or, am I just really missing something? So, that's why I was sort of poking around because I'm interested in hearing someone who really loves using the tool, and seeing if there's just a piece of it that didn't click, or if I'm just undervaluing the flexibility and I go, “Oh, it could be better if it wasn't flexible…” but then that's sort of the key value. You know what I mean?
Ryan Summers
So I find it interesting that you're thinking that I'm like the MQTT evangelist here. I would phrase it more as a love-hate relationship. As I'm implementing a client and protocol like, man, this is just so incredibly wonky sometimes. I mean, any protocol has got its own weird quirks right? But ultimately what drove us to use and MQTT was back when we were first starting out on some of these projects we had this need where we're like: We need to control these things. We've got an ethernet connection. We don't want to develop our own bespoke protocol, because we'd like to be able to leverage anything that's kind of out there. And so we are kind of looking around and saying like, “Okay, what- what protocols are there?” and I know CoAP was kind of on our list and MQTT was there… I don't know if we saw many others. But ultimately it was more along the lines of like — what is something that we can give back to the community that doesn't exist yet. Because this is a need that we have, and we can probably imagine that a lot of other people are going to have this. Especially as microcontrollers start having more and more capabilities, like they start seeing the STN32H7. It's incredible what you can do on that chip even if it's not power efficient. Things are getting more connectivity. You see network stacks more often. Libraries are getting much more optimized. You can start fitting things like small TCP on very small devices. And so really, what we wanted to do is like, beauty of this contract is that it's been open source work and so we were able to say — how can we find something, fill a niche in the community and publish it out there that other people could also use it? So I think that was one of the real driving factors from MQTT2 because it's like, hey, this is a widely adopted protocol. There's nothing here that works on embedded in Rust yet, so why don't we go ahead and write a library so other people can use it too if they want. Really, just being able to see kind of what that brings out is really fascinating, because honestly, I had never even heard of MQTT before we went on building this client. We're like yeah, this looks like it fills our needs, and then seeing what comes out as a result, suddenly now two or three years down the line, we have these 5 line Grafana configs and we get this whole dashboard of like all the time series information. And then suddenly they were able to debug like an atom trap and were able to say like, “Ah yeah, when we had the temperature go a little high, our error increased and we lost it.” You wouldn't have that capability without having kind of this connectivity to the device and these easy integration with existing tooling like yeah, the data is theoretically all there but… you made a good point, like when you're looking at these Iot or network connected devices, you're mainly interested in the data over time. You’re not interested in an immediate point like — oh cool, the temperature is 32.5 C, like, that is absolutely meaningless to me. But suddenly when we start looking at it over the time of the day like, oh what's the difference at night versus like, what's the hottest point in the day? Is that going to affect some bar measurements, and is that affecting how the assessive setup is actually working and you can actually start seeing some of that. And that becomes way more fascinating when you aren't looking at it as an individual point, but as a collective whole especially when we don't have the capability to store all of this data on the device and we're not going to write all this bespoke tooling on the computer to collect it all, and keep it all there put it into a Postgres database and write all these visualizes like, no I'm not interested in that. Being able to leverage out there is really what's been powerful from this outcome here. Kind of looking back retrospectively, I don't know if we intended any of this when we first set out, but it's been cool to see.
James Munns
Yeah, and that's interesting. So how has that changed for you over the development cycle because, I'm not sure how early you came into the project but having that sort of backbone of connectivity and the ability to get that kind of data — How much of that did you build and lean on in the early stages of development? So not necessarily like developing the protocol, but I assume once you had the protocol up and you were working on it versus the longer term like, evolution of it or maintenance of it or… Has there been any like major shift there, or is that just generally, no, we built them as we needed them and it turns out sometimes they were really useful months later, or they haven't been useful yet, but it doesn't hurt to keep sending them — is that kind of connectivity… How you use that, has that changed over time?
Ryan Summers
Oh yeah, definitely. When I first came on to the project I think it had just recently gotten started up. It was 2019, I think the H7 series was relatively new, especially in Rust support, and I think what Robert had done was the very first ethernet small TCP implementation on the H7 using, like, raw registers and stuff. And I think that eventually started making its way into H7 how Richard Cheung — I don't know if I'm pronouncing his name — incorporated that into the HAL after looking at the Stabilizer repo, we like got it all in there and so really, it's been interesting to track from when we first started off on the chip when there really wasn't much support, and kind of leveraging all of the things that were adding into the project and start pushing that with the open source. We started out pretty low level, and we figured out what we needed as we went and we got this kind of like — Okay, now once we've got the basics of MQTT like… We started with a very minimal client. Let's just be able to connect. Let's be able to publish. Let's be able to subscribe. We're not going to worry about retained topics. We're not to worry about topic A releasing or quality of service, like none of that jazz I just want to fire and forget… because the rest of it seems like a lot work. And so we kind of got that going and working and we're publishing telemetry and information about the device and suddenly we're like — Man, it would be really really nice if we could talk back to it. “Change your state, do something else,” command it, like not even to the extent of like modifying setting but even just like telling it, “Hey, I want you to perform some action now,” like set some relays or something, not like configure it like… I dunno, initiate a measurement for example if they're like long running. And so we started seeing like okay, there's these things that we kind of need from a high level in an application, how can we kind of build on top of this now? We've got- like if you think of the OSI model: your TCP, you've got MQTT sitting on top of that, you’re like — Okay, well, we've now got this nice way to talk to a specific device, get these acknowledgements and verify that we're setting it properly, get responses, get feedback. So say for Booster, we've got an example where we want to set the gain on one of the channels, and so we say like set the gate source threshold voltage to 1.7 volts, and then in the response it tells us what the drain current is and so you can now write your control loop on the computer side where you say like okay I'm going to iterate across like all of the thresholds for the transistor gate, and once I start getting the point that I want. Okay now let's start backing off and lower it until we get right, so you can kind of tune it in response with this real-time feedback loop of send a message to the device, get a response back, analyze it figure out what you need to do and suddenly now you've offloaded this entire control algorithm from having to write any firmware where you need to think about timing, you need to think about how you're going to do these asynchronous control lifts, where you set the transistor gate threshold, but then need to wait 10 milliseconds or something until the ADC-update comes in and you need to do this 500, a thousand times and it's just — it becomes this whole mess. But suddenly if like you just expose this functionality but set gate threshold get the response you can do all of that on the computer in Python. And you don't have to think about the complexities that go into that. And it's fascinating, because similar in this vein, one of the the projects that just got finished up is they had that cordup does more than just Booster and Stabilizer, there's all kinds of lab rate hardware they're building for people and one of the setups they had was a neutral atom trap and I believe they actually had Stabilizer or one of their other controllers talking to another device, both of which were running MQTT on these mini comp clients. And basically, one was setting an interlock for the other and saying like — hey, disable yourself now based on its measurements, so it was performing like remote temperature sensing, remotely disabling the other device. To me, that was just really cool to see in the end, because you don't set out with the intent to build these kind of systems. When you're like, “Oh, I want to build an MQTT client. Like you have no idea what it's going to be used for. But then when you see someone coming afterwards and like start connecting all these things together across multiple embedded devices and doing end to end communication like — you see that as buzzword all the time, you're like yeah but when did people actually do that and then suddenly like someone did it. It actually helped to solve their problem and it’s like whoa. That's really cool.
James Munns
Yeah, so you poked at a couple things that are interesting because there's very different environments for: we're building a consumer device that is connected for whatever reason, like it talks to MQTT for logging or updates or remote control or something like that, where- where it's intended to be ah, a system to itself. It is doing what it was designed to do, and it does that. And then there's devices where you're designing them more to be lab equipment if that makes sense like you said: that it will be expected that you are not designing the functionality, you are designing the toolkit of functionality to let other people build what they're trying to build and it's interesting. That never works for consumer products, for every consumer product I've ever seen that's like, “Oh, it's flexible and you could do that…” Like 1FTTT is like, the only version of that that I've ever seen that's not really developer focused, but it's like you could make it do what you'd like it to do, or you know there's some platforms for like, home assistant or or things like that. But when you're talking about like a normal retail device like your dishwasher or your fridge or something: like even if it has the ability to have some kind of sensor or send you a notification when your washing is done or something like that. They're not flexible. They are doing what they are meant to be done. But that's interesting that you point out- I mean you're really working in a different field where you are building tools for other people. You're not designing these experiments. You say like, it's a power amplifier like, or it has this or you can write custom firmware or if someone wants something specific. Maybe you're writing some of the logic. Being able to offload that is really interesting and from there… It's always interesting: there's like 2 main approaches that I've seen to people implement embedded systems. It's a range but there's sort of like 2 polar opposites where: one is you design it dumb, like a control system really, like, in the same way that you would design a piece of electronics where you say like, yes it might have some states that it walks through and things like that, and there might be some logic that it handles but primarily I'm treating it as a very dumb, modelable device and you see a lot of things where people just have like a register table where you say like, there's a table of the gate voltage in out, whatever whatever. And then you know almost getting close to like ladder logic with PLCs and things like that where the logic is very limited but it's mostly parameters and you are tuning them and you might be able to control them remotely with some sort of orchestration or something like that. But the system itself does not think. It is set to a value and it maintains a control loop or a pattern or responds to certain events in certain ways like, “Ah, when I get over this threshold, I cut all of these outputs” or something like that. And the other side is people who build systems that are “All of the logic lives on the device.” So I guess that's mostly like where does the logic live, does it live somewhere else and the device that you are building is an extension of that logic that just does it. It expects that it is one piece in the puzzle and it doesn't think it just does what it is told versus people who build firmware that are very complex and thinks and goes, “Ah, when I am this state I have this whole decision tree of how I could react ,or I have this application logic that I'm thinking a lot of, where on one hand that allows you for for very fine grain control, but on the other side it makes it very inflexible like it is it is doing what it is designed to do because all the logic lives on that. I immediately said that there are 2 things and I'm saying that you're doing something in the middle where there are certain tasks like you said where you need that… I'm gonna- I'm going to invoke some audio like DSP language here because it's the only resource I have to draw from where like if you're if you're building a modular synth or something like that: there's typically 2 rates that you think of you think of the audio rate — so I am sampling at 44-1 kilohertz or whatever like I need to be processing samples and things like that — But then they'll typically have a control rate which is much lower than that maybe like 1 kilohertz or 2 kilohertz of like, “Yes, although I can sample 12 channels at 44 Kilohertz, I can only make changes to my filter parameters, or I can only read the knob that's on the front of the device at 1 kilohertz like once a millisecond,” or something like that. And they have this sort of like, you've given a name and given sort of like a model in ah, a pattern of working where like there are 2 different worlds. There is the audio line world and there is the control world and some things operate at control rate and some things operate at that. I've worked on projects that were built really like- and I feel like those audio devices that I just described are much more in the like it is a device that is configurable but it doesn't think a lot for itself. It's much more on that side. It's always interesting to see different groups of embedded people take one of those models or not take one of those models or really try to do the other one while standing in the wrong pile. Like you write all the internal device logic, but you assume that you're going to be configured by everyone so you write this whole Rust API or MQC API that expects you to do these command and control things but really you can't operate at that at that rate. So, I’ve sort of gone 4 or 5 different places here. But: How much of what you're building ends up being “I have to build logic that knows how to to react immediately at like, hard real-time sort of functionality levels,” versus how much can you defer to being dumb, because you do have this network to talk over and you go, “Look, I'm not here for thinking, I'm here for doing,” and the Python script or the scientist or whoever's automating or controlling this is going to do all the thinking, and how does that rub between line rate of whatever you're sampling or responding to, or your hard real-time guarantees rub against- well how much latency can I expect to get from a Python script that's talking over MQTT to my device?
Ryan Summers
Yeah, that- you get into a really interesting point in embedded design and coming from a background where I do a lot of medical devices, where you have these very strict hard real time requirements — It's an interesting point to bring up, where like there are designs where you expose a register level and like, configure me do whatever you want with me, and then there's also lines where everything's self-contained. But in reality I think when you see a lot of the more complex embedded systems, they stand right in the middle where you're trying to abstract away the hard real-time nature of things. So the firmware itself like, especially on Stabilizer, there's a ton of very very hard real-time requirements on it because it's got a sample regularly guaranteed at like 100 megahertz or something like that. And there's all kinds of DNA channels that are coordinating. It's collecting tons of samples into a buffer before it's sharing and interrupts and thankfully we've got RTIC to be able to handle all of this kind of latency management so you can set your DNA to start interrupting you whenever you're in the middle of these kind of MQTT transactions. It's nice because suddenly you can abstract away all of that hard real-time nature and give yourself this super sloppy MQTT interface where you can talk over, and you're not going to interrupt this hard real-time behavior. And then as soon as it's able to, it's going to adjust itself based on what you've sent over the network, and then start behaving that way. So it's really a mix of the 2 of like — yeah, there's this subset of registers enabled, but you're just kind of making a black box underneath. Like, if you're looking at traditional stock designs, you've got a register layer, but then you've got all the logic actually implemented behind that you’re configuring. And I think embedded system design is very similar to that. You're not - I don't want to care about clock domain crossing when I'm writing registers, like I don't want to think about that ever in my day-to-day life. It's kind of interesting. In our case, there's a lot of the intent is that someone buys this hardware and they're going to have a debug probe that they plug into it. And they're most likely going to be forking the repository, writing their own DSP routines and making it do whatever they want. The hardware is very flexible and I think that's one of the things that really drove into our design of kind of this miniconf specifying the domain space the settings because now you just have a single Rust derive macro at the top of your setting structure, and if someone forks the repository and wants to add their own settings or configuration like set some GPIOs they can enable that by just adding a new member and they don't have to think about it. And suddenly that propagates out to all the tooling that we made that doesn't know or care about any setting structure. And so it really makes this way more adaptable from someone that may not care about how the internal called the real-time stuff works. Someone besides me definitely doesn't care how all the DNA streams are collecting data and how timers are triggering. Like, I find that super awesome is an embedded guy, I'm sure you find that interesting too. But someone who's doing like PHD research and if quantum physics lab doesn't want to think about an embedded design, they're not an embedded engineer, they want to get their physics experiment up and running and they need to expose some functionality. And so if we can make that easy for them, while still giving all those hard real-time guarantees, that's where you see this really interesting value proposition.
James Munns
Yeah, that makes a ton of sense. I spend a lot of time thinking about protocols. So, I joke that the thing that I am good at is making computers talk to other computers, and so much of embedded systems is just that — whether it's talking to a sensor, or you've got 100 nodes in the field that are talking to a backend or tooling or things like that… a huge amount of the value comes from making computers talk to other computers, which means I think a lot about protocol design and like, different ways of doing things. And it's interesting going back to when I worked in avionics — your airplane is really a network of computers. We even call them line replaceable units — they are purpose-built equipment that have one role and do it really well, and that's sort of the abstraction layer of where it becomes like: I don't think about business logic I think about what the device does and that's- that's device problem to deal with the best way of how do I do the signal processing for a weather radar, or a radar altimeter or something like that, or how do I filter a pressure sensor to get my altitude versus the incoming air speed versus the static altitude a pressure altitude and things like that. It was always interesting because you have all these devices and sometimes you need data from other devices, but also sometimes these devices need to coordinate with each other, where most of the time everything was dumb and so when you talk- when the devices talk to each other, they're mostly just sending their current state — like, and this is the very like control system-y approach of multiple devices I find is — you send what your current values are. So like, my state is this you send, “I would like your state to be this,” and you listen to what the other device's state is so it's not like um, request response like you're not querying the device what is the temperature, or you're not querying the device like what is your angle of attack or or whatever or something like that you say like, “The altitude is this, the altitude is this, the altitude is this, the altitude is this,“ and then when something controls it, you don't send an RPC command. You just listen and you hear ‘the angle of attack is 20°.’ And instead of making a request and response, you just say, “I want you to be 25, I want you to be 25, I want you to be 25,” and that way, you can end up writing these very like control systems-y algorithms on all these devices, where if you're running a PID loop on your input controller for this, it's not like, thinking about, “Oh, has this message been acknowledged or not?” or something like that. It's just listening in sort of like the very control system-y PID loop sort of way of like, I control this input into the device and I'm receiving this output to the device and I need to tune my control loop to operate like that. But there's always 1 or 2 devices that do actually need to coordinate. Sometimes it's because they share an antenna and they need to make sure that they have like coexistence, so only one of them is sending at a time, or there are a couple of like actual request response. So, I worked on collision avoidance systems and actually collision avoidance means talking to other planes and so you exchange where you are so you know where all the other planes are and all the other planes know where you are but it actually takes 2 systems to do that. There's the transponder which is saying “I'm here, I'm here, I'm here,” and then there's the TCAS unit which is sending messages like “where are you where are you.” You have one unit that's doing one responsibility and another that's doing another, but essentially your TCAS talks to someone else's transponder, and then their transponder talks back to your transponder. So like, you get this sort of back and forth, where if you are in a mode where you're like, “Hey, tell that plane not to come over here.” You need to send a one-shot message, because you don't want to send out like, “Send them this message, send them this message” — because you want that message to go out once and you want the response to come back once. And so it's always this interesting paradigm, where if you can design your systems to be dumb, then you don't have to think about what is my acknowledgement window or you don't have to think about a response timeout because you just say, “Well, if I stop hearing messages for longer than 500 milliseconds the other device is faulted, or the line is broken, and we're done here,” because it should always send at every 100 millisecond or something like that. This sort of gets back to that PubSub versus request response where when you get people coming from a web background a lot of the times they're thinking in terms of like rest requests and things like that and you end up with this sort of like always RPC world where you send a request and you wait for a response versus when you get the more like double-E brained people or people who have worked in control systems you go I don't want that because it's easier to just think of it in terms of a naive control system of there's inputs and outputs and we just are saying what we want and receiving back what they are. I was wondering, how much of the design of your system ends up being more like the web request response sort of world versus the control system “I want you to be this. I am this.” sort of back and forth?
Ryan Summers
Yeah, you made me think of something really interesting. It's kind of when you look at these PubSub methodologies. It almost feels like a way for us to take this digitized state. And transform that back into a pseudo-analog continuous time series value of like, “Hey, yeah, my altitudes 1, my altitude’s 1.1, my altitude’s 1.2,” like we discreetly- it's a discrete system but suddenly, you've still got this kind of- you've got time associated with it. You can see trends over time and night. It's interesting when you bring up control systems because it definitely makes sense where you start tying that in, like — when you're thinking about control systems you want to understand it in terms of continuous signals like you don't want to think of it in terms of like command response. It just doesn't make sense when you're trying to do things like that. When you're applying that in our use case — it’s interesting because when I think about it, like all of the things that are PubSub are these kind of control loop things or data logging, and then all the things that their command response are not related to controllers like when we're trying to do the settings, you send a request like, “Do my setting thing. “And you get a response like, “I did your setting thing, and it worked.” And it's interesting because it's specific when you're doing settings, that's not a control loop but when you're doing telemetry and you're trying to get feedback about state, that is a control loop. That's a really interesting point.
James Munns
Yeah, because in aviation, you would just spec the bus and- this is one of those like safety critical- and you've worked in medical devices, so you know this — versus the, “Ah! 90% of the time we could be more efficient by doing this!” But in safety critical devices you go, “No. Worst case is the only one that matters, so I will just spec the bus so that it has enough bandwidth.” So like if you say, “I'm going to send configuration state, I'm going to send the entire configuration table on every message at the rate of 100 milliseconds or 500 milliseconds, because then I don't have to care is the device booted or not.” You just say, “I want this, I want this, I want this, I want this,” and then once it starts responding, if it responds with something that doesn't match that, you know it's not listening to you. And if it responds with, “My state is this, my state is this,” and it matches what you sent, you don't have to worry about, like, ack and responses and things like that. But from an efficiency perspective, if your whole configuration table is like, 3 kilobytes and you're sending that every 100 milliseconds over- ah you know ethernet's a wide pipe- but if in aviation you use a lot of like very weird archaic serial protocols that are very low bandwidth or you know if you're using I2C or something like that. That'd be awful for device life or like if the fact that you just have to handle that message every one hundred milliseconds is terrible versus like hey no, we're all you know computers here I change the one field that I want you to change and maybe I can also query you to dump your entire state, or you do that on boot up and I just listen. And you get those differentials which are great but it means that you have to think about the next level of complexity up because you go, “Have I heard this message from this device ever?” or if you reboot your tooling loop but not your embedded system, you have to query it. “Give me your your status because I just woke up and I've never heard it before even though you sent it 5 minutes ago, I've never heard it,” or something like that versus like the really dumb world of you just send the whole thing out all the time - no thinking only sending. You never have to think about that because you just go, “Okay. If I just listen then I know everyone's state and the first time I hear from them, I know that they're alive and awake.” But from an efficiency perspective, that's a terrible idea, and if you were to doing anything with battery life: terrible idea. Even for you, where you want to probably reserve as much bandwidth on your network for raw data samples and something like that: control data is probably not a huge fraction of that. But if you just have to spend the time serializing and deserializing those messages, that's time you could be spent doing DSP or something like that.
Ryan Summers
That's actually kind of where the beauty of scatter/gather the DMA comes in. Suddenly you're like okay, well, especially with RTIC you're like, “Okay, I need to serialize this huge buffer of data, and it's going to take a long time like.” Obviously I'm going to be getting samples in that period, and being able to stop yourself when you're like, “Okay, I've serialized the first {
of a JSON, and the first "
of the first field, I'm halfway through the first word. Okay, now we got to go do some ADC, like do some DSP stuff.” It's really interesting to see that. One of the really interesting parts is also when it comes down to real-time in embedded is in your codebase when you're writing functions and you have an edge case where you can say like, “Oh, yeah, we could return early here. We could avoid this expensive calculation.“ Sometimes it's actually better to still do the expensive dumb calculation for no reason, because what it gives you is this known timing characteristic. Suddenly, when you've got this known timing constant, you don't have these weird conditions where most of the time when you're calling this function it works in a millisecond and then sporadically like, garbage collector comes around and it's 99 milliseconds all of a sudden. If you can keep your control loop consistent, you can kind of manage it better, and so sometimes the best path is not the most optimal execution path and you'll say like, even though we could break out early here we're intentionally not doing that. I don't know if that related too much to your earlier question about kind of there quest response stuff, but it kind of jogged an idea in my head.
James Munns
Yeah, definitely for safety critical, that’s one of those things where if you have a sorting algorithm, you might just do insertion sort or bubble sort every single time because you know even though it's O(n^2), you know the number is never going to be greater than 30 and it always takes the same amount of time you might not even do the early return path, versus something that you go, “Well, 99% of the time, it's 10 milliseconds to sort these, versus 1 out of every thousand times, it'll be the worst upside down case and it takes 30 times longer and oops now you've overrun your timing domain” and like… Actually, I'd be interested to hear if your experience is the same is that… Consulting is interesting, because embedded in particular I find has a very different knowledge and maturity level company to company. Like I've been in some companies where the embedded systems team know a lot of best practices and know a lot of ways of like conceptualizing or thinking about systems problems or approaching them and things like that. But I've also helped a lot of startups, where they were building a lot of prototypes or or maybe even with ah a team that's not like classically from systems engineering or or embedded systems and you know they kind of got roped into building the first couple prototypes. And now, if they're building like a robot they're getting really close to safety critical like you know. Maybe they're not in a car or in an industrial control robot or something like that… As the startup expands scope, they're getting closer and closer to like, “Ah, that's- you're, you're getting real close to safety critical and you really need to be doing this the like, IEC61508 kind of way or ah, you know, whatever like safety critical standard of analysis…” It's interesting to see sometimes I come in and I get blown away because I learn a ton of stuff because they've just been doing this for forever and they have a process and a way of analyzing. And then sometimes you land at these companies where, especially when like they have 1 or 2 embedded developers that were primarily self-taught or came from like a very different field and got hired into a new domain where they just have no concept of analyzing things like worst case execution time of, “Hey. No, we need to build this as dumb as it can be.” Because when you have to go through qualification process every line of code is a liability, and so like what's the dumbest way that we can build this and still get away with it?” And have that sort of experience where sitting down with them, like we do exactly what you were talking about is like, what's the worst all of these numbers could ever be… like what is the longest latency R80C could give back before it has ah, conversion complete, or what's the longest time it could take to send a message like this, or what's the longest time that we're going to allow this to happen before we declare a fault or something like that. It actually sometimes makes things way easier because you just say, “I don't care about, like ah, this is 90% and this is 90% so statistically, maybe these sometimes-” you just go, “No. Add add add add add. Do I have the CPU budget or the RAM budget or the the code storage budget for all of these things, and does it pass or not?” And if the answer is no, we buy a bigger chip. And if the answer is yes, then we're good to go- or if we can't buy a bigger chip, the question is what can we cut to make to make it go away. It's all exactly those things of like, you stop worrying about statistical ON- whatever because like you only in safety critical, you only care about the worst case because that's what matters or or how quickly can we respond to something going wrong. So yeah again, I don't know if any of this maps back to the protocol stuff but it's interesting having that different domain experience, because just seeing how different industries solve different problems and what's the way you must do it in safety critical would not fly in like, consumer products because everything would be overspec-ed and everything would run longer and everything would be like — well, if it crashes once a week, ideally it does it at night when no one's going to notice. And then just, you know, don't display a splash screen and no one noticed. That's one of my favorite things from the Pebble Watch is: they had a ton of like, fail and reboot quickly stuff, where if you weren't looking at it when it crashed you probably wouldn't have even noticed that it crashed and they would do things like, take core dumps, so the next time you reattach to your phone, they will just kind of quietly upload a core dump when a crash happened, so that it's all sneaky stuff. Someone's not staring at it and there's no like — how bad do things get if it crashes? If your watch reboots once a month, 99% of the time you're not going to be looking at it so you won't even notice it, so like the impact is 0 versus like in your analysis system if you know once a day it crashed and in invalidated a scientist's research that day because all of a sudden, the data was corrupted, or maybe it was really expensive in timer materials to set up that equipment, like, the answer of what's acceptable failure is totally totally different, and that's always like a rub in my head of like, which approach do I take for those kind of things… So I got really far off on that, but it would be interesting to hear if you've experienced really wide gaps in how people do it either just like stylistically or just they don't really know how to do it at all when you come in to help them for consulting.
Ryan Summers
I've seen kind of all over the board and I think it comes down to how embedded systems are taught nowadays. In that like oftentimes, they're not. You don't see very many embedded courses in university anymore. I think I had 1 class for 1 semester and then I ended up doing an autonomous robotic submarine and did all of the electrical engineering and microcontrollers for that, and that's essentially how I got my start and embedded and then immediately went into consulting after I graduated. And it's like nobody knows how this stuff works and I wonder if part of it comes down to the fact that it's so dependent on what the capabilities of your chip are. Like you can go from this multicore sock that's got all this kind of connectivity all the way down to like an MSP430 that's been around for 30 years where you've got a hundred kilobytes of RAM. The capabilities between those 2 devices are so divergent that I think you end up seeing a lot of different design methodologies. And in embedded, it's almost more like an art because there's no necessarily like right way to do it necessarily like if it conforms to spec and and it doesn't crash. It works like. And even sometimes crashing is totally a-okay like I mean coming from the safety critical, definitely not. But it's an interesting kind of way of looking at it. So I've seen a lot of people but I've seen a lot of startups where you made this point about going from startup, working your way into safety critical. And it can be a little bit of a dangerous path, because when you go into safety critical designs, a lot of it is about the process. It's not about the code itself. So when you take this startup behavior where we made the prototype, and you turn that into product, you don't have any of the safety process behind you verifying that like, what you did was the right way to do it. And so like yeah, it might work but that doesn't really tell you anything about whether or not it's acceptable to use it in the safety critical context. Which is a weird thing to wrap your head around if you haven't done safety critical stuff before because suddenly like it's not about whether or not the device works. It's about all of the analysis behind it like you could publish the source code for a lot of medical devices and it's meaningless, because people couldn't use it because you can't go and get that approved by the FDA. You need all of this information back there. To be able to say like — hey, this device is safe. We've done all the analysis that no one is going to get hurt. These are all the failure modes and like none of that is there in code. It's all testing. And so you get these wildly different domains where like, we need a prototype in two weeks to verify this product idea. So that we can invest $2 million, hire all these engineers and get it done, but we need to know if it's possible. Versus like, we need to make this device that's not going to kill people that potentially could. And so you get these really interesting design patterns where… there's these weird interplays and it can be dangerous when they cross.
James Munns
Yeah… I think you really nailed it on how wide embedded is like embedded spans everything from like I'm writing assembly on ah an 8 bit micro all the way up to like, embedded Linux deployment and we call all of that embedded. Yeah, I did a podcast episode a couple years ago now with Francois from interrupt who used to be at Pebble and we talked about this a lot of like —you talk about how people got into it where, especially on the lower end of when you're talking about like your PIC8s or whatever like, a lot of people were like double-Es who were they were the one that was like software inclined, and they figured out how to get the PID Loop running on their PIC so they could have a better motor controller or something like that and they came up from that. Or you get the people from the opposite side who were doing backend services and they had done a little bit of electronic stuff before, so they got roped in to writing the microcontroller code. And you really do have these 2 very different worlds that are smashing together in the middle, where you get this really weird gray area of everything from like 32 bit microcontrollers up to very small microprocessors. There is no one style and there's no one right answer and particularly when everyone's cost optimizing or trying to get the most hard realtime out of everything that they're doing, there's no one size fit all approach; which is, whew, it's something for sure.
Ryan Summers
There's also this interesting kind of issue where, um, once you build something and embedded, it's not like software. The second you release code it becomes Legacy Code. It exists out there in the world and you may or may not need to deal with it. At the moment that you're releasing it, you're like yeah this is great. It's perfect. It works fantastically. Obviously never ends up being that way in a few months time and you patch it and make new code. But that stuff stays out there. And so it's kind of an interesting domain where you have to deal with these. Maybe this is a nice loop back and the kind of protocol and discuss and kind of being able to deal with schemas, but like you've got all this old firmware that's sitting out there. It's a snapshot at a point in time like a codebase is obviously organic. It changes as your design, as you learn how to program better or learn new design patterns such as the codebase but the second you put it on a device it freezes. So you get these interesting problems. This is one of the things I've been wanting to solve in MQTT with some of our settings configuration where you could publish some self-representative schema. When I was- I have worked as an intern at SpaceX for a while, and from what I understood about their telemetry protocol. It was really interesting, because they'd reserve like 10% of the packet payload for metadata about what the message itself represented. And then the rest of the 90% was like, the actual payload but it was all binary and serialized so all nice and compact. So essentially all some the back to do on the listening end was get a few of these messages and build up their metadata, and once they got that they could start interpreting all the messages. So you didn't have to know anything about the structure of it, and you could create all this ground support equipment that's listening to data from the rocket builds up its own metadata and then just starts logging that into dumps where they can do analysis after the launch. It was just super fascinating to see and and kind of want to be like. How could we kind of emulate that kind of structure where you have this self-describing data format.
James Munns
So I actually spent a ton of time thinking about this recently because Postcard does not encode its own schema, and it’s super compact binary serialization. So like, there's no hints at what field starts and ends where or whatever like CBOR is binary but it it still has that JSON-like schema where you go like — field start, field end, whatever. Or like Protobuf has field number-whatever-whatever-whatever. Even if you can't really understand it, you can still like, interpret some of it. Or if you have a partial understanding, it might be useful, but Postcard has none of that. If you don't know exactly how to interpret those bytes, it's garbage to you. Like, it might as well be meaningless, and I thought a lot about how can I encode that, and I came up with what I thought was a fairly clever way of encoding the schema because Lachlan, someone who- who's hung out before, helped me write a drive macro so it actually walks the schema down so you can get this sort of static struct that describes the recursive structure of the data and then you could serialize that using some like, well-known format or even like, compress it or even just take like, a hash of it to make sure that your scheme is match up. And that's what I wrote that post of “What good is partial understanding?” and I came to the same approach that you saw is, from a code in the field perspective, there is no value. Because if you ship these 2 devices at the same time and you update one to send new data that the new one doesn't know how to field- it has no code to handle that new field like, “Okay, cool, you’re sending me humidity. I know you have sent me a field called humidity. But I don't know what that means, and the code that I shipped on that date knows nothing about humidity.” So like, the best it can do is ignore it. But I think telemetry and logging is the 1 case where that's not true, because you can stick that schema in a database or even like if you heard the schema late, you could go back and reinterpret old messages and things like that and I think telemetry of which logging or tracing counts as- that is one of those things where you go, “If I know nothing about this device in the field, how can I tell what it's trying to tell me?” And I think that is one of those areas where it is really useful, but I think where I sort of lost interest in there I goes, “This will never help me with device to device communication: it will only help me with like post-mortem analysis or something like that…” Which has a ton of value like you've shown, but like um doesn't help you in the field which is what the problem I was trying to tackle at the time and that made me realize like. There is no way because if you've written code that doesn't understand something you can't make it understand it after the fact unless you go like really fancy dynamic programming like Python or something like that. But then still, you're limited by the flexibility that you've put into the the code when you shipped it and snapshot at it like you said.
Ryan Summers
Hey, all we've got to do is start putting in deep neural networks on our microcontrollers to be able to interpret the telemetry. That's- it's going to be easy in the future. Trust me, large language models.
James Munns
My approach for doing the schema was I designed a compact forth stack machine that described- you would encode how to decode the message… like, the schema wasn't just a schema. The schema was a small forth program that decoded all of the fields for you by essentially like, walking the stack of the the data that you got in-
Ryan Summers
Nice.
James Munns
-which means that like if you already knew the binary serialization of it, you could just do the raw thing. But if you wanted to dump it into logs, I figured out how in like, a couple dozen or 100 bytes how you could write a very compact forth program. Because Serde only has like 29 data fields and so like if you can recurse with a stack, you can actually decode all of them in a way where you get essentially like, one byte per opcode. And if you feed that into a VM, that opcode can walk the payload essentially. And most of that like 100 and whatever bytes were like the field names. So if you gave up on field names, or if you like you said, you just send field names occasionally where you send like, 1 piece of it that you can reassemble and then deal with it later, you could end up getting a program that can decode it dynamically in just a stupid small amount of time. This is something that Whitequark- or Catherine suggested to me where I was talking about — how do we compress this more when you don't know this information — and she was like, “Well, you make a virtual machine for it.” And at first I thought she was kidding or or joking, and I was thinking about it and I go, “No. That is the right answer” You build this sort of like self-decompressing format or something like that. I guess forth is the very old approach to that, maybe ML is the new approach to that. But like, I literally wrote something, and I go — well, what do I do with this? Like, I can then print it to a console, but I can't do anything with it, because if I don't know what any of these fields mean then like there's no semantic meaning that goes with it. Like- the data is recovered, but you can't teach the meaning after the fact to a program that's already shipped, unless you can send it an update script or an update firmware that then later can interpret it. But then you've flashed new code, so it doesn't matter that it's a schema that it doesn't understand, because you just teach it to understand that. Like this is sort of like, lockstep I got myself into where I go — well, if you can update it, you don't need something that handles this, and if you can't, then there's nothing that you can do to make it understand this. So yeah, it's funny.
Ryan Summers
I think that's the reason why as much as I absolutely loathe serialization protocols like JSON, I think that's why people like them. Because it is a self-descriptive type system where you can just receive it and everyone knows how to understand that data. There's a few things missing, like you aren't able to say upper and lower bounds if you want like, a setting or if you have enums or something you can't say all of the various enums that you support. But by and large it kind of tells you everything you need with just glancing at it and it's relatively human readable. As much as it's not efficient at all at a binary level, it's easy and in cases, especially going back and MQTT like when you got a TCP connection over 100 megabit 5 like, I don't care from wasting a few bytes like it's not that big of a deal like yeah, we've got streaming data and stuff. But it's streaming data: if you miss it this time, you can get it next time. Like there'll be more. It's interesting, where it comes down to ease of use ultimately. Postcard obviously would be really cool but you make a great point where when you get into this machine to machine like, being able to understand the format is meaningless because you have to update it. But with these kind of telemetry protocols, there's usually not something that's kind of trying to interpret it. It's purely just logging everything it gets. And then after the fact you can kind of postmortem go and look at it. So yeah, it makes me wonder: I don't know if there is a good way that you could have a meaningful self-describeable format that would be machine to machine.
James Munns
If you figured out let me know. I mean Forth’s answer on this was: don't send data, send programs to each other. So I mean like, you get into things like PostScript where because back in the day when like printers were brand new and you know, switch networks were brand new and things like that. There's a good chance you couldn't actually rasterize- your printer would have 10 times more CPU than your end terminal would, and so there was no way that your end terminal could rasterize a whole PDF, or something equivalent to a PDF to send it to the printer. So what it did instead is — it would send a program that drew it so it would go, “There are lines here, and there is text here, with this font here, and this and this” and so like that was the the other thing. But then you get into the super like crazy world of everyone's a VM for each other's commands versus like that's opposite of what you want in safety critical where you go like: no, I want bounded determinism and strict understanding of each other versus like, hey we just throw programs at each other, and you don't have to think you just do, because we have this like- the common understanding is not the protocol. It's the VM essentially at that point.
Ryan Summers
But the buzzword of the last 2 to 3 years at least in medical devices has been cyber security because the US released some presidential decrees that are like, “You shall be cybersecure.” And so there's been a lot of like a lot of the work I've been doing recently is like — how do we ensure that these devices are secure, and that you can't hack them, which is a very meaningful problem to solve. And when you start looking at kind of these remote ‘send me the program to execute’ suddenly that becomes a giant like, that's just like cross-site scripting for your embedded system like. That sounds incredibly dangerous and like a giant hole.
James Munns
Yeah, I don't know if I'd recommend it. But it's the only solution I've been able to come up to is how do you build systems that don't have to understand what is asked of them and the answer is well. Then you get into the dynamic world of of scripting languages whether that's JavaScript or Python or whatever. But that's still the ability to send new code and like could you have done that better with over-the-air updates. But, it's interesting. Ah, it's something that I spend a lot of time thinking about for both protocols and like, how how to think about stuff like that.
Ryan Summers
That's interesting.
James Munns
But we've been going for about an hour and a half, and I could go another hour and a half, but I don't think I can keep you here or- and I can't keep myself here.
Ryan Summers
Need to go and get dinner?
James Munns
Yeah ah, yeah, we're both in the same time zone. So it is about dinner time. But It's been excellent to talk to you and I would love to talk to you again soon. But thanks so much for coming on to to chat today.
Ryan Summers
Yeah, thank you so much, it’s been a fun time.
James Munns
You mentioned that you're doing consulting in a couple of the companies that you're working for. Is there anything that you've been working on or that you'd like to plug, or to share before we wrap up?
Ryan Summers
Check out forged.dev. That's the only plug I've got. If you're building devices come and- come and talk to us.
James Munns
Do you want to give a quick ah, explanation of what that is? Because I know what it is, but do you want to give the- do you have the like 30 second elevator pitch of it?
Ryan Summers
Oh god I should have this… Okay, let's give it a shot. We automate flashing and programming devices, put it all on the web for you, so you track it. So if you're building a lot of something, we help you do that. I think that- I think that was 30 seconds.
James Munns
Excellent- ah, less than that. So it's all of the like factory installation and testing I assume, or at least basic like smoke test, “Do you wake up? Hello, here's your program,” sort of tooling for people who are building. Automation in the factory. Basically.
Ryan Summers
Yeah, but but it's more than that. It's also like you need to put serial numbers on each device. You need to put all your test programs. You need to run them, collect all your data, do trending, check device requirements all kinds of that. So, that's the intent of forge: that it centralizes all of that, puts it all in 1 place, you put your device in, you plug in the programming port, you hit start, and at the end it tells you if it pass-failed, and then you get all these trend lines while your devices over time and you're able to check if they're passing requirements or where things are going wrong in your factory process.
James Munns
That's another one of those things that I see startups really struggle with: they think that they're done once the hardware works, then you go, “No, no, now you start phase 2,” which is — how do you make it so that you can make thousands of these a day with a reasonable failure rate and know when something has gone wrong in your assembly process, and they go- that's like: you finish the first 90% of the project and then you start the next 90% of the project. So I'm super excited to see ah, see where that goes but thank you so much for for coming on, and I'm look forward to talking to you again soon.
Ryan Summers
Yeah, thanks for having me!
James Munns
All right bye.
Ryan Summers
Bye.
Credits
This podcast is brought to you by OneVariable UG — a consultancy focused on advising and development services in the areas of systems engineering, embedded systems, and software development in the Rust programming language, based in Berlin, Germany. Check out our website at onevariable.com or send an email to contact@onevariable.com.
This interview was conducted on August 3rd, 2023.
Audio recording done by James Munns, edited and produced by Amanda Majorowicz. Special thanks to Louie Zong for the music.
Thanks for listening!