Kleiner - Wow, that made my brain hurt when I looked under the skirts of Computer Vision Neural Network theory. I will go out on a limb here: You are looking at multiple camera's using 'Face Detection' to determine what train is what and where. Is one camera directly overhead with a 'whole layout view' which would look to be the camera to determine the location of a train or trains. If a frame is processed at 1/2 second intervals, location may not be precise although close enough for what is wanted. Wouldn't you also need other camera's just above the horizontal plane at some location where foreground objects are not blocking to determine more specific things about each train to help with identity? I.E. If you happen to have multiple specific vendor GP30's; which could be identical from the top view, you would need to ID each - according to the Engine number or some other outstanding feature using a side view while it is moving. Again, I guess that you could grab a specific frame every 1/2 or so to process. To me that would be a ton of processing. Looked quick at the Jetson Nano. Quad core ARM processor running a 4GFLOPs/sec with a bunch of RAM. Is that for each core, total or GPU? I suspect that it would reach it limits sooner than later. This is going to be interesting to follow!
I am also working toward some sort of smart control of trains that can circle the layout without my intervention while I am running a local or switching the yard somewhere. I have been using AtMel ( now Microchip ) devices as my goto devices. My layout is bigger ( 30 x 40 ) than what you are going with, so camera's would not be possible because of the sheer count. Current detection looks to be the best way to do this in my case.
I am also a believer of using DCC as train control only. The more stuff you hang on the bus, the more chances sh.t is gonna happen. On top of that if DCC dies for a specific segment or section, so does all the other stuff hung on that segment or section. Troubleshooting becomes a real pain.
Merry Christmas!