Jump to content

Coordinating Actions Across a Network


Recommended Posts

I am designing a method for multiple application instances to coordinate an action across a network. I have previously implemented this using queues for coordination of asynchronous VIs within a single application instance. I now need to make this work across a local network. I have been looking at network shared variables, network streams, TCP/IP, UDP. I would like to know if anyone else has accomplished this and what you learned along the way.

The basic requirements are:

  • N number of independent applications need to coordinate some action (essentially a rendezvous).
  • No one application knows if the others exist. (don't know their IP or anything about them)
  • No one application knows if it will be the first to join or will join later.
  • All applications will share a common name for the rendezvous.
  • The first application should create the rendezvous
  • As other join, they should see that the rendezvous already exists and join it instead of creating a new one.
  • All applications should be able to see how many have joined.
  • When all applications depart, the rendezvous should be destroyed by the last one to leave.

Thanks for any tips or examples you can share.

-John

Link to comment

There are a number of considerations here to be aware of:

  1. Once you're working across machines, things are not instantaneous. Unlike a RV, where everyone is guaranteed to get it "at the same time".
  2. That also means that you have potential for race conditions. For a RV, you have a single arbitrator. Here, you have multiple entities with no arbitrator.
  3. Firewall issues. You need to know that everyone is accessible.
  4. What's the logic? Are you saying "once I get to this point in the program, I need to see that everyone else has also reached that point before I can proceed"?

Assuming the answer to 4 is yes, here's a basic design. It's probably full of holes.

  1. Each program needs to continuously advertise itself (its IP address) with UDP and listen for all the others.
  2. Once a program hears another program, it needs to open a TCP connection to it (so you have N squared connections. This will be an issue if N is large).
  3. This TCP connection is basically what you use to tell each of the other programs "I reached the RV".
  4. Since each program holds a connection to every one of the other programs, you need to see that all of them have sent the OK signal and then you can proceed and reset the RV.

Some issues with this:

  1. What happens if only some programs hear the UDP call and not others? You will have a big bug where some of them proceed and others don't.
  2. What happens if someone disconnects in the middle?
  3. What happens if someone connects in the middle?

Another option is to try to negotiate between the different programs to decide who's the master. This master will be the one who decides when to proceed and will regularly inform the others of the current status, so that if it's taken down, someone else can take over. I have no good reference for a master selection algorithm (I have done one in the past, but it relied on having a fixed point in the system), but I'm assuming there is some material on this. I'm guessing that mostly in involves the various programs shouting "I'm here" and waiting until enough other programs agree on which one is the master with a certain algorithm to make certain programs more likely (e.g. the lower the IP address, the better).

Link to comment

The answer to #4 is yes, in one circumstance. I have other use cases as well but that is the basic one. So, it seems you are saying that using the low level UDP and TCP functions is the way to go, vs using some sort of share variables or network streams.

I figured the network streams were the wrong path, but I was hoping there was a way to do this with shared variables. Since each machine will be running the LabVIEW Runtime, each one could host a shared variable. The question becomes: can I create and destroy these variable at runtime? Can others find them at runtime?

If not, then your suggestion of using UDP and TCP seems like the only way.

As for the number of systems, currently we can have up to 50 'siblings' using the same shared rendezvous at a time. And there could be 10s of these type of groups operating independently So, the code would need to be able to differentiate between these groups and allow a sibling to join the right group, without affecting the others.

All machines would be on the same local private network, but not necessarily the same sub-net. There would no no firewall issues as this is a closed private network.

To address the issue you listed, I would need to support a sibling leaving the group either in the controlled way or if it crashed, having timeouts to deal with that. I already do this in the code that uses queues to accomplish this.

My biggest concern is how to broadcast a unique named connection and have others join that connection while ignoring other simultaneous broadcasts.

Link to comment

A lot of stuff you need for the network "architecture" is probably contained in the Dispatcher in the CR. It can do the "clustering" by simply placing the dispatcher in the right place and pointing the publishers and subscribers to it (can be on the same machine or centralised and you can have multiple dispatchers spread out accross many machines). What you send and what you do with it is then up to your implementation.

I wouldn't suggest UDP for this, however, unless you are going to write a robust protocol on top-which is a lot of work and mitigates a lot of the advantages.

Edited by ShaunR
Link to comment

What I understood from your original post is that you're working on a completely ad-hoc system. If you have a closed network which you can control, I would suggest using a server for this, which then makes the issue much simpler - every program talks to the server and it decides what to do. Then, all you have to worry about is various race conditions and what happens if the server is inaccessible (basically the system is dead).

Is having a server a viable design?

The question becomes: can I create and destroy these variable at runtime? Can others find them at runtime?

I'm pretty sure you can create, destroy and access SVs dynamically, but the key point is finding them - that's where the UDP broadcasts I suggested come in, as they allow others to find you.

All machines would be on the same local private network, but not necessarily the same sub-net.

This may also be an issue with UDP. UDP broadcasts have a limited life-span to avoid bouncing all over the internet. They won't necessarily reach every node.

Link to comment

I've never really used SVs before so I suppose I need to do some reading and experiments. But, the problem I need to solve is all the machines on the network will be siblings with no central server, so they all need to be servers to each other. I really want more of a peer-peer network where they can automatically find each other and coordinate.

As to UDP, the solution has to be 100% bulletproof, so perhaps that is too risky.

Link to comment

Well, I'm certainly no expert on P2P architectures, but my understanding is that some of them (at least those that work over the web) do rely on some nodes functioning as temporary servers (at least for facilitating connections). The Wikipedia article seems to suggest you need an overlay network, which is used to connect any two nodes which know each other. This is basically the direct TCP connection I suggested in my first reply (which applies to all nodes, in your case).

I don't think I would personally use SVs for this, mainly because they're designed for passing data, not commands.

UDP may not be reliable, but I don't think you have any other choice. Without a server, your choices for discovery are either cycling through all available addresses (probably impossible, since they're not all on the same subnet) and broadcast, which I believe only UDP can do. The fact that UDP is not guaranteed to get your message across shouldn't necessarily matter, because you keep transmitting it regularly and once you form TCP connections, your communication path is guaranteed and you can also use the TCP connections to pass along messages such as "I found this new guy". The only thing you have to make sure of is that the UDP messages can get to all nodes.

I think the main issue you would have with a no-server system is making sure that you don't have conflicts in the network. In classic P2P systems you don't have this issue because each node has to work with N other nodes, but it doesn't have to do any coordination. If you look at file sharing, for instance, a node might say "I want this part" to one node and "I want that part" to another node and the receiving node will make sure that it got all the part and put it together correctly, but that has nothing to do with the other nodes.

I think this would require you to be able to change at least the number of nodes you're waiting on dynamically (e.g. if more nodes were added to the system while you're waiting) or coordinate and lock it in place when you start waiting, otherwise, you will have the option for race condition. That means that when the first node wants to start waiting, it has to send a message saying "these are the nodes I'm waiting on" and it has to receive a confirmation on that from all connected nodes before proceeding, and all those nodes then have to agree to not add anyone to that list until they're done waiting. They can agree to take someone off, but not to add.

Link to comment

We do something very much like this with independent state-based components. (We don't use the Rendezvous construct.) Components interact only by exchangine data via shared variables. Note the shared variable engine (SVE) manages all connections. Note also that in our paradigm a command is just a type of data.

If you want to pursue this line of thought read about the Observer Pattern (Publish-Subscribe). The SVE implements the push version of this. Also read about component-oriented designs.

Link to comment

Why not just have a seperate TCP/IP process (call it a moderator) running as a seperate app? Have all your logic in there.

When each of your seperate apps startsup they checking to see if the moderator is running (by attempting to open a connection to it). If not, then it can spin one off.

Link to comment

Having a central server can lead to a single point of failure for the entire system. So, I think I may have to back off on this goal and do something more rudimentary, like having the coordination restricted to an app instance and not allow networked coordination. Since that solution already works well, I am inclined to stick with it. That just means my load balancing will have to be more granular and controlled at the client level and not by the server. And fail-over will not be a possibility. I think I can live with that more than I can live with the risk that all my systems go down if the single coordinating server goes down.

Link to comment

Well, the only way for my 'server farm' to be aware of each other is to have this central coordination application that they can register with. If I do not implement that, then I cannot transfer data from one instance to another.

This is not central to my overall goal of separating my view from the model and making my system 'client-server'. It just would have been a more flexible and 'clean' system if I could have decoupled this so asynchronous processes that needed to coordinate actions could live withing different app instances. Perhaps I will pursue this again in the future.

Thanks everyone for the input. It was all helpful.

Link to comment

Well, the only way for my 'server farm' to be aware of each other is to have this central coordination application that they can register with. If I do not implement that, then I cannot transfer data from one instance to another.

This is not central to my overall goal of separating my view from the model and making my system 'client-server'. It just would have been a more flexible and 'clean' system if I could have decoupled this so asynchronous processes that needed to coordinate actions could live withing different app instances. Perhaps I will pursue this again in the future.

Thanks everyone for the input. It was all helpful.

It doesn't have to be that central. The only stipulation is that the clients need to be told, where servers are (or discover using network scanning).

Say for example you have 10 identical machines (with identical producers) each with a dispatcher. You tell each machine where the other 9 servers are (or probe for discovery). Each machine also has a client that, when it wants a service, uses one of the 9 to interrogate the dispatcher for services on that machine (i've chosen identical to keep it simple but they don't have to be). If that fails, it then tries the second, third, 4th.....and so on until it finds one that is available. Once an operating machine is found, it can then connect to that service. For the identical machines scenario, the system is robust as long as there is one machine in the system. For multiple machines with different services; it is robust as long as there are as many machines in the system that provide at least one of the services that is required that, together, supply all the services. Additional machines in the latter case are then purely for redundancy.

You can get around needing a central repository of IP addresses by pinging addresses in a known address range and issuing a services request if an echo is returned. You can then cache that address list.

Edited by ShaunR
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.