The fix client can receive incoming messages but cannot send outgoing heartbeat message

  Kiến thức lập trình

We have built a fix client. The fix client can receive incoming messages but cannot send outgoing heartbeat message or reply the TestRequest message after the last heartbeat was sent, something is triggered to stop sending heartbeat anymore from client side.

fix version: fix5.0

The same incident happened before, we have tcpdump for one session in that time

we deploy every fix session to separated k8s pods.

  1. We doubted it’s CPU resource issue because the load average is high around the issue time, but it’s not solved after we add more cpu cores. we think the load average is high because of fix reconnection.
  2. We doubted it’s IO issue because we use AWS efs which shared by 3 sessions for logging and message store. but it’s still not solved after we use pod affinity to assign 3 sessions to different nodes.
  3. It’s not a network issue either, since we can receive fix messages, other sessions worked well at that time. We have disabled SNAT in k8s cluster too.

We are using quickfixj 2.2.0 to create a fix client, we have 3 sessions, which are deployed to k8s pods in separated nodes.

  1. rate session to get fx price from server
  2. order session to get transaction(execution report) messages from server, we only send logon/heartbeat/logout messages to server.
  3. backoffice session to get marketstatus

We use apache camel quickfixj component to make our programming easy. It works well in most time, but it keeps happening to reconnect to fix servers in 3 sessions, the frequency is like once a month, mostly only 2 sessions have issues.

heartbeatInt = 30s

The fix event messages at client side

20201004-21:10:53.203 Already disconnected: Verifying message failed: quickfix.SessionException: Logon state is not valid for message (MsgType=1)
20201004-21:10:53.271 MINA session created: local=/, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/
20201004-21:10:53.537 Initiated logon request
20201004-21:10:53.643 Setting DefaultApplVerID (1137=9) from Logon
20201004-21:10:53.643 Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20201004-21:10:53.643 Received logon

The fix incoming messages at client side

----- 21:10:53.203 Already disconnected ----

The fix outgoing messages at client side

---- no heartbeat message around 21:09:32 ----
---- 21:10:53.203 Already disconnected ---

Thread dump when TEST message from server was received.BTW, The gist is from our development environment which has the same deployment.

We had enabled the debug log at quickfixj, but not much information, only logs for messages receieved.


The sequence in time serial

  1. 20201101-23:56:02.742 Outgoing heartbeat should be sent at this time, Looks like it’s sending, but hung at io writing – in Running state
  2. 20201101-23:56:18.651 test message from server side to trigger thread dump
  3. 20201101-22:57:45.654 server side began to close the connection
  4. 20201101-22:57:46.727 thread dump – right
  5. 20201101-23:57:48.363 logon message
  6. 20201101-22:58:56.515 thread dump – left

The right(2020-11-01T22:57:46.727Z): when it hangs, The left(2020-11-01T22:58:56.515Z): after reconnection
enter image description here

It looks like that the storage – aws efs we are using made the issue happen.
But the feedback from aws support is that nothing is wrong at aws efs side.
Maybe it’s the network issue between k8s ec2 instance and aws efs.

  1. First, we make the logging async at all session, make the disconnection happen less.
  2. Second, for market session, we write the sequence files to local disk, the disconnection had gone at market session.
  3. Third, at last we replaced the aws efs with aws ebs(persist volume in k8s) for all sessions. It works great now.

BTW, aws ebs is not high availability across zone, but it’s better than fix disconnection.