From Keras to optimized Inference

2021: These instructions are obsolete as they refer to old versions of Keras and TensorFlow

Step1: build a frozen TensorFlow model

The first step is to produce a frozen TensorFlow model.

prerequisite

A python installation of Keras Tensorflow and their dependences + few other goodies such as pathlib

keras2tf

One can find several implementation around I started from this keras_to_tensorflow by Amir Abdi I modified to make it python2.7 compliant and added also a printout of the model summary that turns out to be useful for configuring next step. here is my version: keras2tf.py

~/pyTools/keras2tf.py -input_model_file LWTNN_v10.h5 -output_model_file LWTNN_v10.pb

[innocent@vinzen0 models]$ ~/pyTools/keras2tf.py -input_model_file LWTNN_v10.h5 -output_model_file LWTNN_v10.pb
usage: keras2tf.py [-h] [-input_fld INPUT_FLD] [-output_fld OUTPUT_FLD]
                   [-input_model_file INPUT_MODEL_FILE]
                   [-output_model_file OUTPUT_MODEL_FILE]
                   [-output_graphdef_file OUTPUT_GRAPHDEF_FILE]
                   [-num_outputs NUM_OUTPUTS] [-graph_def GRAPH_DEF]
                   [-output_node_prefix OUTPUT_NODE_PREFIX]
                   [-quantize QUANTIZE] [-theano_backend THEANO_BACKEND]
                   [-f F]

set input arguments

optional arguments:
  -h, --help            show this help message and exit
  -input_fld INPUT_FLD
  -output_fld OUTPUT_FLD
  -input_model_file INPUT_MODEL_FILE
  -output_model_file OUTPUT_MODEL_FILE
  -output_graphdef_file OUTPUT_GRAPHDEF_FILE
  -num_outputs NUM_OUTPUTS
  -graph_def GRAPH_DEF
  -output_node_prefix OUTPUT_NODE_PREFIX
  -quantize QUANTIZE
  -theano_backend THEANO_BACKEND
  -f F
('input args: ', Namespace(f=None, graph_def=False, input_fld='.', input_model_file='LWTNN_v10.h5', num_outputs=1, output_fld='', output_graphdef_file='model.ascii', output_model_file='LWTNN_v10.pb', output_node_prefix='output_node', quantize=False, theano_backend=False))
/usr/lib64/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
2018-03-03 11:39:47.500686: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
ins (InputLayer)             (None, 22)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               6900      
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 150)               45150     
_________________________________________________________________
dropout_2 (Dropout)          (None, 150)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 20)                3020      
_________________________________________________________________
dropout_3 (Dropout)          (None, 20)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                210       
_________________________________________________________________
outs (Dense)                 (None, 1)                 11        
=================================================================
Total params: 55,291
Trainable params: 55,291
Non-trainable params: 0
_________________________________________________________________
('output nodes names are: ', ['output_node0'])
Converted 10 variables to const ops.
('saved the freezed graph (ready for inference) at: ', 'LWTNN_v10.pb')

inspecting model using tensorflow tools

see for instance https://www.tensorflow.org/mobile/prepare_models

from tensorflow area invoke

bazel run tensorflow/tools/graph_transforms:summarize_graph -- --in_graph=/tmp/innocent/models/LWTNN_v10.pb --print_structure=true

INFO: Running command line: bazel-bin/tensorflow/tools/graph_transforms/summarize_graph '--in_graph=/tmp/innocent/models/LWTNN_v10.pb' '--print_structure=true'
Found 1 possible inputs: (name=ins, type=float(1), shape=[?,22]) 
No variables spotted.
Found 1 possible outputs: (name=output_node0, op=Identity) 
Found 55291 (55.29k) const parameters, 0 (0) variable parameters, and 0 control_edges
Op types used: 14 Identity, 10 Const, 5 BiasAdd, 5 MatMul, 4 Relu, 1 Placeholder, 1 Sigmoid
To use with tensorflow/tools/benchmark:benchmark_model try these arguments:
bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=/tmp/innocent/models/LWTNN_v10.pb --show_flops --input_layer=ins --input_layer_type=float --input_layer_shape=-1,22 --output_layer=output_node0
outs/bias (Const): [], value=Tensor
outs/bias/read (Identity): [outs/bias]
outs/kernel (Const): [], value=Tensor
outs/kernel/read (Identity): [outs/kernel]
dense_4/bias (Const): [], value=Tensor
dense_4/bias/read (Identity): [dense_4/bias]
dense_4/kernel (Const): [], value=Tensor
dense_4/kernel/read (Identity): [dense_4/kernel]
dense_3/bias (Const): [], value=Tensor
dense_3/bias/read (Identity): [dense_3/bias]
dense_3/kernel (Const): [], value=Tensor
dense_3/kernel/read (Identity): [dense_3/kernel]
dense_2/bias (Const): [], value=Tensor
dense_2/bias/read (Identity): [dense_2/bias]
dense_2/kernel (Const): [], value=Tensor
dense_2/kernel/read (Identity): [dense_2/kernel]
dense_1/bias (Const): [], value=Tensor
dense_1/bias/read (Identity): [dense_1/bias]
dense_1/kernel (Const): [], value=Tensor
dense_1/kernel/read (Identity): [dense_1/kernel]
ins (Placeholder): []
dense_1/MatMul (MatMul): [ins, dense_1/kernel/read]
dense_1/BiasAdd (BiasAdd): [dense_1/MatMul, dense_1/bias/read]
dense_1/Relu (Relu): [dense_1/BiasAdd]
dropout_1/Identity (Identity): [dense_1/Relu]
dense_2/MatMul (MatMul): [dropout_1/Identity, dense_2/kernel/read]
dense_2/BiasAdd (BiasAdd): [dense_2/MatMul, dense_2/bias/read]
dense_2/Relu (Relu): [dense_2/BiasAdd]
dropout_2/Identity (Identity): [dense_2/Relu]
dense_3/MatMul (MatMul): [dropout_2/Identity, dense_3/kernel/read]
dense_3/BiasAdd (BiasAdd): [dense_3/MatMul, dense_3/bias/read]
dense_3/Relu (Relu): [dense_3/BiasAdd]
dropout_3/Identity (Identity): [dense_3/Relu]
dense_4/MatMul (MatMul): [dropout_3/Identity, dense_4/kernel/read]
dense_4/BiasAdd (BiasAdd): [dense_4/MatMul, dense_4/bias/read]
dense_4/Relu (Relu): [dense_4/BiasAdd]
outs/MatMul (MatMul): [dense_4/Relu, outs/kernel/read]
outs/BiasAdd (BiasAdd): [outs/MatMul, outs/bias/read]
outs/Sigmoid (Sigmoid): [outs/BiasAdd]
output_node0 (Identity): [outs/Sigmoid]

you can then run tf benchmark as

bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=/tmp/innocent/models/LWTNN_v10.pb --show_flops --input_layer=ins --input_layer_type=float --input_layer_shape=1,22 --output_layer=output_node0

INFO: Running command line: bazel-bin/tensorflow/tools/benchmark/benchmark_model '--graph=/tmp/innocent/models/LWTNN_v10.pb' --show_flops '--input_layer=ins' '--input_layer_type=float' '--input_layer_shape=1,22' '--output_layer=output_node0'
2018-03-03 18:19:31.230132: I tensorflow/tools/benchmark/benchmark_model.cc:443] Graph: [/tmp/innocent/models/LWTNN_v10.pb]
2018-03-03 18:19:31.230236: I tensorflow/tools/benchmark/benchmark_model.cc:444] Input layers: [ins]
2018-03-03 18:19:31.230247: I tensorflow/tools/benchmark/benchmark_model.cc:445] Input shapes: [1,22]
2018-03-03 18:19:31.230253: I tensorflow/tools/benchmark/benchmark_model.cc:446] Input types: [float]
2018-03-03 18:19:31.230260: I tensorflow/tools/benchmark/benchmark_model.cc:447] Output layers: [output_node0]
2018-03-03 18:19:31.230274: I tensorflow/tools/benchmark/benchmark_model.cc:448] Num runs: [1000]
2018-03-03 18:19:31.230281: I tensorflow/tools/benchmark/benchmark_model.cc:449] Inter-inference delay (seconds): [-1.0]
2018-03-03 18:19:31.230287: I tensorflow/tools/benchmark/benchmark_model.cc:450] Inter-benchmark delay (seconds): [-1.0]
2018-03-03 18:19:31.230292: I tensorflow/tools/benchmark/benchmark_model.cc:452] Num threads: [-1]
2018-03-03 18:19:31.230298: I tensorflow/tools/benchmark/benchmark_model.cc:453] Benchmark name: []
2018-03-03 18:19:31.230304: I tensorflow/tools/benchmark/benchmark_model.cc:454] Output prefix: []
2018-03-03 18:19:31.230310: I tensorflow/tools/benchmark/benchmark_model.cc:455] Show sizes: [0]
2018-03-03 18:19:31.230316: I tensorflow/tools/benchmark/benchmark_model.cc:456] Warmup runs: [1]
2018-03-03 18:19:31.230323: I tensorflow/tools/benchmark/benchmark_model.cc:54] Loading TensorFlow.
2018-03-03 18:19:31.230335: I tensorflow/tools/benchmark/benchmark_model.cc:61] Got config, 0 devices
2018-03-03 18:19:31.230641: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-03 18:19:31.356688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-03 18:19:31.357039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:28:00.0
totalMemory: 5.93GiB freeMemory: 5.78GiB
2018-03-03 18:19:31.357057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-03 18:19:31.553262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-03 18:19:31.553301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-03-03 18:19:31.553308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-03-03 18:19:31.553484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5566 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:28:00.0, compute capability: 6.1)
2018-03-03 18:19:31.588386: I tensorflow/tools/benchmark/benchmark_model.cc:468] Initialized session in 0.358054s
2018-03-03 18:19:31.588433: I tensorflow/tools/benchmark/benchmark_model.cc:308] Running benchmark for max 1 iterations, max -1 seconds without detailed stat logging, with -1s sleep between inferences
2018-03-03 18:19:31.722332: I tensorflow/tools/benchmark/benchmark_model.cc:341] count=1 curr=133865

2018-03-03 18:19:31.722373: I tensorflow/tools/benchmark/benchmark_model.cc:308] Running benchmark for max 1000 iterations, max 10 seconds without detailed stat logging, with -1s sleep between inferences
2018-03-03 18:19:31.979278: I tensorflow/tools/benchmark/benchmark_model.cc:341] count=1000 first=725 curr=273 min=182 max=725 avg=254.677 std=24

2018-03-03 18:19:31.979315: I tensorflow/tools/benchmark/benchmark_model.cc:308] Running benchmark for max 1000 iterations, max 10 seconds with detailed stat logging, with -1s sleep between inferences
2018-03-03 18:19:31.980920: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcupti.so.9.1 locally
2018-03-03 18:19:32.813738: I tensorflow/tools/benchmark/benchmark_model.cc:341] count=1000 first=99599 curr=676 min=607 max=99599 avg=760.863 std=3127

2018-03-03 18:19:32.813780: I tensorflow/tools/benchmark/benchmark_model.cc:561] Average inference timings in us: Warmup: 133865, no stats: 254, with stats: 760
2018-03-03 18:19:32.814551: I tensorflow/core/util/stat_summarizer.cc:358] Number of nodes executed: 46
2018-03-03 18:19:32.814699: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Run Order ==============================
2018-03-03 18:19:32.814712: I tensorflow/core/util/stat_summarizer.cc:468] 	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
2018-03-03 18:19:32.814719: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYHtoD	   -0.423	    0.001	    0.001	  0.225%	  0.225%	     0.000	        1	edge_39__arg_ins_0_0 [MemCpy]
2018-03-03 18:19:32.814726: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYDtoH	    0.000	    0.001	    0.001	  0.225%	  0.450%	     0.000	        1	edge_40_output_node0 [MemCpy]
2018-03-03 18:19:32.814733: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.359	    0.005	    0.004	  0.801%	  1.251%	     0.000	        1	dense_1/MatMul [Kernel]
2018-03-03 18:19:32.814740: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.328	    0.002	    0.002	  0.472%	  1.723%	     0.000	        1	dense_1/BiasAdd [Kernel]
2018-03-03 18:19:32.814746: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.308	    0.002	    0.002	  0.455%	  2.178%	     0.000	        1	dense_1/Relu [Kernel]
2018-03-03 18:19:32.814752: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.283	    0.005	    0.006	  1.274%	  3.453%	     0.000	        1	dense_2/MatMul [Kernel]
2018-03-03 18:19:32.814758: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.259	    0.002	    0.002	  0.449%	  3.902%	     0.000	        1	dense_2/BiasAdd [Kernel]
2018-03-03 18:19:32.814764: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.239	    0.002	    0.002	  0.490%	  4.392%	     0.000	        1	dense_2/Relu [Kernel]
2018-03-03 18:19:32.814770: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.215	    0.003	    0.004	  0.796%	  5.188%	     0.000	        1	dense_3/MatMul [Kernel]
2018-03-03 18:19:32.814776: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.191	    0.003	    0.002	  0.448%	  5.636%	     0.000	        1	dense_3/BiasAdd [Kernel]
2018-03-03 18:19:32.814782: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.171	    0.002	    0.002	  0.450%	  6.086%	     0.000	        1	dense_3/Relu [Kernel]
2018-03-03 18:19:32.814788: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.148	    0.003	    0.003	  0.673%	  6.759%	     0.000	        1	dense_4/MatMul [Kernel]
2018-03-03 18:19:32.814794: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.124	    0.002	    0.002	  0.442%	  7.202%	     0.000	        1	dense_4/BiasAdd [Kernel]
2018-03-03 18:19:32.814800: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.105	    0.002	    0.002	  0.482%	  7.684%	     0.000	        1	dense_4/Relu [Kernel]
2018-03-03 18:19:32.814806: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.080	    0.005	    0.003	  0.691%	  8.375%	     0.000	        1	outs/MatMul [Kernel]
2018-03-03 18:19:32.814812: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.056	    0.002	    0.002	  0.447%	  8.822%	     0.000	        1	outs/BiasAdd [Kernel]
2018-03-03 18:19:32.814818: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:Sigmoid	   -0.036	    0.003	    0.002	  0.562%	  9.384%	     0.000	        1	outs/Sigmoid [Kernel]
2018-03-03 18:19:32.814824: I tensorflow/core/util/stat_summarizer.cc:468] 	                    NoOp	   -0.479	    0.011	    0.005	  2.092%	 11.476%	     0.000	        2	_SOURCE
2018-03-03 18:19:32.814830: I tensorflow/core/util/stat_summarizer.cc:468] 	                    _Arg	   -0.470	    0.009	    0.005	  1.188%	 12.664%	     0.000	        1	_arg_ins_0_0
2018-03-03 18:19:32.814836: I tensorflow/core/util/stat_summarizer.cc:468] 	                 _Retval	    0.032	    0.005	    0.005	  1.133%	 13.797%	     0.000	        1	_retval_output_node0_0_0
2018-03-03 18:19:32.814842: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.467	    0.015	    0.007	  1.639%	 15.437%	     0.000	        1	dense_1/kernel
2018-03-03 18:19:32.814848: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.458	    0.005	    0.005	  1.213%	 16.650%	     0.000	        1	dense_1/bias
2018-03-03 18:19:32.814854: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.451	    0.005	    0.006	  1.242%	 17.892%	     0.000	        1	dense_2/kernel
2018-03-03 18:19:32.814860: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.445	    0.008	    0.005	  1.048%	 18.940%	     0.000	        1	dense_2/bias
2018-03-03 18:19:32.814866: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.439	    0.004	    0.005	  1.037%	 19.977%	     0.000	        1	dense_3/kernel
2018-03-03 18:19:32.814872: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.433	    0.004	    0.004	  0.954%	 20.931%	     0.000	        1	dense_3/bias
2018-03-03 18:19:32.814878: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.428	    0.018	    0.004	  0.985%	 21.917%	     0.000	        1	dense_4/kernel
2018-03-03 18:19:32.814884: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.423	    0.004	    0.004	  0.937%	 22.854%	     0.000	        1	dense_4/bias
2018-03-03 18:19:32.814890: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.417	    0.004	    0.004	  1.000%	 23.854%	     0.000	        1	outs/kernel
2018-03-03 18:19:32.814897: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.412	    0.008	    0.004	  0.986%	 24.840%	     0.000	        1	outs/bias
2018-03-03 18:19:32.814903: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.398	    0.089	    0.042	  9.518%	 34.359%	     1.280	        1	dense_1/MatMul
2018-03-03 18:19:32.814909: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.352	    0.034	    0.023	  5.148%	 39.507%	     0.000	        1	dense_1/BiasAdd
2018-03-03 18:19:32.814915: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.327	    0.026	    0.018	  4.120%	 43.627%	     0.000	        1	dense_1/Relu
2018-03-03 18:19:32.814921: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.307	    0.032	    0.025	  5.631%	 49.258%	     0.768	        1	dense_2/MatMul
2018-03-03 18:19:32.814927: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.280	    0.023	    0.020	  4.423%	 53.681%	     0.000	        1	dense_2/BiasAdd
2018-03-03 18:19:32.814933: I tensorflow/core/util/stat_summarizer.cc:468] 	                Identity	   -0.036	    0.004	    0.004	  0.979%	 54.660%	     0.000	        1	output_node0
2018-03-03 18:19:32.814939: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.258	    0.024	    0.017	  3.921%	 58.581%	     0.000	        1	dense_2/Relu
2018-03-03 18:19:32.814945: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.238	    0.028	    0.024	  5.467%	 64.048%	     0.256	        1	dense_3/MatMul
2018-03-03 18:19:32.814951: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.212	    0.025	    0.019	  4.373%	 68.420%	     0.000	        1	dense_3/BiasAdd
2018-03-03 18:19:32.814957: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.190	    0.020	    0.017	  3.882%	 72.302%	     0.000	        1	dense_3/Relu
2018-03-03 18:19:32.814963: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.171	    0.027	    0.024	  5.334%	 77.636%	     0.256	        1	dense_4/MatMul
2018-03-03 18:19:32.814969: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.145	    0.026	    0.019	  4.336%	 81.972%	     0.000	        1	dense_4/BiasAdd
2018-03-03 18:19:32.814975: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.124	    0.020	    0.017	  3.846%	 85.819%	     0.000	        1	dense_4/Relu
2018-03-03 18:19:32.814981: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.105	    0.035	    0.026	  5.820%	 91.639%	     0.256	        1	outs/MatMul
2018-03-03 18:19:32.814986: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.077	    0.024	    0.020	  4.408%	 96.047%	     0.000	        1	outs/BiasAdd
2018-03-03 18:19:32.814992: I tensorflow/core/util/stat_summarizer.cc:468] 	                 Sigmoid	   -0.055	    0.021	    0.018	  3.953%	100.000%	     0.000	        1	outs/Sigmoid
2018-03-03 18:19:32.814998: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815004: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Top by Computation Time ==============================
2018-03-03 18:19:32.815010: I tensorflow/core/util/stat_summarizer.cc:468] 	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
2018-03-03 18:19:32.815016: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.398	    0.089	    0.042	  9.518%	  9.518%	     1.280	        1	dense_1/MatMul
2018-03-03 18:19:32.815022: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.105	    0.035	    0.026	  5.820%	 15.339%	     0.256	        1	outs/MatMul
2018-03-03 18:19:32.815028: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.307	    0.032	    0.025	  5.631%	 20.970%	     0.768	        1	dense_2/MatMul
2018-03-03 18:19:32.815034: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.238	    0.028	    0.024	  5.467%	 26.436%	     0.256	        1	dense_3/MatMul
2018-03-03 18:19:32.815040: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.171	    0.027	    0.024	  5.334%	 31.771%	     0.256	        1	dense_4/MatMul
2018-03-03 18:19:32.815046: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.352	    0.034	    0.023	  5.148%	 36.919%	     0.000	        1	dense_1/BiasAdd
2018-03-03 18:19:32.815052: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.280	    0.023	    0.020	  4.423%	 41.342%	     0.000	        1	dense_2/BiasAdd
2018-03-03 18:19:32.815057: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.077	    0.024	    0.020	  4.408%	 45.750%	     0.000	        1	outs/BiasAdd
2018-03-03 18:19:32.815063: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.212	    0.025	    0.019	  4.373%	 50.123%	     0.000	        1	dense_3/BiasAdd
2018-03-03 18:19:32.815069: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.145	    0.026	    0.019	  4.336%	 54.459%	     0.000	        1	dense_4/BiasAdd
2018-03-03 18:19:32.815075: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815080: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Top by Memory Use ==============================
2018-03-03 18:19:32.815086: I tensorflow/core/util/stat_summarizer.cc:468] 	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
2018-03-03 18:19:32.815092: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.398	    0.089	    0.042	  9.518%	  9.518%	     1.280	        1	dense_1/MatMul
2018-03-03 18:19:32.815098: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.307	    0.032	    0.025	  5.631%	 15.149%	     0.768	        1	dense_2/MatMul
2018-03-03 18:19:32.815104: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.105	    0.035	    0.026	  5.820%	 20.970%	     0.256	        1	outs/MatMul
2018-03-03 18:19:32.815110: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.171	    0.027	    0.024	  5.334%	 26.304%	     0.256	        1	dense_4/MatMul
2018-03-03 18:19:32.815116: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.238	    0.028	    0.024	  5.467%	 31.771%	     0.256	        1	dense_3/MatMul
2018-03-03 18:19:32.815122: I tensorflow/core/util/stat_summarizer.cc:468] 	                Identity	   -0.036	    0.004	    0.004	  0.979%	 32.750%	     0.000	        1	output_node0
2018-03-03 18:19:32.815128: I tensorflow/core/util/stat_summarizer.cc:468] 	                 Sigmoid	   -0.055	    0.021	    0.018	  3.953%	 36.703%	     0.000	        1	outs/Sigmoid
2018-03-03 18:19:32.815134: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.077	    0.024	    0.020	  4.408%	 41.111%	     0.000	        1	outs/BiasAdd
2018-03-03 18:19:32.815140: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.124	    0.020	    0.017	  3.846%	 44.957%	     0.000	        1	dense_4/Relu
2018-03-03 18:19:32.815145: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.145	    0.026	    0.019	  4.336%	 49.293%	     0.000	        1	dense_4/BiasAdd
2018-03-03 18:19:32.815155: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815161: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Summary by node type ==============================
2018-03-03 18:19:32.815167: I tensorflow/core/util/stat_summarizer.cc:468] 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
2018-03-03 18:19:32.815173: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	        5	     0.139	    32.783%	    32.783%	     2.816	        5
2018-03-03 18:19:32.815179: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	        5	     0.098	    23.113%	    55.896%	     0.000	        5
2018-03-03 18:19:32.815185: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	        4	     0.069	    16.274%	    72.170%	     0.000	        4
2018-03-03 18:19:32.815191: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	       10	     0.045	    10.613%	    82.783%	     0.000	       10
2018-03-03 18:19:32.815197: I tensorflow/core/util/stat_summarizer.cc:468] 	                 Sigmoid	        1	     0.017	     4.009%	    86.792%	     0.000	        1
2018-03-03 18:19:32.815203: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	        5	     0.016	     3.774%	    90.566%	     0.000	        5
2018-03-03 18:19:32.815209: I tensorflow/core/util/stat_summarizer.cc:468] 	                    NoOp	        1	     0.009	     2.123%	    92.689%	     0.000	        2
2018-03-03 18:19:32.815214: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	        4	     0.007	     1.651%	    94.340%	     0.000	        4
2018-03-03 18:19:32.815220: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	        5	     0.006	     1.415%	    95.755%	     0.000	        5
2018-03-03 18:19:32.815226: I tensorflow/core/util/stat_summarizer.cc:468] 	                 _Retval	        1	     0.005	     1.179%	    96.934%	     0.000	        1
2018-03-03 18:19:32.815232: I tensorflow/core/util/stat_summarizer.cc:468] 	                    _Arg	        1	     0.005	     1.179%	    98.113%	     0.000	        1
2018-03-03 18:19:32.815237: I tensorflow/core/util/stat_summarizer.cc:468] 	                Identity	        1	     0.004	     0.943%	    99.057%	     0.000	        1
2018-03-03 18:19:32.815243: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:Sigmoid	        1	     0.002	     0.472%	    99.528%	     0.000	        1
2018-03-03 18:19:32.815249: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYHtoD	        1	     0.001	     0.236%	    99.764%	     0.000	        1
2018-03-03 18:19:32.815255: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYDtoH	        1	     0.001	     0.236%	   100.000%	     0.000	        1
2018-03-03 18:19:32.815260: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815266: I tensorflow/core/util/stat_summarizer.cc:468] Timings (microseconds): count=1000 first=625 curr=459 min=411 max=625 avg=444.149 std=17
2018-03-03 18:19:32.815272: I tensorflow/core/util/stat_summarizer.cc:468] Memory (bytes): count=1000 curr=2816(all same)
2018-03-03 18:19:32.815277: I tensorflow/core/util/stat_summarizer.cc:468] 46 nodes observed
2018-03-03 18:19:32.815283: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.829018: I tensorflow/tools/benchmark/benchmark_model.cc:596] FLOPs estimate: 109.62k
2018-03-03 18:19:32.829044: I tensorflow/tools/benchmark/benchmark_model.cc:598] FLOPs/second: 430.43M

form these inspections we have learn

  • the name of the input layer to be ins
  • the shape of the input layer to be 1,22 (actually more a guess)
  • the name of the output layer to be output_node0 (as given to keras2tf)

Step2 compile the model

to compile a model in a almost stand-alone-library or executable one can follow the instruction on the tensorflow site about tfcompile

prerequisite

TensorFlow 1.5 source code (from git) and its dependences including bazel.

(tip: bazel will fill up your home directory populating ~.cache/bazel. Make sure you have plenty of space there for istance creating a link to /tmp or similar. you can use also bazel option --output_base=/tmp/bazel/output, the probability that you forget about it is high and bazel happily will rebuild everything from scratch...)

Just build the tests (after ./configure) as

bazel build //tensorflow/compiler/aot/tests:all_tests
it will create all what needed to compile and build the self contained inference engine

write the "config file"

Once we guessed right name and shape of input and name of output, writing the config its easy (at least for our simple model above):
cat dnn.config.pbtxt 
feed {
  id { node_name: "ins" }
  shape {
    dim { size: 1 }
    dim { size: 22 }
  }
}
fetch {
  id { node_name: "output_node0" }
  name: "output_node0"
}

compile!

We can now use the tfcompile tool to compile our model in a shared library for the intended target(s)
/data/vin/tensorflow/bazel-out/host/bin/tensorflow/compiler/aot/tfcompile --target_features="+sse4.2"  --graph=LWTNN_v10.pb --config=dnn.config.pbtxt --entry_point=__tensorflow_tfmydnn --cpp_class=MyDNN --target_triple=x86_64-pc-linux --out_header=tfmydnn.h --out_metadata_object=tfmydnn_tfcompile_metadata.o --out_function_object=tfmydnn_tfcompile_function.o
c++ -shared tfmydnn_tfcompile_metadata.o tfmydnn_tfcompile_function.o  -o tfmydnn_sse.so
rm *.o
/data/vin/tensorflow/bazel-out/host/bin/tensorflow/compiler/aot/tfcompile --target_features="+fma,+avx2"  --graph=LWTNN_v10.pb --config=dnn.config.pbtxt --entry_point=__tensorflow_tfmydnn --cpp_class=MyDNN --target_triple=x86_64-pc-linux --out_header=tfmydnn.h --out_metadata_object=tfmydnn_tfcompile_metadata.o --out_function_object=tfmydnn_tfcompile_function.o
c++ -shared tfmydnn_tfcompile_metadata.o tfmydnn_tfcompile_function.o  -o tfmydnn_avx2.so
rm *.o
/data/vin/tensorflow/bazel-out/host/bin/tensorflow/compiler/aot/tfcompile --target_features="+fma,+avx512f"  --graph=LWTNN_v10.pb --config=dnn.config.pbtxt --entry_point=__tensorflow_tfmydnn --cpp_class=MyDNN --target_triple=x86_64-pc-linux --out_header=tfmydnn.h --out_metadata_object=tfmydnn_tfcompile_metadata.o --out_function_object=tfmydnn_tfcompile_function.o
c++ -shared tfmydnn_tfcompile_metadata.o tfmydnn_tfcompile_function.o  -o tfmydnn_avx512.so
rm *.o

ls -l tfmydnn*
-rw-r--r--. 1 innocent zh   7493 Mar  3 19:09 tfmydnn.h
-rwxr-xr-x. 1 innocent zh 232952 Mar  3 18:37 tfmydnn.so
-rwxr-xr-x. 1 innocent zh 237048 Mar  3 19:09 tfmydnn_avx2.so
-rwxr-xr-x. 1 innocent zh 237048 Mar  3 19:09 tfmydnn_avx512.so
-rwxr-xr-x. 1 innocent zh 232952 Mar  3 19:09 tfmydnn_sse.so
ls -l *.pb
rw-r--r--. 1 innocent zh  224235 Mar  2 17:09 LWTNN_v10.pb
essentially the shared library contains the model as constants in a functions implementing the DNN
// Generated by tfcompile, the TensorFlow graph compiler.  DO NOT EDIT!
//
// This header was generated via ahead-of-time compilation of a TensorFlow
// graph.  An object file corresponding to this header was also generated.
// This header gives access to the functionality in that object file.
//
// clang-format off

#ifndef TFCOMPILE_GENERATED___tensorflow_tfmydnn_H_  // NOLINT(build/header_guard)
#define TFCOMPILE_GENERATED___tensorflow_tfmydnn_H_  // NOLINT(build/header_guard)


#include "tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h"
#include "tensorflow/core/platform/types.h"

namespace Eigen { struct ThreadPoolDevice; }
namespace xla { class ExecutableRunOptions; }

// (Implementation detail) Entry point to the function in the object file.
extern "C" void __tensorflow_tfmydnn(
    void* result, const xla::ExecutableRunOptions* run_options,
    const void** args, void** temps, tensorflow::int64* profile_counters);




// MyDNN represents a computation previously specified in a
// TensorFlow graph, now compiled into executable code. This extends the generic
// XlaCompiledCpuFunction class with statically type-safe arg and result
// methods. Usage example:
//
//   MyDNN computation;
//   // ...set args using computation.argN methods
//   CHECK(computation.Run());
//   // ...inspect results using computation.resultN methods
//
// The Run method invokes the actual computation, with inputs read from arg
// buffers, and outputs written to result buffers. Each Run call may also use
// a set of temporary buffers for the computation.
//
// By default each instance of this class manages its own arg, result and temp
// buffers. The AllocMode constructor parameter may be used to modify the
// buffer allocation strategy.
//
// Under the default allocation strategy, this class is thread-compatible:
// o Calls to non-const methods require exclusive access to the object.
// o Concurrent calls to const methods are OK, if those calls are made while it
//   is guaranteed that no thread may call a non-const method.
//
// The logical function signature is:
//   (arg0: f32[1,22]) -> (f32[1,1])
//
// Memory stats:
//   arg bytes total:    88
//   arg bytes aligned:  96
//   temp bytes total:   2412
//   temp bytes aligned: 2464
class MyDNN : public tensorflow::XlaCompiledCpuFunction {
 public:
  // Number of input arguments for the compiled computation.
  static constexpr size_t kNumArgs = 1;

  // Byte size of each argument buffer. There are kNumArgs entries.
  static const intptr_t* ArgSizes() {
    static constexpr intptr_t kArgSizes[kNumArgs] = {88};
    return kArgSizes;
  }

  // Returns static data used to create an XlaCompiledCpuFunction.
  static const tensorflow::XlaCompiledCpuFunction::StaticData& StaticData() {
    static XlaCompiledCpuFunction::StaticData* kStaticData = [](){
      XlaCompiledCpuFunction::StaticData* data =
        new XlaCompiledCpuFunction::StaticData;
      data->raw_function = __tensorflow_tfmydnn;
      data->arg_sizes = ArgSizes();
      data->num_args = kNumArgs;
      data->temp_sizes = TempSizes();
      data->num_temps = kNumTemps;
      data->result_index = kResultIndex;
      data->arg_names = StaticArgNames();
      data->result_names = StaticResultNames();
      data->program_shape = StaticProgramShape();
      return data;
    }();
    return *kStaticData;
  }

  MyDNN(AllocMode alloc_mode = AllocMode::ARGS_RESULTS_PROFILES_AND_TEMPS)
      : XlaCompiledCpuFunction(StaticData(), alloc_mode) {}

  MyDNN(const MyDNN&) = delete;
  MyDNN& operator=(const MyDNN&) = delete;

  // Arg methods for managing input buffers. Buffers are in row-major order.
  // There is a set of methods for each positional argument, with the following
  // general form:
  //
  // void set_argN_data(void* data)
  //   Sets the buffer of type T for positional argument N. May be called in
  //   any AllocMode. Must be called before Run to have an affect. Must be
  //   called in AllocMode::RESULTS_PROFILES_AND_TEMPS_ONLY for each positional
  //   argument, to set the argument buffers.
  //
  // T* argN_data()
  //   Returns the buffer of type T for positional argument N.
  //
  // T& argN(...dim indices...)
  //   Returns a reference to the value of type T for positional argument N,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.

  void set_arg0_data(void* data) {
    set_arg_data(0, data);
  }
  float* arg0_data() {
    return static_cast<float*>(arg_data(0));
  }
  float& arg0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][22]>(
        arg_data(0)))[dim0][dim1];
  }
  const float* arg0_data() const {
    return static_cast<const float*>(arg_data(0));
  }
  const float& arg0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][22]>(
        arg_data(0)))[dim0][dim1];
  }

  // Result methods for managing output buffers. Buffers are in row-major order.
  // Must only be called after a successful Run call. There is a set of methods
  // for each positional result, with the following general form:
  //
  // T* resultN_data()
  //   Returns the buffer of type T for positional result N.
  //
  // T& resultN(...dim indices...)
  //   Returns a reference to the value of type T for positional result N,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.
  //
  // Unlike the arg methods, there is no set_resultN_data method. The result
  // buffers are managed internally, and may change after each call to Run.

  float* result0_data() {
    return static_cast<float*>(result_data(0));
  }
  float& result0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }
  const float* result0_data() const {
    return static_cast<const float*>(result_data(0));
  }
  const float& result0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }

  float* result_output_node0_data() {
    return static_cast<float*>(result_data(0));
  }
  float& result_output_node0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }
  const float* result_output_node0_data() const {
    return static_cast<const float*>(result_data(0));
  }
  const float& result_output_node0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }

 private:
  // Number of result and temporary buffers for the compiled computation.
  static constexpr size_t kNumTemps = 4;
  // The 0-based index of the result tuple in the temporary buffers.
  static constexpr size_t kResultIndex = 1;

  // Byte size of each result / temporary buffer. There are kNumTemps entries.
  static const intptr_t* TempSizes() {
    static constexpr intptr_t kTempSizes[kNumTemps] = {-1, 8, 4, 2400};
    return kTempSizes;
  }

  // Array of names of each positional argument, terminated by nullptr.
  static const char** StaticArgNames() {
    return nullptr;
  }

  // Array of names of each positional result, terminated by nullptr.
  static const char** StaticResultNames() {
    return nullptr;
  }

  // Shape of the args and results.
  static const xla::ProgramShape* StaticProgramShape() {
    static const xla::ProgramShape* kShape = nullptr;
    return kShape;
  }
};


#endif  // TFCOMPILE_GENERATED___tensorflow_tfmydnn_H_

// clang-format on

where it claims the each instance will use ~2.5KB of memory

Step3 Tests

prerequisite

the easiest is to copy local the two tensorflow library required to invoke the compiled function
cp /data/vin/tensorflow/bazel-bin/tensorflow/compiler/aot/libruntime.so .
cp /data/vin/tensorflow/bazel-bin/tensorflow/compiler/tf2xla/libxla_compiled_cpu_function.so .

then one can copy the example in tensorflow test and adapt it to our model. We have also made possible to instantiate multiple (identical) models, and running then multiple times in multiple threads

#include "tfmydnn.h"

#include <array>
#include <algorithm>

#include <thread>
#include <functional>
#include<vector>

#include <iostream>
/*
vars=["trk_pt", "trk_eta", "trk_lambda", "trk_dxy", "trk_dz", "trk_dxyClosestPV", "trk_dzClosestPVClamped", 
"trk_ptErr","trk_etaErr", "trk_lambdaErr", "trk_dxyErr", "trk_dzErr", "trk_nChi2", "trk_ndof", "trk_nInvalid", "trk_nPixel", "trk_nStrip", "trk_nPixelLay", 
"trk_nStripLay", "trk_n3DLay", "trk_nLostLay", "trk_algo"]
*/


void  go(int neval, int ndnn) {

  if (ndnn==0) {
   float tot=0;
   std::cout <<"dummy running (compute tanh) "<<std::endl;
   for (int i=0; i<neval; ++i) {
       tot+= std::tanh(float((i%2) ? i : -i));
   }
   std::cout << tot << std::endl;
   return;
  }

  float tot=0;
  int N=neval/ndnn;
  MyDNN dnn[ndnn];
  std::cout <<"running " << ndnn << " dnns" <<std::endl;
 
  for (int i=0; i<N; ++i) {
  for (int j=0; j<ndnn; ++j) {
  dnn[j].arg0_data()[0] = (i%2) ? 3. : 5;
  dnn[j].arg0_data()[1] = 0.;
  dnn[j].arg0_data()[2] = 0.;
  dnn[j].arg0_data()[3] = 0.;
  dnn[j].arg0_data()[4] = 0.;
  dnn[j].arg0_data()[5] = 0.;
  dnn[j].arg0_data()[6] = 0.;
  dnn[j].arg0_data()[7] = 0.1;
  dnn[j].arg0_data()[8] = 0.1;
  dnn[j].arg0_data()[9] = 0.01;
  dnn[j].arg0_data()[10] = 0.01;
  dnn[j].arg0_data()[11] = 0.1;
  dnn[j].arg0_data()[12] = (i%3) ? 1. : 1.2;
  dnn[j].arg0_data()[10] = 0.01;
  dnn[j].arg0_data()[11] = 0.1;
  dnn[j].arg0_data()[12] = 1.;
  dnn[j].arg0_data()[13] = 15;
  dnn[j].arg0_data()[14] = 0;
  dnn[j].arg0_data()[15] = 4;
  dnn[j].arg0_data()[16] = 12;
  dnn[j].arg0_data()[17] = 4;
  dnn[j].arg0_data()[18] = 8;
  dnn[j].arg0_data()[19] = 8;
  dnn[j].arg0_data()[20] = 1;
  dnn[j].arg0_data()[21] = (i%5) ? 4 : 6;
 
  dnn[j].Run();
  tot+= dnn[j].result0_data()[0];
  }
  }
  std::cout << tot << ' ' << dnn[0].result0_data()[0]<< std::endl;

};


#include <cstdlib>

int main(int args, char * argv[]) {
  typedef std::thread Thread;
  typedef std::vector<std::thread> ThreadGroup;

  if (args<4) { 
    std::cout << "please provide # total invokation, # dnns, # threads" << std::endl;
    return -1;
  }

  
  const int NUMTHREADS= atoi(argv[3]);
   ThreadGroup threads;
 
  if (NUMTHREADS>1) {
    threads.reserve(NUMTHREADS-1);
    for (int i=0; i<NUMTHREADS-1; ++i) {
      threads.push_back(Thread(go,atoi(argv[1]),atoi(argv[2])));
    }
  }  

  go(atoi(argv[1]),atoi(argv[2]));

  if (NUMTHREADS>1)
    std::for_each(threads.begin(),threads.end(), 
  		std::bind(&Thread::join,std::placeholders::_1));
 
  return 0;
}

we can now compile it as

 c++ -march=haswell -pthread -fPIC -Ofast -std=c++11 -Wall test_tfmydnn.cpp -I/data/vin/tensorflow/ ./tfmydnn.so ./libxla_compiled_cpu_function.so ./libruntime.so
where tfmydnn.so is a link to one of the library built above. One can switch from one architecture to another just changing the link
rm tfmydnn.so; ln -s tfmydnn_avx512.so tfmydnn.so

now just pack everything in a directory and you are ready for deployment

[innocent@vinzen0 tkdnn]$ pwd
/afs/cern.ch/user/i/innocent/w1/tkdnn
[innocent@vinzen0 tkdnn]$ ls -l
total 747
-r-xr-xr-x. 1 innocent zh   8048 Mar  4 09:29 libruntime.so
-r-xr-xr-x. 1 innocent zh  13520 Mar  4 09:29 libxla_compiled_cpu_function.so
-rwxr-xr-x. 1 innocent zh  22104 Mar  4 09:29 test_tfmydnn
-rw-r--r--. 1 innocent zh   2638 Mar  4 09:29 test_tfmydnn.cpp
-rw-r--r--. 1 innocent zh   7493 Mar  4 09:29 tfmydnn.h
-rwxr-xr-x. 1 innocent zh 237048 Mar  4 09:29 tfmydnn_avx2.so
-rwxr-xr-x. 1 innocent zh 237048 Mar  4 09:29 tfmydnn_avx512.so
-rwxr-xr-x. 1 innocent zh 232952 Mar  4 09:29 tfmydnn_sse.so
[innocent@vinzen0 tkdnn]$ ldd test_tfmydnn
	linux-vdso.so.1 =>  (0x00007ffff772e000)
	./tfmydnn.so => not found
	./libxla_compiled_cpu_function.so (0x00007f7ff33e6000)
	./libruntime.so (0x00007f7ff31e4000)
	libstdc++.so.6 => /afs/cern.ch/user/i/innocent/w5/lib64/libstdc++.so.6 (0x00007f7ff2e60000)
	libm.so.6 => /usr/lib64/libm.so.6 (0x00007f7ff2b5d000)
	libgcc_s.so.1 => /afs/cern.ch/user/i/innocent/w5/lib64/libgcc_s.so.1 (0x00007f7ff2945000)
	libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f7ff2729000)
	libc.so.6 => /usr/lib64/libc.so.6 (0x00007f7ff2365000)
	/lib64/ld-linux-x86-64.so.2 (0x0000561859e06000)

Benchmarks

prerequisite

access to a variety of hardware platform...

memory scaling

test performed on a Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (4core) running CC7 using ps -o "cmd rss vsz";
(for 0 instances the test just run a single "tanh")

# evaluations # instances # thread RSS VSZ
1000000000 0 1 1552 21776
1000000000 0 4 1552 46364
1000000000 0 8 1556 79148
1000000 1 1 1820 21776
1000000 10 1 1820 21776
1000000 100 1 2084 22040
1000000 1000 1 4456 24440
1000000 1 4 1832 242972
1000000 1000 4 12080 245636
1000000 1 8 3596 537900
1000000 1000 8 24944 540564

which seems to be consistent with ~250KB of data plus ~2.5K for each instance

timing

performed using
rm tfmydnn.so ; ln -s tfmydnn_sse.so tfmydnn.so ; perf stat -d ./test_tfmydnn 1000000 100 4 | & egrep "GHz|elapsed|msec|instr"
etc

on Xeon machines took care to run on a single socket

machine arch # threads Freq time cpu time Ginstructions ins/cycles
i7-6700K SSE 4 3.9 4.89 18.8 230.9 3.14
i7-6700K SSE 8 3.9 8.84 70.3 461.8 1.69
i7-6700K AVX2 1 3.9 3.69 3.69 27.4 1.91
i7-6700K AVX2 4 3.9 3.69 14.7 57.4 1.91
i7-6700K AVX2 8 3.9 5.91 47.0 219.1 1.20
Ryzen7 1800X SSE 8 3.62 4.34 34.6 461.2 3.68
Ryzen7 1800X AVX2 8 3.62 4.34 34.6 219.1 1.75
E5-2650 v4 SSE 1 2.59 7.82 7.82 57.7 2.85
E5-2650 v4 SSE 12 2.47 8.20 97.5 692.5 2.87
E5-2650 v4 AVX2 1 2.55 6.49 6.49 27.4 1.65
E5-2650 v4 AVX2 12 2.43 6.85 81.0 328.8 1.67
Silver 4110 SSE 8 2.09 8.2 65.5 461.7 3.37
Silver 4110 AVX2 1 2.09 6.13 6.13 27.4 2.14
Silver 4110 AVX2 8 2.09 6.14 49.1 219.0 2.14
Silver 4110 AVX512 1 1.76 9.61 9.51 32.4 1.92
Silver 4110 AVX512 8 1.20 13.1 104.6 259.3 1.92
Phi 7210 SSE 64 1.28 42.17 2232 3697 1.29
Phi 7210 SSE 256 1.25 112 28320 14821 0.42
Phi 7210 AVX2 8 1.28 21.6 171 219 1.0
Phi 7210 AVX2 64 1.28 27.0 1484 1756 0.93
Phi 7210 AVX2 128 1.27 41.2 4480 3518 0.63
Phi 7210 AVX2 256 1.25 64.9 16238 7045 0.35
Phi 7210 AVX512 8 1.28 22.8 181 259.5 1.12
Phi 7210 AVX512 64 1.28 32.9 1582 2975 1.02
Phi 7210 AVX512 128 1.27 48.0 5202 4161 0.63
Phi 7210 AVX512 256 1.25 77.0 19290 8337 0.35

here are the ratio of real time sse/avx2 avx2/avx512 for one thread per core

machine sse/avx2 avx2/avx512
i7-6700K 1.33  
Ryzen7 1800X 1.0  
E5-2650 v4 1.22  
Silver 4110 1.33 0.47
Phi 7210 1.56 0.82

The performance with avx512 (in particular on SKL 4110) are pretty deceiving: besides the lower frequency (that is expected, see for instance https://en.wikichip.org/wiki/intel/xeon_silver/4110 ), note how the number of instructions is higher than with AVX2. This can also be seen with a more detailed perf analysis

[innocent@olsky03 tkdnn]$ rm tfmydnn.so ; ln -s tfmydnn_sse.so tfmydnn.so ; ~/scripts/doOCPerfSX "./test_tfmydnn 100000 100 16" | & egrep "fp_arith.*single.*#"
                0      fp_arith_inst_retired_512b_packed_single #    0.000 K/sec                    (11.57%)
                0      fp_arith_inst_retired_256b_packed_single #    0.000 K/sec                    (11.43%)
      44575000479      fp_arith_inst_retired_128b_packed_single # 3379.500 M/sec                    (11.50%)
       2087690622      fp_arith_inst_retired_scalar_single #  158.280 M/sec                    (11.58%)
[innocent@olsky03 tkdnn]$ rm tfmydnn.so ; ln -s tfmydnn_avx2.so tfmydnn.so; ~/scripts/doOCPerfSX "./test_tfmydnn 100000 100 16" | & egrep "fp_arith.*single.*#"
                0      fp_arith_inst_retired_512b_packed_single #    0.000 K/sec                    (11.73%)
      21626826075      fp_arith_inst_retired_256b_packed_single # 2131.011 M/sec                    (11.73%)
        376225683      fp_arith_inst_retired_128b_packed_single #   37.072 M/sec                    (11.56%)
       8265440396      fp_arith_inst_retired_scalar_single #  814.440 M/sec                    (11.45%)
[innocent@olsky03 tkdnn]$ rm tfmydnn.so ; ln -s tfmydnn_avx512.so tfmydnn.so ; ~/scripts/doOCPerfSX "./test_tfmydnn 100000 100 16" | & egrep "fp_arith.*single.*#"
      12737652202      fp_arith_inst_retired_512b_packed_single #  605.915 M/sec                    (11.56%)
        189005228      fp_arith_inst_retired_256b_packed_single #    8.991 M/sec                    (11.49%)
          6313559      fp_arith_inst_retired_128b_packed_single #    0.300 M/sec                    (11.52%)
      12082932411      fp_arith_inst_retired_scalar_single #  574.771 M/sec                    (11.54%)

most probably either the code generated by tensorflow is not optimized (yet) for avx512 or there are problems of alignment or, more simply, is our model (22x300x150x20x10x1) that does not fit well the 16-float wide registries of avx512

we have tested a different model (22x320x160x32x16x1) and indeed the ratio avx512/avx2 improves a bit (the number of instruction becomes at least similar). Still, being the model bigger, the performance with avx2 are worse (slower) than the previous smaller model.

-- VincenzoInnocente - 2018-03-03

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2021-05-06 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback