Main Web>TWikiUsers>VincenzoInnocente>VinInn>VIKeras2Inference (2021-05-06, VincenzoInnocente)

EditAttachPDF

From Keras to optimized Inference

2021: These instructions are obsolete as they refer to old versions of Keras and TensorFlow

Step1: build a frozen TensorFlow model

The first step is to produce a frozen TensorFlow model.

prerequisite

A python installation of Keras Tensorflow and their dependences + few other goodies such as pathlib

keras2tf

One can find several implementation around I started from this keras_to_tensorflow by Amir Abdi I modified to make it python2.7 compliant and added also a printout of the model summary that turns out to be useful for configuring next step. here is my version: keras2tf.py

~/pyTools/keras2tf.py -input_model_file LWTNN_v10.h5 -output_model_file LWTNN_v10.pb

Show output

Hide output

[innocent@vinzen0 models]$ ~/pyTools/keras2tf.py -input_model_file LWTNN_v10.h5 -output_model_file LWTNN_v10.pb
usage: keras2tf.py [-h] [-input_fld INPUT_FLD] [-output_fld OUTPUT_FLD]
                   [-input_model_file INPUT_MODEL_FILE]
                   [-output_model_file OUTPUT_MODEL_FILE]
                   [-output_graphdef_file OUTPUT_GRAPHDEF_FILE]
                   [-num_outputs NUM_OUTPUTS] [-graph_def GRAPH_DEF]
                   [-output_node_prefix OUTPUT_NODE_PREFIX]
                   [-quantize QUANTIZE] [-theano_backend THEANO_BACKEND]
                   [-f F]

set input arguments

optional arguments:
  -h, --help            show this help message and exit
  -input_fld INPUT_FLD
  -output_fld OUTPUT_FLD
  -input_model_file INPUT_MODEL_FILE
  -output_model_file OUTPUT_MODEL_FILE
  -output_graphdef_file OUTPUT_GRAPHDEF_FILE
  -num_outputs NUM_OUTPUTS
  -graph_def GRAPH_DEF
  -output_node_prefix OUTPUT_NODE_PREFIX
  -quantize QUANTIZE
  -theano_backend THEANO_BACKEND
  -f F
('input args: ', Namespace(f=None, graph_def=False, input_fld='.', input_model_file='LWTNN_v10.h5', num_outputs=1, output_fld='', output_graphdef_file='model.ascii', output_model_file='LWTNN_v10.pb', output_node_prefix='output_node', quantize=False, theano_backend=False))
/usr/lib64/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
2018-03-03 11:39:47.500686: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
ins (InputLayer)             (None, 22)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               6900      
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 150)               45150     
_________________________________________________________________
dropout_2 (Dropout)          (None, 150)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 20)                3020      
_________________________________________________________________
dropout_3 (Dropout)          (None, 20)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                210       
_________________________________________________________________
outs (Dense)                 (None, 1)                 11        
=================================================================
Total params: 55,291
Trainable params: 55,291
Non-trainable params: 0
_________________________________________________________________
('output nodes names are: ', ['output_node0'])
Converted 10 variables to const ops.
('saved the freezed graph (ready for inference) at: ', 'LWTNN_v10.pb')

inspecting model using tensorflow tools

see for instance https://www.tensorflow.org/mobile/prepare_models

from tensorflow area invoke

bazel run tensorflow/tools/graph_transforms:summarize_graph -- --in_graph=/tmp/innocent/models/LWTNN_v10.pb --print_structure=true

Show output

Hide output

INFO: Running command line: bazel-bin/tensorflow/tools/graph_transforms/summarize_graph '--in_graph=/tmp/innocent/models/LWTNN_v10.pb' '--print_structure=true'
Found 1 possible inputs: (name=ins, type=float(1), shape=[?,22]) 
No variables spotted.
Found 1 possible outputs: (name=output_node0, op=Identity) 
Found 55291 (55.29k) const parameters, 0 (0) variable parameters, and 0 control_edges
Op types used: 14 Identity, 10 Const, 5 BiasAdd, 5 MatMul, 4 Relu, 1 Placeholder, 1 Sigmoid
To use with tensorflow/tools/benchmark:benchmark_model try these arguments:
bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=/tmp/innocent/models/LWTNN_v10.pb --show_flops --input_layer=ins --input_layer_type=float --input_layer_shape=-1,22 --output_layer=output_node0
outs/bias (Const): [], value=Tensor
outs/bias/read (Identity): [outs/bias]
outs/kernel (Const): [], value=Tensor
outs/kernel/read (Identity): [outs/kernel]
dense_4/bias (Const): [], value=Tensor
dense_4/bias/read (Identity): [dense_4/bias]
dense_4/kernel (Const): [], value=Tensor
dense_4/kernel/read (Identity): [dense_4/kernel]
dense_3/bias (Const): [], value=Tensor
dense_3/bias/read (Identity): [dense_3/bias]
dense_3/kernel (Const): [], value=Tensor
dense_3/kernel/read (Identity): [dense_3/kernel]
dense_2/bias (Const): [], value=Tensor
dense_2/bias/read (Identity): [dense_2/bias]
dense_2/kernel (Const): [], value=Tensor
dense_2/kernel/read (Identity): [dense_2/kernel]
dense_1/bias (Const): [], value=Tensor
dense_1/bias/read (Identity): [dense_1/bias]
dense_1/kernel (Const): [], value=Tensor
dense_1/kernel/read (Identity): [dense_1/kernel]
ins (Placeholder): []
dense_1/MatMul (MatMul): [ins, dense_1/kernel/read]
dense_1/BiasAdd (BiasAdd): [dense_1/MatMul, dense_1/bias/read]
dense_1/Relu (Relu): [dense_1/BiasAdd]
dropout_1/Identity (Identity): [dense_1/Relu]
dense_2/MatMul (MatMul): [dropout_1/Identity, dense_2/kernel/read]
dense_2/BiasAdd (BiasAdd): [dense_2/MatMul, dense_2/bias/read]
dense_2/Relu (Relu): [dense_2/BiasAdd]
dropout_2/Identity (Identity): [dense_2/Relu]
dense_3/MatMul (MatMul): [dropout_2/Identity, dense_3/kernel/read]
dense_3/BiasAdd (BiasAdd): [dense_3/MatMul, dense_3/bias/read]
dense_3/Relu (Relu): [dense_3/BiasAdd]
dropout_3/Identity (Identity): [dense_3/Relu]
dense_4/MatMul (MatMul): [dropout_3/Identity, dense_4/kernel/read]
dense_4/BiasAdd (BiasAdd): [dense_4/MatMul, dense_4/bias/read]
dense_4/Relu (Relu): [dense_4/BiasAdd]
outs/MatMul (MatMul): [dense_4/Relu, outs/kernel/read]
outs/BiasAdd (BiasAdd): [outs/MatMul, outs/bias/read]
outs/Sigmoid (Sigmoid): [outs/BiasAdd]
output_node0 (Identity): [outs/Sigmoid]

you can then run tf benchmark as

bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=/tmp/innocent/models/LWTNN_v10.pb --show_flops --input_layer=ins --input_layer_type=float --input_layer_shape=1,22 --output_layer=output_node0

Show output

Hide output

INFO: Running command line: bazel-bin/tensorflow/tools/benchmark/benchmark_model '--graph=/tmp/innocent/models/LWTNN_v10.pb' --show_flops '--input_layer=ins' '--input_layer_type=float' '--input_layer_shape=1,22' '--output_layer=output_node0'
2018-03-03 18:19:31.230132: I tensorflow/tools/benchmark/benchmark_model.cc:443] Graph: [/tmp/innocent/models/LWTNN_v10.pb]
2018-03-03 18:19:31.230236: I tensorflow/tools/benchmark/benchmark_model.cc:444] Input layers: [ins]
2018-03-03 18:19:31.230247: I tensorflow/tools/benchmark/benchmark_model.cc:445] Input shapes: [1,22]
2018-03-03 18:19:31.230253: I tensorflow/tools/benchmark/benchmark_model.cc:446] Input types: [float]
2018-03-03 18:19:31.230260: I tensorflow/tools/benchmark/benchmark_model.cc:447] Output layers: [output_node0]
2018-03-03 18:19:31.230274: I tensorflow/tools/benchmark/benchmark_model.cc:448] Num runs: [1000]
2018-03-03 18:19:31.230281: I tensorflow/tools/benchmark/benchmark_model.cc:449] Inter-inference delay (seconds): [-1.0]
2018-03-03 18:19:31.230287: I tensorflow/tools/benchmark/benchmark_model.cc:450] Inter-benchmark delay (seconds): [-1.0]
2018-03-03 18:19:31.230292: I tensorflow/tools/benchmark/benchmark_model.cc:452] Num threads: [-1]
2018-03-03 18:19:31.230298: I tensorflow/tools/benchmark/benchmark_model.cc:453] Benchmark name: []
2018-03-03 18:19:31.230304: I tensorflow/tools/benchmark/benchmark_model.cc:454] Output prefix: []
2018-03-03 18:19:31.230310: I tensorflow/tools/benchmark/benchmark_model.cc:455] Show sizes: [0]
2018-03-03 18:19:31.230316: I tensorflow/tools/benchmark/benchmark_model.cc:456] Warmup runs: [1]
2018-03-03 18:19:31.230323: I tensorflow/tools/benchmark/benchmark_model.cc:54] Loading TensorFlow.
2018-03-03 18:19:31.230335: I tensorflow/tools/benchmark/benchmark_model.cc:61] Got config, 0 devices
2018-03-03 18:19:31.230641: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-03 18:19:31.356688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-03 18:19:31.357039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:28:00.0
totalMemory: 5.93GiB freeMemory: 5.78GiB
2018-03-03 18:19:31.357057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-03 18:19:31.553262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-03 18:19:31.553301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-03-03 18:19:31.553308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-03-03 18:19:31.553484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5566 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:28:00.0, compute capability: 6.1)
2018-03-03 18:19:31.588386: I tensorflow/tools/benchmark/benchmark_model.cc:468] Initialized session in 0.358054s
2018-03-03 18:19:31.588433: I tensorflow/tools/benchmark/benchmark_model.cc:308] Running benchmark for max 1 iterations, max -1 seconds without detailed stat logging, with -1s sleep between inferences
2018-03-03 18:19:31.722332: I tensorflow/tools/benchmark/benchmark_model.cc:341] count=1 curr=133865

2018-03-03 18:19:31.722373: I tensorflow/tools/benchmark/benchmark_model.cc:308] Running benchmark for max 1000 iterations, max 10 seconds without detailed stat logging, with -1s sleep between inferences
2018-03-03 18:19:31.979278: I tensorflow/tools/benchmark/benchmark_model.cc:341] count=1000 first=725 curr=273 min=182 max=725 avg=254.677 std=24

2018-03-03 18:19:31.979315: I tensorflow/tools/benchmark/benchmark_model.cc:308] Running benchmark for max 1000 iterations, max 10 seconds with detailed stat logging, with -1s sleep between inferences
2018-03-03 18:19:31.980920: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcupti.so.9.1 locally
2018-03-03 18:19:32.813738: I tensorflow/tools/benchmark/benchmark_model.cc:341] count=1000 first=99599 curr=676 min=607 max=99599 avg=760.863 std=3127

2018-03-03 18:19:32.813780: I tensorflow/tools/benchmark/benchmark_model.cc:561] Average inference timings in us: Warmup: 133865, no stats: 254, with stats: 760
2018-03-03 18:19:32.814551: I tensorflow/core/util/stat_summarizer.cc:358] Number of nodes executed: 46
2018-03-03 18:19:32.814699: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Run Order ==============================
2018-03-03 18:19:32.814712: I tensorflow/core/util/stat_summarizer.cc:468] 	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
2018-03-03 18:19:32.814719: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYHtoD	   -0.423	    0.001	    0.001	  0.225%	  0.225%	     0.000	        1	edge_39__arg_ins_0_0 [MemCpy]
2018-03-03 18:19:32.814726: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYDtoH	    0.000	    0.001	    0.001	  0.225%	  0.450%	     0.000	        1	edge_40_output_node0 [MemCpy]
2018-03-03 18:19:32.814733: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.359	    0.005	    0.004	  0.801%	  1.251%	     0.000	        1	dense_1/MatMul [Kernel]
2018-03-03 18:19:32.814740: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.328	    0.002	    0.002	  0.472%	  1.723%	     0.000	        1	dense_1/BiasAdd [Kernel]
2018-03-03 18:19:32.814746: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.308	    0.002	    0.002	  0.455%	  2.178%	     0.000	        1	dense_1/Relu [Kernel]
2018-03-03 18:19:32.814752: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.283	    0.005	    0.006	  1.274%	  3.453%	     0.000	        1	dense_2/MatMul [Kernel]
2018-03-03 18:19:32.814758: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.259	    0.002	    0.002	  0.449%	  3.902%	     0.000	        1	dense_2/BiasAdd [Kernel]
2018-03-03 18:19:32.814764: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.239	    0.002	    0.002	  0.490%	  4.392%	     0.000	        1	dense_2/Relu [Kernel]
2018-03-03 18:19:32.814770: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.215	    0.003	    0.004	  0.796%	  5.188%	     0.000	        1	dense_3/MatMul [Kernel]
2018-03-03 18:19:32.814776: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.191	    0.003	    0.002	  0.448%	  5.636%	     0.000	        1	dense_3/BiasAdd [Kernel]
2018-03-03 18:19:32.814782: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.171	    0.002	    0.002	  0.450%	  6.086%	     0.000	        1	dense_3/Relu [Kernel]
2018-03-03 18:19:32.814788: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.148	    0.003	    0.003	  0.673%	  6.759%	     0.000	        1	dense_4/MatMul [Kernel]
2018-03-03 18:19:32.814794: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.124	    0.002	    0.002	  0.442%	  7.202%	     0.000	        1	dense_4/BiasAdd [Kernel]
2018-03-03 18:19:32.814800: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	   -0.105	    0.002	    0.002	  0.482%	  7.684%	     0.000	        1	dense_4/Relu [Kernel]
2018-03-03 18:19:32.814806: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	   -0.080	    0.005	    0.003	  0.691%	  8.375%	     0.000	        1	outs/MatMul [Kernel]
2018-03-03 18:19:32.814812: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	   -0.056	    0.002	    0.002	  0.447%	  8.822%	     0.000	        1	outs/BiasAdd [Kernel]
2018-03-03 18:19:32.814818: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:Sigmoid	   -0.036	    0.003	    0.002	  0.562%	  9.384%	     0.000	        1	outs/Sigmoid [Kernel]
2018-03-03 18:19:32.814824: I tensorflow/core/util/stat_summarizer.cc:468] 	                    NoOp	   -0.479	    0.011	    0.005	  2.092%	 11.476%	     0.000	        2	_SOURCE
2018-03-03 18:19:32.814830: I tensorflow/core/util/stat_summarizer.cc:468] 	                    _Arg	   -0.470	    0.009	    0.005	  1.188%	 12.664%	     0.000	        1	_arg_ins_0_0
2018-03-03 18:19:32.814836: I tensorflow/core/util/stat_summarizer.cc:468] 	                 _Retval	    0.032	    0.005	    0.005	  1.133%	 13.797%	     0.000	        1	_retval_output_node0_0_0
2018-03-03 18:19:32.814842: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.467	    0.015	    0.007	  1.639%	 15.437%	     0.000	        1	dense_1/kernel
2018-03-03 18:19:32.814848: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.458	    0.005	    0.005	  1.213%	 16.650%	     0.000	        1	dense_1/bias
2018-03-03 18:19:32.814854: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.451	    0.005	    0.006	  1.242%	 17.892%	     0.000	        1	dense_2/kernel
2018-03-03 18:19:32.814860: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.445	    0.008	    0.005	  1.048%	 18.940%	     0.000	        1	dense_2/bias
2018-03-03 18:19:32.814866: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.439	    0.004	    0.005	  1.037%	 19.977%	     0.000	        1	dense_3/kernel
2018-03-03 18:19:32.814872: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.433	    0.004	    0.004	  0.954%	 20.931%	     0.000	        1	dense_3/bias
2018-03-03 18:19:32.814878: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.428	    0.018	    0.004	  0.985%	 21.917%	     0.000	        1	dense_4/kernel
2018-03-03 18:19:32.814884: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.423	    0.004	    0.004	  0.937%	 22.854%	     0.000	        1	dense_4/bias
2018-03-03 18:19:32.814890: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.417	    0.004	    0.004	  1.000%	 23.854%	     0.000	        1	outs/kernel
2018-03-03 18:19:32.814897: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	   -0.412	    0.008	    0.004	  0.986%	 24.840%	     0.000	        1	outs/bias
2018-03-03 18:19:32.814903: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.398	    0.089	    0.042	  9.518%	 34.359%	     1.280	        1	dense_1/MatMul
2018-03-03 18:19:32.814909: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.352	    0.034	    0.023	  5.148%	 39.507%	     0.000	        1	dense_1/BiasAdd
2018-03-03 18:19:32.814915: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.327	    0.026	    0.018	  4.120%	 43.627%	     0.000	        1	dense_1/Relu
2018-03-03 18:19:32.814921: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.307	    0.032	    0.025	  5.631%	 49.258%	     0.768	        1	dense_2/MatMul
2018-03-03 18:19:32.814927: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.280	    0.023	    0.020	  4.423%	 53.681%	     0.000	        1	dense_2/BiasAdd
2018-03-03 18:19:32.814933: I tensorflow/core/util/stat_summarizer.cc:468] 	                Identity	   -0.036	    0.004	    0.004	  0.979%	 54.660%	     0.000	        1	output_node0
2018-03-03 18:19:32.814939: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.258	    0.024	    0.017	  3.921%	 58.581%	     0.000	        1	dense_2/Relu
2018-03-03 18:19:32.814945: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.238	    0.028	    0.024	  5.467%	 64.048%	     0.256	        1	dense_3/MatMul
2018-03-03 18:19:32.814951: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.212	    0.025	    0.019	  4.373%	 68.420%	     0.000	        1	dense_3/BiasAdd
2018-03-03 18:19:32.814957: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.190	    0.020	    0.017	  3.882%	 72.302%	     0.000	        1	dense_3/Relu
2018-03-03 18:19:32.814963: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.171	    0.027	    0.024	  5.334%	 77.636%	     0.256	        1	dense_4/MatMul
2018-03-03 18:19:32.814969: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.145	    0.026	    0.019	  4.336%	 81.972%	     0.000	        1	dense_4/BiasAdd
2018-03-03 18:19:32.814975: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.124	    0.020	    0.017	  3.846%	 85.819%	     0.000	        1	dense_4/Relu
2018-03-03 18:19:32.814981: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.105	    0.035	    0.026	  5.820%	 91.639%	     0.256	        1	outs/MatMul
2018-03-03 18:19:32.814986: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.077	    0.024	    0.020	  4.408%	 96.047%	     0.000	        1	outs/BiasAdd
2018-03-03 18:19:32.814992: I tensorflow/core/util/stat_summarizer.cc:468] 	                 Sigmoid	   -0.055	    0.021	    0.018	  3.953%	100.000%	     0.000	        1	outs/Sigmoid
2018-03-03 18:19:32.814998: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815004: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Top by Computation Time ==============================
2018-03-03 18:19:32.815010: I tensorflow/core/util/stat_summarizer.cc:468] 	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
2018-03-03 18:19:32.815016: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.398	    0.089	    0.042	  9.518%	  9.518%	     1.280	        1	dense_1/MatMul
2018-03-03 18:19:32.815022: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.105	    0.035	    0.026	  5.820%	 15.339%	     0.256	        1	outs/MatMul
2018-03-03 18:19:32.815028: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.307	    0.032	    0.025	  5.631%	 20.970%	     0.768	        1	dense_2/MatMul
2018-03-03 18:19:32.815034: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.238	    0.028	    0.024	  5.467%	 26.436%	     0.256	        1	dense_3/MatMul
2018-03-03 18:19:32.815040: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.171	    0.027	    0.024	  5.334%	 31.771%	     0.256	        1	dense_4/MatMul
2018-03-03 18:19:32.815046: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.352	    0.034	    0.023	  5.148%	 36.919%	     0.000	        1	dense_1/BiasAdd
2018-03-03 18:19:32.815052: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.280	    0.023	    0.020	  4.423%	 41.342%	     0.000	        1	dense_2/BiasAdd
2018-03-03 18:19:32.815057: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.077	    0.024	    0.020	  4.408%	 45.750%	     0.000	        1	outs/BiasAdd
2018-03-03 18:19:32.815063: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.212	    0.025	    0.019	  4.373%	 50.123%	     0.000	        1	dense_3/BiasAdd
2018-03-03 18:19:32.815069: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.145	    0.026	    0.019	  4.336%	 54.459%	     0.000	        1	dense_4/BiasAdd
2018-03-03 18:19:32.815075: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815080: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Top by Memory Use ==============================
2018-03-03 18:19:32.815086: I tensorflow/core/util/stat_summarizer.cc:468] 	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
2018-03-03 18:19:32.815092: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.398	    0.089	    0.042	  9.518%	  9.518%	     1.280	        1	dense_1/MatMul
2018-03-03 18:19:32.815098: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.307	    0.032	    0.025	  5.631%	 15.149%	     0.768	        1	dense_2/MatMul
2018-03-03 18:19:32.815104: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.105	    0.035	    0.026	  5.820%	 20.970%	     0.256	        1	outs/MatMul
2018-03-03 18:19:32.815110: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.171	    0.027	    0.024	  5.334%	 26.304%	     0.256	        1	dense_4/MatMul
2018-03-03 18:19:32.815116: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	   -0.238	    0.028	    0.024	  5.467%	 31.771%	     0.256	        1	dense_3/MatMul
2018-03-03 18:19:32.815122: I tensorflow/core/util/stat_summarizer.cc:468] 	                Identity	   -0.036	    0.004	    0.004	  0.979%	 32.750%	     0.000	        1	output_node0
2018-03-03 18:19:32.815128: I tensorflow/core/util/stat_summarizer.cc:468] 	                 Sigmoid	   -0.055	    0.021	    0.018	  3.953%	 36.703%	     0.000	        1	outs/Sigmoid
2018-03-03 18:19:32.815134: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.077	    0.024	    0.020	  4.408%	 41.111%	     0.000	        1	outs/BiasAdd
2018-03-03 18:19:32.815140: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	   -0.124	    0.020	    0.017	  3.846%	 44.957%	     0.000	        1	dense_4/Relu
2018-03-03 18:19:32.815145: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	   -0.145	    0.026	    0.019	  4.336%	 49.293%	     0.000	        1	dense_4/BiasAdd
2018-03-03 18:19:32.815155: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815161: I tensorflow/core/util/stat_summarizer.cc:468] ============================== Summary by node type ==============================
2018-03-03 18:19:32.815167: I tensorflow/core/util/stat_summarizer.cc:468] 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
2018-03-03 18:19:32.815173: I tensorflow/core/util/stat_summarizer.cc:468] 	                  MatMul	        5	     0.139	    32.783%	    32.783%	     2.816	        5
2018-03-03 18:19:32.815179: I tensorflow/core/util/stat_summarizer.cc:468] 	                 BiasAdd	        5	     0.098	    23.113%	    55.896%	     0.000	        5
2018-03-03 18:19:32.815185: I tensorflow/core/util/stat_summarizer.cc:468] 	                    Relu	        4	     0.069	    16.274%	    72.170%	     0.000	        4
2018-03-03 18:19:32.815191: I tensorflow/core/util/stat_summarizer.cc:468] 	                   Const	       10	     0.045	    10.613%	    82.783%	     0.000	       10
2018-03-03 18:19:32.815197: I tensorflow/core/util/stat_summarizer.cc:468] 	                 Sigmoid	        1	     0.017	     4.009%	    86.792%	     0.000	        1
2018-03-03 18:19:32.815203: I tensorflow/core/util/stat_summarizer.cc:468] 	              gpu:MatMul	        5	     0.016	     3.774%	    90.566%	     0.000	        5
2018-03-03 18:19:32.815209: I tensorflow/core/util/stat_summarizer.cc:468] 	                    NoOp	        1	     0.009	     2.123%	    92.689%	     0.000	        2
2018-03-03 18:19:32.815214: I tensorflow/core/util/stat_summarizer.cc:468] 	                gpu:Relu	        4	     0.007	     1.651%	    94.340%	     0.000	        4
2018-03-03 18:19:32.815220: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:BiasAdd	        5	     0.006	     1.415%	    95.755%	     0.000	        5
2018-03-03 18:19:32.815226: I tensorflow/core/util/stat_summarizer.cc:468] 	                 _Retval	        1	     0.005	     1.179%	    96.934%	     0.000	        1
2018-03-03 18:19:32.815232: I tensorflow/core/util/stat_summarizer.cc:468] 	                    _Arg	        1	     0.005	     1.179%	    98.113%	     0.000	        1
2018-03-03 18:19:32.815237: I tensorflow/core/util/stat_summarizer.cc:468] 	                Identity	        1	     0.004	     0.943%	    99.057%	     0.000	        1
2018-03-03 18:19:32.815243: I tensorflow/core/util/stat_summarizer.cc:468] 	             gpu:Sigmoid	        1	     0.002	     0.472%	    99.528%	     0.000	        1
2018-03-03 18:19:32.815249: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYHtoD	        1	     0.001	     0.236%	    99.764%	     0.000	        1
2018-03-03 18:19:32.815255: I tensorflow/core/util/stat_summarizer.cc:468] 	          gpu:MEMCPYDtoH	        1	     0.001	     0.236%	   100.000%	     0.000	        1
2018-03-03 18:19:32.815260: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.815266: I tensorflow/core/util/stat_summarizer.cc:468] Timings (microseconds): count=1000 first=625 curr=459 min=411 max=625 avg=444.149 std=17
2018-03-03 18:19:32.815272: I tensorflow/core/util/stat_summarizer.cc:468] Memory (bytes): count=1000 curr=2816(all same)
2018-03-03 18:19:32.815277: I tensorflow/core/util/stat_summarizer.cc:468] 46 nodes observed
2018-03-03 18:19:32.815283: I tensorflow/core/util/stat_summarizer.cc:468] 
2018-03-03 18:19:32.829018: I tensorflow/tools/benchmark/benchmark_model.cc:596] FLOPs estimate: 109.62k
2018-03-03 18:19:32.829044: I tensorflow/tools/benchmark/benchmark_model.cc:598] FLOPs/second: 430.43M

form these inspections we have learn

the name of the input layer to be ins
the shape of the input layer to be 1,22 (actually more a guess)
the name of the output layer to be output_node0 (as given to keras2tf)

Step2 compile the model

to compile a model in a almost stand-alone-library or executable one can follow the instruction on the tensorflow site about tfcompile

prerequisite

TensorFlow 1.5 source code (from git) and its dependences including bazel.

(tip: bazel will fill up your home directory populating ~.cache/bazel. Make sure you have plenty of space there for istance creating a link to /tmp or similar. you can use also bazel option --output_base=/tmp/bazel/output, the probability that you forget about it is high and bazel happily will rebuild everything from scratch...)

Just build the tests (after ./configure) as

bazel build //tensorflow/compiler/aot/tests:all_tests

it will create all what needed to compile and build the self contained inference engine

write the "config file"

Once we guessed right name and shape of input and name of output, writing the config its easy (at least for our simple model above):

cat dnn.config.pbtxt 
feed {
  id { node_name: "ins" }
  shape {
    dim { size: 1 }
    dim { size: 22 }
  }
}
fetch {
  id { node_name: "output_node0" }
  name: "output_node0"
}

compile!

We can now use the tfcompile tool to compile our model in a shared library for the intended target(s)

/data/vin/tensorflow/bazel-out/host/bin/tensorflow/compiler/aot/tfcompile --target_features="+sse4.2"  --graph=LWTNN_v10.pb --config=dnn.config.pbtxt --entry_point=__tensorflow_tfmydnn --cpp_class=MyDNN --target_triple=x86_64-pc-linux --out_header=tfmydnn.h --out_metadata_object=tfmydnn_tfcompile_metadata.o --out_function_object=tfmydnn_tfcompile_function.o
c++ -shared tfmydnn_tfcompile_metadata.o tfmydnn_tfcompile_function.o  -o tfmydnn_sse.so
rm *.o
/data/vin/tensorflow/bazel-out/host/bin/tensorflow/compiler/aot/tfcompile --target_features="+fma,+avx2"  --graph=LWTNN_v10.pb --config=dnn.config.pbtxt --entry_point=__tensorflow_tfmydnn --cpp_class=MyDNN --target_triple=x86_64-pc-linux --out_header=tfmydnn.h --out_metadata_object=tfmydnn_tfcompile_metadata.o --out_function_object=tfmydnn_tfcompile_function.o
c++ -shared tfmydnn_tfcompile_metadata.o tfmydnn_tfcompile_function.o  -o tfmydnn_avx2.so
rm *.o
/data/vin/tensorflow/bazel-out/host/bin/tensorflow/compiler/aot/tfcompile --target_features="+fma,+avx512f"  --graph=LWTNN_v10.pb --config=dnn.config.pbtxt --entry_point=__tensorflow_tfmydnn --cpp_class=MyDNN --target_triple=x86_64-pc-linux --out_header=tfmydnn.h --out_metadata_object=tfmydnn_tfcompile_metadata.o --out_function_object=tfmydnn_tfcompile_function.o
c++ -shared tfmydnn_tfcompile_metadata.o tfmydnn_tfcompile_function.o  -o tfmydnn_avx512.so
rm *.o

ls -l tfmydnn*
-rw-r--r--. 1 innocent zh   7493 Mar  3 19:09 tfmydnn.h
-rwxr-xr-x. 1 innocent zh 232952 Mar  3 18:37 tfmydnn.so
-rwxr-xr-x. 1 innocent zh 237048 Mar  3 19:09 tfmydnn_avx2.so
-rwxr-xr-x. 1 innocent zh 237048 Mar  3 19:09 tfmydnn_avx512.so
-rwxr-xr-x. 1 innocent zh 232952 Mar  3 19:09 tfmydnn_sse.so
ls -l *.pb
rw-r--r--. 1 innocent zh  224235 Mar  2 17:09 LWTNN_v10.pb

essentially the shared library contains the model as constants in a functions implementing the DNN

Show tfmydnn.h

Hide tfmydnn.h

// Generated by tfcompile, the TensorFlow graph compiler.  DO NOT EDIT!
//
// This header was generated via ahead-of-time compilation of a TensorFlow
// graph.  An object file corresponding to this header was also generated.
// This header gives access to the functionality in that object file.
//
// clang-format off

#ifndef TFCOMPILE_GENERATED___tensorflow_tfmydnn_H_  // NOLINT(build/header_guard)
#define TFCOMPILE_GENERATED___tensorflow_tfmydnn_H_  // NOLINT(build/header_guard)


#include "tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h"
#include "tensorflow/core/platform/types.h"

namespace Eigen { struct ThreadPoolDevice; }
namespace xla { class ExecutableRunOptions; }

// (Implementation detail) Entry point to the function in the object file.
extern "C" void __tensorflow_tfmydnn(
    void* result, const xla::ExecutableRunOptions* run_options,
    const void** args, void** temps, tensorflow::int64* profile_counters);




// MyDNN represents a computation previously specified in a
// TensorFlow graph, now compiled into executable code. This extends the generic
// XlaCompiledCpuFunction class with statically type-safe arg and result
// methods. Usage example:
//
//   MyDNN computation;
//   // ...set args using computation.argN methods
//   CHECK(computation.Run());
//   // ...inspect results using computation.resultN methods
//
// The Run method invokes the actual computation, with inputs read from arg
// buffers, and outputs written to result buffers. Each Run call may also use
// a set of temporary buffers for the computation.
//
// By default each instance of this class manages its own arg, result and temp
// buffers. The AllocMode constructor parameter may be used to modify the
// buffer allocation strategy.
//
// Under the default allocation strategy, this class is thread-compatible:
// o Calls to non-const methods require exclusive access to the object.
// o Concurrent calls to const methods are OK, if those calls are made while it
//   is guaranteed that no thread may call a non-const method.
//
// The logical function signature is:
//   (arg0: f32[1,22]) -> (f32[1,1])
//
// Memory stats:
//   arg bytes total:    88
//   arg bytes aligned:  96
//   temp bytes total:   2412
//   temp bytes aligned: 2464
class MyDNN : public tensorflow::XlaCompiledCpuFunction {
 public:
  // Number of input arguments for the compiled computation.
  static constexpr size_t kNumArgs = 1;

  // Byte size of each argument buffer. There are kNumArgs entries.
  static const intptr_t* ArgSizes() {
    static constexpr intptr_t kArgSizes[kNumArgs] = {88};
    return kArgSizes;
  }

  // Returns static data used to create an XlaCompiledCpuFunction.
  static const tensorflow::XlaCompiledCpuFunction::StaticData& StaticData() {
    static XlaCompiledCpuFunction::StaticData* kStaticData = [](){
      XlaCompiledCpuFunction::StaticData* data =
        new XlaCompiledCpuFunction::StaticData;
      data->raw_function = __tensorflow_tfmydnn;
      data->arg_sizes = ArgSizes();
      data->num_args = kNumArgs;
      data->temp_sizes = TempSizes();
      data->num_temps = kNumTemps;
      data->result_index = kResultIndex;
      data->arg_names = StaticArgNames();
      data->result_names = StaticResultNames();
      data->program_shape = StaticProgramShape();
      return data;
    }();
    return *kStaticData;
  }

  MyDNN(AllocMode alloc_mode = AllocMode::ARGS_RESULTS_PROFILES_AND_TEMPS)
      : XlaCompiledCpuFunction(StaticData(), alloc_mode) {}

  MyDNN(const MyDNN&) = delete;
  MyDNN& operator=(const MyDNN&) = delete;

  // Arg methods for managing input buffers. Buffers are in row-major order.
  // There is a set of methods for each positional argument, with the following
  // general form:
  //
  // void set_argN_data(void* data)
  //   Sets the buffer of type T for positional argument N. May be called in
  //   any AllocMode. Must be called before Run to have an affect. Must be
  //   called in AllocMode::RESULTS_PROFILES_AND_TEMPS_ONLY for each positional
  //   argument, to set the argument buffers.
  //
  // T* argN_data()
  //   Returns the buffer of type T for positional argument N.
  //
  // T& argN(...dim indices...)
  //   Returns a reference to the value of type T for positional argument N,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.

  void set_arg0_data(void* data) {
    set_arg_data(0, data);
  }
  float* arg0_data() {
    return static_cast<float*>(arg_data(0));
  }
  float& arg0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][22]>(
        arg_data(0)))[dim0][dim1];
  }
  const float* arg0_data() const {
    return static_cast<const float*>(arg_data(0));
  }
  const float& arg0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][22]>(
        arg_data(0)))[dim0][dim1];
  }

  // Result methods for managing output buffers. Buffers are in row-major order.
  // Must only be called after a successful Run call. There is a set of methods
  // for each positional result, with the following general form:
  //
  // T* resultN_data()
  //   Returns the buffer of type T for positional result N.
  //
  // T& resultN(...dim indices...)
  //   Returns a reference to the value of type T for positional result N,
  //   with dim indices specifying which value. No bounds checking is performed
  //   on dim indices.
  //
  // Unlike the arg methods, there is no set_resultN_data method. The result
  // buffers are managed internally, and may change after each call to Run.

  float* result0_data() {
    return static_cast<float*>(result_data(0));
  }
  float& result0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }
  const float* result0_data() const {
    return static_cast<const float*>(result_data(0));
  }
  const float& result0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }

  float* result_output_node0_data() {
    return static_cast<float*>(result_data(0));
  }
  float& result_output_node0(size_t dim0, size_t dim1) {
    return (*static_cast<float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }
  const float* result_output_node0_data() const {
    return static_cast<const float*>(result_data(0));
  }
  const float& result_output_node0(size_t dim0, size_t dim1) const {
    return (*static_cast<const float(*)[1][1]>(
        result_data(0)))[dim0][dim1];
  }

 private:
  // Number of result and temporary buffers for the compiled computation.
  static constexpr size_t kNumTemps = 4;
  // The 0-based index of the result tuple in the temporary buffers.
  static constexpr size_t kResultIndex = 1;

  // Byte size of each result / temporary buffer. There are kNumTemps entries.
  static const intptr_t* TempSizes() {
    static constexpr intptr_t kTempSizes[kNumTemps] = {-1, 8, 4, 2400};
    return kTempSizes;
  }

  // Array of names of each positional argument, terminated by nullptr.
  static const char** StaticArgNames() {
    return nullptr;
  }

  // Array of names of each positional result, terminated by nullptr.
  static const char** StaticResultNames() {
    return nullptr;
  }

  // Shape of the args and results.
  static const xla::ProgramShape* StaticProgramShape() {
    static const xla::ProgramShape* kShape = nullptr;
    return kShape;
  }
};


#endif  // TFCOMPILE_GENERATED___tensorflow_tfmydnn_H_

// clang-format on

where it claims the each instance will use ~2.5KB of memory

Step3 Tests

prerequisite

the easiest is to copy local the two tensorflow library required to invoke the compiled function

cp /data/vin/tensorflow/bazel-bin/tensorflow/compiler/aot/libruntime.so .
cp /data/vin/tensorflow/bazel-bin/tensorflow/compiler/tf2xla/libxla_compiled_cpu_function.so .

then one can copy the example in tensorflow test and adapt it to our model. We have also made possible to instantiate multiple (identical) models, and running then multiple times in multiple threads

Show test code

Hide test code

#include "tfmydnn.h"

#include <array>
#include <algorithm>

#include <thread>
#include <functional>
#include<vector>

#include <iostream>
/*
vars=["trk_pt", "trk_eta", "trk_lambda", "trk_dxy", "trk_dz", "trk_dxyClosestPV", "trk_dzClosestPVClamped", 
"trk_ptErr","trk_etaErr", "trk_lambdaErr", "trk_dxyErr", "trk_dzErr", "trk_nChi2", "trk_ndof", "trk_nInvalid", "trk_nPixel", "trk_nStrip", "trk_nPixelLay", 
"trk_nStripLay", "trk_n3DLay", "trk_nLostLay", "trk_algo"]
*/


void  go(int neval, int ndnn) {

  if (ndnn==0) {
   float tot=0;
   std::cout <<"dummy running (compute tanh) "<<std::endl;
   for (int i=0; i<neval; ++i) {
       tot+= std::tanh(float((i%2) ? i : -i));
   }
   std::cout << tot << std::endl;
   return;
  }

  float tot=0;
  int N=neval/ndnn;
  MyDNN dnn[ndnn];
  std::cout <<"running " << ndnn << " dnns" <<std::endl;
 
  for (int i=0; i<N; ++i) {
  for (int j=0; j<ndnn; ++j) {
  dnn[j].arg0_data()[0] = (i%2) ? 3. : 5;
  dnn[j].arg0_data()[1] = 0.;
  dnn[j].arg0_data()[2] = 0.;
  dnn[j].arg0_data()[3] = 0.;
  dnn[j].arg0_data()[4] = 0.;
  dnn[j].arg0_data()[5] = 0.;
  dnn[j].arg0_data()[6] = 0.;
  dnn[j].arg0_data()[7] = 0.1;
  dnn[j].arg0_data()[8] = 0.1;
  dnn[j].arg0_data()[9] = 0.01;
  dnn[j].arg0_data()[10] = 0.01;
  dnn[j].arg0_data()[11] = 0.1;
  dnn[j].arg0_data()[12] = (i%3) ? 1. : 1.2;
  dnn[j].arg0_data()[10] = 0.01;
  dnn[j].arg0_data()[11] = 0.1;
  dnn[j].arg0_data()[12] = 1.;
  dnn[j].arg0_data()[13] = 15;
  dnn[j].arg0_data()[14] = 0;
  dnn[j].arg0_data()[15] = 4;
  dnn[j].arg0_data()[16] = 12;
  dnn[j].arg0_data()[17] = 4;
  dnn[j].arg0_data()[18] = 8;
  dnn[j].arg0_data()[19] = 8;
  dnn[j].arg0_data()[20] = 1;
  dnn[j].arg0_data()[21] = (i%5) ? 4 : 6;
 
  dnn[j].Run();
  tot+= dnn[j].result0_data()[0];
  }
  }
  std::cout << tot << ' ' << dnn[0].result0_data()[0]<< std::endl;

};


#include <cstdlib>

int main(int args, char * argv[]) {
  typedef std::thread Thread;
  typedef std::vector<std::thread> ThreadGroup;

  if (args<4) { 
    std::cout << "please provide # total invokation, # dnns, # threads" << std::endl;
    return -1;
  }

  
  const int NUMTHREADS= atoi(argv[3]);
   ThreadGroup threads;
 
  if (NUMTHREADS>1) {
    threads.reserve(NUMTHREADS-1);
    for (int i=0; i<NUMTHREADS-1; ++i) {
      threads.push_back(Thread(go,atoi(argv[1]),atoi(argv[2])));
    }
  }  

  go(atoi(argv[1]),atoi(argv[2]));

  if (NUMTHREADS>1)
    std::for_each(threads.begin(),threads.end(), 
  		std::bind(&Thread::join,std::placeholders::_1));
 
  return 0;
}

we can now compile it as

 c++ -march=haswell -pthread -fPIC -Ofast -std=c++11 -Wall test_tfmydnn.cpp -I/data/vin/tensorflow/ ./tfmydnn.so ./libxla_compiled_cpu_function.so ./libruntime.so

where tfmydnn.so is a link to one of the library built above. One can switch from one architecture to another just changing the link

rm tfmydnn.so; ln -s tfmydnn_avx512.so tfmydnn.so

now just pack everything in a directory and you are ready for deployment

Show deployment directory

Hide deployment directory

[innocent@vinzen0 tkdnn]$ pwd
/afs/cern.ch/user/i/innocent/w1/tkdnn
[innocent@vinzen0 tkdnn]$ ls -l
total 747
-r-xr-xr-x. 1 innocent zh   8048 Mar  4 09:29 libruntime.so
-r-xr-xr-x. 1 innocent zh  13520 Mar  4 09:29 libxla_compiled_cpu_function.so
-rwxr-xr-x. 1 innocent zh  22104 Mar  4 09:29 test_tfmydnn
-rw-r--r--. 1 innocent zh   2638 Mar  4 09:29 test_tfmydnn.cpp
-rw-r--r--. 1 innocent zh   7493 Mar  4 09:29 tfmydnn.h
-rwxr-xr-x. 1 innocent zh 237048 Mar  4 09:29 tfmydnn_avx2.so
-rwxr-xr-x. 1 innocent zh 237048 Mar  4 09:29 tfmydnn_avx512.so
-rwxr-xr-x. 1 innocent zh 232952 Mar  4 09:29 tfmydnn_sse.so
[innocent@vinzen0 tkdnn]$ ldd test_tfmydnn
	linux-vdso.so.1 =>  (0x00007ffff772e000)
	./tfmydnn.so => not found
	./libxla_compiled_cpu_function.so (0x00007f7ff33e6000)
	./libruntime.so (0x00007f7ff31e4000)
	libstdc++.so.6 => /afs/cern.ch/user/i/innocent/w5/lib64/libstdc++.so.6 (0x00007f7ff2e60000)
	libm.so.6 => /usr/lib64/libm.so.6 (0x00007f7ff2b5d000)
	libgcc_s.so.1 => /afs/cern.ch/user/i/innocent/w5/lib64/libgcc_s.so.1 (0x00007f7ff2945000)
	libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f7ff2729000)
	libc.so.6 => /usr/lib64/libc.so.6 (0x00007f7ff2365000)
	/lib64/ld-linux-x86-64.so.2 (0x0000561859e06000)

Benchmarks

prerequisite

access to a variety of hardware platform...

memory scaling

test performed on a Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (4core) running CC7 using ps -o "cmd rss vsz";
(for 0 instances the test just run a single "tanh")

# evaluations	# instances	# thread	RSS	VSZ
1000000000	0	1	1552	21776
1000000000	0	4	1552	46364
1000000000	0	8	1556	79148
1000000	1	1	1820	21776
1000000	10	1	1820	21776
1000000	100	1	2084	22040
1000000	1000	1	4456	24440
1000000	1	4	1832	242972
1000000	1000	4	12080	245636
1000000	1	8	3596	537900
1000000	1000	8	24944	540564

which seems to be consistent with ~250KB of data plus ~2.5K for each instance

timing

performed using

rm tfmydnn.so ; ln -s tfmydnn_sse.so tfmydnn.so ; perf stat -d ./test_tfmydnn 1000000 100 4 | & egrep "GHz|elapsed|msec|instr"

etc

on Xeon machines took care to run on a single socket

machine	arch	# threads	Freq	time	cpu time	Ginstructions	ins/cycles
i7-6700K	SSE	4	3.9	4.89	18.8	230.9	3.14
i7-6700K	SSE	8	3.9	8.84	70.3	461.8	1.69
i7-6700K	AVX2	1	3.9	3.69	3.69	27.4	1.91
i7-6700K	AVX2	4	3.9	3.69	14.7	57.4	1.91
i7-6700K	AVX2	8	3.9	5.91	47.0	219.1	1.20
Ryzen7 1800X	SSE	8	3.62	4.34	34.6	461.2	3.68
Ryzen7 1800X	AVX2	8	3.62	4.34	34.6	219.1	1.75
E5-2650 v4	SSE	1	2.59	7.82	7.82	57.7	2.85
E5-2650 v4	SSE	12	2.47	8.20	97.5	692.5	2.87
E5-2650 v4	AVX2	1	2.55	6.49	6.49	27.4	1.65
E5-2650 v4	AVX2	12	2.43	6.85	81.0	328.8	1.67
Silver 4110	SSE	8	2.09	8.2	65.5	461.7	3.37
Silver 4110	AVX2	1	2.09	6.13	6.13	27.4	2.14
Silver 4110	AVX2	8	2.09	6.14	49.1	219.0	2.14
Silver 4110	AVX512	1	1.76	9.61	9.51	32.4	1.92
Silver 4110	AVX512	8	1.20	13.1	104.6	259.3	1.92
Phi 7210	SSE	64	1.28	42.17	2232	3697	1.29
Phi 7210	SSE	256	1.25	112	28320	14821	0.42
Phi 7210	AVX2	8	1.28	21.6	171	219	1.0
Phi 7210	AVX2	64	1.28	27.0	1484	1756	0.93
Phi 7210	AVX2	128	1.27	41.2	4480	3518	0.63
Phi 7210	AVX2	256	1.25	64.9	16238	7045	0.35
Phi 7210	AVX512	8	1.28	22.8	181	259.5	1.12
Phi 7210	AVX512	64	1.28	32.9	1582	2975	1.02
Phi 7210	AVX512	128	1.27	48.0	5202	4161	0.63
Phi 7210	AVX512	256	1.25	77.0	19290	8337	0.35

here are the ratio of real time sse/avx2 avx2/avx512 for one thread per core

machine	sse/avx2	avx2/avx512
i7-6700K	1.33
Ryzen7 1800X	1.0
E5-2650 v4	1.22
Silver 4110	1.33	0.47
Phi 7210	1.56	0.82

The performance with avx512 (in particular on SKL 4110) are pretty deceiving: besides the lower frequency (that is expected, see for instance https://en.wikichip.org/wiki/intel/xeon_silver/4110 ), note how the number of instructions is higher than with AVX2. This can also be seen with a more detailed perf analysis

[innocent@olsky03 tkdnn]$ rm tfmydnn.so ; ln -s tfmydnn_sse.so tfmydnn.so ; ~/scripts/doOCPerfSX "./test_tfmydnn 100000 100 16" | & egrep "fp_arith.*single.*#"
                0      fp_arith_inst_retired_512b_packed_single #    0.000 K/sec                    (11.57%)
                0      fp_arith_inst_retired_256b_packed_single #    0.000 K/sec                    (11.43%)
      44575000479      fp_arith_inst_retired_128b_packed_single # 3379.500 M/sec                    (11.50%)
       2087690622      fp_arith_inst_retired_scalar_single #  158.280 M/sec                    (11.58%)
[innocent@olsky03 tkdnn]$ rm tfmydnn.so ; ln -s tfmydnn_avx2.so tfmydnn.so; ~/scripts/doOCPerfSX "./test_tfmydnn 100000 100 16" | & egrep "fp_arith.*single.*#"
                0      fp_arith_inst_retired_512b_packed_single #    0.000 K/sec                    (11.73%)
      21626826075      fp_arith_inst_retired_256b_packed_single # 2131.011 M/sec                    (11.73%)
        376225683      fp_arith_inst_retired_128b_packed_single #   37.072 M/sec                    (11.56%)
       8265440396      fp_arith_inst_retired_scalar_single #  814.440 M/sec                    (11.45%)
[innocent@olsky03 tkdnn]$ rm tfmydnn.so ; ln -s tfmydnn_avx512.so tfmydnn.so ; ~/scripts/doOCPerfSX "./test_tfmydnn 100000 100 16" | & egrep "fp_arith.*single.*#"
      12737652202      fp_arith_inst_retired_512b_packed_single #  605.915 M/sec                    (11.56%)
        189005228      fp_arith_inst_retired_256b_packed_single #    8.991 M/sec                    (11.49%)
          6313559      fp_arith_inst_retired_128b_packed_single #    0.300 M/sec                    (11.52%)
      12082932411      fp_arith_inst_retired_scalar_single #  574.771 M/sec                    (11.54%)

most probably either the code generated by tensorflow is not optimized (yet) for avx512 or there are problems of alignment or, more simply, is our model (22x300x150x20x10x1) that does not fit well the 16-float wide registries of avx512

we have tested a different model (22x320x160x32x16x1) and indeed the ratio avx512/avx2 improves a bit (the number of instruction becomes at least similar). Still, being the model bigger, the performance with avx2 are worse (slower) than the previous smaller model.

-- VincenzoInnocente - 2018-03-03

Topic revision: r10 - 2021-05-06 - VincenzoInnocente

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback