Attempting to make a tensorflow hass.io addon by implementing the tensorflow module with gRPC support

So, based on some discussions from this thread, in which it seemed the main reason for lack of tensorflow in hass.io was the size of tensorflow. My feeling is that the best way to tackle this is to just put a gRPC stub inside the main hass.io container, then all the tensorflow stuff can be put in it’s own container.

So, since I was the one who opened my big mouth and since I really want to dockerize this component too, I’m giving it a shot myself. I’m probably not super qualified to do this as I’m not a Python programmer or Tensorflow expert by trade, but I’m a generalist, so who knows, maybe I’ll pull it off. I’m putting this thread here should anyone else want to follow in my footsteps or give me (much needed) advice along the way. Please feel free to try this yourself if you are so inclined. I hope I can give at least some useful information that can help to push this project forward.

2 Likes

So, first step is to get a tensorflow container spun up with gRPC support. I will be working from a hass.io instance I have installed on a 2011 Mac Mini running Debian Server. For now, I’m not using this instance as my main homeassistant server, so it can be a good testbed I hope.

As I don’t really know tensorflow in any meaningful way (other than having it running on my hassbian instance by following the setup guides), I’m going to see if I can get the gRPC version running using a guide. Part 3 of this guide tells the general concept of what we are trying to do here, but I will use Part 2 to help me set up my docker container.

There is an official tensorflow GRPC server docker available here, but the author of the guide made his own, and since it’s his guide I will be using, I will try it also with his container (I am using the CPU version, but if you have a GPU, you can try that version.

1 Like

So, let’s get started. I’m not going to bother trying to build my own docker image like he shows in the guide, I’ll just try to use the one he made.

Pull the docker image:

docker pull gauravkaila/tf_serving_cpu

(It’s big and it takes a while)

Spin up the container

docker run -it -d -P --name tf_serving_cpu -p 3000:3000 gauravkaila/tf_serving_cpu

I’m also not going to try to bother building a model like he does (though I do aspire to someday), I’m just going to try with one of the prebuilt models from the model zoo

In my case, I hope my old mac mini can manage faster_rcnn_inception_v2_coco, so I download that to my hass.io box and unzip it

tar -xvzf faster_rcnn_inception_v2_coco_2018_01_28.tar.gz 

Then I use portainer to create a shell in my tf_serving_cpu container and I make a folder to hold my model:

mkdir /modeldir

Then I go back into the host debian server and copy my tensorflow model over to my folder in the running tf_serving_cpu container.

docker cp ./faster_rcnn_inception_v2_coco_2018_01_28 tf_serving_cpu:/modeldir

Then I pop back into the tf_serving_container in my portainer shell and try to start it up…

cd /server

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=3000 --model_name=obj_det --model_base_path=/modeldir/saved_model &> obj_det &

as I don’t really know what I’m doing, I also tried

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=3000 --model_name=obj_det --model_base_path=/modeldir &> obj_det &

Unfortunately in both cases, I am met with a not nice message :frowning:

Aborted (core dumped) 

When I check my log in obj_det, I find:

2019-02-15 21:30:09.726026: F external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:36] The TensorFlow library was compiled to use SSE4.2 instructions, but these aren't available on your machine.

Looks like he had a nicer CPU than me… So I need to see what I can do to remedy this… Maybe one of the precompiled builds from here can save my bacon… Otherwise I will try to compile it myself.

I’ll let you know what I come up with when I get a chance to try and fix it.

1 Like

So, I tried with a different image today. Pulled the standard tensorflow/serving image which is meant to be forgiving and ran it with

docker run -p 8501:8501 --mount type=bind,source=/usr/share/hassio/share/faster_rcnn_inception_v2_coco_2018_01_28/saved_model,target=/models/my_model -e MODEL_NAME=my_model -t tensorflow/serving

Unfortunately, that also ended in a very similar way:

/usr/bin/tf_serving_entrypoint.sh: line 3: 6 Illegal instruction (core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} “$@”

I’m starting to be concerned that this machine is too weak to pull this off without a pre-built image to work from since from what I read, you want 10-12GB of memory and 4-6 cores to compile with, whereas I only currently have 2 and 2 :(. I also don’t think I can just swap out a standard tensorflow build for this because tensorflow serving seems to be a different codebase… I’m not well versed enough with tensorflow to know what workaround I can get away with here.

I have another mac mini I plan to replace and then I will have 8gb for this machine, but it still might not be enough to build with… And that won’t be happening until summer.

I might have to go build a whole new environment on a VM on one of my work windows boxes which would be strong enough… Not sure if I will find the time any time soon to build another environment from scratch…

Assuming you are running Linux, You can check your cpu with ‘cat /proc/cpuinfo | grep -i flags’

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xsaves arat flush_l1d arch_capabilities

The file /proc/cpuinfo has the details on the processor including the features it supports.

Yep I did that already… Tried a build last night, I think it completed, which is a good sign that I don’t really need 10gb to build it. I have a new container called stoic_newton, gonna see what it does.

edit: actually, that seems like one of my previous attempts… Too many containers now, gonna clean them up and try to build again. Just noticed there was a

--local_resources=2048,.5,1.0

option which might let me build under my low memory conditions.

Okay, so I ran another build today which got me a docker image… I ran as follows (default options are meant to use the right CPU flags for your architecture, so I put my faith in their makefile):


sudo -i

git clone https://github.com/tensorflow/serving

cd ~/serving/tensorflow_serving/tools/docker

docker build --pull --build-arg TF_SERVING_BUILD_OPTIONS="--local_resources 1800,1.0,1.0" -t $USER/tensorflow_serving .

I ended up with a docker image called root/tensorflow_serving:latest

I tried launching a container with:

docker run -p 8501:8501 --mount type=bind,source=/usr/share/hassio/share/faster_rcnn_inception_v2_coco_2018_01_28/saved_model,target=/models/my_model -e MODEL_NAME=my_model -t root/tensorflow_serving:latest

but unfortunately I am still getting a core dump…

/usr/bin/tf_serving_entrypoint.sh: line 3: 6 Illegal instruction (core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"

Anyone have an idea if it’s really choking on my CPU or if there’s something else wrong in what I’m doing?

I might try another build with more prescriptive compilation options, but I’m starting to run out of ideas.

Side note. Just remembered that the facebox integration also work via external server. So one your got a working container might look at how the facebox integration works

So, I gave up on the mac mini. The processor was probably too old or the memory too small to pull this off. I will do it on my work laptop which is a new i7/32gb machine… I installed a esx container with Ubuntu Server on it, gave it 64gb of disk and 20gb of RAM to play with, ran the usual hass.io install:

sudo -i

add-apt-repository universe

apt-get update

apt-get install -y apparmor-utils apt-transport-https avahi-daemon ca-certificates curl dbus jq network-manager socat software-properties-common

curl -fsSL get.docker.com | sh

curl -sL "https://raw.githubusercontent.com/home-assistant/hassio-build/master/install/hassio_install" | bash -s

And then just to see if I could get it to work without another core dump, followed the tensorflow serving quick start from here

# Download the TensorFlow Serving Docker image and repo
docker pull tensorflow/serving

git clone https://github.com/tensorflow/serving
# Location of demo models
TESTDATA="$(pwd)/serving/tensorflow_serving/servables/tensorflow/testdata"

# Start TensorFlow Serving container and open the REST API port
docker run -t --rm -p 8501:8501 \
   -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
   -e MODEL_NAME=half_plus_two \
   tensorflow/serving &

# Query the model using the predict API
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
   -X POST http://localhost:8501/v1/models/half_plus_two:predict

# Returns => { "predictions": [2.5, 3.0, 4.5] }

And happy days, no core dump. I’m able to access the REST API based on the command shown… So now just to map everything over. No problem :sweat_smile:

3 Likes

Did you manage to get it working?

What else did you have to do?

someone did something similar but different at https://github.com/blakeblackshear/frigate

see Local realtime person detection for RTSP cameras

Sorry, it will be a while before this project is done, I’m afraid. At the moment I just have a test environment, I still need to do all the programming of a component to do this (which I’m far from an expert it doing, so it requires a lot of research). I wouldn’t expect it in the next 60-90 days but I will keep working on it when I find the free time to do so, if someone doesn’t come up with a better option.

That’s an interesting way to do it… I might see if I can try that out before spending too much energy on this (though I do think this is a better architecture as it’s the way everyone builds tensorflow clusters so it’s much more standard).

edit: so looking at it more, he looks like he’s way ahead of me and way more qualified to be doing this integration. Also has the advantage of not needing any additional components added to hass to work, so can get it up and running right away… I’m going to put my support behind his work for now, I don’t think I would catch up or surpass it. But someone else is certainly willing to try…

1 Like