looking for assistance with "parallel" makefile, willing to pay

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

looking for assistance with "parallel" makefile, willing to pay

Robert P. J. Day-2

  (not sure if this is appropriate, but i have a makefile just dropped
in my lap that has some issues, and if someone can clean it up, i'll
do better than virtual beer, i'll do real compensation. i'd be happy
to interac the problem solver $100 (CAD), so here we go.)

  here is a package that is being built for x86-64 using wind river
linux 9:

  https://github.com/juniper/jp4agent

as you can see, there has been some attempt to add parallelization to
this makefile, which appears to be exactly what the problem is.

  if you look at the top-level Makefile, a whole pile of things
depends on the same thing -- src/pi/protos:

  src/pi/protos: AFI
  src/pi/src: src/pi/protos
  src/jp4agent/src: src/pi/protos src/pi/src
  src/utils/src: src/pi/protos
  src/afi/src: src/pi/protos
  test/controller/src: src/pi/protos
  test/gtest/src: src/pi/protos
  src/targets/null/null/src: src/pi/protos

and it's just that Makefile under src/pi/protos that, intermittently,
generates the following build error (i don't think there's anything
here i need to redact for security):

... much snipped ...
19:29:13 | cp: cannot create regular file
'/build/jenkins/workspace/WRL9-build/daily/tmp/work/core2-64-poky-linux/jp4agent/git-r0/image/usr/include/jp4agent/src/pi/./protos/status.pb.h':
File exists
19:29:13 | cp: cannot create regular file
'/build/jenkins/workspace/WRL9-build/daily/tmp/work/core2-64-poky-linux/jp4agent/git-r0/image/usr/include/jp4agent/src/pi/protos/./p4info.grpc.pb.h':
File exists
19:29:13 | make[1]: *** [Makefile:166: install] Error 1
19:29:13 | make[1]: Leaving directory
'/build/jenkins/workspace/WRL9-build/daily/tmp/work/core2-64-poky-linux/jp4agent/git-r0/git/src/pi/protos'
19:29:13 | make: *** [Makefile:58: install-src/pi/protos] Error 2
19:29:13 | make: *** Waiting for unfinished jobs....
19:29:13 | make[1]: Leaving directory
'/build/jenkins/workspace/WRL9-build/daily/tmp/work/core2-64-poky-linux/jp4agent/git-r0/git/src/afi/src'
19:29:13 | make[1]: *** [Makefile:104: install] Error 1
19:29:13 | make[1]: Leaving directory
'/build/jenkins/workspace/WRL9-build/daily/tmp/work/core2-64-poky-linux/jp4agent/git-r0/git/src/pi/src'
19:29:13 | make: *** [Makefile:58: install-src/pi/src] Error 2

  those "File exists" errors appear to be a race condition as
described here:

https://unix.stackexchange.com/questions/116280/cannot-create-regular-file-filename-file-exists

and if you head down to src/pi/protos/Makefile, sure enough, at line
166:

.PHONY: install
install: $(TARGET_LIB)
        install -d $(DESTDIR)$(prefix)/lib64/
        install -m 644 $(wildcard *.so.*) $(DESTDIR)$(prefix)/lib64/
        rm -f $(DESTDIR)$(prefix)/lib64/${basename $(TARGET_LIB)}
        ln -s ${notdir $(TARGET_LIB)} $(DESTDIR)$(prefix)/lib64/${basename $(TARGET_LIB)}
        install -d $(DESTDIR)$(prefix)/include/jp4agent
        install -d $(DESTDIR)$(prefix)/include/jp4agent/src
        install -d $(DESTDIR)$(prefix)/include/jp4agent/src/pi
        install -d $(DESTDIR)$(prefix)/include/jp4agent/src/pi/protos
166     cp --parents `find -name \*.h` $(DESTDIR)$(prefix)/include/jp4agent/src/pi/protos

  as you can see, that cp command tries to copy a pile of header files
elsewhere, and that certainly seems to be the cause of this (alleged)
race condition.

  i've only started looking at this today, so it's going to take a
while to wrap my head around this, but it's hard to see how that error
can be other than a race condition based on the parallelization that's
been added to this code base.

  if someone wants to isolate the problem and fix it, i'd be a happy
camper as i have lots of other stuff to work on.

  thoughts?

rday

p.s. finding a way to reproduce the build error would be just ducky,
too.

Reply | Threaded
Open this post in threaded view
|

Re: looking for assistance with "parallel" makefile, willing to pay

Kaz Kylheku (gmake)
On 2020-02-25 15:45, Robert P. J. Day wrote:
>   if someone wants to isolate the problem and fix it, i'd be a happy
> camper as i have lots of other stuff to work on.
>
>   thoughts?

If multiple targets in the top-level makefile are recursively invoking
make in src/pi/protos, running the install target, that's a problem.

The copies will interfere with each other, because cp is doing something
like

     unlink('target');
     fd = open('target', O_WRONLY | O_CREAT | O_EXCL);
     write(fd, data)

if another copy is running in parallel and creates the file between
the above unlink and open, then that will fail with EEXIST.

One possible way to fix it is to use a lock directory as a mutex:

At the start of the operation do something like:

    .PHONY: install
    install:
        mkdir -p $(DESTDIR) # ensure we have a place where .lockdir goes
        while ! mkdir $(DESTDIR)/.lockdir ; do sleep 1 ; done

        # do all the file file copying

        rmdir .lockdir

mkdir fails when a directory exists, and I think there
is enough synchronization in NFS over directory creation that
it even works there (whereas lock files do not necessarily).

There is a risk of stale .lockdir jamming up the build, though.

It could be pre-emptively removed by some top-level prepare step
(that is not itself parallelized).

That rule could use a general stamp file so that it doesn't
wastefully repeat all the file copying. Serializing it is one
thing, but it's still wasteful to repeatedly do it.

P.S. here's a more robust lock directory implementation, in Lisp. :)
That has a timeout, and checks for the situation that making
the directory fails for an error other other than 17/EEXIST

http://www.kylheku.com/cgit/tamarind/tree/lockdir.tl


Reply | Threaded
Open this post in threaded view
|

Re: looking for assistance with "parallel" makefile, willing to pay

Robert P. J. Day-2
On Tue, 25 Feb 2020, Kaz Kylheku (gmake) wrote:

> On 2020-02-25 15:45, Robert P. J. Day wrote:
> >   if someone wants to isolate the problem and fix it, i'd be a happy
> > camper as i have lots of other stuff to work on.
> >
> >   thoughts?
>
> If multiple targets in the top-level makefile are recursively
> invoking make in src/pi/protos, running the install target, that's a
> problem.

... snip ...

  on top of everything you wrote, look at the number of "sleep" calls
sprinkled around ... pretty clear someone knew there were timing
issues:

$ grep -r sleep *
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/Makefile: sleep 2
AFI/test/afi-controller/Makefile: sleep 2
... etc etc ...

  man, this looks fragile.

rday