Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2018 Tencent World AI Weiqi Competition #1554

Closed
bjiyxo opened this issue Jun 14, 2018 · 200 comments
Closed

2018 Tencent World AI Weiqi Competition #1554

bjiyxo opened this issue Jun 14, 2018 · 200 comments

Comments

@bjiyxo
Copy link

bjiyxo commented Jun 14, 2018

http://weiqi.qq.com/special/109.html

IV. Contest System:

June 23-24: Preliminary contest (at site) as 5-7 rounds (subject to number of actual applicant teams) of Swiss-system contest. The top 8 enters the circle;

Early July: Top 8 circle (online). Seven rounds of circle contest on Tencent Foxwq platform (use black and white stones respectively for two games in each round). Top 4 enters site final;

Late July: Final (at site). The winner of 5-game contest in semi-final enters the final; the winner of 7-game contest in the final will be the champion.

(I) Contest Awards

Champion: CNY 400,000;
Runner-up: CNY 200,000;
3rd - 4th place: CNY 120,000;
5th - 8th place: CNY 60,000;
9th - 16th place: CNY 10,000.

@gcp I'm writing to enquire suggestion about

  1. Was there anything we should improve on Fuzhou AI Go Competition?
  2. Is there any idea or demand on this Tencent Competition?
@l1t1
Copy link

l1t1 commented Jun 14, 2018

gcp should hide the best weight in a safe place, so that it won't be copied by other program.

@l1t1
Copy link

l1t1 commented Jun 16, 2018

at least, train a 20*256 or more that beat elf, otherwise is too weak to take the Competition

@l1t1
Copy link

l1t1 commented Jun 16, 2018

Is it possible to prepare different weights for different opponents? For example, if elf is not good at ladder, we can exploit the defect choose one specific weight to deal with it.

@gcp
Copy link
Member

gcp commented Jun 19, 2018

Was there anything we should improve on Fuzhou AI Go Competition?

I would use the next branch instead of some experimental one (even if it did not seem much worse).
Try to make sure the used weights are at least as good as the official ones.

There have been some suggestions to fiddle with max threads on machines with a lot of GPU.

@gcp
Copy link
Member

gcp commented Jun 19, 2018

at least, train a 20*256 or more that beat elf, otherwise is too weak to take the Competition

We have no way of doing this right now (*). Indeed if someone enters something based on ELF weights directly, LZ would be expected to lose to it.

(*) We have ~230k ELF self-play games, but this is probably not enough to train a super-ELF from scratch.

@l1t1
Copy link

l1t1 commented Jun 19, 2018

are there any method to compute the percentage of a weight is base on another?

@Tsun-hung
Copy link

@gcp Do you attend the competition youself? We all expect you could come to China(at least this is my thought, I'm confident that LZ can enter the final)

@diadorak
Copy link

@Tsun-hung
I am less optimistic. LZ is much stronger now. But since LZ's strength is well-known now, other closed-source programs could hardcode "traps" specifically designed for LZ...

@l1t1
Copy link

l1t1 commented Jun 21, 2018

is that true?

There are 11 AIs from all over China, Japan, Korea, Europe and the United States that will compete in the competition, including fineart, octopus, golarxy, and northern lights from China, AQ, AYa, Raynz from Japan, Dolbaram and Baduki from Korea, and Leela Zero from Belgium. and ELF OpenGo from the United States.


共有遍布中日韩和欧美的11款AI将出征本次大赛,包括来自中国的绝艺、章鱼、星阵、北极光,来自日本的AQ、AYa、Raynz,来自韩国的Dolbaram、Baduki,来自比利时的 Leela Zero 和来自美国的ELF OpenGo。

http://bbs.flygo.net/bbs/forum.php?mod=viewthread&tid=106979&extra=page%3D1

@l1t1
Copy link

l1t1 commented Jun 21, 2018

confirmed http://weiqi.qq.com/news/9438.html

@alreadydone
Copy link
Contributor

alreadydone commented Jun 21, 2018

CITIC champion CNY 450,000 ... even higher, though the total prize money 800,000 is lower. https://share.yikeweiqi.com/gonews/detail/id/18281/type/QQReadCount

@diadorak
Copy link

Octopus and Northern Lights... New Go AIs keep coming out of nowhere.

@jokkebk
Copy link

jokkebk commented Jun 21, 2018

Seems quite likely that ELF OpenGo will beat at least Leela Zero...

@john45678
Copy link

A top 4 finish for LZ would be great (considering how strong the competition is) , and you never know, with a bit of luck...

@diadorak
Copy link

AQ, AYa, Raynz , Dolbaram and Baduki should be weaker than LZ. Only Fine Art and Elf are clearly stronger.

@l1t1
Copy link

l1t1 commented Jun 22, 2018

pytorch/ELF#62 (comment)
see what ELF team says

@l1t1
Copy link

l1t1 commented Jun 22, 2018

live log for the preliminaries of the game. June 23, 10:00
http://duiyi.sina.com.cn/m/

@l1t1
Copy link

l1t1 commented Jun 22, 2018

at least gcp had better use a hide weight stronger than all public lz weight ,include 20*256, except elf.

@john45678
Copy link

john45678 commented Jun 22, 2018

@diadorak "Only Fine Art and Elf are clearly stronger."

And Golaxy.
Edit: And the two unknown bots octopus and northern lights could be as well.

@l1t1
Copy link

l1t1 commented Jun 22, 2018

http://weiqi.qq.com/news/9439.html

Tian Yuandong: We have been encouraging everyone to use it since it was open source. This one and a half months I believe that many teams have found ways to improve. For example, adding some extra conditions to prevent ladder bugs will make OpenGo stronger. This competition will enable everyone to use our platform and that is to achieve the goal. The team still has many tasks to complete. We will not spend too much effort on the outcome of this game, so we do not expect OpenGo to get a high ranking. Of course, we still hope it can win.


田渊栋:我们开源后一直鼓励大家使用,这一个半月以来我相信很多团队已经找到改进的方案,比如说加一些额外的条件防止出现征子bug,这些都会让OpenGo变得更强。这次参赛,能让大家用我们的平台,那就是达到目的了。团队还有很多任务要完成,我们不会为这次比赛的胜负花太多力气在上面,所以我们并不期望OpenGo可以获得很高的名次。当然啦,我们还是希望它能赢。

@lp200
Copy link

lp200 commented Jun 22, 2018

PARANOiA-Rebirth in CGOS is probably AQ.
That's pretty strong.

@l1t1
Copy link

l1t1 commented Jun 22, 2018

http://txwq.qq.com/act/mgame/index.html webpage Chinese

@alreadydone
Copy link
Contributor

The official English name of "Northern Lights" (北极光) is Aurora. It's a registered participant of ICGA as well:

Game Type
    Go 19x19
Program Name
    Aurora
Author(s)
    Qi Zhang
    Dongbin Zhao
Operator
    Qi Zhang
Organization
    Institute of Automation, Chinese Academy of Science
Country
    China
Introduction
    A not very faithful re-implementation of Master. 1. Training a residual tower by supervised learning, mini-batches of data were sampled at random from the Fox, Baduk, Tygem data sets that beyond 9d level including professional games. 2. Deep reinforcement learning based on supervised-learning starting point. Playing millions of self-play games using a large number of GPUs from National Supercomputer Center and CASIA. 3. Playing a move based on heuristic MCTS with fast rollout policy combined with policy and value.

Octopus is Leela Master with customized engine.

@diadorak
Copy link

Elf is run by Tencent... Since their program does not support multiple GPUs, Elf could indeed be a weaker one in the competition.

pytorch/ELF#62
"Our understanding is that the tournament organizers plan to deploy the Fox server's ELF OpenGo version in the tournament. Thus, the hardware and software used will be whatever is running on Fox at the time of the tournament (hard to say).

The ELF OpenGo team is not officially participating but the tournament organizers have our permission to use the Fox bot as they wish."

@l1t1
Copy link

l1t1 commented Jun 23, 2018

@l1t1
Copy link

l1t1 commented Jun 23, 2018

2018-06-23_100347
fineart vs golaxy

@diadorak
Copy link

LZ won the first game against AQ.

@l1t1
Copy link

l1t1 commented Jun 23, 2018

http://duiyi.sina.com.cn/gibo/new_gibo.asp sgf
lz vs aq
http://duiyi.sina.com.cn/cGibo/20186/ai1-1806233.sgf

@l1t1
Copy link

l1t1 commented Jun 23, 2018

round2 lz (w) vs fineart (b) b+

2018-06-23_142107

http://duiyi.sina.com.cn/cgibo/20186/ai2-1806237.sgf

@l1t1
Copy link

l1t1 commented Jun 23, 2018

who are they?
yhqa-heirxye5181029

http://sports.sina.com.cn/go/2018-06-23/doc-iheirxye5187956.shtml

and they use a loptop in the game?

@gcp
Copy link
Member

gcp commented Jul 23, 2018

as it seems with scalable infrastructure

As if this is easy!

@gcp
Copy link
Member

gcp commented Jul 31, 2018

For the record, for the Fuzhou (WAIGO) tournament, I recevied about 20000 RMB in prize money.

For this tournament (Tencent), I have not received any news regarding prize money.

@morenus
Copy link

morenus commented Aug 4, 2018

I suggest that Tencent should write up a report of this tournament on the Internet. In English. Searching for it is like trying to find an obscure secret that might exist, perhaps, in a foreign language. Or it might not. Tencent is a $300 billion (USD) technology company, don't they know how to put up a web page? Why keep the world in the dark about their big contest? Don't they want publicity? I theorize that there might be some big website, behind the Great Firewall of China, in Chinese characters, that is full of articles about what happened at the "2018 Tencent World AI Weiqi (Go) Competition". Or maybe there isn't one. It's really hard to tell. I can tell that there is very little in English (and even less in Spanish, or French). Tencent, please let everyone know what happened.

@bjiyxo
Copy link
Author

bjiyxo commented Aug 4, 2018

Please don't be panic. We are negotiating with Tencent committee now. They will give us result next week.

@alreadydone
Copy link
Contributor

Many commentaries of matches at http://foxwq.com/news/index/p/1.html (in Chinese)

@gcp
Copy link
Member

gcp commented Nov 12, 2018

I received about ~6000 EUR prize money today from this tournament.

@herazul
Copy link

herazul commented Nov 12, 2018

Nice, definitly deserved ! If it's not too personnal, do you have a plan with it or just keep it for personal/family use ?
(What i mean is, thoose 2080ti looks nice for taining NN :p)

@baduk1
Copy link

baduk1 commented Nov 14, 2018 via email

@wonderingabout
Copy link
Contributor

transparency is most appreciated, as well as personal effort

@john45678
Copy link

@gcp Great news, of course the work yourself and all the contributors have done in the last year to make LZ so strong is priceless. Thanks so much.

@gcp
Copy link
Member

gcp commented Nov 15, 2018

do you have a plan with it
(What i mean is, thoose 2080ti looks nice for taining NN :p)

The RTX 2080 Ti doesn't look so attractive from perf/price/watts/availability point of view right now. I'm considering getting an RTX 2080 instead and seeing if we can make the training work with fp16 (inference already works). That would be a nice speedup over the GTX 1080 Ti at lower power usage.

Other than that I plan to privately contact some of the people who made the largest code contributions and offer them a cut. It's probably impossible to do right by everyone here but I will use my best judgement and give it a shot.

@herazul
Copy link

herazul commented Nov 15, 2018

I'm considering getting an RTX 2080 instead and seeing if we can make the training work with fp16 (inference already works). That would be a nice speedup over the GTX 1080 Ti at lower power usage.

Your reasoning is sound =)

The RTX 2080 Ti doesn't look so attractive from perf/price/watts/availability point of view right now.

At the same time, the price seems to be dropping.
I was looking at ldlc delivery to belgium, and there is 7days availiability under 1200€ now (https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-graphique-interne/c4684/+fv121-16759.html?sort=1). Hopefully the trend keep going, i might get one if the price drop under 1k€.
Maybe waiting for black friday and looking for deals could be a good idea !

@gcp
Copy link
Member

gcp commented Nov 16, 2018

@TFiFiE and @ihavnoid you should consider sending me an email so I can contact you.

@wonderingabout
Copy link
Contributor

wonderingabout commented Nov 16, 2018

fp16 is definitely interesting if possible
anything to optimize on amd gpus (like rx580/590 or vega) ?

@Yakago
Copy link

Yakago commented Nov 17, 2018

The RTX 2080 Ti doesn't look so attractive from perf/price/watts/availability point of view right now. I'm considering getting an RTX 2080 instead and seeing if we can make the training work with fp16 (inference already works).

I've got an RTX card - how does fp16 work with inference? I seem to get similar performance using single/half precision. Do I have to do a specific tuning for half or something else? It detects the card as not having fp16 compute support.

@gcp
Copy link
Member

gcp commented Nov 17, 2018

anything to optimize on amd gpus (like rx580/590 or vega) ?

We support both of these through fp16 compute support (during inference). The RX cards in theory should only benefit a little because they only save register space in fp16 mode, but empirically my RX560 actually becomes almost twice as fast in fp16/half mode. Vega should benefit a lot as it has fp16 compute, but I remember early reviews saying it is disabled in OpenCL (Edit: Some Googling shows newer drivers do have it enabled).

For training it all depends on how good TensorFlow's support for AMD cards is.

@gcp
Copy link
Member

gcp commented Nov 17, 2018

It detects the card as not having fp16 compute support.

If the OpenCL driver doesn't expose fp16 support then we can't use it.

Edit: The rumors I read say that NVIDIA blocks this on consumer cards, but that it works on V100. If that's true then RTX cards are currently not very interesting for running LZ, though they may still be useful for training.

@alreadydone
Copy link
Contributor

@gcp What you found agree with what I heard (that Vega can use fp16 compute now but RTX 20xx can't), except that V100 actually doesn't work yet. That's why people wants cuda/cudnn etc. so dearly...

@gcp
Copy link
Member

gcp commented Nov 17, 2018

except that V100 actually doesn't work yet.

What does "doesn't work yet" mean? This? V100 doesn't support half precision compute under OpenCL

If it still doesn't then that's interesting, I found a claim the V100 did but RTX doesn't, but if you're right then even that is bullshit. Should be easy to confirm by checking the clinfo on a card and seeing if cl_khr_fp16 or 'half' capability is reported.

I can confirm that the RTX cards don't have drivers that are capable of fp16 in OpenCL:

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     GeForce RTX 2080 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  416.94
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               68
  Max clock frequency                             1635MHz
  Compute Capability (NV)                         7.5
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes
    char                                                 1 / 1
    short                                                1 / 1
    int                                                  1 / 1
    long                                                 1 / 1
    half                                                 0 / 0        (n/a)
    float                                                1 / 1
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              11811160064 (11GiB)
  Error Correction support                        No
  Max memory allocation                           2952790016 (2.75GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1114112 (1.063MiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             32768x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                32
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  3
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer

So, NVIDIA wants everyone to rewrite to CUDA (exclusive to their platform) because they can't be arsed to write proper drivers. Apple wants everyone to rewrite to Metal (exclusive to their platform) because they can't be arsed to write proper drivers.

I really look forward to the next platform.

@alreadydone
Copy link
Contributor

alreadydone commented Nov 18, 2018

Update:

V100's output

original post:

I asked a friend to run clinfo with V100 but Google Cloud won't assign one to him at the moment, so here is result on P100 instead (truncated for some reason, but the Device Extensions are shown):

  Global memory size                              17071734784 (15.9GiB)
  Error Correction support                        Yes
  Max memory allocation                           4267933696 (3.975GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        917504 (896KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1

@wonderingabout
Copy link
Contributor

wonderingabout commented Nov 18, 2018

@gcp @alreadydone
this the output i have on microsoft azure tesla v100 datascience batch ubuntu image, following the azure google doc with low priority cost that i just completed :
https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu?tab=Overview

it uses a custom image provided by microsoft azure with everything preinstalled (gpu driver, cuda, cudnn, opencl)

with clinfo in datascience 16.04 preinstalled (testworking) :

Preparing to unpack .../clinfo_2.1.16.01.12-1_amd64.deb ...
Unpacking clinfo (2.1.16.01.12-1) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up clinfo (2.1.16.01.12-1) ...
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 9.2.176
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     Tesla V100-PCIE-16GB
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  396.44
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Topology (NV)                            PCI-E, 00:00.0
  Max compute units                               80
  Max clock frequency                             1380MHz
  Compute Capability (NV)                         7.0
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              16945512448 (15.78GiB)
  Error Correction support                        Yes
  Max memory allocation                           4236378112 (3.945GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1310720
  Global Memory cache line                        128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             32768x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                32
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max constant buffer size                        65536 (64KiB)
  Max number of constant args                     9
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  7
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform
Sat Nov 17 17:54:03 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00005AA6:00:00.0 Off |                    0 |
| N/A   29C    P0    35W / 250W |      0MiB / 16160MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |

then during tuning
in datascience 16.04 preinstalled (testworking) :

[100%] Linking CXX executable validation
[100%] Built target validation
Starting tuning process, please wait...
Net filename: networks/efa2fefd122cb212b990cadcd6cb07258ea266bc4904ce0b4ee3f69ede03bb2b.gz
net: efa2fefd122cb212b990cadcd6cb07258ea266bc4904ce0b4ee3f69ede03bb2b.
./leelaz --tune-only -w networks/efa2fefd122cb212b990cadcd6cb07258ea266bc4904ce0b4ee3f69ede03bb2b.gz
Leela Zero 0.16  Copyright (C) 2017-2018  Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.

Using 2 thread(s).
RNG seed: 11755976039706366702
BLAS Core: built-in Eigen 3.3.5 library.
Detecting residual layers...v1...256 channels...40 blocks.
Initializing OpenCL (autodetecting precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 9.2.176
Platform profile: FULL_PROFILE
Platform name:    NVIDIA CUDA
Platform vendor:  NVIDIA Corporation
Device ID:     0
Device name:   Tesla V100-PCIE-16GB
Device type:   GPU
Device vendor: NVIDIA Corporation
Device driver: 396.44
Device speed:  1380 MHz
Device cores:  80 CU
Device score:  1112
Selected platform: NVIDIA CUDA
Selected device: Tesla V100-PCIE-16GB
with OpenCL 1.2 capability.
Half precision compute support: No.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 9.2.176
Platform profile: FULL_PROFILE
Platform name:    NVIDIA CUDA
Platform vendor:  NVIDIA Corporation
Device ID:     0
Device name:   Tesla V100-PCIE-16GB
Device type:   GPU
Device vendor: NVIDIA Corporation
Device driver: 396.44
Device speed:  1380 MHz
Device cores:  80 CU
Device score:  1112
Selected platform: NVIDIA CUDA
Selected device: Tesla V100-PCIE-16GB
with OpenCL 1.2 capability.
Half precision compute support: No.

Started OpenCL SGEMM tuner.
Will try 290 valid configurations.
(1/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0484 ms (2438.5 GFLOPS)
(5/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0415 ms (2845.0 GFLOPS)
(13/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0410 ms (2880.6 GFLOPS)
(14/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0397 ms (2974.1 GFLOPS)
(27/290) KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0371 ms (3178.6 GFLOPS)
(32/290) KWG=32 KWI=2 MDIMA=32 MDIMC=32 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0358 ms (3292.2 GFLOPS)
(82/290) KWG=16 KWI=8 MDIMA=16 MDIMC=16 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0348 ms (3389.0 GFLOPS)
(136/290) KWG=16 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0338 ms (3490.9 GFLOPS)
(139/290) KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0335 ms (3517.6 GFLOPS)
(169/290) KWG=16 KWI=8 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0325 ms (3631.0 GFLOPS)
(172/290) KWG=32 KWI=8 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0315 ms (3746.3 GFLOPS)
Found Leela Version : 0.16
Tuning process finished
Starting thread 1 on GPU 0
{
    "cmd": "selfplay",
    "hash": 

@wonderingabout
Copy link
Contributor

wonderingabout commented Nov 18, 2018

@alreadydone @gcp
you can use this very useful website to upload large text files :
you can even include them in html txt links like i did below
http://www.uploadedit.com/_to-upload-documents-onto-internet-PLAIN-TEXT-TXT-hosting.htm

i already had the log from brand new ubuntu 18.04 + gpu driver latest version ppa with tesla V100 in test1000 (same script as test81) :

with clinfo in ubuntu 18.04 + gpu drivers ppa + all latest packages

  • before reboot log :

http://m.uploadedit.com/bbtc/15425314021.txt
Document hosting: UploadEdit.com

clinfo after reboot (ubuntu 18.04 + gpu drivers ppa + all latest packages : test1000) :

Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 10.0.185
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     Tesla V100-PCIE-16GB
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  410.73
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 00:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               80
  Max clock frequency                             1380MHz
  Compute Capability (NV)                         7.0
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              16914055168 (15.75GiB)
  Error Correction support                        Yes
  Max memory allocation                           4228513792 (3.938GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1310720 (1.25MiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             32768x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                32
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  7
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
Sun Nov 18 08:52:13 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 000039C3:00:00.0 Off |                    0 |
| N/A   29C    P0    35W / 250W |      0MiB / 16130MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

tuning after reboot (ubuntu 18.04 + gpu drivers ppa + all latest packages : test1000) :

Initializing OpenCL (autodetecting precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 10.0.185
Platform profile: FULL_PROFILE
Platform name:    NVIDIA CUDA
Platform vendor:  NVIDIA Corporation
Device ID:     0
Device name:   Tesla V100-PCIE-16GB
Device type:   GPU
Device vendor: NVIDIA Corporation
Device driver: 410.73
Device speed:  1380 MHz
Device cores:  80 CU
Device score:  1112
Selected platform: NVIDIA CUDA
Selected device: Tesla V100-PCIE-16GB
with OpenCL 1.2 capability.
Half precision compute support: No.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.2 CUDA 10.0.185
Platform profile: FULL_PROFILE
Platform name:    NVIDIA CUDA
Platform vendor:  NVIDIA Corporation
Device ID:     0
Device name:   Tesla V100-PCIE-16GB
Device type:   GPU
Device vendor: NVIDIA Corporation
Device driver: 410.73
Device speed:  1380 MHz
Device cores:  80 CU
Device score:  1112
Selected platform: NVIDIA CUDA
Selected device: Tesla V100-PCIE-16GB
with OpenCL 1.2 capability.
Half precision compute support: No.

Started OpenCL SGEMM tuner.
Will try 290 valid configurations.
(1/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0527 ms (2236.6 GFLOPS)
(5/290) KWG=16 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0422 ms (2793.8 GFLOPS)
(13/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0417 ms (2826.5 GFLOPS)
(14/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0379 ms (3114.2 GFLOPS)
(32/290) KWG=32 KWI=2 MDIMA=32 MDIMC=32 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=2 VWN=2 0.0374 ms (3157.5 GFLOPS)
(131/290) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0348 ms (3387.5 GFLOPS)
(136/290) KWG=16 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0338 ms (3491.7 GFLOPS)
(139/290) KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0330 ms (3573.0 GFLOPS)
(169/290) KWG=16 KWI=8 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0328 ms (3601.8 GFLOPS)
(172/290) KWG=32 KWI=8 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 0.0315 ms (3747.3 GFLOPS)
Found Leela Version : 0.16
Tuning process finished
Starting thread 1 on GPU 0

note :
test81 full log here for information :
Document hosting: UploadEdit.com

http://m.uploadedit.com/bbtc/1542529282479.txt

note 2 :
for information this is a ubuntu 18.04 latest ppa+packages full log :

http://m.uploadedit.com/bbtc/1542531465359.txt

Document hosting: UploadEdit.com

for some reason the datascience image seems quite slower than my own install with ppa

note 3 :
for information :
manual reboot full script i used :
#1905 (comment)

/bin/bash -c 'sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y install nvidia-driver-410 linux-headers-generic nvidia-opencl-dev && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl ; rm -r leela-zero ; clinfo ; nvidia-smi ; git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

@TFiFiE
Copy link
Contributor

TFiFiE commented Nov 18, 2018

@gcp, spend whatever cut you feel you owe me in any way you think will best contribute to the future development of Leela Zero.

@gcp
Copy link
Member

gcp commented Nov 19, 2018

@gcp, spend whatever cut you feel you owe me in any way you think will best contribute to the future development of Leela Zero.

OK, noted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests